System Design
Prev: Parallel Computing Next: Tools (e.g., SQL, command line)
Problems
21.1
Designing a good spelling-correction system can be challenging. In Problem 17.2 we only computed the Levenshtein distance between a pair of strings, but a spell checker must find the set of dictionary words that are closest to a given word from the entire dictionary. Furthermore, plain Levenshtein distance may not be the right distance function because it ignores commonly misspelled words and the proximity of letters on a keyboard.
How would you build a spelling correction system?
Hint: Start with an appropriate notion of distance between words.
21.2
When a user submits the query "computation" to a search engine, it is quite possible that documents containing the words "computers", "compute", and "computing" are also relevant. If a query contains several keywords, it becomes difficult to search for all combinations of all variants of the words in the query.
Stemming is the process of reducing all variants of a given word to one common root, both in the query string and in the documents. For example, one might map to compute.
Design a stemming algorithm that is fast and effective.
Hint: The examples are suggestive of general rules.
21.3
Design an efficient algorithm that takes as input a set of text files and returns pairs of files that have substantial commonality.
Hint: Design a hash function which can incrementally hash for .
21.4
You are building a social network where each user specifies a set of attributes. You would like to pair each user with another unpaired user that specified the same set of attributes.
You are given a sequence of users where each user has a unique 32-bit integer key and a set of attributes specified as strings. When you read a user, you should pair that user with another previously read user with identical attributes who is currently unpaired, if such a user exists. If the user cannot be paired, you should add that user to the unpaired set.
Hint: Map sets of attributes to strings.
21.5
YouTV.com is a successful online video sharing site. Hollywood studios complain that much of the material uploaded to the site violates copyright.
Design a feature that allows a studio to enter a set of videos that belong to it, and to determine which videos in the YouTV.com database match videos in .
Hint: Normalize the video format and create signatures.
Variant: Design an online music identification service.
21.6
The TeX system for typesetting beautiful documents was designed by Don Knuth. Unlike GUI-based document editing programs, TeX relies on a markup language which is compiled into a device-independent intermediate representation. TeX formats text, lists, tables, and embedded figures; supports a very rich set of fonts and mathematical symbols; automates section numbering, cross-referencing, and index generation; exports an API; and much more.
How would you implement TeX?
Hint: There are two aspects: building blocks such as fonts and symbols, and hierarchical layout.
21.7
Keyword-based search engines maintain a collection of several billion documents. One of the key computations performed by a search engine is to retrieve all the documents that contain the keywords contained in a query, and this must be done in a few tens of milliseconds.
Here consider a smaller version of the problem where the collection can fit within the RAM of a single computer.
Given a million documents with an average size of 10 kilobytes, design a program that can efficiently return the subset of documents containing a given set of words.
Hint: Build on the idea of a book’s index.
21.8
The PageRank algorithm assigns a rank to a web page based on the number of important pages that link to it. It amounts to the following:
- Build a matrix based on the hyperlink structure of the web. Specifically, if page links to page , where is the number of distinct pages linked from page .
- Find satisfying . Here is a constant, e.g., , and represents a column vector of 1s. The value is the rank of the th page.
Design a system that can compute the ranks of ten billion web pages in a reasonable amount of time.
Hint: This must be performed on an ensemble of machines. The right data structures will simplify the computation.
21.9
Modern datasets are huge. For example, a popular social network may contain over two trillion distinct items.
How would you sort a billion 1000-byte strings? How about a trillion 1000-byte strings?
Hint: Can a trillion 1000-byte strings fit on one machine?
21.10
You have machines, or crawlers, for downloading the entire web. The responsibility for a given URL is assigned to the crawler whose ID is .
Downloading a web page takes away bandwidth from the web server hosting it. Implement crawling under the constraint that in any given minute your crawlers do not request more than bytes from any website.
Hint: Use a server to coordinate the crawl.
21.11
Maintaining a set of prioritized jobs in a distributed system can be tricky. Applications include a search engine crawling web pages in some prioritized order, as well as event-driven simulation in molecular dynamics. In both cases the number of jobs is in the billions and each job has its own priority.
Design a system for maintaining a set of prioritized jobs that implements the following API: (1) insert a new job with a given priority; (2) delete a job; (3) find the highest-priority job. Each job has a unique ID. Assume the set cannot fit into a single machine’s memory.
Hint: How would you partition jobs across machines? Is it always essential to operate on the highest priority job?
21.12
A photomosaic is built from a collection of images called tiles and a target image. The photomosaic approximates the target image and is built by juxtaposing the tiles. Quality is defined by human perception.
Design a program that produces high-quality mosaics with minimal compute time.
Hint: How would you define the distance between two images?
21.13
Airlines often give customers who fly frequently with them a status, typically based on miles flown in the past twelve months. Some customers want to take a round-trip flight simply to maintain their status. The destination is immaterial; the goal is to minimize the cost-per-mile, i.e., the ratio of dollars spent to miles flown.
Design a system that will help its users find mileage runs.
Hint: Partition the implied features into independent tasks.
21.14
How would you design Connexus, a system by which users can share pictures? Address issues around access control such as public and private pictures, picture upload, organizing pictures, summarizing sets of pictures, allowing feedback, and displaying geographical and temporal information for pictures.
Hint: Think about the UI, in particular UI widgets that make for an engaging product.
21.15
Jingle, a search engine startup, has been very successful at providing a high-quality Internet search service. Many customers have approached Jingle and asked it to display paid advertisements for their products and services alongside search results.
Design an advertising system for Jingle.
Hint: Consider the stakeholders separately.
21.16
Jingle wants to generate more page views on its news site. A product manager has the idea to add to each article a sidebar of clickable snippets from articles that are likely to be of interest to someone reading the current article.
Design a system that automatically generates a sidebar of related articles.
Hint: This problem can be solved with various degrees of algorithmic sophistication: none at all, simple frequency analysis, or machine learning.
21.17
Jingle is developing a search feature for breaking news. New articles are collected from a variety of online news sources such as newspapers, bulletin boards, and blogs by a single lab machine at Jingle. Every minute, roughly one thousand articles are posted and each article is 100 kilobytes.
Jingle would like to serve these articles from a data center consisting of 1000 servers. For performance reasons, each server should have its own copy of articles that were recently added. The data center is far away from the lab machine.
Design an efficient way of copying one thousand files each 100 kilobytes in size from a single lab server to each of 1000 servers in a distant data center.
Hint: Exploit the data center.
21.18
Design the World Wide Web. Specifically, describe what happens when you enter a URL in a browser address bar and press return.
Hint: Follow the flow of information.
21.19
Estimate the hardware cost of the server hardware needed to build a photo-sharing app used by every person on the earth.
Hint: Use variables to denote quantities and costs, and relate them with equations. Then fill in reasonable values.
Answers
21.1
Start with a candidate generator based on small edit distance, typically a dictionary-backed search over words within one or two edits. Then improve ranking with richer signals such as keyboard proximity, phonetic similarity, historical correction data, and stemming. The important split is candidate generation versus candidate ordering.
21.2
The practical approach is rule-based rewriting: strip or replace common suffixes using a finite-state or trie-like rule engine, then layer in exceptions and language-specific patterns. Porter-style stemming is the canonical example. More sophisticated alternatives include learned rewrite rules or -gram-based context.
21.3
Represent each file by rolling hashes of all length- substrings and store those hashes in a table keyed by hash value. Collisions identify candidate overlaps, which can then be verified or post-processed. This scales well and generalizes naturally to distributed execution such as map-reduce.
21.4
Canonicalize each attribute set and use that canonical form as a hash key. Small fixed universes can use bit vectors; sparse or large universes can use sorted-string encodings of the attribute set. Then each arriving user is a constant-time hash lookup against unmatched users with the same key.
21.5
Reduce near-duplicate video detection to signature comparison. First normalize format, resolution, and compression as much as possible, then compute robust fingerprints over frames or frame regions. Use those signatures to search the corpus efficiently, and combine that with metadata and user-reporting workflows to reduce false positives and review cost.
21.6
Separate the problem into symbol representation and hierarchical layout. Symbols and fonts can be represented parametrically rather than as bitmaps, while the document itself is a tree of rectangles and layout constraints. Rendering then becomes assembling these components while satisfying hard formatting rules and soft aesthetic goals.
21.7
Use an inverted index: for each word, keep a postings list of document locations where it appears. Query evaluation becomes intersection of sorted postings lists, with standard optimizations such as compression, caching, intersecting the shortest lists first, and keeping high-quality or hot subsets in RAM.
21.8
Store the web graph as sparse adjacency lists distributed across many machines. PageRank is then repeated sparse matrix-vector multiplication. Partition the graph and the rank vector across servers, perform local contributions, exchange boundary updates, and rely on a distributed filesystem or map-reduce style framework for fault tolerance.
21.9
A billion large strings fits external sorting on one machine: split into RAM-sized chunks, sort each chunk, write runs to disk, then merge them. A trillion strings requires distributed range partitioning: sample keys, assign disjoint key ranges to machines, reshuffle records to their owning machine, and sort locally inside each range.
21.10
Introduce a permission or quota service keyed by host. Crawlers ask it for approval before downloading from a site, and it tracks how many bytes have already been served in the current time window. If needed, shard this coordinator by host hash and add policy for priority and oversized files.
21.11
Partition jobs across machines, typically by hashing job IDs for insert and delete. If strict global max extraction is required, collect local maxima from all shards and resolve the winner centrally. If approximate or eventually consistent priority is acceptable, relax the API and let clients consume from local or sampled top-priority shards for higher throughput.
21.12
Turn each target region and tile into a compact feature vector, such as average color or a coarse pixelization, then search for nearest neighbors in that feature space. The design is mostly about choosing the image distance metric and indexing tiles so matching is much faster than the naive scan.
21.13
Split the system into a user-facing alerting product and a backend data pipeline. Users define filters such as origin, date windows, and target cost-per-mile; the backend ingests fare and distance data, computes matching flights, persists alerts, and periodically reevaluates them. The key architectural boundary is UI/product management versus scheduled data acquisition and matching.
21.14
Treat it as three systems: media storage and metadata, a web frontend for browsing and interaction, and a mobile experience that takes advantage of camera, GPS, and notifications. Store images outside the database, keep rich metadata for search and grouping, and invest in UI features like streams, maps, sliders, comments, and upload workflows.
21.15
Model the stakeholders separately: users want relevant ads, advertisers want return on spend, and the platform wants revenue and safety. That naturally leads to an advertiser-facing campaign system plus an ad-serving/ranking system that uses query context, bids, user data, and learned relevance features to choose ads under latency constraints.
21.16
The core problem is related-article ranking. A simple system can use recency, editorial tags, or popularity; a better one adds text similarity with stop-word handling and weighting; the strongest version layers collaborative filtering over user reading sessions. The UI work is trivial compared with computing good recommendations.
21.17
Do not copy from the lab server to all 1000 machines directly. Seed the data center by sending the bundle once, then let machines inside the data center fan it out to one another, ideally in a tree or peer-to-peer pattern. This exploits cheap internal bandwidth and reduces the bottleneck on the long-haul link.
21.18
The browser parses the URL, resolves the hostname via DNS, opens an HTTP connection over TCP/IP, sends the request, and the server maps it to either a static asset or application service. The server returns HTML, JSON, XML, or other content; the browser parses the response into a DOM, applies CSS, executes JavaScript, issues follow-on requests as needed, and renders the page.
21.19
Set up a back-of-the-envelope model in terms of users, uploads per user per day, image size, view rate, storage cost, request-serving capacity, and outbound bandwidth. Then estimate the number of machines required by the dominant serving bottleneck and add storage growth separately. The important part is structuring the estimate rather than defending one exact number.