Classic Designs
Design a web search engine. Covers inverted indexes, TF-IDF ranking, PageRank, query parsing, index sharding, and serving optimization.
Designing a web search engine is one of the most comprehensive system design questions you can face. It touches on crawling, indexing, ranking, distributed systems, and serving latency. The core problem: given a text query, return the most relevant web pages from an index of billions of documents, in under 500 milliseconds. This article focuses on the indexing and serving components, assuming a web crawler (covered separately) feeds pages into the system.
Assumptions:
- 50 billion web pages in the index
- Average page: 50 KB of text content (after stripping HTML)
- Raw text corpus: 50B × 50 KB = 2.5 PB
- Inverted index size: typically 20-30% of raw text = ~600 TB
- Query volume: 100,000 queries/sec (Google handles ~100K/sec)
Index storage:
- 600 TB distributed across thousands of machines
- Each machine holds a shard of ~50 GB (fits in memory for fast serving)
- 600 TB / 50 GB = ~12,000 shardsAn inverted index is the foundational data structure of a search engine. Instead of mapping documents to their words (forward index), it maps words to the documents that contain them (inverted index).
Forward Index (what you have after crawling):
doc_1: "the quick brown fox"
doc_2: "the lazy brown dog"
doc_3: "the quick red fox jumps"
Inverted Index (what you build for searching):
"the" → [doc_1, doc_2, doc_3]
"quick" → [doc_1, doc_3]
"brown" → [doc_1, doc_2]
"fox" → [doc_1, doc_3]
"lazy" → [doc_2]
"dog" → [doc_2]
"red" → [doc_3]
"jumps" → [doc_3]Each entry in the inverted index is called a posting list. Each posting contains:
Term: "quick"
Posting list:
┌──────────┬───────────────┬──────────────────┬──────────────┐
│ doc_id │ term_frequency │ positions │ field │
├──────────┼───────────────┼──────────────────┼──────────────┤
│ doc_1 │ 1 │ [1] │ body │
│ doc_3 │ 1 │ [1] │ body │
│ doc_7 │ 3 │ [5, 12, 28] │ title, body │
└──────────┴───────────────┴──────────────────┴──────────────┘Positions are needed for phrase queries ("quick brown fox" -- the words must appear consecutively). Fields indicate whether the term appeared in the title, body, URL, or anchor text (which affects ranking).
class InvertedIndexBuilder:
def __init__(self):
self.index = {} # term -> list of postings
def add_document(self, doc_id, text):
tokens = tokenize_and_stem(text) # lowercase, stem, remove stop words
term_positions = {}
for position, token in enumerate(tokens):
if token not in term_positions:
term_positions[token] = []
term_positions[token].append(position)
for term, positions in term_positions.items():
if term not in self.index:
self.index[term] = []
self.index[term].append({
"doc_id": doc_id,
"tf": len(positions),
"positions": positions,
})
def finalize(self):
# Sort posting lists by doc_id for efficient merging
for term in self.index:
self.index[term].sort(key=lambda p: p["doc_id"])Posting lists for common terms ("the", "is", "a") can contain billions of entries. Compression is essential.
Raw doc_ids: [1, 5, 9, 14, 20, 35, 41]
Delta-encoded: [1, 4, 4, 5, 6, 15, 6] ← much smaller numbers
VByte compressed: significant space savings for large posting listsTF-IDF (Term Frequency - Inverse Document Frequency) is the foundational ranking algorithm. It answers: "How relevant is this document to this query?"
TF(t, d) = frequency of term t in document d / total terms in d
IDF(t) = log(N / df(t))
where N = total documents, df(t) = documents containing term t
TF-IDF(t, d) = TF(t, d) × IDF(t)
For a multi-word query, sum the TF-IDF scores:
Score(query, d) = Σ TF-IDF(t, d) for each term t in queryimport math
def tf_idf_score(query_terms, doc_id, index, doc_count, doc_lengths):
score = 0.0
for term in query_terms:
if term not in index:
continue
posting_list = index[term]
df = len(posting_list) # document frequency
idf = math.log(doc_count / (1 + df))
for posting in posting_list:
if posting["doc_id"] == doc_id:
tf = posting["tf"] / doc_lengths[doc_id]
score += tf * idf
break
return scoreBM25 is the modern evolution of TF-IDF, used by Elasticsearch and most production search engines. It adds term frequency saturation (diminishing returns for repeated terms) and document length normalization.
PageRank is Google's original insight: a page is important if many important pages link to it. It models the web as a graph where pages are nodes and hyperlinks are edges.
PageRank intuition:
- A page linked by many other pages is probably important.
- A link from an important page counts more than a link from an obscure page.
- A page that links to many pages dilutes its PageRank across all outgoing links.
Formula:
PR(A) = (1 - d) + d × Σ PR(T) / C(T)
for each page T that links to A
where d = damping factor (0.85), C(T) = number of outgoing links from Tdef compute_pagerank(graph, damping=0.85, iterations=20):
"""graph: dict of page_id -> list of pages it links to"""
n = len(graph)
pagerank = {page: 1.0 / n for page in graph}
for _ in range(iterations):
new_pr = {}
for page in graph:
incoming_score = sum(
pagerank[src] / len(graph[src])
for src in graph
if page in graph[src]
)
new_pr[page] = (1 - damping) / n + damping * incoming_score
pagerank = new_pr
return pagerankIn practice, PageRank is precomputed offline (it requires iterating over the entire web graph multiple times). The final PageRank scores are combined with TF-IDF at query time:
Final Score = alpha × text_relevance(TF-IDF/BM25) + beta × PageRank + gamma × freshness + ...Modern search engines use hundreds of ranking signals (PageRank, text relevance, freshness, user engagement, mobile-friendliness, etc.) combined by machine learning models.
User query: "best python tutorials 2024"
1. Tokenize: ["best", "python", "tutorials", "2024"]
2. Normalize: lowercase, stem ("tutorials" → "tutorial")
3. Remove stop words: (none in this case)
4. Expand synonyms: "best" → "best", "top", "greatest" (optional)
5. Detect intent: informational query (not navigational)For a multi-term query, retrieve the posting list for each term and compute the intersection (AND semantics) or union (OR semantics).
Query: "python tutorial"
Posting list for "python": [doc_2, doc_5, doc_8, doc_12, doc_15, ...]
Posting list for "tutorial": [doc_3, doc_5, doc_9, doc_12, doc_18, ...]
AND (intersection): [doc_5, doc_12, ...] ← documents containing both terms
OR (union): [doc_2, doc_3, doc_5, doc_8, doc_9, doc_12, ...]
Score each matching document and return the top K results.The intersection of two sorted posting lists can be computed in O(n + m) with a merge-like algorithm.
For queries that match millions of documents, scoring all of them is too slow. Techniques for early termination:
With 600 TB of index data, a single machine cannot hold or serve the index. It must be distributed.
Partition documents across shards. Each shard holds the complete inverted index for a subset of documents.
Shard 1: documents 1 - 4,000,000 (complete inverted index for these docs)
Shard 2: documents 4,000,001 - 8,000,000
...
Shard N: documents ...
Query execution:
1. Scatter query to all shards (in parallel)
2. Each shard returns its top K results
3. Merge results from all shards, re-rank, return global top KPartition terms across shards. Each shard holds the complete posting list for a subset of terms.
Shard 1: terms "a" - "m" (all postings for these terms)
Shard 2: terms "n" - "z"
Query "python tutorial":
1. "python" → Shard 2, "tutorial" → Shard 2 (same shard, lucky)
2. Single-shard query if terms fall on the same shard
3. Multi-shard query if terms span shardsDocument-based sharding is preferred because every shard can independently score and rank documents. Term-based sharding requires cross-shard communication to score documents.
Each shard is replicated across multiple machines for fault tolerance and to handle high query throughput. A shard with 3 replicas can handle 3x the query load.
The search engine receives new and updated pages from the web crawler. The indexing pipeline processes these into the inverted index.
Crawler → Kafka (raw pages) → Document Processor → Index Builder → Index Servers
Document Processor:
1. Parse HTML, extract text content
2. Extract title, meta description, headings
3. Extract outgoing links (for PageRank)
4. Detect language
5. Deduplicate (SimHash)
Index Builder:
1. Tokenize and stem the text
2. Build posting list entries
3. Merge new entries into the existing index
4. Recompute PageRank periodically (offline batch job)Rebuilding the entire index for every new page would take days. Instead, maintain two indexes:
At query time, search both indexes and merge results. Periodically, merge the delta index into the base index.
Serving search results in under 500ms requires careful optimization at every layer.
Latency budget (500ms total):
- Query parsing and planning: 10ms
- Scatter to shards: 5ms (network)
- Per-shard query execution: 200ms (the bottleneck)
- Merge and re-rank: 20ms
- Snippet generation: 50ms
- Network return: 5ms
- Buffer: 210ms ┌──────────────┐
Users ────>│ Load Balancer│
└──────┬───────┘
│
┌────┴────┐
│ Query │
│ Service │
└────┬────┘
│ scatter
┌───────────┼───────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Index │ │ Index │ │ Index │
│ Shard 1 │ │ Shard 2 │ │ Shard N │
│(+replicas)│(+replicas)│(+replicas)│
└─────────┘ └─────────┘ └─────────┘
Offline:
Crawler → Document Processor → Index Builder → Push to Shards
Web Graph → PageRank Computer → Score DB → Push to Shards