Classic Designs

Search Engine

Design a web search engine. Covers inverted indexes, TF-IDF ranking, PageRank, query parsing, index sharding, and serving optimization.

Designing a Web Search Engine

Designing a web search engine is one of the most comprehensive system design questions you can face. It touches on crawling, indexing, ranking, distributed systems, and serving latency. The core problem: given a text query, return the most relevant web pages from an index of billions of documents, in under 500 milliseconds. This article focuses on the indexing and serving components, assuming a web crawler (covered separately) feeds pages into the system.

Requirements

Functional Requirements

Accept a text query and return a ranked list of relevant web pages.
Support multi-word queries, phrase queries ("machine learning"), and Boolean operators.
Return results with a title, URL, and text snippet highlighting the matching terms.
Autocomplete and "Did you mean" spelling correction.

Non-Functional Requirements

Latency: Results must be served within 500ms (p99). Users perceive anything slower as broken.
Scale: Index 50+ billion web pages. Handle 100,000+ queries per second.
Freshness: New and updated pages should appear in search results within hours.
Relevance: The ranking algorithm must surface the most useful results, not just the most keyword-dense.

Capacity Estimation

Assumptions:
  - 50 billion web pages in the index
  - Average page: 50 KB of text content (after stripping HTML)
  - Raw text corpus: 50B × 50 KB = 2.5 PB
  - Inverted index size: typically 20-30% of raw text = ~600 TB
  - Query volume: 100,000 queries/sec (Google handles ~100K/sec)

Index storage:
  - 600 TB distributed across thousands of machines
  - Each machine holds a shard of ~50 GB (fits in memory for fast serving)
  - 600 TB / 50 GB = ~12,000 shards

The Inverted Index

An inverted index is the foundational data structure of a search engine. Instead of mapping documents to their words (forward index), it maps words to the documents that contain them (inverted index).

Forward Index (what you have after crawling):
  doc_1: "the quick brown fox"
  doc_2: "the lazy brown dog"
  doc_3: "the quick red fox jumps"

Inverted Index (what you build for searching):
  "the"   → [doc_1, doc_2, doc_3]
  "quick" → [doc_1, doc_3]
  "brown" → [doc_1, doc_2]
  "fox"   → [doc_1, doc_3]
  "lazy"  → [doc_2]
  "dog"   → [doc_2]
  "red"   → [doc_3]
  "jumps" → [doc_3]

Posting Lists

Each entry in the inverted index is called a posting list. Each posting contains:

Term: "quick"
Posting list:
  ┌──────────┬───────────────┬──────────────────┬──────────────┐
  │ doc_id   │ term_frequency │ positions        │ field        │
  ├──────────┼───────────────┼──────────────────┼──────────────┤
  │ doc_1    │ 1             │ [1]              │ body         │
  │ doc_3    │ 1             │ [1]              │ body         │
  │ doc_7    │ 3             │ [5, 12, 28]      │ title, body  │
  └──────────┴───────────────┴──────────────────┴──────────────┘

Positions are needed for phrase queries ("quick brown fox" -- the words must appear consecutively). Fields indicate whether the term appeared in the title, body, URL, or anchor text (which affects ranking).

Building the Index

class InvertedIndexBuilder:
    def __init__(self):
        self.index = {}  # term -> list of postings

    def add_document(self, doc_id, text):
        tokens = tokenize_and_stem(text)  # lowercase, stem, remove stop words
        term_positions = {}

        for position, token in enumerate(tokens):
            if token not in term_positions:
                term_positions[token] = []
            term_positions[token].append(position)

        for term, positions in term_positions.items():
            if term not in self.index:
                self.index[term] = []
            self.index[term].append({
                "doc_id": doc_id,
                "tf": len(positions),
                "positions": positions,
            })

    def finalize(self):
        # Sort posting lists by doc_id for efficient merging
        for term in self.index:
            self.index[term].sort(key=lambda p: p["doc_id"])

Index Compression

Posting lists for common terms ("the", "is", "a") can contain billions of entries. Compression is essential.

Variable-byte encoding (VByte): Encode small numbers in fewer bytes. Since posting lists are sorted by doc_id, store delta-encoded doc_ids (differences between consecutive IDs), which are small numbers.
Block-based compression: Group postings into blocks of 128, compress each block with techniques like PForDelta or SIMD-based decompression for CPU efficiency.

1

2

3

Raw doc_ids:      [1, 5, 9, 14, 20, 35, 41]
Delta-encoded:    [1, 4, 4, 5,  6, 15,  6]   ← much smaller numbers
VByte compressed: significant space savings for large posting lists

TF-IDF Ranking

TF-IDF (Term Frequency - Inverse Document Frequency) is the foundational ranking algorithm. It answers: "How relevant is this document to this query?"

TF(t, d) = frequency of term t in document d / total terms in d
IDF(t) = log(N / df(t))
  where N = total documents, df(t) = documents containing term t

TF-IDF(t, d) = TF(t, d) × IDF(t)

For a multi-word query, sum the TF-IDF scores:
  Score(query, d) = Σ TF-IDF(t, d) for each term t in query

import math

def tf_idf_score(query_terms, doc_id, index, doc_count, doc_lengths):
    score = 0.0
    for term in query_terms:
        if term not in index:
            continue
        posting_list = index[term]
        df = len(posting_list)  # document frequency
        idf = math.log(doc_count / (1 + df))

        for posting in posting_list:
            if posting["doc_id"] == doc_id:
                tf = posting["tf"] / doc_lengths[doc_id]
                score += tf * idf
                break
    return score

BM25 is the modern evolution of TF-IDF, used by Elasticsearch and most production search engines. It adds term frequency saturation (diminishing returns for repeated terms) and document length normalization.

PageRank

PageRank is Google's original insight: a page is important if many important pages link to it. It models the web as a graph where pages are nodes and hyperlinks are edges.

PageRank intuition:
  - A page linked by many other pages is probably important.
  - A link from an important page counts more than a link from an obscure page.
  - A page that links to many pages dilutes its PageRank across all outgoing links.

Formula:
  PR(A) = (1 - d) + d × Σ PR(T) / C(T)
    for each page T that links to A
    where d = damping factor (0.85), C(T) = number of outgoing links from T

def compute_pagerank(graph, damping=0.85, iterations=20):
    """graph: dict of page_id -> list of pages it links to"""
    n = len(graph)
    pagerank = {page: 1.0 / n for page in graph}

    for _ in range(iterations):
        new_pr = {}
        for page in graph:
            incoming_score = sum(
                pagerank[src] / len(graph[src])
                for src in graph
                if page in graph[src]
            )
            new_pr[page] = (1 - damping) / n + damping * incoming_score
        pagerank = new_pr

    return pagerank

In practice, PageRank is precomputed offline (it requires iterating over the entire web graph multiple times). The final PageRank scores are combined with TF-IDF at query time:

1

Final Score = alpha × text_relevance(TF-IDF/BM25) + beta × PageRank + gamma × freshness + ...

Modern search engines use hundreds of ranking signals (PageRank, text relevance, freshness, user engagement, mobile-friendliness, etc.) combined by machine learning models.

Query Processing

Query Parsing

1

2

3

4

5

6

7

User query: "best python tutorials 2024"

1. Tokenize: ["best", "python", "tutorials", "2024"]
2. Normalize: lowercase, stem ("tutorials" → "tutorial")
3. Remove stop words: (none in this case)
4. Expand synonyms: "best" → "best", "top", "greatest" (optional)
5. Detect intent: informational query (not navigational)

Query Execution on the Inverted Index

For a multi-term query, retrieve the posting list for each term and compute the intersection (AND semantics) or union (OR semantics).

Query: "python tutorial"

Posting list for "python":    [doc_2, doc_5, doc_8, doc_12, doc_15, ...]
Posting list for "tutorial":  [doc_3, doc_5, doc_9, doc_12, doc_18, ...]

AND (intersection): [doc_5, doc_12, ...]  ← documents containing both terms
OR (union):         [doc_2, doc_3, doc_5, doc_8, doc_9, doc_12, ...]

Score each matching document and return the top K results.

The intersection of two sorted posting lists can be computed in O(n + m) with a merge-like algorithm.

Early Termination

For queries that match millions of documents, scoring all of them is too slow. Techniques for early termination:

WAND (Weak AND): Skip documents that cannot possibly score high enough to enter the top K results. Maintain a threshold score; skip posting list entries whose maximum possible contribution is below the threshold.
Tiered index: Separate the index into tiers by document quality (PageRank). Search the high-quality tier first. If enough results are found, skip lower tiers.
Index pruning: Remove postings for very common terms in low-quality documents. Reduces index size and query time with minimal impact on result quality.

Index Sharding

With 600 TB of index data, a single machine cannot hold or serve the index. It must be distributed.

Document-Based Sharding

Partition documents across shards. Each shard holds the complete inverted index for a subset of documents.

Shard 1: documents 1 - 4,000,000 (complete inverted index for these docs)
Shard 2: documents 4,000,001 - 8,000,000
...
Shard N: documents ...

Query execution:
  1. Scatter query to all shards (in parallel)
  2. Each shard returns its top K results
  3. Merge results from all shards, re-rank, return global top K

Term-Based Sharding

Partition terms across shards. Each shard holds the complete posting list for a subset of terms.

1

2

3

4

5

6

7

Shard 1: terms "a" - "m" (all postings for these terms)
Shard 2: terms "n" - "z"

Query "python tutorial":
  1. "python" → Shard 2, "tutorial" → Shard 2 (same shard, lucky)
  2. Single-shard query if terms fall on the same shard
  3. Multi-shard query if terms span shards

Document-based sharding is preferred because every shard can independently score and rank documents. Term-based sharding requires cross-shard communication to score documents.

Replication

Each shard is replicated across multiple machines for fault tolerance and to handle high query throughput. A shard with 3 replicas can handle 3x the query load.

Crawler Integration

The search engine receives new and updated pages from the web crawler. The indexing pipeline processes these into the inverted index.

Crawler → Kafka (raw pages) → Document Processor → Index Builder → Index Servers

Document Processor:
  1. Parse HTML, extract text content
  2. Extract title, meta description, headings
  3. Extract outgoing links (for PageRank)
  4. Detect language
  5. Deduplicate (SimHash)

Index Builder:
  1. Tokenize and stem the text
  2. Build posting list entries
  3. Merge new entries into the existing index
  4. Recompute PageRank periodically (offline batch job)

Incremental Index Updates

Rebuilding the entire index for every new page would take days. Instead, maintain two indexes:

Base index: The full index, rebuilt periodically (e.g., weekly).
Delta index: A small, in-memory index of recently crawled pages. Updated in real-time.

At query time, search both indexes and merge results. Periodically, merge the delta index into the base index.

Serving Latency Optimization

Serving search results in under 500ms requires careful optimization at every layer.

Latency budget (500ms total):
  - Query parsing and planning: 10ms
  - Scatter to shards: 5ms (network)
  - Per-shard query execution: 200ms (the bottleneck)
  - Merge and re-rank: 20ms
  - Snippet generation: 50ms
  - Network return: 5ms
  - Buffer: 210ms

Key Optimizations

Keep the index in memory. Each shard should fit in RAM. Disk seeks add 10ms per posting list access; RAM access is microseconds.
SSD for overflow. If the full index does not fit in RAM, use SSDs. Avoid spinning disks entirely.
Caching: Cache results for popular queries. The same queries are searched thousands of times per second ("weather", "news", "facebook login").
Parallel shard queries: Scatter the query to all shards simultaneously. The total latency is the maximum shard latency, not the sum.
Tail latency management: With 12,000 shards, a single slow shard delays the entire query. Send redundant requests to replica shards and use the first response.

High-Level Architecture

             ┌──────────────┐
  Users ────>│ Load Balancer│
             └──────┬───────┘
                    │
               ┌────┴────┐
               │ Query   │
               │ Service │
               └────┬────┘
                    │ scatter
        ┌───────────┼───────────┐
        ▼           ▼           ▼
   ┌─────────┐ ┌─────────┐ ┌─────────┐
   │ Index   │ │ Index   │ │ Index   │
   │ Shard 1 │ │ Shard 2 │ │ Shard N │
   │(+replicas)│(+replicas)│(+replicas)│
   └─────────┘ └─────────┘ └─────────┘

Offline:
  Crawler → Document Processor → Index Builder → Push to Shards
  Web Graph → PageRank Computer → Score DB → Push to Shards

Interview Tips

Start with the inverted index. It is the single most important concept. Draw it out and explain posting lists, term frequency, and document frequency.
Explain BM25 over raw TF-IDF. Mention that TF-IDF is the conceptual foundation but BM25 is what production systems use due to term frequency saturation and length normalization.
Discuss PageRank at a high level. You do not need to derive the math. Explain the intuition (important pages are linked by important pages) and that it is precomputed offline.
Address the scatter-gather pattern for sharding. This is the key architectural pattern. Explain how queries are fanned out to all shards in parallel and results are merged.
Mention early termination. This separates senior-level answers from junior ones. Explain WAND or tiered indexing as techniques to avoid scoring every matching document.
Quantify latency budgets. Break down the 500ms budget across components. This shows you think about performance concretely, not abstractly.