← back

Classic Designs

Search Autocomplete

Design a typeahead system like Google search suggestions. Covers trie data structures, top-K queries, prefix caching, and real-time ranking.

Designing a Search Autocomplete System

A search autocomplete (typeahead) system provides real-time suggestions as a user types a query. Think of Google's search bar: after typing "how to", you instantly see suggestions like "how to tie a tie" and "how to boil eggs." This question tests your knowledge of trie data structures, top-K algorithms, caching strategies, and the balance between precomputation and real-time serving.

Requirements

Functional Requirements

  • As the user types each character, return the top 5-10 matching suggestions.
  • Suggestions are ranked by popularity (frequency of past queries).
  • Support personalization (optional): boost suggestions based on the user's history.
  • Handle multi-word queries, not just single-word prefixes.

Non-Functional Requirements

  • Latency: Suggestions must appear within 100ms of the user typing a character. Anything slower feels laggy.
  • Availability: The system should be highly available. A failed autocomplete is worse than no autocomplete (it breaks user expectations).
  • Scalability: Handle 100,000+ queries per second (Google scale).
  • Freshness: Trending queries (e.g., breaking news) should appear in suggestions within minutes, not days.

Capacity Estimation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Assumptions:
  - 5 billion searches per day (Google scale)
  - Average query length: 20 characters → 20 autocomplete requests per query
  - 5B x 20 = 100 billion autocomplete requests/day
  - 100B / 86400 ≈ 1.16 million requests/sec

Unique queries:
  - ~500 million unique queries per day
  - Store top 10 million most popular queries (covers 99%+ of autocomplete hits)
  - Average query: 20 bytes → 10M x 20 = 200 MB (fits easily in memory)

Cache:
  - Unique prefixes for top queries: ~50 million prefixes
  - Each prefix maps to 10 suggestions x 20 bytes = 200 bytes
  - 50M x 200 = 10 GB (fits in a Redis cluster)

The Trie Data Structure

A trie (prefix tree) is the natural data structure for autocomplete. Each node represents a character, and paths from root to leaf represent complete strings. The key property: all strings sharing a common prefix share the same path from the root.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Trie for queries: "tree", "try", "true", "toy", "top"

        (root)
       /      \
      t        ...
     / \
    r    o
   / \    \
  e   u    y  p
  |   |
  e   e

Each node stores:
  - Character
  - Is this node the end of a valid query?
  - Frequency count (how often this query was searched)
  - Top-K suggestions for this prefix (precomputed)

Storing Top-K at Each Node

The naive approach to autocomplete with a trie: given a prefix, traverse to the prefix node, then explore all subtrees to find the most popular completions. This is O(total characters in all matching queries) -- far too slow.

Optimization: At each node, precompute and store the top K suggestions for that prefix. When the user types "ho", you simply look up the node for "ho" and return its precomputed list.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end = False
        self.frequency = 0
        self.top_suggestions = []  # precomputed top-K

class AutocompleteTrie:
    def __init__(self, k=10):
        self.root = TrieNode()
        self.k = k

    def insert(self, query, frequency):
        node = self.root
        for char in query:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
            # Update top-K suggestions at every prefix node
            self._update_top_k(node, query, frequency)
        node.is_end = True
        node.frequency = frequency

    def _update_top_k(self, node, query, frequency):
        # Check if query is already in top suggestions
        for i, (q, f) in enumerate(node.top_suggestions):
            if q == query:
                node.top_suggestions[i] = (query, frequency)
                node.top_suggestions.sort(key=lambda x: -x[1])
                return
        # Add if we have room or if this query is more popular
        if len(node.top_suggestions) < self.k:
            node.top_suggestions.append((query, frequency))
            node.top_suggestions.sort(key=lambda x: -x[1])
        elif frequency > node.top_suggestions[-1][1]:
            node.top_suggestions[-1] = (query, frequency)
            node.top_suggestions.sort(key=lambda x: -x[1])

    def search(self, prefix):
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []
            node = node.children[char]
        return node.top_suggestions

Trade-off: This uses more memory (each node stores K suggestions) but makes queries O(prefix length) -- constant time regardless of how many completions exist.

Precomputation vs Real-Time

Offline Precomputation (Primary Path)

Query frequencies change slowly for most queries. The trie can be rebuilt periodically (e.g., every few hours) from aggregated query logs.

1
2
3
4
5
6
7
8
9
10
11
12
13
Data Collection Pipeline:

  Search Logs → Kafka → Aggregation Service → Query Frequency DB
                                                      │
                                                      ▼
                                              Trie Builder (offline)
                                                      │
                                                      ▼
                                              Trie Snapshot (serialized)
                                                      │
                                                      ▼
                                              Autocomplete Servers
                                              (load new trie into memory)

The aggregation service maintains a rolling window (e.g., last 7 days) of query frequencies, applying exponential decay so that recent queries are weighted more heavily.

Real-Time Trending (Secondary Path)

For breaking news and trending topics, waiting hours is too long. A separate real-time pipeline detects query spikes and injects them into the autocomplete system.

1
2
3
4
5
Search Logs → Kafka → Trend Detector → Trending Queries Cache
                                                │
                                                ▼
                                        Autocomplete Servers
                                        (merge trending with trie results)

The trend detector uses a sliding window counter (e.g., count queries in the last 15 minutes) and flags queries whose frequency exceeds a threshold relative to their historical baseline.

Caching by Prefix

Since autocomplete is extremely read-heavy and the same prefixes are queried millions of times, aggressive caching is essential.

Server-Side Cache

1
2
3
4
5
Prefix → Top-K Results Cache (Redis)

"how t"  → ["how to tie a tie", "how to boil eggs", ...]
"weath"  → ["weather today", "weather tomorrow", ...]
"pytho"  → ["python tutorial", "python download", ...]

Cache entries are keyed by the exact prefix string. With 50 million unique prefixes and 10 GB of cache, this covers the vast majority of requests.

Cache invalidation: When the trie is rebuilt, invalidate the entire cache (or use a versioned cache where each trie version has its own namespace).

Client-Side Cache

The browser or mobile app can cache results locally. When the user types "how", the client receives suggestions. When they type "how ", the client can check if it already has results for "how " from a previous keystroke's response that included child prefixes. This eliminates network round trips for many keystrokes.

Request Coalescing

Users type fast. By the time the response for "ho" arrives, the user may have already typed "how t". The client should:

  1. Debounce requests: Wait 50-100ms after the last keystroke before sending a request. This avoids sending requests for every single character.
  2. Cancel stale requests: If a new keystroke arrives, cancel the pending request for the previous prefix.
  3. Only display the latest: If responses arrive out of order, only display the response for the current prefix.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Client-side pseudocode
class AutocompleteClient:
    def __init__(self):
        self.cache = {}
        self.pending_request = None
        self.debounce_timer = None

    def on_keystroke(self, prefix):
        # Check local cache first
        if prefix in self.cache:
            display(self.cache[prefix])
            return

        # Cancel pending request
        if self.pending_request:
            self.pending_request.cancel()

        # Debounce: wait 50ms before sending
        if self.debounce_timer:
            self.debounce_timer.cancel()

        self.debounce_timer = set_timer(50, lambda: self._send_request(prefix))

    def _send_request(self, prefix):
        self.pending_request = async_get(f"/autocomplete?q={prefix}")
        self.pending_request.then(lambda results: self._handle_response(prefix, results))

    def _handle_response(self, prefix, results):
        self.cache[prefix] = results
        display(results)

Data Collection Pipeline

The quality of autocomplete depends entirely on the quality of query frequency data. The pipeline must collect, clean, and aggregate search queries.

Collection

Log every search query with a timestamp. At Google scale, this is billions of events per day flowing into Kafka.

Cleaning

  • Remove queries with sensitive or offensive content (using a blocklist and ML classifier).
  • Remove bot traffic (identify by IP patterns, query rate, user agent).
  • Normalize queries: lowercase, trim whitespace, correct obvious typos.
  • Remove queries with personal information (emails, phone numbers, SSNs).

Aggregation

1
2
3
4
5
6
7
Raw logs:
  "python tutorial"  x 50,000 (today)
  "python tutorial"  x 45,000 (yesterday)
  "python tutorial"  x 40,000 (2 days ago)

Weighted frequency (exponential decay, half-life = 7 days):
  Score = 50000 * 1.0 + 45000 * 0.91 + 40000 * 0.82 + ...

Ranking and Personalization

Popularity-Based Ranking (Default)

The simplest ranking: sort suggestions by query frequency. This is what the trie stores. It works well for most users most of the time.

Personalized Ranking

For logged-in users, boost suggestions based on their search history. If a user frequently searches for "python documentation", then when they type "py", "python documentation" should rank higher for them than for an average user.

1
2
3
Final Score = alpha * global_popularity + (1 - alpha) * personal_relevance

Where alpha = 0.7 (global popularity still dominates)

Implementation: Store each user's recent queries in a lightweight profile. At query time, merge the global top-K from the trie with the user's personal top-K, re-rank, and return. This merge is fast because both lists are small (K=10).

Context-Aware Ranking

Additional signals for ranking:

  • Recency: Boost trending queries. A query that spiked in the last hour ranks higher than one with the same total frequency.
  • Location: "restaurants near me" should suggest local results. Use the user's IP or GPS to bias geographically relevant suggestions.
  • Language: Prefer suggestions in the user's language.
  • Time of day: "breakfast" ranks higher in the morning; "dinner" ranks higher in the evening.

High-Level Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
                    ┌──────────────┐
   User Query ────> │ Load Balancer│
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │ AC Svc 1 │ │ AC Svc 2 │ │ AC Svc N │
        │(trie in  │ │(trie in  │ │(trie in  │
        │ memory)  │ │ memory)  │ │ memory)  │
        └────┬─────┘ └────┬─────┘ └────┬─────┘
             │             │             │
             └─────────────┼─────────────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
       ┌────────────┐ ┌──────────┐ ┌────────────┐
       │ Redis Cache│ │ Trending │ │ User       │
       │(prefix→top │ │ Queries  │ │ Profiles   │
       │ results)   │ │ Cache    │ │ (personal) │
       └────────────┘ └──────────┘ └────────────┘

  Offline Pipeline:
  Search Logs → Kafka → Aggregator → Trie Builder → Distribute to AC Servers

Trie Distribution

Each autocomplete server holds a complete copy of the trie in memory (~200 MB for top 10M queries). When the trie is rebuilt:

  1. The Trie Builder serializes the new trie to a file (e.g., stored in S3).
  2. Each AC server downloads the new trie snapshot and swaps it in atomically (double-buffering: load the new trie into memory, then swap the pointer).
  3. No downtime. Old requests complete on the old trie, new requests use the new trie.

For larger datasets, shard the trie by prefix range. Server group A handles prefixes a-m, server group B handles n-z. A routing layer directs each request to the correct shard.

Interview Tips

  • Draw the trie and explain the top-K optimization. The precomputed top-K at each node is the key insight. Without it, the naive approach is too slow.
  • Separate the offline pipeline from the serving path. The trie is built offline; the servers just do lookups. This separation is what makes the system fast and maintainable.
  • Discuss client-side optimizations. Debouncing, request cancellation, and client-side caching are practical details that show real-world experience.
  • Address freshness vs latency. Explain the two-path approach: offline trie for most queries, real-time trending overlay for breaking events.
  • Mention filtering offensive content. Google autocomplete famously filters out offensive, defamatory, and dangerous suggestions. Mention this as a product requirement.
  • Quantify memory usage. 10 million queries at 20 bytes each is 200 MB. The top-K metadata adds overhead, but the entire trie fits in memory on a single server. This is a key insight that simplifies the architecture.