Classic Designs

Web Crawler

Design a web crawler like Googlebot. Covers URL frontier, politeness policies, distributed crawling, deduplication, and handling traps.

Designing a Web Crawler

A web crawler (like Googlebot) systematically browses the internet, downloading pages so they can be indexed by a search engine. This question tests your understanding of graph traversal at internet scale, distributed systems coordination, politeness constraints, and deduplication. The challenge is not just fetching pages -- it is doing so efficiently, politely, and without getting stuck in traps.

Requirements

Functional Requirements

Given a set of seed URLs, crawl the web by following hyperlinks.
Download and store the content of each page.
Extract links from downloaded pages and add new URLs to the crawl frontier.
Respect robots.txt rules and crawl-delay directives.
Avoid crawling duplicate content.

Non-Functional Requirements

Scalability: Crawl billions of pages across hundreds of millions of domains.
Politeness: Do not overwhelm any single web server. Enforce per-domain rate limits.
Robustness: Handle malformed HTML, infinite loops, spider traps, slow servers, and network failures.
Freshness: Re-crawl pages periodically to keep the index up to date.
Extensibility: Support new content types (images, PDFs) and custom crawl policies.

Capacity Estimation

Assumptions:
  - Target: 1 billion pages per month
  - Average page size: 500 KB (HTML + headers)
  - Crawl rate: 1B / (30 x 24 x 3600) ~ 385 pages/sec

Storage:
  - Raw HTML: 1B x 500 KB = 500 TB/month
  - Compressed (5:1): ~100 TB/month
  - With 6 months retention: ~600 TB

Bandwidth:
  - 385 pages/sec x 500 KB = 192 MB/sec ~ 1.5 Gbps sustained

URL Frontier

The URL frontier is the heart of a web crawler. It is the data structure that holds the URLs to be crawled and determines the order in which they are fetched. A well-designed frontier must balance two competing goals: prioritization (crawl important pages first) and politeness (do not hammer any single host).

Priority Queue

Not all URLs are equal. The homepage of a major news site is more valuable than a deeply nested forum thread. The frontier should prioritize URLs based on:

PageRank or domain authority: Higher-ranked domains get crawled first.
Freshness: Pages that change frequently (news sites) should be re-crawled sooner.
Depth: Pages closer to seed URLs are generally more important.

URL Frontier Architecture:

  Incoming URLs
       │
       ▼
  ┌──────────────┐
  │  Prioritizer │  ← Assigns priority score based on PageRank, freshness, depth
  └──────┬───────┘
         │
         ▼
  ┌──────────────────────────────────────┐
  │  Priority Queues (multiple buckets)  │
  │  ┌─────────┐ ┌─────────┐ ┌────────┐ │
  │  │ High    │ │ Medium  │ │ Low    │ │
  │  │ Priority│ │ Priority│ │ Priority│ │
  │  └────┬────┘ └────┬────┘ └───┬────┘ │
  └───────┼───────────┼──────────┼──────┘
          │           │          │
          ▼           ▼          ▼
  ┌──────────────────────────────────────┐
  │  Politeness Enforcer                 │
  │  (per-host queues + rate limiter)    │
  └──────────────────────────────────────┘
         │
         ▼
    Fetch Workers

Politeness: Per-Host Queues

The politeness layer ensures that no single host receives more than one request at a time (or more than one request per crawl-delay interval). Each host gets its own FIFO queue. A worker thread is assigned to a host queue, fetches one URL, waits for the politeness delay, then fetches the next.

class PolitenessEnforcer:
    def __init__(self):
        self.host_queues = {}       # host -> deque of URLs
        self.last_fetch_time = {}   # host -> timestamp
        self.crawl_delays = {}      # host -> delay in seconds (from robots.txt)

    def add_url(self, url):
        host = extract_host(url)
        if host not in self.host_queues:
            self.host_queues[host] = deque()
        self.host_queues[host].append(url)

    def get_next_url(self, host):
        delay = self.crawl_delays.get(host, 1.0)  # default 1 second
        last = self.last_fetch_time.get(host, 0)
        if time.time() - last < delay:
            return None  # too soon, try another host
        url = self.host_queues[host].popleft()
        self.last_fetch_time[host] = time.time()
        return url

BFS vs DFS Crawling

BFS (Breadth-First Search)

BFS explores all links at the current depth before moving deeper. Starting from seed URLs, it first crawls all pages linked from the seeds, then all pages linked from those, and so on.

Pros: Discovers important pages (close to seeds) quickly. Natural priority -- shallow pages tend to be more important. Cons: Requires storing a large frontier in memory. At internet scale, the frontier can grow to billions of URLs.

DFS (Depth-First Search)

DFS follows a single path of links as deep as possible before backtracking.

Pros: Lower memory usage (only need to store the current path). Cons: Can get trapped in deep link chains or spider traps. Misses important shallow pages.

In practice, use BFS with priority adjustments. Pure BFS is the baseline, but the priority queue allows you to dynamically reorder the frontier so that high-value URLs jump ahead regardless of their depth.

Robots.txt

Every well-behaved crawler must respect the `robots.txt` file at the root of each domain. It specifies which paths are allowed or disallowed for specific user agents.

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2

User-agent: Googlebot
Allow: /admin/public/
Disallow: /admin/

Implementation: Before crawling any URL on a new domain, fetch and parse its robots.txt. Cache the parsed rules (they rarely change). Check every URL against the rules before adding it to the frontier.

from urllib.robotparser import RobotFileParser

class RobotsCache:
    def __init__(self):
        self.cache = {}  # host -> RobotFileParser

    def is_allowed(self, url, user_agent="MyCrawler"):
        host = extract_host(url)
        if host not in self.cache:
            rp = RobotFileParser()
            rp.set_url(f"https://{host}/robots.txt")
            rp.read()
            self.cache[host] = rp
        return self.cache[host].can_fetch(user_agent, url)

URL Deduplication

At internet scale, the same content can live at many different URLs. Without deduplication, the crawler wastes bandwidth re-fetching identical pages.

URL Normalization

Before adding a URL to the frontier, normalize it to a canonical form:

Convert the scheme and host to lowercase: `HTTP://Example.COM` becomes `http://example.com`.
Remove default ports: `http://example.com:80/` becomes `http://example.com/`.
Remove fragment identifiers: `http://example.com/page#section` becomes `http://example.com/page`.
Sort query parameters: `?b=2&a=1` becomes `?a=1&b=2`.
Remove trailing slashes consistently.
Resolve relative paths: `/a/b/../c` becomes `/a/c`.

Seen-URL Set

Maintain a set of all URLs that have been added to the frontier. Before adding a new URL, check this set. At billions of URLs, this set is too large for memory. Options:

Bloom filter: A probabilistic set with no false negatives. Uses about 10 bits per URL. For 10 billion URLs, that is about 12 GB -- fits in memory. A small false positive rate (1%) means you occasionally skip a URL you have not seen, which is acceptable.
Disk-backed hash table: Store URL hashes on disk (e.g., RocksDB). Slower but exact.

Content Fingerprinting with SimHash

Two different URLs can serve identical or near-identical content (e.g., `?sort=asc` vs `?sort=desc` on a product listing). URL normalization cannot catch this. Instead, compute a fingerprint of the page content.

SimHash is a locality-sensitive hash designed by Google specifically for web deduplication. Unlike cryptographic hashes (where a single changed character produces a completely different hash), SimHash produces similar hashes for similar documents.

def simhash(document):
    """Simplified SimHash implementation."""
    tokens = tokenize(document)
    vector = [0] * 64  # 64-bit hash

    for token in tokens:
        token_hash = hash_64bit(token)
        for i in range(64):
            if token_hash & (1 << i):
                vector[i] += 1
            else:
                vector[i] -= 1

    fingerprint = 0
    for i in range(64):
        if vector[i] > 0:
            fingerprint |= (1 << i)
    return fingerprint

# Two documents are near-duplicates if their SimHashes differ
# in fewer than k bits (Hamming distance < k, typically k=3)

Distributed Architecture

A single machine cannot crawl billions of pages. The crawler must be distributed across hundreds or thousands of workers.

                    ┌────────────────────┐
                    │  Seed URLs / URL   │
                    │  Frontier Service  │
                    └────────┬───────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
       ┌────────────┐ ┌────────────┐ ┌────────────┐
       │  Crawler   │ │  Crawler   │ │  Crawler   │
       │  Worker 1  │ │  Worker 2  │ │  Worker N  │
       └─────┬──────┘ └─────┬──────┘ └─────┬──────┘
             │               │               │
             └───────────────┼───────────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
       ┌────────────┐ ┌────────────┐ ┌────────────┐
       │  DNS Cache │ │Content Store│ │  Link     │
       │  (local)   │ │  (S3/HDFS) │ │  Extractor│
       └────────────┘ └────────────┘ └────────────┘

Domain-Based Partitioning

Assign each worker a set of domains (using consistent hashing on the domain name). This ensures that all URLs for a given domain are fetched by the same worker, making politeness enforcement straightforward. If a worker fails, its domains are redistributed to other workers.

DNS Caching

DNS resolution is expensive (50-200ms per lookup) and is required for every new domain. At crawl scale, DNS becomes a major bottleneck.

Local DNS cache: Each worker maintains a cache of DNS resolutions with a TTL.
Shared DNS cache: A dedicated DNS caching service (like a local BIND resolver or a shared Redis cache) serves all workers.
Pre-resolution: When adding URLs to the frontier, batch-resolve their DNS entries and store the IP addresses alongside the URLs.

Handling Traps and Infinite Loops

The web is adversarial. Some sites deliberately or accidentally create infinite URL spaces.

Spider Traps

Calendar traps: A dynamic calendar that generates URLs like `/calendar/2024/01/01`, `/calendar/2024/01/02`, etc., going infinitely into the future or past.
Session ID traps: URLs that include unique session IDs, creating infinite variations: `/page?sid=abc123`, `/page?sid=def456`.
Soft 404s: Pages that return HTTP 200 but contain "Page not found" content, often with links to other pages that also produce soft 404s.

Mitigation Strategies

Maximum depth limit: Do not follow links beyond depth N from any seed URL. Typically N=15-20.
Maximum pages per domain: Cap the number of pages crawled from any single domain per crawl cycle.
URL pattern detection: If the crawler sees many URLs with the same path structure but varying parameters, it likely hit a trap. Limit the number of URLs matching each pattern.
Page content similarity: If the last K pages from a domain all have the same SimHash, stop crawling that path.
Blacklisting: Maintain a list of known trap domains.

class TrapDetector:
    def __init__(self, max_depth=20, max_per_domain=50000):
        self.max_depth = max_depth
        self.max_per_domain = max_per_domain
        self.domain_counts = {}

    def should_crawl(self, url, depth):
        domain = extract_domain(url)
        count = self.domain_counts.get(domain, 0)

        if depth > self.max_depth:
            return False
        if count >= self.max_per_domain:
            return False

        self.domain_counts[domain] = count + 1
        return True

Re-Crawl Strategy

The web is not static. Pages change, new pages appear, and old pages disappear. A crawler must re-crawl pages to keep the index fresh.

Exponential backoff: If a page has not changed in the last N crawls, double the interval between re-crawls. If it changes frequently, crawl it more often.
Sitemap-driven: Many sites provide `sitemap.xml` files that list all pages and their last-modified dates. Use these to prioritize re-crawls.
HTTP conditional requests: Use `If-Modified-Since` or `If-None-Match` headers. The server responds with 304 Not Modified if the page has not changed, saving bandwidth.

Interview Tips

Start with the URL frontier. It is the most interesting and complex component. Explain the two-level architecture: priority queues for importance, per-host queues for politeness.
Discuss deduplication at both the URL and content level. URL normalization catches syntactic duplicates; SimHash catches semantic duplicates. Mention Bloom filters for the seen-URL set.
Mention robots.txt early. Interviewers want to see that you respect web standards and think about the ethics of crawling.
Address distributed coordination. Explain how you partition domains across workers and handle worker failures. Consistent hashing is the standard answer.
Talk about traps proactively. This shows you have practical experience or deep understanding of the problem. Maximum depth, per-domain caps, and URL pattern detection are the key defenses.
Quantify the DNS bottleneck. Mention that without caching, DNS resolution alone would limit you to 5-20 pages/sec per worker. Local caching brings that up to thousands.