Classic Designs
Design a web crawler like Googlebot. Covers URL frontier, politeness policies, distributed crawling, deduplication, and handling traps.
A web crawler (like Googlebot) systematically browses the internet, downloading pages so they can be indexed by a search engine. This question tests your understanding of graph traversal at internet scale, distributed systems coordination, politeness constraints, and deduplication. The challenge is not just fetching pages -- it is doing so efficiently, politely, and without getting stuck in traps.
Assumptions:
- Target: 1 billion pages per month
- Average page size: 500 KB (HTML + headers)
- Crawl rate: 1B / (30 x 24 x 3600) ~ 385 pages/sec
Storage:
- Raw HTML: 1B x 500 KB = 500 TB/month
- Compressed (5:1): ~100 TB/month
- With 6 months retention: ~600 TB
Bandwidth:
- 385 pages/sec x 500 KB = 192 MB/sec ~ 1.5 Gbps sustainedThe URL frontier is the heart of a web crawler. It is the data structure that holds the URLs to be crawled and determines the order in which they are fetched. A well-designed frontier must balance two competing goals: prioritization (crawl important pages first) and politeness (do not hammer any single host).
Not all URLs are equal. The homepage of a major news site is more valuable than a deeply nested forum thread. The frontier should prioritize URLs based on:
URL Frontier Architecture:
Incoming URLs
│
▼
┌──────────────┐
│ Prioritizer │ ← Assigns priority score based on PageRank, freshness, depth
└──────┬───────┘
│
▼
┌──────────────────────────────────────┐
│ Priority Queues (multiple buckets) │
│ ┌─────────┐ ┌─────────┐ ┌────────┐ │
│ │ High │ │ Medium │ │ Low │ │
│ │ Priority│ │ Priority│ │ Priority│ │
│ └────┬────┘ └────┬────┘ └───┬────┘ │
└───────┼───────────┼──────────┼──────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────┐
│ Politeness Enforcer │
│ (per-host queues + rate limiter) │
└──────────────────────────────────────┘
│
▼
Fetch WorkersThe politeness layer ensures that no single host receives more than one request at a time (or more than one request per crawl-delay interval). Each host gets its own FIFO queue. A worker thread is assigned to a host queue, fetches one URL, waits for the politeness delay, then fetches the next.
class PolitenessEnforcer:
def __init__(self):
self.host_queues = {} # host -> deque of URLs
self.last_fetch_time = {} # host -> timestamp
self.crawl_delays = {} # host -> delay in seconds (from robots.txt)
def add_url(self, url):
host = extract_host(url)
if host not in self.host_queues:
self.host_queues[host] = deque()
self.host_queues[host].append(url)
def get_next_url(self, host):
delay = self.crawl_delays.get(host, 1.0) # default 1 second
last = self.last_fetch_time.get(host, 0)
if time.time() - last < delay:
return None # too soon, try another host
url = self.host_queues[host].popleft()
self.last_fetch_time[host] = time.time()
return urlBFS explores all links at the current depth before moving deeper. Starting from seed URLs, it first crawls all pages linked from the seeds, then all pages linked from those, and so on.
Pros: Discovers important pages (close to seeds) quickly. Natural priority -- shallow pages tend to be more important. Cons: Requires storing a large frontier in memory. At internet scale, the frontier can grow to billions of URLs.
DFS follows a single path of links as deep as possible before backtracking.
Pros: Lower memory usage (only need to store the current path). Cons: Can get trapped in deep link chains or spider traps. Misses important shallow pages.
In practice, use BFS with priority adjustments. Pure BFS is the baseline, but the priority queue allows you to dynamically reorder the frontier so that high-value URLs jump ahead regardless of their depth.
Every well-behaved crawler must respect the `robots.txt` file at the root of each domain. It specifies which paths are allowed or disallowed for specific user agents.
# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2
User-agent: Googlebot
Allow: /admin/public/
Disallow: /admin/Implementation: Before crawling any URL on a new domain, fetch and parse its robots.txt. Cache the parsed rules (they rarely change). Check every URL against the rules before adding it to the frontier.
from urllib.robotparser import RobotFileParser
class RobotsCache:
def __init__(self):
self.cache = {} # host -> RobotFileParser
def is_allowed(self, url, user_agent="MyCrawler"):
host = extract_host(url)
if host not in self.cache:
rp = RobotFileParser()
rp.set_url(f"https://{host}/robots.txt")
rp.read()
self.cache[host] = rp
return self.cache[host].can_fetch(user_agent, url)At internet scale, the same content can live at many different URLs. Without deduplication, the crawler wastes bandwidth re-fetching identical pages.
Before adding a URL to the frontier, normalize it to a canonical form:
`HTTP://Example.COM` becomes `http://example.com`.`http://example.com:80/` becomes `http://example.com/`.`http://example.com/page#section` becomes `http://example.com/page`.`?b=2&a=1` becomes `?a=1&b=2`.`/a/b/../c` becomes `/a/c`.Maintain a set of all URLs that have been added to the frontier. Before adding a new URL, check this set. At billions of URLs, this set is too large for memory. Options:
Two different URLs can serve identical or near-identical content (e.g., `?sort=asc` vs `?sort=desc` on a product listing). URL normalization cannot catch this. Instead, compute a fingerprint of the page content.
SimHash is a locality-sensitive hash designed by Google specifically for web deduplication. Unlike cryptographic hashes (where a single changed character produces a completely different hash), SimHash produces similar hashes for similar documents.
def simhash(document):
"""Simplified SimHash implementation."""
tokens = tokenize(document)
vector = [0] * 64 # 64-bit hash
for token in tokens:
token_hash = hash_64bit(token)
for i in range(64):
if token_hash & (1 << i):
vector[i] += 1
else:
vector[i] -= 1
fingerprint = 0
for i in range(64):
if vector[i] > 0:
fingerprint |= (1 << i)
return fingerprint
# Two documents are near-duplicates if their SimHashes differ
# in fewer than k bits (Hamming distance < k, typically k=3)A single machine cannot crawl billions of pages. The crawler must be distributed across hundreds or thousands of workers.
┌────────────────────┐
│ Seed URLs / URL │
│ Frontier Service │
└────────┬───────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Crawler │ │ Crawler │ │ Crawler │
│ Worker 1 │ │ Worker 2 │ │ Worker N │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
└───────────────┼───────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ DNS Cache │ │Content Store│ │ Link │
│ (local) │ │ (S3/HDFS) │ │ Extractor│
└────────────┘ └────────────┘ └────────────┘Assign each worker a set of domains (using consistent hashing on the domain name). This ensures that all URLs for a given domain are fetched by the same worker, making politeness enforcement straightforward. If a worker fails, its domains are redistributed to other workers.
DNS resolution is expensive (50-200ms per lookup) and is required for every new domain. At crawl scale, DNS becomes a major bottleneck.
The web is adversarial. Some sites deliberately or accidentally create infinite URL spaces.
`/calendar/2024/01/01`, `/calendar/2024/01/02`, etc., going infinitely into the future or past.`/page?sid=abc123`, `/page?sid=def456`.class TrapDetector:
def __init__(self, max_depth=20, max_per_domain=50000):
self.max_depth = max_depth
self.max_per_domain = max_per_domain
self.domain_counts = {}
def should_crawl(self, url, depth):
domain = extract_domain(url)
count = self.domain_counts.get(domain, 0)
if depth > self.max_depth:
return False
if count >= self.max_per_domain:
return False
self.domain_counts[domain] = count + 1
return TrueThe web is not static. Pages change, new pages appear, and old pages disappear. A crawler must re-crawl pages to keep the index fresh.
`sitemap.xml` files that list all pages and their last-modified dates. Use these to prioritize re-crawls.`If-Modified-Since` or `If-None-Match` headers. The server responds with 304 Not Modified if the page has not changed, saving bandwidth.