Fundamentals
The two fundamental performance metrics. Learn how to reason about latency budgets, throughput bottlenecks, and the relationship between them.
Latency and throughput are the two fundamental performance metrics of any system. Every system design interview involves reasoning about one or both. Yet many candidates confuse them, oversimplify their relationship, or fail to use concrete numbers. Mastering these concepts — and being able to do quick back-of-the-envelope calculations — is one of the highest-leverage skills you can develop.
Latency is the time it takes for a single operation to complete, measured from the moment a request is sent to the moment the response arrives. It answers the question: "How long does the user wait?"
Throughput is the number of operations a system completes per unit of time. It answers the question: "How many requests can the system handle?"
These are related but distinct. A system can have low latency but low throughput (a single fast worker), or high throughput but high latency (a batch processing pipeline that takes hours but processes millions of records).
Latency: |──── request ────>|<──── response ────|
t=0 t=200ms
"This request took 200ms"
Throughput: ████████████████████ → 1000 requests/second
"This system handles 1000 QPS"Every request passes through multiple stages, and latency accumulates at each one:
Client ──► DNS ──► CDN/LB ──► App Server ──► Cache ──► Database
│ │ │ │ │ │
5ms 50ms 1ms 10ms 1ms 20ms
(network) (lookup) (routing) (compute) (hit) (query)
Total: ~87ms (simplified)The components:
Averages are misleading for latency. If 99 requests take 10ms and 1 request takes 10 seconds, the average is ~109ms — which describes nobody's experience. Use percentiles instead.
import statistics
def compute_percentiles(latencies):
latencies.sort()
n = len(latencies)
return {
"p50": latencies[n // 2],
"p95": latencies[int(n * 0.95)],
"p99": latencies[int(n * 0.99)],
"p999": latencies[int(n * 0.999)],
}
# Example: most requests fast, some slow
latencies = [10] * 950 + [100] * 40 + [1000] * 9 + [5000] * 1 # ms
print(compute_percentiles(latencies))
# {'p50': 10, 'p95': 100, 'p99': 1000, 'p999': 5000}In a microservices architecture, a single user request might fan out to dozens of backend services. If each service has a P99 of 100ms, and you call 20 services in parallel, the probability that at least one takes 100ms+ is: 1 - (0.99)^20 = 18%. Your P99 latency at the edge is dramatically worse than any individual service's P99.
This is the tail-at-scale problem (described by Jeff Dean and Luiz Barroso at Google). It is why large-scale systems obsess over tail latency, not average latency.
A latency budget breaks down your total allowed latency into allocations for each component. If your SLA says page load must complete in 500ms:
Total budget: 500ms
├── Client rendering: 100ms
├── Network (user to LB): 50ms
├── Load balancer: 5ms
├── Application logic: 50ms
├── Cache lookup: 5ms
├── Database query: 30ms
├── External API call: 100ms
├── Network (LB to user): 50ms
└── Buffer for spikes: 110msDefining budgets forces you to identify bottlenecks before they surprise you in production. If your database query already takes 200ms, you know the 500ms SLA is impossible without caching or query optimization.
Throughput is typically measured in:
A system's throughput is determined by its bottleneck — the slowest component. This is the application of Amdahl's Law to system design:
Web Server (10,000 QPS) ──► App Server (5,000 QPS) ──► Database (1,000 QPS)
System throughput: 1,000 QPS (limited by database)Scaling the web server or app server does not help until you address the database bottleneck. This is why identifying the bottleneck is always the first step.
Latency and throughput are not independent. As throughput approaches a system's capacity, latency increases dramatically — this follows queueing theory.
Latency ▲
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱──── Latency explodes as
│ ╱ utilization approaches 100%
│ ╱
│ ╱
│ ╱
│ ╱
└────────────────────────────────────► Throughput
0% 100% utilizationA rule of thumb: keep systems at 60-70% utilization to maintain reasonable latency. At 90%+ utilization, queueing delays dominate and latency becomes unpredictable.
Batching is the classic technique to increase throughput at the cost of latency. Instead of processing items one by one, you wait until you have a batch and process them together.
# Without batching: low latency, lower throughput
def process_individually(items):
for item in items:
db.insert(item) # One round trip per item
# With batching: higher latency for first item, much higher throughput
def process_batch(items, batch_size=100):
batch = []
for item in items:
batch.append(item)
if len(batch) >= batch_size:
db.batch_insert(batch) # One round trip per 100 items
batch = []
if batch:
db.batch_insert(batch)Kafka is a great example of this trade-off in practice. Producers can be configured with `linger.ms` — how long to wait for more messages before sending a batch. Higher values mean more latency but better throughput.
Little's Law is an elegant relationship that connects three quantities:
*L = lambda W**
This is incredibly useful for capacity planning:
# Example: How many concurrent connections do we need?
throughput = 1000 # requests per second
latency = 0.2 # seconds (200ms average)
concurrency = throughput * latency # = 200 concurrent requests
# So your server needs to handle at least 200 concurrent connections
# If each connection uses a thread, you need at least 200 threads
# If each thread uses 1MB of stack, that is 200MB just for thread stacksAnother application: if your system handles 10,000 QPS with an average latency of 50ms, there are on average 500 requests in flight at any time. If latency doubles to 100ms, you will have 1,000 requests in flight — double the memory, connections, and threads.
These are the numbers every system designer should have memorized. They come from Jeff Dean's famous talk and have been updated for modern hardware:
┌──────────────────────────────────────┬──────────────┐
│ Operation │ Latency │
├──────────────────────────────────────┼──────────────┤
│ L1 cache reference │ 0.5 ns │
│ L2 cache reference │ 7 ns │
│ Main memory reference │ 100 ns │
│ SSD random read │ 16 μs │
│ Read 1 MB sequentially from memory │ 3 μs │
│ Read 1 MB sequentially from SSD │ 49 μs │
│ Read 1 MB sequentially from disk │ 825 μs │
│ Disk seek │ 2 ms │
│ Send packet CA → Netherlands → CA │ 150 ms │
│ Round trip within same datacenter │ 0.5 ms │
│ Round trip within same region │ ~1-5 ms │
│ Round trip cross-continent │ 50-150 ms │
│ Mutex lock/unlock │ 17 ns │
│ Redis GET (local datacenter) │ 0.1-0.5 ms │
│ Simple database query (indexed) │ 1-5 ms │
│ Complex database query │ 10-100 ms │
│ TLS handshake │ 2-10 ms │
│ DNS lookup (uncached) │ 20-120 ms │
└──────────────────────────────────────┴──────────────┘Using these numbers, you can quickly estimate system performance:
# Example: How long to read and process a 10 MB image?
# From SSD: 10 * 49μs = 490μs ≈ 0.5ms
# From HDD: 10 * 825μs = 8.25ms
# From memory: 10 * 3μs = 30μs
# Network transfer (1 Gbps link): 10MB / 125MB/s = 80ms
# Network transfer (10 Gbps link): 10MB / 1.25GB/s = 8ms
# So for a user uploading a 10MB image:
# Upload time (100 Mbps user connection): 10MB / 12.5MB/s = 800ms
# Server processing (resize, thumbnail): ~200ms
# Write to S3: ~50ms
# Total: ~1050ms# Example: Can this database handle our load?
# Requirements: 50,000 QPS, p99 < 20ms
# Single PostgreSQL: ~5,000-20,000 QPS for simple queries
# With connection pooling and read replicas:
# 1 primary + 4 replicas = 5 * 10,000 = 50,000 QPS (reads)
# Writes still limited to primary: ~10,000 QPS
# Redis cache for hot data: 100,000+ QPS per instance
# Strategy: Cache reads (90% hit rate) + 4 read replicas for cache misses
# 50,000 QPS → 45,000 from cache + 5,000 from DB → feasibleCommon techniques for reducing latency: