← back

Fundamentals

Latency and Throughput

The two fundamental performance metrics. Learn how to reason about latency budgets, throughput bottlenecks, and the relationship between them.

Latency and Throughput

Latency and throughput are the two fundamental performance metrics of any system. Every system design interview involves reasoning about one or both. Yet many candidates confuse them, oversimplify their relationship, or fail to use concrete numbers. Mastering these concepts — and being able to do quick back-of-the-envelope calculations — is one of the highest-leverage skills you can develop.

Definitions

Latency is the time it takes for a single operation to complete, measured from the moment a request is sent to the moment the response arrives. It answers the question: "How long does the user wait?"

Throughput is the number of operations a system completes per unit of time. It answers the question: "How many requests can the system handle?"

These are related but distinct. A system can have low latency but low throughput (a single fast worker), or high throughput but high latency (a batch processing pipeline that takes hours but processes millions of records).

1
2
3
4
5
6
Latency:    |──── request ────>|<──── response ────|
            t=0                                  t=200ms
            "This request took 200ms"

Throughput: ████████████████████ → 1000 requests/second
            "This system handles 1000 QPS"

Latency in Depth

Where latency comes from

Every request passes through multiple stages, and latency accumulates at each one:

1
2
3
4
5
6
Client ──► DNS ──► CDN/LB ──► App Server ──► Cache ──► Database
  │         │        │           │             │          │
  5ms      50ms     1ms        10ms          1ms       20ms
  (network) (lookup) (routing)  (compute)   (hit)     (query)

Total: ~87ms (simplified)

The components:

  • Network latency: Physical distance and number of hops. Speed of light in fiber is roughly 200,000 km/s, so a round trip across the US (~4,000 km) takes about 40ms minimum.
  • Processing time: CPU time spent executing your code.
  • Queueing time: Waiting for resources (threads, connections, locks).
  • I/O time: Reading from disk, querying a database, calling an external service.

Percentiles: P50, P95, P99

Averages are misleading for latency. If 99 requests take 10ms and 1 request takes 10 seconds, the average is ~109ms — which describes nobody's experience. Use percentiles instead.

  • P50 (median): Half of requests are faster than this. This is the "typical" experience.
  • P95: 95% of requests are faster. This captures the experience of most users, including those hit by occasional slowness.
  • P99: 99% of requests are faster. This is the tail latency — the worst experience that 1 in 100 users has.
  • P99.9: Critical for high-traffic systems. At 1 million requests per day, P99.9 means 1,000 users per day have this experience or worse.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import statistics

def compute_percentiles(latencies):
    latencies.sort()
    n = len(latencies)
    return {
        "p50": latencies[n // 2],
        "p95": latencies[int(n * 0.95)],
        "p99": latencies[int(n * 0.99)],
        "p999": latencies[int(n * 0.999)],
    }

# Example: most requests fast, some slow
latencies = [10] * 950 + [100] * 40 + [1000] * 9 + [5000] * 1  # ms
print(compute_percentiles(latencies))
# {'p50': 10, 'p95': 100, 'p99': 1000, 'p999': 5000}

Why tail latency matters

In a microservices architecture, a single user request might fan out to dozens of backend services. If each service has a P99 of 100ms, and you call 20 services in parallel, the probability that at least one takes 100ms+ is: 1 - (0.99)^20 = 18%. Your P99 latency at the edge is dramatically worse than any individual service's P99.

This is the tail-at-scale problem (described by Jeff Dean and Luiz Barroso at Google). It is why large-scale systems obsess over tail latency, not average latency.

Latency budgets

A latency budget breaks down your total allowed latency into allocations for each component. If your SLA says page load must complete in 500ms:

1
2
3
4
5
6
7
8
9
10
Total budget:                 500ms
├── Client rendering:         100ms
├── Network (user to LB):     50ms
├── Load balancer:              5ms
├── Application logic:         50ms
├── Cache lookup:               5ms
├── Database query:            30ms
├── External API call:        100ms
├── Network (LB to user):     50ms
└── Buffer for spikes:        110ms

Defining budgets forces you to identify bottlenecks before they surprise you in production. If your database query already takes 200ms, you know the 500ms SLA is impossible without caching or query optimization.

Throughput in Depth

Measuring throughput

Throughput is typically measured in:

  • QPS (Queries Per Second) for web services
  • TPS (Transactions Per Second) for databases
  • Mbps or Gbps for network bandwidth
  • IOPS (I/O Operations Per Second) for storage

Bottlenecks

A system's throughput is determined by its bottleneck — the slowest component. This is the application of Amdahl's Law to system design:

1
2
3
Web Server (10,000 QPS) ──► App Server (5,000 QPS) ──► Database (1,000 QPS)

System throughput: 1,000 QPS (limited by database)

Scaling the web server or app server does not help until you address the database bottleneck. This is why identifying the bottleneck is always the first step.

The Latency-Throughput Trade-Off

Latency and throughput are not independent. As throughput approaches a system's capacity, latency increases dramatically — this follows queueing theory.

1
2
3
4
5
6
7
8
9
10
11
12
13
Latency ▲
        │                              ╱
        │                            ╱
        │                          ╱
        │                        ╱
        │                     ╱──── Latency explodes as
        │                  ╱        utilization approaches 100%
        │              ╱
        │          ╱
        │     ╱
        │ ╱
        └────────────────────────────────────► Throughput
        0%                                  100% utilization

A rule of thumb: keep systems at 60-70% utilization to maintain reasonable latency. At 90%+ utilization, queueing delays dominate and latency becomes unpredictable.

Batching: trading latency for throughput

Batching is the classic technique to increase throughput at the cost of latency. Instead of processing items one by one, you wait until you have a batch and process them together.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Without batching: low latency, lower throughput
def process_individually(items):
    for item in items:
        db.insert(item)  # One round trip per item

# With batching: higher latency for first item, much higher throughput
def process_batch(items, batch_size=100):
    batch = []
    for item in items:
        batch.append(item)
        if len(batch) >= batch_size:
            db.batch_insert(batch)  # One round trip per 100 items
            batch = []
    if batch:
        db.batch_insert(batch)

Kafka is a great example of this trade-off in practice. Producers can be configured with `linger.ms` — how long to wait for more messages before sending a batch. Higher values mean more latency but better throughput.

Little's Law

Little's Law is an elegant relationship that connects three quantities:

*L = lambda W**

  • L = average number of items in the system (concurrency)
  • lambda = average arrival rate (throughput)
  • W = average time an item spends in the system (latency)

This is incredibly useful for capacity planning:

1
2
3
4
5
6
7
8
# Example: How many concurrent connections do we need?
throughput = 1000  # requests per second
latency = 0.2      # seconds (200ms average)
concurrency = throughput * latency  # = 200 concurrent requests

# So your server needs to handle at least 200 concurrent connections
# If each connection uses a thread, you need at least 200 threads
# If each thread uses 1MB of stack, that is 200MB just for thread stacks

Another application: if your system handles 10,000 QPS with an average latency of 50ms, there are on average 500 requests in flight at any time. If latency doubles to 100ms, you will have 1,000 requests in flight — double the memory, connections, and threads.

Common Latency Numbers

These are the numbers every system designer should have memorized. They come from Jeff Dean's famous talk and have been updated for modern hardware:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
┌──────────────────────────────────────┬──────────────┐
│ Operation                            │ Latency      │
├──────────────────────────────────────┼──────────────┤
│ L1 cache reference                   │ 0.5 ns       │
│ L2 cache reference                   │ 7 ns         │
│ Main memory reference                │ 100 ns       │
│ SSD random read                      │ 16 μs        │
│ Read 1 MB sequentially from memory   │ 3 μs         │
│ Read 1 MB sequentially from SSD      │ 49 μs        │
│ Read 1 MB sequentially from disk     │ 825 μs       │
│ Disk seek                            │ 2 ms         │
│ Send packet CA → Netherlands → CA    │ 150 ms       │
│ Round trip within same datacenter    │ 0.5 ms       │
│ Round trip within same region        │ ~1-5 ms      │
│ Round trip cross-continent           │ 50-150 ms    │
│ Mutex lock/unlock                    │ 17 ns        │
│ Redis GET (local datacenter)         │ 0.1-0.5 ms   │
│ Simple database query (indexed)      │ 1-5 ms       │
│ Complex database query               │ 10-100 ms    │
│ TLS handshake                        │ 2-10 ms      │
│ DNS lookup (uncached)                │ 20-120 ms    │
└──────────────────────────────────────┴──────────────┘

Back-of-envelope calculations

Using these numbers, you can quickly estimate system performance:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Example: How long to read and process a 10 MB image?

# From SSD: 10 * 49μs = 490μs ≈ 0.5ms
# From HDD: 10 * 825μs = 8.25ms
# From memory: 10 * 3μs = 30μs

# Network transfer (1 Gbps link): 10MB / 125MB/s = 80ms
# Network transfer (10 Gbps link): 10MB / 1.25GB/s = 8ms

# So for a user uploading a 10MB image:
# Upload time (100 Mbps user connection): 10MB / 12.5MB/s = 800ms
# Server processing (resize, thumbnail): ~200ms
# Write to S3: ~50ms
# Total: ~1050ms
1
2
3
4
5
6
7
8
9
10
11
# Example: Can this database handle our load?

# Requirements: 50,000 QPS, p99 < 20ms
# Single PostgreSQL: ~5,000-20,000 QPS for simple queries
# With connection pooling and read replicas:
#   1 primary + 4 replicas = 5 * 10,000 = 50,000 QPS (reads)
#   Writes still limited to primary: ~10,000 QPS

# Redis cache for hot data: 100,000+ QPS per instance
# Strategy: Cache reads (90% hit rate) + 4 read replicas for cache misses
# 50,000 QPS → 45,000 from cache + 5,000 from DB → feasible

Optimizing Latency

Common techniques for reducing latency:

  1. Caching: Avoid recomputing or refetching data. Cache at every layer (browser, CDN, application, database).
  2. Geographical proximity: Serve users from nearby data centers. A CDN can reduce latency from 150ms to 10ms.
  3. Connection pooling: Avoid the overhead of establishing new connections (TCP handshake + TLS = 10-50ms).
  4. Async processing: Move non-critical work off the request path. Send the response first, process later.
  5. Precomputation: Compute results before they are requested (materialized views, pre-rendered pages).
  6. Compression: Trade CPU time for reduced data transfer time. Usually a net win for network-bound operations.

Optimizing Throughput

  1. Horizontal scaling: Add more machines to handle more requests in parallel.
  2. Batching: Amortize fixed overhead across multiple items.
  3. Asynchronous I/O: Use non-blocking I/O to handle many concurrent connections with few threads.
  4. Load shedding: When overloaded, reject low-priority requests early rather than letting everything slow down.
  5. Backpressure: When a downstream service is slow, slow down the upstream producer rather than buffering indefinitely.

Interview Tips

  1. Use concrete numbers. "A cross-country round trip is about 60ms, so with three sequential database calls, we have already used 180ms of our latency budget" is much stronger than "network calls are slow."
  1. Think in percentiles. When the interviewer asks about latency, always clarify: "Are we targeting P50 or P99? The P99 is typically 5-10x the P50." This shows operational maturity.
  1. Apply Little's Law. "If we need 10,000 QPS and each request takes 100ms, we need 1,000 concurrent handlers" is the kind of quick math that impresses interviewers.
  1. Identify the bottleneck first. Before optimizing anything, determine what limits your system's throughput. There is always one bottleneck, and optimizing anything else is waste.
  1. Discuss the trade-off explicitly. "We could batch writes to increase throughput from 1,000 to 10,000 TPS, but that adds up to 100ms of latency for each write. For this use case, that trade-off is acceptable because users do not need instant confirmation."
  1. Know the latency numbers. Memorize the table above. Being able to say "a Redis lookup in the same datacenter is under a millisecond, but a cross-region database call is 50ms" shows you have practical experience with real systems.