← back

Scalability Patterns

Rate Limiting

Protect services from abuse with token bucket, leaky bucket, fixed window, and sliding window algorithms. Distributed rate limiting with Redis.

Rate Limiting

Rate limiting controls how many requests a client can make to your service within a given time window. It protects your system from abuse (intentional or accidental), prevents cascading failures, ensures fair resource allocation, and controls costs.

Every major API implements rate limiting — Twitter, GitHub, Stripe, AWS — and it comes up frequently in system design interviews, both as a standalone design question and as a component within larger designs.

Why Rate Limit?

  • Prevent abuse: Block DDoS attacks, credential stuffing, scraping
  • Protect resources: Prevent a single misbehaving client from overwhelming your database or downstream services
  • Ensure fairness: Give all users a reasonable share of capacity
  • Control costs: Prevent a runaway process from generating a huge cloud bill
  • Comply with SLAs: If you depend on a rate-limited third-party API, you must limit your own outgoing rate

Rate Limiting Algorithms

Token Bucket

A bucket holds up to `N` tokens. Tokens are added at a fixed rate (e.g., 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity          # Max tokens
        self.refill_rate = refill_rate    # Tokens per second
        self.tokens = capacity
        self.last_refill = time.time()

    def allow_request(self):
        now = time.time()
        # Add tokens based on elapsed time
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# Usage: 10 requests/sec, burst up to 20
bucket = TokenBucket(capacity=20, refill_rate=10)

Key property: Allows short bursts up to the bucket capacity, then enforces the steady-state rate. A bucket with capacity 20 and rate 10/sec allows a burst of 20 requests, then 10/sec thereafter.

Pros: Simple, memory efficient (just two numbers per client), allows controlled bursting. Cons: Requires careful tuning of capacity and refill rate.

Used by: AWS API Gateway, Stripe.

Leaky Bucket

Requests enter a queue (the bucket). The queue is processed at a fixed rate. If the queue is full, new requests are dropped.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from collections import deque
import time

class LeakyBucket:
    def __init__(self, capacity, leak_rate):
        self.capacity = capacity       # Max queue size
        self.leak_rate = leak_rate     # Requests processed per second
        self.queue = deque()
        self.last_leak = time.time()

    def allow_request(self, request):
        self._leak()
        if len(self.queue) < self.capacity:
            self.queue.append(request)
            return True
        return False  # Queue full, drop request

    def _leak(self):
        now = time.time()
        elapsed = now - self.last_leak
        leaks = int(elapsed * self.leak_rate)
        for _ in range(min(leaks, len(self.queue))):
            self.queue.popleft()
        if leaks > 0:
            self.last_leak = now

Key property: Smooths out bursts — requests are processed at a constant rate regardless of arrival pattern.

Difference from token bucket: The token bucket allows bursts (up to capacity). The leaky bucket smooths out bursts (processes at constant rate). The leaky bucket shapes traffic; the token bucket polices traffic.

Pros: Produces a perfectly smooth output rate. Cons: A burst of requests fills the queue, and some may experience high latency waiting in the queue. Bursty traffic is penalized even if the average rate is within limits.

Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute intervals). Count requests in each window. If the count exceeds the limit, reject.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import time

class FixedWindowCounter:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.counts = {}  # window_key -> count

    def allow_request(self, client_id):
        window_key = int(time.time() / self.window_seconds)
        key = f"{client_id}:{window_key}"

        self.counts.setdefault(key, 0)
        if self.counts[key] >= self.limit:
            return False

        self.counts[key] += 1
        return True

# Usage: 100 requests per minute
limiter = FixedWindowCounter(limit=100, window_seconds=60)

The boundary problem: A client could send 100 requests at 0:59 and another 100 at 1:01 — that is 200 requests in 2 seconds, even though the limit is 100/minute. The requests straddle the window boundary.

1
2
3
4
Window 1 [0:00 - 1:00]: ... 100 requests at 0:59
Window 2 [1:00 - 2:00]: 100 requests at 1:01 ...

200 requests in a 2-second span — effectively 2x the intended rate.

Pros: Very simple, low memory (one counter per window per client). Cons: The boundary problem allows up to 2x the intended rate in the worst case.

Sliding Window Log

Keep a log (sorted list) of all request timestamps. When a new request arrives, remove timestamps older than the window, then check if the count exceeds the limit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import time

class SlidingWindowLog:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.logs = {}  # client_id -> [timestamps]

    def allow_request(self, client_id):
        now = time.time()
        window_start = now - self.window_seconds

        if client_id not in self.logs:
            self.logs[client_id] = []

        # Remove expired entries
        self.logs[client_id] = [
            ts for ts in self.logs[client_id] if ts > window_start
        ]

        if len(self.logs[client_id]) >= self.limit:
            return False

        self.logs[client_id].append(now)
        return True

Pros: Perfectly accurate — no boundary problems. The window slides smoothly. Cons: High memory usage — must store every request timestamp. For a client making 1000 requests/minute, that is 1000 timestamps per client.

Sliding Window Counter

A hybrid of fixed window and sliding window log. Keep counts for the current and previous windows, then compute a weighted count based on how far into the current window you are.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import time

class SlidingWindowCounter:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.windows = {}  # client_id -> {window_key: count}

    def allow_request(self, client_id):
        now = time.time()
        current_window = int(now / self.window_seconds)
        previous_window = current_window - 1

        # How far into the current window (0.0 to 1.0)
        window_progress = (now % self.window_seconds) / self.window_seconds

        if client_id not in self.windows:
            self.windows[client_id] = {}

        prev_count = self.windows[client_id].get(previous_window, 0)
        curr_count = self.windows[client_id].get(current_window, 0)

        # Weighted count: full current window + remaining fraction of previous
        weighted_count = curr_count + prev_count * (1 - window_progress)

        if weighted_count >= self.limit:
            return False

        self.windows[client_id][current_window] = curr_count + 1
        return True

Example: Limit is 100/minute. We are 40% into the current window. Previous window had 80 requests, current has 30.

1
2
weighted_count = 30 + 80 * (1 - 0.4) = 30 + 48 = 78
78 < 100 -> allow

Pros: Nearly as accurate as sliding window log, but with the memory efficiency of fixed window (just two counters per client). Cons: Approximate — the weighted estimate is not perfectly accurate, but it is very close in practice.

This is the best general-purpose algorithm. It combines accuracy, memory efficiency, and simplicity. Cloudflare and many other companies use this approach.

Algorithm Comparison

AlgorithmAccuracyMemoryBurst HandlingComplexity
Token BucketGoodVery lowAllows controlled burstsLow
Leaky BucketGoodLowSmooths burstsLow
Fixed WindowHas boundary issueVery lowAllows 2x burst at boundaryVery low
Sliding Window LogPerfectHighNo burst issueMedium
Sliding Window CounterNear-perfectLowMinimal burst issueLow

Distributed Rate Limiting

In a distributed system with multiple servers, each server cannot maintain its own counters independently — a client could send requests to different servers and bypass the limit.

Centralized Counter with Redis

Use Redis as a shared counter store. All servers check and increment the same counter.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import redis
import time

r = redis.Redis()

def rate_limit(client_id, limit, window_seconds):
    key = f"ratelimit:{client_id}:{int(time.time() / window_seconds)}"

    # Atomic increment + set expiry
    pipe = r.pipeline()
    pipe.incr(key)
    pipe.expire(key, window_seconds)
    count, _ = pipe.execute()

    return count <= limit

For the sliding window counter in Redis:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def sliding_window_rate_limit(client_id, limit, window_seconds):
    now = time.time()
    current_window = int(now / window_seconds)
    previous_window = current_window - 1
    window_progress = (now % window_seconds) / window_seconds

    curr_key = f"ratelimit:{client_id}:{current_window}"
    prev_key = f"ratelimit:{client_id}:{previous_window}"

    pipe = r.pipeline()
    pipe.incr(curr_key)
    pipe.expire(curr_key, window_seconds * 2)
    pipe.get(prev_key)
    results = pipe.execute()

    curr_count = results[0]
    prev_count = int(results[2] or 0)

    weighted = curr_count + prev_count * (1 - window_progress)
    return weighted <= limit

Race Conditions

With concurrent requests, two requests might both read count=99 (limit=100), both increment, and both succeed — resulting in count=101. Using Redis `INCR` (which is atomic) avoids this. Alternatively, use a Lua script for multi-step atomic operations.

1
2
3
4
5
6
7
8
# Lua script for atomic check-and-increment
lua_script = """
local count = redis.call('INCR', KEYS[1])
if count == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return count
"""

Latency and Availability Trade-Offs

A centralized Redis counter adds a network round trip (~1ms) to every request. For very high-throughput systems, this can be a bottleneck.

Local counter + sync: Each server maintains a local counter and periodically syncs with Redis. This reduces Redis calls but allows temporary over-limit traffic.

Cell-based rate limiting: Split the rate limit across servers. If the limit is 1000/min and you have 10 servers, each server allows 100/min locally. Simple but inflexible — does not handle uneven traffic distribution.

Where to Implement Rate Limiting

1
2
3
4
5
6
7
Client -> CDN/Edge -> API Gateway -> Load Balancer -> Application -> Database

Rate limiting can happen at any layer:
  - CDN/Edge: Block DDoS, geographic blocking (Cloudflare, AWS Shield)
  - API Gateway: Per-API-key limits (Kong, AWS API Gateway)
  - Application: Business logic limits (e.g., 5 password attempts per hour)
  - Database: Connection limits (max_connections in PostgreSQL)

Best practice: Apply rate limiting as early as possible (at the edge or API gateway) to reject bad traffic before it consumes resources. Apply business-specific limits at the application layer.

Handling Rate-Limited Requests

HTTP Response

Return HTTP 429 (Too Many Requests) with informative headers:

1
2
3
4
5
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000

Client-Side Handling

Clients should implement exponential backoff with jitter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import random
import time

def make_request_with_retry(url, max_retries=5):
    for attempt in range(max_retries):
        response = http.get(url)
        if response.status_code != 429:
            return response

        # Exponential backoff with jitter
        base_delay = 2 ** attempt  # 1, 2, 4, 8, 16 seconds
        jitter = random.uniform(0, base_delay)
        time.sleep(base_delay + jitter)

    raise Exception("Rate limited after max retries")

The jitter prevents the thundering herd problem — without it, all rate-limited clients retry at exactly the same time.

Multi-Tier Rate Limiting

Production systems often apply multiple rate limits simultaneously:

1
2
3
4
Per IP:     1000 requests/minute  (DDoS protection)
Per API key:  100 requests/minute  (fair usage)
Per user:      10 requests/second  (burst protection)
Per endpoint:   5 requests/second  (expensive operation protection)

A request must pass all applicable limits. This allows fine-grained control.

Interview Tips

Know at least 3 algorithms. Token bucket, fixed window, and sliding window counter are the most practical. Be ready to code pseudocode for each and explain their tradeoffs.

Always discuss distributed rate limiting. A single-server solution is trivial. The interesting part is how to rate limit across multiple servers. Mention Redis as the coordination layer.

Mention the HTTP 429 response. Include `Retry-After` and rate limit headers. This shows you think about the client experience, not just the server implementation.

Address race conditions. If using Redis, explain that INCR is atomic or that you use Lua scripts for multi-step operations.

Discuss where to place the rate limiter. Edge/API gateway for DDoS and abuse protection. Application layer for business rules. This shows you understand the full request path.

For the standalone design question ("Design a rate limiter"), cover: requirements (what limits, what identifies a client), algorithm choice, distributed implementation with Redis, the API (HTTP 429 + headers), and scaling (what if you need to rate limit millions of clients).

Connect to other concepts: Rate limiting is closely related to load shedding (dropping requests when the system is overloaded), circuit breaking (stopping requests to a failing downstream), and backpressure (slowing producers when consumers cannot keep up). Mentioning these connections shows systems thinking.