Scalability Patterns
Protect services from abuse with token bucket, leaky bucket, fixed window, and sliding window algorithms. Distributed rate limiting with Redis.
Rate limiting controls how many requests a client can make to your service within a given time window. It protects your system from abuse (intentional or accidental), prevents cascading failures, ensures fair resource allocation, and controls costs.
Every major API implements rate limiting — Twitter, GitHub, Stripe, AWS — and it comes up frequently in system design interviews, both as a standalone design question and as a component within larger designs.
A bucket holds up to `N` tokens. Tokens are added at a fixed rate (e.g., 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected.
import time
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity # Max tokens
self.refill_rate = refill_rate # Tokens per second
self.tokens = capacity
self.last_refill = time.time()
def allow_request(self):
now = time.time()
# Add tokens based on elapsed time
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# Usage: 10 requests/sec, burst up to 20
bucket = TokenBucket(capacity=20, refill_rate=10)Key property: Allows short bursts up to the bucket capacity, then enforces the steady-state rate. A bucket with capacity 20 and rate 10/sec allows a burst of 20 requests, then 10/sec thereafter.
Pros: Simple, memory efficient (just two numbers per client), allows controlled bursting. Cons: Requires careful tuning of capacity and refill rate.
Used by: AWS API Gateway, Stripe.
Requests enter a queue (the bucket). The queue is processed at a fixed rate. If the queue is full, new requests are dropped.
from collections import deque
import time
class LeakyBucket:
def __init__(self, capacity, leak_rate):
self.capacity = capacity # Max queue size
self.leak_rate = leak_rate # Requests processed per second
self.queue = deque()
self.last_leak = time.time()
def allow_request(self, request):
self._leak()
if len(self.queue) < self.capacity:
self.queue.append(request)
return True
return False # Queue full, drop request
def _leak(self):
now = time.time()
elapsed = now - self.last_leak
leaks = int(elapsed * self.leak_rate)
for _ in range(min(leaks, len(self.queue))):
self.queue.popleft()
if leaks > 0:
self.last_leak = nowKey property: Smooths out bursts — requests are processed at a constant rate regardless of arrival pattern.
Difference from token bucket: The token bucket allows bursts (up to capacity). The leaky bucket smooths out bursts (processes at constant rate). The leaky bucket shapes traffic; the token bucket polices traffic.
Pros: Produces a perfectly smooth output rate. Cons: A burst of requests fills the queue, and some may experience high latency waiting in the queue. Bursty traffic is penalized even if the average rate is within limits.
Divide time into fixed windows (e.g., 1-minute intervals). Count requests in each window. If the count exceeds the limit, reject.
import time
class FixedWindowCounter:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window_seconds = window_seconds
self.counts = {} # window_key -> count
def allow_request(self, client_id):
window_key = int(time.time() / self.window_seconds)
key = f"{client_id}:{window_key}"
self.counts.setdefault(key, 0)
if self.counts[key] >= self.limit:
return False
self.counts[key] += 1
return True
# Usage: 100 requests per minute
limiter = FixedWindowCounter(limit=100, window_seconds=60)The boundary problem: A client could send 100 requests at 0:59 and another 100 at 1:01 — that is 200 requests in 2 seconds, even though the limit is 100/minute. The requests straddle the window boundary.
Window 1 [0:00 - 1:00]: ... 100 requests at 0:59
Window 2 [1:00 - 2:00]: 100 requests at 1:01 ...
200 requests in a 2-second span — effectively 2x the intended rate.Pros: Very simple, low memory (one counter per window per client). Cons: The boundary problem allows up to 2x the intended rate in the worst case.
Keep a log (sorted list) of all request timestamps. When a new request arrives, remove timestamps older than the window, then check if the count exceeds the limit.
import time
class SlidingWindowLog:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window_seconds = window_seconds
self.logs = {} # client_id -> [timestamps]
def allow_request(self, client_id):
now = time.time()
window_start = now - self.window_seconds
if client_id not in self.logs:
self.logs[client_id] = []
# Remove expired entries
self.logs[client_id] = [
ts for ts in self.logs[client_id] if ts > window_start
]
if len(self.logs[client_id]) >= self.limit:
return False
self.logs[client_id].append(now)
return TruePros: Perfectly accurate — no boundary problems. The window slides smoothly. Cons: High memory usage — must store every request timestamp. For a client making 1000 requests/minute, that is 1000 timestamps per client.
A hybrid of fixed window and sliding window log. Keep counts for the current and previous windows, then compute a weighted count based on how far into the current window you are.
import time
class SlidingWindowCounter:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window_seconds = window_seconds
self.windows = {} # client_id -> {window_key: count}
def allow_request(self, client_id):
now = time.time()
current_window = int(now / self.window_seconds)
previous_window = current_window - 1
# How far into the current window (0.0 to 1.0)
window_progress = (now % self.window_seconds) / self.window_seconds
if client_id not in self.windows:
self.windows[client_id] = {}
prev_count = self.windows[client_id].get(previous_window, 0)
curr_count = self.windows[client_id].get(current_window, 0)
# Weighted count: full current window + remaining fraction of previous
weighted_count = curr_count + prev_count * (1 - window_progress)
if weighted_count >= self.limit:
return False
self.windows[client_id][current_window] = curr_count + 1
return TrueExample: Limit is 100/minute. We are 40% into the current window. Previous window had 80 requests, current has 30.
weighted_count = 30 + 80 * (1 - 0.4) = 30 + 48 = 78
78 < 100 -> allowPros: Nearly as accurate as sliding window log, but with the memory efficiency of fixed window (just two counters per client). Cons: Approximate — the weighted estimate is not perfectly accurate, but it is very close in practice.
This is the best general-purpose algorithm. It combines accuracy, memory efficiency, and simplicity. Cloudflare and many other companies use this approach.
| Algorithm | Accuracy | Memory | Burst Handling | Complexity |
|---|---|---|---|---|
| Token Bucket | Good | Very low | Allows controlled bursts | Low |
| Leaky Bucket | Good | Low | Smooths bursts | Low |
| Fixed Window | Has boundary issue | Very low | Allows 2x burst at boundary | Very low |
| Sliding Window Log | Perfect | High | No burst issue | Medium |
| Sliding Window Counter | Near-perfect | Low | Minimal burst issue | Low |
In a distributed system with multiple servers, each server cannot maintain its own counters independently — a client could send requests to different servers and bypass the limit.
Use Redis as a shared counter store. All servers check and increment the same counter.
import redis
import time
r = redis.Redis()
def rate_limit(client_id, limit, window_seconds):
key = f"ratelimit:{client_id}:{int(time.time() / window_seconds)}"
# Atomic increment + set expiry
pipe = r.pipeline()
pipe.incr(key)
pipe.expire(key, window_seconds)
count, _ = pipe.execute()
return count <= limitFor the sliding window counter in Redis:
def sliding_window_rate_limit(client_id, limit, window_seconds):
now = time.time()
current_window = int(now / window_seconds)
previous_window = current_window - 1
window_progress = (now % window_seconds) / window_seconds
curr_key = f"ratelimit:{client_id}:{current_window}"
prev_key = f"ratelimit:{client_id}:{previous_window}"
pipe = r.pipeline()
pipe.incr(curr_key)
pipe.expire(curr_key, window_seconds * 2)
pipe.get(prev_key)
results = pipe.execute()
curr_count = results[0]
prev_count = int(results[2] or 0)
weighted = curr_count + prev_count * (1 - window_progress)
return weighted <= limitWith concurrent requests, two requests might both read count=99 (limit=100), both increment, and both succeed — resulting in count=101. Using Redis `INCR` (which is atomic) avoids this. Alternatively, use a Lua script for multi-step atomic operations.
# Lua script for atomic check-and-increment
lua_script = """
local count = redis.call('INCR', KEYS[1])
if count == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return count
"""A centralized Redis counter adds a network round trip (~1ms) to every request. For very high-throughput systems, this can be a bottleneck.
Local counter + sync: Each server maintains a local counter and periodically syncs with Redis. This reduces Redis calls but allows temporary over-limit traffic.
Cell-based rate limiting: Split the rate limit across servers. If the limit is 1000/min and you have 10 servers, each server allows 100/min locally. Simple but inflexible — does not handle uneven traffic distribution.
Client -> CDN/Edge -> API Gateway -> Load Balancer -> Application -> Database
Rate limiting can happen at any layer:
- CDN/Edge: Block DDoS, geographic blocking (Cloudflare, AWS Shield)
- API Gateway: Per-API-key limits (Kong, AWS API Gateway)
- Application: Business logic limits (e.g., 5 password attempts per hour)
- Database: Connection limits (max_connections in PostgreSQL)Best practice: Apply rate limiting as early as possible (at the edge or API gateway) to reject bad traffic before it consumes resources. Apply business-specific limits at the application layer.
Return HTTP 429 (Too Many Requests) with informative headers:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000Clients should implement exponential backoff with jitter:
import random
import time
def make_request_with_retry(url, max_retries=5):
for attempt in range(max_retries):
response = http.get(url)
if response.status_code != 429:
return response
# Exponential backoff with jitter
base_delay = 2 ** attempt # 1, 2, 4, 8, 16 seconds
jitter = random.uniform(0, base_delay)
time.sleep(base_delay + jitter)
raise Exception("Rate limited after max retries")The jitter prevents the thundering herd problem — without it, all rate-limited clients retry at exactly the same time.
Production systems often apply multiple rate limits simultaneously:
Per IP: 1000 requests/minute (DDoS protection)
Per API key: 100 requests/minute (fair usage)
Per user: 10 requests/second (burst protection)
Per endpoint: 5 requests/second (expensive operation protection)A request must pass all applicable limits. This allows fine-grained control.
Know at least 3 algorithms. Token bucket, fixed window, and sliding window counter are the most practical. Be ready to code pseudocode for each and explain their tradeoffs.
Always discuss distributed rate limiting. A single-server solution is trivial. The interesting part is how to rate limit across multiple servers. Mention Redis as the coordination layer.
Mention the HTTP 429 response. Include `Retry-After` and rate limit headers. This shows you think about the client experience, not just the server implementation.
Address race conditions. If using Redis, explain that INCR is atomic or that you use Lua scripts for multi-step operations.
Discuss where to place the rate limiter. Edge/API gateway for DDoS and abuse protection. Application layer for business rules. This shows you understand the full request path.
For the standalone design question ("Design a rate limiter"), cover: requirements (what limits, what identifies a client), algorithm choice, distributed implementation with Redis, the API (HTTP 429 + headers), and scaling (what if you need to rate limit millions of clients).
Connect to other concepts: Rate limiting is closely related to load shedding (dropping requests when the system is overloaded), circuit breaking (stopping requests to a failing downstream), and backpressure (slowing producers when consumers cannot keep up). Mentioning these connections shows systems thinking.