Scalability Patterns
Distribute traffic across servers using round-robin, least connections, consistent hashing, and weighted algorithms. L4 vs L7 load balancing.
A load balancer distributes incoming network traffic across multiple servers so that no single server bears too much load. It is one of the first components you draw in any system design diagram, sitting between the client and your application servers.
Load balancing is essential for scalability (handle more traffic by adding servers), availability (if one server dies, traffic is routed to healthy ones), and performance (distribute load evenly so no server becomes a bottleneck).
The "L4" and "L7" refer to the OSI model layers at which the load balancer operates. This distinction has major practical implications.
An L4 load balancer routes traffic based on IP address and TCP/UDP port. It does not inspect the contents of the packets — it just looks at the source/destination IP and port to make routing decisions.
Client -> L4 Load Balancer -> Backend Server
(routes based on IP:port)
The LB sees: source=1.2.3.4:50123, dest=lb.example.com:443
It forwards the entire TCP connection to a chosen backend.Pros: Extremely fast — no need to parse HTTP headers or decrypt TLS. Lower latency and higher throughput. Simple to operate.
Cons: Cannot make routing decisions based on URL path, headers, cookies, or request content. Cannot do content-based routing, header injection, or request modification.
Examples: AWS Network Load Balancer (NLB), HAProxy in TCP mode, IPVS (Linux Virtual Server).
An L7 load balancer inspects the full HTTP request — URL path, headers, cookies, query parameters, even the request body. It can make sophisticated routing decisions.
Client -> L7 Load Balancer -> Backend Server
(routes based on HTTP content)
The LB sees the full HTTP request:
GET /api/users/123 HTTP/1.1
Host: api.example.com
Cookie: session=abc123
It can route /api/* to API servers and /static/* to CDN origins.Pros:
`/api` to API servers, `/images` to image servers)`X-Request-Id`, `X-Forwarded-For`)Cons: Higher latency (must parse HTTP), lower throughput, more complex to operate. Must decrypt TLS to inspect content.
Examples: AWS Application Load Balancer (ALB), Nginx, HAProxy in HTTP mode, Envoy.
Need content-based routing? -> L7
Need TLS termination at LB? -> L7
Need WebSocket support? -> L7 (or L4 with passthrough)
Need maximum throughput? -> L4
Need simplest configuration? -> L4
Microservices with path routing? -> L7In practice, most web applications use L7 load balancing because the routing flexibility is more valuable than the marginal performance difference.
Requests are distributed to servers sequentially: server 1, server 2, server 3, server 1, server 2, ...
class RoundRobinBalancer:
def __init__(self, servers):
self.servers = servers
self.index = 0
def next_server(self):
server = self.servers[self.index]
self.index = (self.index + 1) % len(self.servers)
return serverPros: Simple, fair distribution if all servers and requests are equal. Cons: Ignores server capacity and current load. A slow request on server 1 does not affect the distribution — server 1 keeps getting new requests even if it is overloaded.
Like round robin, but servers have weights reflecting their capacity. A server with weight 3 gets 3x the traffic of a server with weight 1.
Server A (weight 3): gets 3 out of every 6 requests
Server B (weight 2): gets 2 out of every 6 requests
Server C (weight 1): gets 1 out of every 6 requestsUse case: When your server fleet is heterogeneous (e.g., mixing instance types in the cloud).
Route each new request to the server with the fewest active connections. This naturally adapts to server load — a slow server accumulates connections and gets fewer new requests.
class LeastConnectionsBalancer:
def __init__(self, servers):
self.servers = {s: 0 for s in servers}
def next_server(self):
server = min(self.servers, key=self.servers.get)
self.servers[server] += 1
return server
def release(self, server):
self.servers[server] -= 1Pros: Adapts to varying request durations. Better than round robin when requests have different processing times. Cons: Slightly more overhead (must track active connections). Does not account for server capacity differences (use weighted least connections for that).
Best for: APIs with variable response times, WebSocket connections (long-lived), any workload where request duration varies significantly.
Hash the request (or a key from the request, like user ID) and map it to a position on a hash ring. Route to the nearest server on the ring.
Server A (pos 90)
/ \
(pos 0) (pos 180)
| |
Server D Server B
\ /
Server C (pos 270)
hash("user:123") = 150 -> routes to Server B (nearest clockwise)
hash("user:456") = 300 -> routes to Server A (nearest clockwise)Pros: The same user always goes to the same server (great for caching). When a server is added or removed, only a fraction of requests are remapped (approximately 1/N).
Cons: Can be uneven without virtual nodes. More complex to implement.
Use case: Caching layers (each server caches a subset of data), stateful services, distributed systems that need request affinity.
Hash the client's IP address to determine the server. Ensures the same client always reaches the same server.
Pros: Simple sticky sessions without cookies. Cons: Uneven distribution if traffic comes from a small number of IPs (e.g., corporate NAT). Does not work well with IPv6 proxies.
Route to the server with the lowest average response time. More sophisticated than least connections because it accounts for actual performance rather than just connection count.
Pros: Adapts to real server performance. Cons: Requires tracking response times, which adds overhead. Can oscillate if latency fluctuates.
A load balancer must know which servers are healthy. Unhealthy servers should be removed from the pool.
The LB monitors responses from backends. If a server returns too many errors (e.g., 5xx) or connections time out, it is marked unhealthy.
If server returns 3 consecutive 5xx errors -> mark unhealthy
After 30 seconds -> try again -> if healthy, add back to poolThe LB periodically sends probe requests to each backend (e.g., GET /health every 10 seconds).
LB -> GET /health -> Server A returns 200 -> healthy
LB -> GET /health -> Server B returns 503 -> unhealthy (remove from pool)
LB -> GET /health -> Server C timeout -> unhealthy (remove from pool)Best practice: Use both active and passive health checks. Active checks detect problems even when there is no traffic. Passive checks detect problems faster under load.
The health endpoint should check dependencies. A `/health` endpoint that always returns 200 is useless. It should verify that the server can reach the database, cache, and other critical dependencies.
Some applications store session state on the server (in memory). If a user's requests are routed to different servers, they lose their session.
Approaches:
The better solution: Make your application stateless by storing session data in Redis or a database. This eliminates the need for sticky sessions entirely and allows truly random load balancing.
When sticky sessions are still useful: WebSocket connections (inherently sticky to a server), servers with large local caches that would be expensive to replicate.
The load balancer itself can become a bottleneck or single point of failure. Here is how to address that.
Use DNS to return multiple IP addresses (one per load balancer). Clients randomly choose one. This distributes traffic across multiple LBs.
DNS query: api.example.com
Response: [52.1.1.1, 52.1.1.2, 52.1.1.3] (3 load balancers)
Client randomly picks one.Pros: Simple, widely supported. Cons: DNS TTL means changes are slow to propagate. No real health checking. Uneven distribution.
Two load balancers share a virtual IP (VIP). The active LB handles all traffic. If it fails, the passive LB takes over the VIP via a protocol like VRRP.
[Active LB] (owns VIP 10.0.0.1) <-> [Passive LB] (standby)
| |
v v
[Server Pool] (takes over VIP on failure)Multiple LBs are active simultaneously, each handling a portion of traffic. This is the most scalable approach.
[DNS / Anycast]
|
[LB 1] [LB 2] [LB 3]
\ | /
[Server Pool]Cloud solutions: AWS ELB/ALB/NLB are managed load balancers that are inherently highly available and auto-scale. You never manage the LB instances directly.
For multi-region deployments, GSLB routes users to the nearest datacenter using DNS-based geographic routing or anycast.
User in Tokyo -> DNS resolves to Tokyo LB -> Tokyo servers
User in London -> DNS resolves to London LB -> London serversImplemented via: Route 53 latency-based routing (AWS), Cloudflare load balancing, Akamai GTM.
In a microservices architecture, you need load balancing at multiple levels:
External LB: Distributes client traffic to your edge (API gateway or front-end servers).
Internal LB / Service Mesh: Distributes traffic between microservices. Often done client-side (the calling service has a list of target instances and does the load balancing itself) or via a sidecar proxy (Envoy in Istio/Linkerd).
Client -> External LB -> API Gateway -> Internal LB -> Service A
-> Service B
-> Service CService discovery (Consul, Eureka, Kubernetes DNS) provides the list of healthy instances. The load balancer (or client library) uses this list.
Always include a load balancer in your design. It is the first thing you draw after the client. Even if the interviewer does not ask about it explicitly, having an LB shows you are thinking about scalability and availability.
Specify L4 vs L7. Do not just say "load balancer." Say "an L7 load balancer like ALB so we can route /api and /web traffic to different server pools." This shows depth.
Discuss the algorithm. For most web services, round robin or least connections is fine. If you need session affinity or are building a caching layer, mention consistent hashing. Justify your choice.
Address the LB as a SPOF. If the interviewer asks about single points of failure, explain how you would make the LB highly available (active-passive pair, or use a managed cloud LB).
Health checks matter. Mention both active and passive health checks. Explain that the health endpoint should verify the server's ability to serve requests (not just return 200 unconditionally).
For global systems: Mention GSLB or DNS-based geographic routing to direct users to the nearest datacenter. This dramatically reduces latency for international users.
Know the cloud offerings: AWS has ALB (L7), NLB (L4), and CLB (legacy). GCP has its own equivalents. Being able to name the right managed service shows practical knowledge.