Scalability Patterns

Load Balancing

Distribute traffic across servers using round-robin, least connections, consistent hashing, and weighted algorithms. L4 vs L7 load balancing.

Load Balancing

A load balancer distributes incoming network traffic across multiple servers so that no single server bears too much load. It is one of the first components you draw in any system design diagram, sitting between the client and your application servers.

Load balancing is essential for scalability (handle more traffic by adding servers), availability (if one server dies, traffic is routed to healthy ones), and performance (distribute load evenly so no server becomes a bottleneck).

L4 vs L7 Load Balancing

The "L4" and "L7" refer to the OSI model layers at which the load balancer operates. This distinction has major practical implications.

Layer 4 (Transport Layer)

An L4 load balancer routes traffic based on IP address and TCP/UDP port. It does not inspect the contents of the packets — it just looks at the source/destination IP and port to make routing decisions.

1

2

3

4

5

Client -> L4 Load Balancer -> Backend Server
                (routes based on IP:port)

The LB sees: source=1.2.3.4:50123, dest=lb.example.com:443
It forwards the entire TCP connection to a chosen backend.

Pros: Extremely fast — no need to parse HTTP headers or decrypt TLS. Lower latency and higher throughput. Simple to operate.

Cons: Cannot make routing decisions based on URL path, headers, cookies, or request content. Cannot do content-based routing, header injection, or request modification.

Examples: AWS Network Load Balancer (NLB), HAProxy in TCP mode, IPVS (Linux Virtual Server).

Layer 7 (Application Layer)

An L7 load balancer inspects the full HTTP request — URL path, headers, cookies, query parameters, even the request body. It can make sophisticated routing decisions.

Client -> L7 Load Balancer -> Backend Server
              (routes based on HTTP content)

The LB sees the full HTTP request:
  GET /api/users/123 HTTP/1.1
  Host: api.example.com
  Cookie: session=abc123

It can route /api/* to API servers and /static/* to CDN origins.

Pros:

Content-based routing (e.g., route `/api` to API servers, `/images` to image servers)
Can terminate TLS and offload encryption from backends
Can inject/modify headers (e.g., add `X-Request-Id`, `X-Forwarded-For`)
Can compress responses, cache responses, rate limit
Can do A/B testing by routing based on cookies or headers

Cons: Higher latency (must parse HTTP), lower throughput, more complex to operate. Must decrypt TLS to inspect content.

Examples: AWS Application Load Balancer (ALB), Nginx, HAProxy in HTTP mode, Envoy.

Which to Choose?

1

2

3

4

5

6

Need content-based routing?     -> L7
Need TLS termination at LB?     -> L7
Need WebSocket support?          -> L7 (or L4 with passthrough)
Need maximum throughput?          -> L4
Need simplest configuration?     -> L4
Microservices with path routing? -> L7

In practice, most web applications use L7 load balancing because the routing flexibility is more valuable than the marginal performance difference.

Load Balancing Algorithms

Round Robin

Requests are distributed to servers sequentially: server 1, server 2, server 3, server 1, server 2, ...

class RoundRobinBalancer:
    def __init__(self, servers):
        self.servers = servers
        self.index = 0

    def next_server(self):
        server = self.servers[self.index]
        self.index = (self.index + 1) % len(self.servers)
        return server

Pros: Simple, fair distribution if all servers and requests are equal. Cons: Ignores server capacity and current load. A slow request on server 1 does not affect the distribution — server 1 keeps getting new requests even if it is overloaded.

Weighted Round Robin

Like round robin, but servers have weights reflecting their capacity. A server with weight 3 gets 3x the traffic of a server with weight 1.

1

2

3

Server A (weight 3): gets 3 out of every 6 requests
Server B (weight 2): gets 2 out of every 6 requests
Server C (weight 1): gets 1 out of every 6 requests

Use case: When your server fleet is heterogeneous (e.g., mixing instance types in the cloud).

Least Connections

Route each new request to the server with the fewest active connections. This naturally adapts to server load — a slow server accumulates connections and gets fewer new requests.

class LeastConnectionsBalancer:
    def __init__(self, servers):
        self.servers = {s: 0 for s in servers}

    def next_server(self):
        server = min(self.servers, key=self.servers.get)
        self.servers[server] += 1
        return server

    def release(self, server):
        self.servers[server] -= 1

Pros: Adapts to varying request durations. Better than round robin when requests have different processing times. Cons: Slightly more overhead (must track active connections). Does not account for server capacity differences (use weighted least connections for that).

Best for: APIs with variable response times, WebSocket connections (long-lived), any workload where request duration varies significantly.

Consistent Hashing

Hash the request (or a key from the request, like user ID) and map it to a position on a hash ring. Route to the nearest server on the ring.

        Server A (pos 90)
       /                \
   (pos 0)           (pos 180)
      |                  |
   Server D          Server B
       \                /
        Server C (pos 270)

hash("user:123") = 150 -> routes to Server B (nearest clockwise)
hash("user:456") = 300 -> routes to Server A (nearest clockwise)

Pros: The same user always goes to the same server (great for caching). When a server is added or removed, only a fraction of requests are remapped (approximately 1/N).

Cons: Can be uneven without virtual nodes. More complex to implement.

Use case: Caching layers (each server caches a subset of data), stateful services, distributed systems that need request affinity.

IP Hash

Hash the client's IP address to determine the server. Ensures the same client always reaches the same server.

Pros: Simple sticky sessions without cookies. Cons: Uneven distribution if traffic comes from a small number of IPs (e.g., corporate NAT). Does not work well with IPv6 proxies.

Least Response Time

Route to the server with the lowest average response time. More sophisticated than least connections because it accounts for actual performance rather than just connection count.

Pros: Adapts to real server performance. Cons: Requires tracking response times, which adds overhead. Can oscillate if latency fluctuates.

Health Checks

A load balancer must know which servers are healthy. Unhealthy servers should be removed from the pool.

Passive Health Checks

The LB monitors responses from backends. If a server returns too many errors (e.g., 5xx) or connections time out, it is marked unhealthy.

1

2

If server returns 3 consecutive 5xx errors -> mark unhealthy
After 30 seconds -> try again -> if healthy, add back to pool

Active Health Checks

The LB periodically sends probe requests to each backend (e.g., GET /health every 10 seconds).

1

2

3

LB -> GET /health -> Server A returns 200 -> healthy
LB -> GET /health -> Server B returns 503 -> unhealthy (remove from pool)
LB -> GET /health -> Server C timeout     -> unhealthy (remove from pool)

Best practice: Use both active and passive health checks. Active checks detect problems even when there is no traffic. Passive checks detect problems faster under load.

The health endpoint should check dependencies. A `/health` endpoint that always returns 200 is useless. It should verify that the server can reach the database, cache, and other critical dependencies.

Sticky Sessions (Session Affinity)

Some applications store session state on the server (in memory). If a user's requests are routed to different servers, they lose their session.

Approaches:

Cookie-based: The LB inserts a cookie identifying the backend server. Subsequent requests from the same user are routed to the same server.
IP-based: Route based on client IP hash (fragile — breaks with NAT and mobile networks).

The better solution: Make your application stateless by storing session data in Redis or a database. This eliminates the need for sticky sessions entirely and allows truly random load balancing.

When sticky sessions are still useful: WebSocket connections (inherently sticky to a server), servers with large local caches that would be expensive to replicate.

Scaling the Load Balancer

The load balancer itself can become a bottleneck or single point of failure. Here is how to address that.

DNS-Based Load Balancing

Use DNS to return multiple IP addresses (one per load balancer). Clients randomly choose one. This distributes traffic across multiple LBs.

1

2

3

DNS query: api.example.com
Response: [52.1.1.1, 52.1.1.2, 52.1.1.3]  (3 load balancers)
Client randomly picks one.

Pros: Simple, widely supported. Cons: DNS TTL means changes are slow to propagate. No real health checking. Uneven distribution.

Active-Passive LB Pair

Two load balancers share a virtual IP (VIP). The active LB handles all traffic. If it fails, the passive LB takes over the VIP via a protocol like VRRP.

1

2

3

4

[Active LB] (owns VIP 10.0.0.1)  <-> [Passive LB] (standby)
      |                                      |
      v                                      v
  [Server Pool]                          (takes over VIP on failure)

Active-Active LB

Multiple LBs are active simultaneously, each handling a portion of traffic. This is the most scalable approach.

1

2

3

4

5

[DNS / Anycast]
      |
   [LB 1]  [LB 2]  [LB 3]
      \      |      /
       [Server Pool]

Cloud solutions: AWS ELB/ALB/NLB are managed load balancers that are inherently highly available and auto-scale. You never manage the LB instances directly.

Global Server Load Balancing (GSLB)

For multi-region deployments, GSLB routes users to the nearest datacenter using DNS-based geographic routing or anycast.

1

2

User in Tokyo  -> DNS resolves to Tokyo LB  -> Tokyo servers
User in London -> DNS resolves to London LB -> London servers

Implemented via: Route 53 latency-based routing (AWS), Cloudflare load balancing, Akamai GTM.

Load Balancing in Microservices

In a microservices architecture, you need load balancing at multiple levels:

External LB: Distributes client traffic to your edge (API gateway or front-end servers).

Internal LB / Service Mesh: Distributes traffic between microservices. Often done client-side (the calling service has a list of target instances and does the load balancing itself) or via a sidecar proxy (Envoy in Istio/Linkerd).

1

2

3

Client -> External LB -> API Gateway -> Internal LB -> Service A
                                                    -> Service B
                                                    -> Service C

Service discovery (Consul, Eureka, Kubernetes DNS) provides the list of healthy instances. The load balancer (or client library) uses this list.

Interview Tips

Always include a load balancer in your design. It is the first thing you draw after the client. Even if the interviewer does not ask about it explicitly, having an LB shows you are thinking about scalability and availability.

Specify L4 vs L7. Do not just say "load balancer." Say "an L7 load balancer like ALB so we can route /api and /web traffic to different server pools." This shows depth.

Discuss the algorithm. For most web services, round robin or least connections is fine. If you need session affinity or are building a caching layer, mention consistent hashing. Justify your choice.

Address the LB as a SPOF. If the interviewer asks about single points of failure, explain how you would make the LB highly available (active-passive pair, or use a managed cloud LB).

Health checks matter. Mention both active and passive health checks. Explain that the health endpoint should verify the server's ability to serve requests (not just return 200 unconditionally).

For global systems: Mention GSLB or DNS-based geographic routing to direct users to the nearest datacenter. This dramatically reduces latency for international users.

Know the cloud offerings: AWS has ALB (L7), NLB (L4), and CLB (legacy). GCP has its own equivalents. Being able to name the right managed service shows practical knowledge.