Data Storage
Leader-follower, multi-leader, and leaderless replication. How to achieve high availability, handle failover, and manage replication lag.
Replication is the process of keeping copies of the same data on multiple machines. It serves three purposes: high availability (the system keeps working when a machine dies), low latency (serve reads from a replica near the user), and read scalability (spread read load across multiple machines).
Every production database uses replication, so this topic comes up in virtually every system design interview. The key is understanding the tradeoffs between different replication strategies and what happens when things go wrong.
One node is designated the leader (primary). All writes go to the leader. The leader sends a replication stream to one or more followers (replicas). Reads can go to any replica.
Writes
|
v
[Leader] --replication--> [Follower 1]
|
+---replication--> [Follower 2]
|
+---replication--> [Follower 3]
Reads can go to Leader or any FollowerThis is the most common setup. PostgreSQL, MySQL, MongoDB, and Redis all support it natively.
Pros: Simple, well-understood, no write conflicts. Cons: The leader is a single point of failure for writes. All writes bottleneck through one node.
Multiple nodes accept writes. Each leader replicates its changes to all other leaders. This is common in multi-datacenter setups where you want writes to be fast in each region.
Datacenter A Datacenter B
[Leader A] <--sync--> [Leader B]
| |
[Follower] [Follower]Pros: Write availability in each datacenter, lower write latency for local users. Cons: Write conflicts are inevitable. If two leaders modify the same row simultaneously, you need a conflict resolution strategy (last-write-wins, merge, or application-level resolution).
When to use: Multi-datacenter deployments, collaborative editing (Google Docs uses a form of this), offline-capable applications.
There is no leader. Any node can accept reads and writes. The client sends writes to multiple replicas simultaneously and reads from multiple replicas, using quorum rules to determine correctness.
Client writes to N replicas, requires W acknowledgments.
Client reads from N replicas, requires R responses.
As long as W + R > N, you're guaranteed to read the latest write.Example configurations (N=3):
Used by: Amazon DynamoDB, Apache Cassandra, Riak.
Pros: No single point of failure, high write availability. Cons: Complex conflict resolution, potential for stale reads if quorum is not met, harder to reason about consistency.
The leader waits for the follower to confirm it has written the data before acknowledging the write to the client.
Client -> Leader: Write X=5
Leader -> Follower: Replicate X=5
Follower -> Leader: ACK
Leader -> Client: Write confirmedGuarantee: If the leader dies, the follower is guaranteed to have the latest data. Zero data loss.
Cost: Every write is as slow as the slowest replica. If a follower is across the country (50ms round trip), every write takes at least 50ms. If a follower is down, writes stall entirely.
In practice: Fully synchronous replication is rarely used with more than one synchronous follower. A common compromise is semi-synchronous: one follower is synchronous (guaranteeing at least one up-to-date replica), and the rest are asynchronous.
The leader acknowledges the write immediately and replicates to followers in the background.
Client -> Leader: Write X=5
Leader -> Client: Write confirmed (immediately)
Leader -> Follower: Replicate X=5 (background)Guarantee: None — if the leader dies before replication completes, data is lost.
Benefit: Fast writes, no dependency on replica health. The system stays available even if followers lag or go down.
This is the default for most systems including PostgreSQL streaming replication, MySQL replication, and MongoDB replica sets. The tradeoff of potential data loss is usually acceptable for the performance gain.
With asynchronous replication, followers are always some amount of time behind the leader. This delay is called replication lag. Under normal conditions it is milliseconds, but under load it can grow to seconds or even minutes.
Read-after-write inconsistency: A user writes data via the leader, then reads from a follower that has not yet received the write. The user sees stale data — their own write appears to be lost.
User: POST /comment "Great article!" -> Leader (write succeeds)
User: GET /comments -> Follower (doesn't have it yet)
User sees: no comment. Confused.Solution — read-your-writes consistency:
Monotonic reads problem: A user makes two reads, hitting different replicas. The second replica is more behind than the first, so the user sees data go "backward in time."
Solution: Ensure each user always reads from the same replica (session affinity / sticky sessions).
Causal ordering violations: User A posts a message, User B reads it and replies. But a third user sees the reply before the original message because they are on a lagging replica.
Solution: Use causal consistency mechanisms or version vectors.
Failover is the process of promoting a follower to become the new leader when the current leader fails. This is the most dangerous moment in a replicated system.
Before: Client -> [Leader] -> [Follower A, Follower B]
Leader dies.
After: Client -> [Follower A (promoted)] -> [Follower B]Data loss on failover. If replication was asynchronous, the promoted follower may be missing recent writes. Those writes are permanently lost. When the old leader comes back, it has writes that no other node has — these are typically discarded.
Split-brain. Both the old leader and the new leader believe they are the leader and accept writes independently. This is catastrophic — data diverges and you may have conflicting writes that are extremely difficult to reconcile.
Network partition:
[Old Leader] <-- clients on side A
X (partition)
[New Leader] <-- clients on side B
Both accept writes. Data diverges.Mitigation strategies for split-brain:
Cascading failures. If the failover process is misconfigured, a leader failure can trigger a chain reaction. For example, GitHub experienced a major outage in 2018 when a failover caused MySQL replication to break in a cascading manner across multiple shards.
Many organizations prefer manual failover for critical databases despite the slower recovery time. The reasoning is that automatic failover can trigger incorrectly (e.g., due to a network blip rather than a real failure) and cause more damage than the original problem.
Netflix approach: They use automated failover but with extensive testing. Their Chaos Engineering practice (Chaos Monkey) regularly kills instances in production to ensure failover works correctly.
Uses streaming replication (async by default, can be configured synchronous). Supports hot standby (read queries on replicas). Failover tools: Patroni, pg_auto_failover, or cloud-managed (AWS RDS Multi-AZ).
Uses binary log (binlog) replication. Supports GTID-based replication for reliable failover. Orchestrator is a popular tool for managing MySQL failover. In the cloud, Aurora MySQL handles replication and failover automatically.
Uses replica sets — a group of nodes that maintain the same data. The primary accepts writes, secondaries replicate asynchronously. Automatic failover via election (Raft-like protocol). If the primary is unreachable, secondaries hold an election within ~10 seconds.
Uses asynchronous replication by default with Redis Sentinel for automatic failover. Redis can lose the most recent writes on failover (the writes that had not been replicated). The `WAIT` command can force synchronous replication for critical writes.
| Strategy | Consistency | Availability | Write Latency | Complexity |
|---|---|---|---|---|
| Single leader, sync | Strong | Lower (leader failure stalls writes) | Higher | Low |
| Single leader, async | Eventual | Good | Low | Low |
| Multi-leader | Eventual (conflicts) | Highest | Low (local writes) | High |
| Leaderless (quorum) | Tunable | High | Depends on W | High |
Always mention the CAP theorem connection. Replication is where the CAP theorem becomes real. In a network partition, you must choose between consistency (reject writes if you cannot reach replicas) and availability (accept writes on whatever nodes are reachable).
Discuss replication lag proactively. If you propose read replicas, immediately address how you handle read-after-write consistency. This shows depth.
Know the failover risks. Interviewers love asking "what happens when the primary goes down?" The answer involves data loss risk, split-brain, and recovery time. Do not just say "the replica takes over."
Be specific about numbers. Replication lag under normal conditions is typically 10–100ms. Failover typically takes 10–30 seconds for automatic systems, minutes for manual. During failover, writes are unavailable.
Quorum math matters. For leaderless replication, know that `W + R > N` guarantees reading the latest write. Be prepared to discuss what happens when this condition is not met (stale reads, but higher availability).
Start simple. In an interview, start with single-leader async replication (the most common setup) and add complexity only when the requirements demand it (multi-region -> multi-leader, zero-downtime -> synchronous replication).