Data Storage

Replication and Failover

Leader-follower, multi-leader, and leaderless replication. How to achieve high availability, handle failover, and manage replication lag.

Replication and Failover

Replication is the process of keeping copies of the same data on multiple machines. It serves three purposes: high availability (the system keeps working when a machine dies), low latency (serve reads from a replica near the user), and read scalability (spread read load across multiple machines).

Every production database uses replication, so this topic comes up in virtually every system design interview. The key is understanding the tradeoffs between different replication strategies and what happens when things go wrong.

Replication Topologies

Leader-Follower (Primary-Replica)

One node is designated the leader (primary). All writes go to the leader. The leader sends a replication stream to one or more followers (replicas). Reads can go to any replica.

     Writes
       |
       v
   [Leader] --replication--> [Follower 1]
       |
       +---replication--> [Follower 2]
       |
       +---replication--> [Follower 3]

   Reads can go to Leader or any Follower

This is the most common setup. PostgreSQL, MySQL, MongoDB, and Redis all support it natively.

Pros: Simple, well-understood, no write conflicts. Cons: The leader is a single point of failure for writes. All writes bottleneck through one node.

Multi-Leader (Active-Active)

Multiple nodes accept writes. Each leader replicates its changes to all other leaders. This is common in multi-datacenter setups where you want writes to be fast in each region.

1

2

3

4

   Datacenter A          Datacenter B
   [Leader A] <--sync--> [Leader B]
       |                      |
   [Follower]            [Follower]

Pros: Write availability in each datacenter, lower write latency for local users. Cons: Write conflicts are inevitable. If two leaders modify the same row simultaneously, you need a conflict resolution strategy (last-write-wins, merge, or application-level resolution).

When to use: Multi-datacenter deployments, collaborative editing (Google Docs uses a form of this), offline-capable applications.

Leaderless (Dynamo-Style)

There is no leader. Any node can accept reads and writes. The client sends writes to multiple replicas simultaneously and reads from multiple replicas, using quorum rules to determine correctness.

1

2

3

Client writes to N replicas, requires W acknowledgments.
Client reads from N replicas, requires R responses.
As long as W + R > N, you're guaranteed to read the latest write.

Example configurations (N=3):

W=2, R=2: balanced read/write latency
W=3, R=1: fast reads, slower writes
W=1, R=3: fast writes, slower reads

Used by: Amazon DynamoDB, Apache Cassandra, Riak.

Pros: No single point of failure, high write availability. Cons: Complex conflict resolution, potential for stale reads if quorum is not met, harder to reason about consistency.

Synchronous vs Asynchronous Replication

Synchronous Replication

The leader waits for the follower to confirm it has written the data before acknowledging the write to the client.

1

2

3

4

Client -> Leader: Write X=5
Leader -> Follower: Replicate X=5
Follower -> Leader: ACK
Leader -> Client: Write confirmed

Guarantee: If the leader dies, the follower is guaranteed to have the latest data. Zero data loss.

Cost: Every write is as slow as the slowest replica. If a follower is across the country (50ms round trip), every write takes at least 50ms. If a follower is down, writes stall entirely.

In practice: Fully synchronous replication is rarely used with more than one synchronous follower. A common compromise is semi-synchronous: one follower is synchronous (guaranteeing at least one up-to-date replica), and the rest are asynchronous.

Asynchronous Replication

The leader acknowledges the write immediately and replicates to followers in the background.

1

2

3

Client -> Leader: Write X=5
Leader -> Client: Write confirmed (immediately)
Leader -> Follower: Replicate X=5 (background)

Guarantee: None — if the leader dies before replication completes, data is lost.

Benefit: Fast writes, no dependency on replica health. The system stays available even if followers lag or go down.

This is the default for most systems including PostgreSQL streaming replication, MySQL replication, and MongoDB replica sets. The tradeoff of potential data loss is usually acceptable for the performance gain.

Replication Lag

With asynchronous replication, followers are always some amount of time behind the leader. This delay is called replication lag. Under normal conditions it is milliseconds, but under load it can grow to seconds or even minutes.

Problems Caused by Replication Lag

Read-after-write inconsistency: A user writes data via the leader, then reads from a follower that has not yet received the write. The user sees stale data — their own write appears to be lost.

1

2

3

User: POST /comment "Great article!"  -> Leader (write succeeds)
User: GET /comments                    -> Follower (doesn't have it yet)
User sees: no comment. Confused.

Solution — read-your-writes consistency:

After a write, route subsequent reads from that user to the leader (for a short window)
Track the last write timestamp and only read from replicas that are caught up past that timestamp
Use the leader for reads of "user's own data" (e.g., profile page after editing)

Monotonic reads problem: A user makes two reads, hitting different replicas. The second replica is more behind than the first, so the user sees data go "backward in time."

Solution: Ensure each user always reads from the same replica (session affinity / sticky sessions).

Causal ordering violations: User A posts a message, User B reads it and replies. But a third user sees the reply before the original message because they are on a lagging replica.

Solution: Use causal consistency mechanisms or version vectors.

Failover

Failover is the process of promoting a follower to become the new leader when the current leader fails. This is the most dangerous moment in a replicated system.

Automatic Failover Steps

Detect that the leader has failed. Typically done via heartbeats — if the leader does not respond within a timeout (e.g., 30 seconds), it is considered dead.

Choose a new leader. The replica with the most up-to-date data is the best candidate. This can be done by consensus (Raft, Paxos) or by a dedicated failover controller.

Reconfigure the system. Clients must send writes to the new leader. Other replicas must start replicating from the new leader. The old leader (if it comes back) must be demoted to a follower.

1

2

3

Before: Client -> [Leader] -> [Follower A, Follower B]
Leader dies.
After:  Client -> [Follower A (promoted)] -> [Follower B]

What Can Go Wrong

Data loss on failover. If replication was asynchronous, the promoted follower may be missing recent writes. Those writes are permanently lost. When the old leader comes back, it has writes that no other node has — these are typically discarded.

Split-brain. Both the old leader and the new leader believe they are the leader and accept writes independently. This is catastrophic — data diverges and you may have conflicting writes that are extremely difficult to reconcile.

1

2

3

4

5

6

Network partition:
  [Old Leader] <-- clients on side A
       X (partition)
  [New Leader] <-- clients on side B

Both accept writes. Data diverges.

Mitigation strategies for split-brain:

Fencing tokens: The new leader gets a monotonically increasing token. Any write with an old token is rejected.
STONITH (Shoot The Other Node In The Head): When a new leader is elected, forcibly power off the old leader (literally, via a management interface).
Quorum-based leader election: Use Raft or Paxos so only one leader can be elected at a time, requiring a majority of nodes to agree.

Cascading failures. If the failover process is misconfigured, a leader failure can trigger a chain reaction. For example, GitHub experienced a major outage in 2018 when a failover caused MySQL replication to break in a cascading manner across multiple shards.

Manual vs Automatic Failover

Many organizations prefer manual failover for critical databases despite the slower recovery time. The reasoning is that automatic failover can trigger incorrectly (e.g., due to a network blip rather than a real failure) and cause more damage than the original problem.

Netflix approach: They use automated failover but with extensive testing. Their Chaos Engineering practice (Chaos Monkey) regularly kills instances in production to ensure failover works correctly.

Replication in Practice

PostgreSQL

Uses streaming replication (async by default, can be configured synchronous). Supports hot standby (read queries on replicas). Failover tools: Patroni, pg_auto_failover, or cloud-managed (AWS RDS Multi-AZ).

MySQL

Uses binary log (binlog) replication. Supports GTID-based replication for reliable failover. Orchestrator is a popular tool for managing MySQL failover. In the cloud, Aurora MySQL handles replication and failover automatically.

MongoDB

Uses replica sets — a group of nodes that maintain the same data. The primary accepts writes, secondaries replicate asynchronously. Automatic failover via election (Raft-like protocol). If the primary is unreachable, secondaries hold an election within ~10 seconds.

Redis

Uses asynchronous replication by default with Redis Sentinel for automatic failover. Redis can lose the most recent writes on failover (the writes that had not been replicated). The `WAIT` command can force synchronous replication for critical writes.

Trade-Off Summary

Strategy	Consistency	Availability	Write Latency	Complexity
Single leader, sync	Strong	Lower (leader failure stalls writes)	Higher	Low
Single leader, async	Eventual	Good	Low	Low
Multi-leader	Eventual (conflicts)	Highest	Low (local writes)	High
Leaderless (quorum)	Tunable	High	Depends on W	High

Interview Tips

Always mention the CAP theorem connection. Replication is where the CAP theorem becomes real. In a network partition, you must choose between consistency (reject writes if you cannot reach replicas) and availability (accept writes on whatever nodes are reachable).

Discuss replication lag proactively. If you propose read replicas, immediately address how you handle read-after-write consistency. This shows depth.

Know the failover risks. Interviewers love asking "what happens when the primary goes down?" The answer involves data loss risk, split-brain, and recovery time. Do not just say "the replica takes over."

Be specific about numbers. Replication lag under normal conditions is typically 10–100ms. Failover typically takes 10–30 seconds for automatic systems, minutes for manual. During failover, writes are unavailable.

Quorum math matters. For leaderless replication, know that `W + R > N` guarantees reading the latest write. Be prepared to discuss what happens when this condition is not met (stale reads, but higher availability).

Start simple. In an interview, start with single-leader async replication (the most common setup) and add complexity only when the requirements demand it (multi-region -> multi-leader, zero-downtime -> synchronous replication).