← back

Infrastructure

Distributed Consensus

How nodes agree on values in an unreliable network. Covers Paxos, Raft, and ZAB protocols, and why consensus is the backbone of distributed systems.

Distributed Consensus

In a distributed system, multiple nodes must agree on a shared value: Who is the current leader? What is the committed state of the database? Which transaction should be applied next? This is the consensus problem, and it is one of the hardest challenges in computer science. Without consensus, distributed databases cannot replicate data consistently, leader election cannot work, and configuration changes cannot propagate safely.

Why Consensus Is Hard

The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proves that in an asynchronous distributed system where even one process can crash, no deterministic algorithm can guarantee consensus. In practice, this means every consensus algorithm must make trade-offs between safety (never agreeing on the wrong value) and liveness (eventually making progress). Real systems prioritize safety and use timeouts to ensure liveness in most practical scenarios.

Paxos

Paxos, invented by Leslie Lamport, is the original consensus algorithm. It is provably correct but notoriously difficult to understand and implement.

Basic Paxos (Single-Decree)

Agrees on a single value through two phases:

Phase 1: Prepare

1
2
3
4
5
6
Proposer → Acceptors: Prepare(n)
  "I want to propose with proposal number n. Have you accepted anything?"

Acceptors → Proposer: Promise(n, accepted_value)
  "I promise not to accept any proposal numbered less than n.
   Here is the highest-numbered proposal I have already accepted (if any)."

Phase 2: Accept

1
2
3
4
5
6
7
8
Proposer → Acceptors: Accept(n, value)
  "Please accept this value with proposal number n."
  (If an Acceptor already accepted a value in Phase 1, the Proposer must
   use that value. Otherwise, it proposes its own.)

Acceptors → Proposer: Accepted(n, value)
  "I have accepted proposal n with this value."
  (Only if the Acceptor has not promised to a higher proposal number.)

A value is chosen when a majority of Acceptors have accepted it.

1
2
3
4
5
6
7
8
9
10
Example with 5 Acceptors (majority = 3):

Proposer P1: Prepare(1) → sent to all 5 Acceptors
  Acceptors A1, A2, A3 respond: Promise(1, null)
  (A4, A5 are slow or unreachable)

P1: Accept(1, "value_X") → sent to A1, A2, A3
  A1, A2, A3 respond: Accepted(1, "value_X")

Consensus reached: "value_X" is chosen (3/5 majority).

Multi-Paxos

Basic Paxos decides one value. Real systems need to decide a sequence of values (a log). Multi-Paxos optimizes this by electing a stable leader. The leader skips Phase 1 for subsequent proposals (since it already holds promises from a majority), reducing each decision to a single round trip.

Why Paxos Is Painful

  • The algorithm is correct but hard to implement. Lamport's paper uses an analogy about Greek legislators on an island, which many found confusing.
  • Multi-Paxos (the practical version) is underspecified. Implementers must fill in gaps for log compaction, membership changes, and snapshotting.
  • Google's Chubby team reported that their Paxos implementation was far more complex than the original algorithm suggested.

Raft

Raft was designed explicitly to be more understandable than Paxos while providing the same safety guarantees. It achieves consensus through leader election and log replication.

Leader Election

1
2
3
4
5
6
7
8
9
10
11
12
Node States:
  Follower  → Default state. Follows the leader.
  Candidate → Requesting votes to become leader.
  Leader    → Manages log replication. Only one leader per term.

Election flow:
1. A Follower's election timeout expires (no heartbeat from leader).
2. It becomes a Candidate, increments its term, and votes for itself.
3. It sends RequestVote RPCs to all other nodes.
4. Each node votes for at most one candidate per term (first-come-first-served).
5. If the Candidate receives votes from a majority, it becomes Leader.
6. The Leader sends periodic heartbeats to prevent new elections.
1
2
3
4
5
6
7
8
9
10
11
Timeline example (5-node cluster):

  Time 0: Node A is Leader (term 1), sending heartbeats.
  Time 1: Node A crashes. No more heartbeats.
  Time 2: Node C's election timeout expires first.
          C becomes Candidate (term 2), votes for itself.
  Time 3: C sends RequestVote(term=2) to B, D, E.
          B, D vote yes. E's timeout also expired, but C got there first.
  Time 4: C has 3 votes (C, B, D) = majority of 5.
          C becomes Leader (term 2).
  Time 5: C sends heartbeats. All nodes update to term 2.

Randomized timeouts prevent split votes. Each node's election timeout is randomly chosen (e.g., 150-300ms). This makes it unlikely that two nodes start an election simultaneously.

Log Replication

Once elected, the Leader accepts client requests and replicates them as log entries:

1
2
3
4
5
6
7
8
9
10
11
12
13
Leader's log:
  Index: 1    2    3    4    5
  Term:  1    1    1    2    2
  Cmd:   x←1  y←2  x←3  y←4  z←5

Replication:
1. Client sends command "z←5" to Leader.
2. Leader appends entry (index=5, term=2, cmd=z←5) to its log.
3. Leader sends AppendEntries(entries=[5]) to all Followers.
4. Each Follower appends the entry and responds with success.
5. Once a majority (3/5) have replicated entry 5, it is "committed."
6. Leader applies the command to its state machine and responds to client.
7. Followers learn about the commit via the next AppendEntries RPC.

Safety guarantee: If a log entry is committed, it will eventually be present in the logs of all healthy nodes. A new Leader is guaranteed to have all committed entries (because it must have received votes from a majority, and at least one voter must have the committed entry).

Log Consistency

If a Follower's log diverges from the Leader's (due to a crash during a previous term), the Leader finds the point of divergence and overwrites the Follower's inconsistent entries:

1
2
3
4
5
6
Leader log:    [1:x←1] [1:y←2] [2:x←3] [2:y←4]
Follower log:  [1:x←1] [1:y←2] [1:z←9]    ← diverges at index 3

Leader sends AppendEntries starting from index 3.
Follower detects mismatch, deletes entry at index 3 onwards.
Follower adopts Leader's entries: [1:x←1] [1:y←2] [2:x←3] [2:y←4]

ZAB (ZooKeeper Atomic Broadcast)

ZAB is the consensus protocol used by Apache ZooKeeper. It is similar to Raft but was developed independently (before Raft was published).

Key differences from Raft:

  • ZAB uses a transaction ID (zxid) that combines epoch (like Raft's term) and a counter.
  • Recovery protocol is more complex but handles some edge cases differently.
  • ZAB focuses on atomic broadcast (total order delivery) rather than log replication, though the result is equivalent.

Safety vs Liveness

Every consensus algorithm navigates this fundamental tension:

Safety: The algorithm never produces an incorrect result. In Raft:

  • Only one leader per term.
  • A committed entry is never lost.
  • The state machine at each node applies the same commands in the same order.

Liveness: The algorithm eventually makes progress. In Raft:

  • If a leader fails, a new leader is eventually elected.
  • Client requests are eventually processed.

Safety is always guaranteed. Liveness depends on network conditions. During a network partition where no majority is reachable, the system halts (no new commits) but never produces incorrect results. This is the CP choice in the CAP theorem.

Real Implementations

etcd (Raft)

etcd is a distributed key-value store that uses Raft for consensus. It is the backbone of Kubernetes, storing all cluster state (pod definitions, service configs, secrets).

1
2
3
4
5
6
7
8
Kubernetes cluster state flow:
  kubectl apply → API Server → etcd (Raft consensus)
                                ├── etcd node 1 (Leader)
                                ├── etcd node 2 (Follower)
                                └── etcd node 3 (Follower)

A kubectl command is only acknowledged after etcd commits it
to a majority of nodes.

Typical etcd clusters run 3 or 5 nodes. More nodes increase read availability but slow down writes (more nodes must acknowledge each write).

ZooKeeper (ZAB)

ZooKeeper provides distributed coordination primitives: locks, leader election, configuration management, and service discovery. Used by Kafka (legacy), HBase, Hadoop, and Solr.

1
2
3
4
5
ZooKeeper use cases:
  - Kafka broker leader election (legacy, now replaced by KRaft)
  - Distributed lock: /locks/resource_1 → ephemeral node
  - Service discovery: /services/payment/instance_1 → { host, port }
  - Configuration: /config/database_url → "postgres://..."

CockroachDB (Raft, multi-group)

CockroachDB uses Raft for every range (data partition). A single cluster might run thousands of independent Raft groups, one per range. This is called Multi-Raft and enables CockroachDB to scale horizontally while maintaining strong consistency.

TiKV (Raft)

TiKV (part of TiDB) also uses Multi-Raft. Each region (16-96 MB of data) is its own Raft group with its own leader. Region leaders are distributed across nodes for load balancing.

Membership Changes

One of the trickiest aspects of consensus is changing the set of nodes (adding or removing members) without losing safety.

Single-Server Changes (Raft)

Raft recommends changing one node at a time. Add or remove a single server as a special log entry. This guarantees that the old and new configurations overlap by a majority, preventing split-brain.

1
2
3
4
Cluster: {A, B, C} → add D → {A, B, C, D} → add E → {A, B, C, D, E}

Each step is a committed log entry. Between steps, the cluster operates
with the new configuration.

Joint Consensus (Raft Enhancement)

For changing multiple nodes simultaneously, Raft uses joint consensus: a transitional configuration that requires majority agreement from both the old and new configurations.

Performance Characteristics

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Raft write latency (typical):
  - Leader receives request: 0ms
  - Leader writes to local log: 1ms (disk sync)
  - Leader sends to followers: 0.5ms (network)
  - Followers write to log: 1ms (disk sync)
  - Followers respond: 0.5ms (network)
  - Total: ~3ms for a local-region cluster

Network partitions:
  - Minority partition: cannot elect a leader, no progress
  - Majority partition: elects a leader, continues operating

Throughput:
  - Batching: Leader batches multiple entries per AppendEntries RPC
  - Pipelining: Leader sends new entries before previous ones are committed
  - Typical: 10,000-100,000 writes/sec depending on hardware and configuration

Interview Tips

  • Know Raft well. Paxos is important historically, but Raft is what you will be asked to explain in detail. Be able to walk through leader election and log replication step by step.
  • Explain why consensus is hard. Mention the FLP impossibility result and the safety/liveness trade-off. This shows theoretical grounding.
  • Draw the state transitions. Follower → Candidate → Leader, with arrows showing what triggers each transition.
  • Connect to real systems. etcd for Kubernetes, ZooKeeper for Kafka, CockroachDB for distributed SQL. Interviewers want to know you understand where consensus is used, not just the algorithm.
  • Discuss the quorum size. 3 nodes tolerates 1 failure. 5 nodes tolerates 2. Adding more nodes increases fault tolerance but slows writes. Most production systems use 3 or 5.
  • Mention log compaction. The Raft log grows forever. Snapshotting the state machine and truncating the log is essential for production systems. This shows you think about operational concerns.
  • Distinguish consensus from replication. Consensus ensures agreement on order. Replication ensures data durability. They are related but not the same. A system with asynchronous replication (like MySQL) does not use consensus.