Infrastructure
How nodes agree on values in an unreliable network. Covers Paxos, Raft, and ZAB protocols, and why consensus is the backbone of distributed systems.
In a distributed system, multiple nodes must agree on a shared value: Who is the current leader? What is the committed state of the database? Which transaction should be applied next? This is the consensus problem, and it is one of the hardest challenges in computer science. Without consensus, distributed databases cannot replicate data consistently, leader election cannot work, and configuration changes cannot propagate safely.
The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proves that in an asynchronous distributed system where even one process can crash, no deterministic algorithm can guarantee consensus. In practice, this means every consensus algorithm must make trade-offs between safety (never agreeing on the wrong value) and liveness (eventually making progress). Real systems prioritize safety and use timeouts to ensure liveness in most practical scenarios.
Paxos, invented by Leslie Lamport, is the original consensus algorithm. It is provably correct but notoriously difficult to understand and implement.
Agrees on a single value through two phases:
Phase 1: Prepare
Proposer → Acceptors: Prepare(n)
"I want to propose with proposal number n. Have you accepted anything?"
Acceptors → Proposer: Promise(n, accepted_value)
"I promise not to accept any proposal numbered less than n.
Here is the highest-numbered proposal I have already accepted (if any)."Phase 2: Accept
Proposer → Acceptors: Accept(n, value)
"Please accept this value with proposal number n."
(If an Acceptor already accepted a value in Phase 1, the Proposer must
use that value. Otherwise, it proposes its own.)
Acceptors → Proposer: Accepted(n, value)
"I have accepted proposal n with this value."
(Only if the Acceptor has not promised to a higher proposal number.)A value is chosen when a majority of Acceptors have accepted it.
Example with 5 Acceptors (majority = 3):
Proposer P1: Prepare(1) → sent to all 5 Acceptors
Acceptors A1, A2, A3 respond: Promise(1, null)
(A4, A5 are slow or unreachable)
P1: Accept(1, "value_X") → sent to A1, A2, A3
A1, A2, A3 respond: Accepted(1, "value_X")
Consensus reached: "value_X" is chosen (3/5 majority).Basic Paxos decides one value. Real systems need to decide a sequence of values (a log). Multi-Paxos optimizes this by electing a stable leader. The leader skips Phase 1 for subsequent proposals (since it already holds promises from a majority), reducing each decision to a single round trip.
Raft was designed explicitly to be more understandable than Paxos while providing the same safety guarantees. It achieves consensus through leader election and log replication.
Node States:
Follower → Default state. Follows the leader.
Candidate → Requesting votes to become leader.
Leader → Manages log replication. Only one leader per term.
Election flow:
1. A Follower's election timeout expires (no heartbeat from leader).
2. It becomes a Candidate, increments its term, and votes for itself.
3. It sends RequestVote RPCs to all other nodes.
4. Each node votes for at most one candidate per term (first-come-first-served).
5. If the Candidate receives votes from a majority, it becomes Leader.
6. The Leader sends periodic heartbeats to prevent new elections.Timeline example (5-node cluster):
Time 0: Node A is Leader (term 1), sending heartbeats.
Time 1: Node A crashes. No more heartbeats.
Time 2: Node C's election timeout expires first.
C becomes Candidate (term 2), votes for itself.
Time 3: C sends RequestVote(term=2) to B, D, E.
B, D vote yes. E's timeout also expired, but C got there first.
Time 4: C has 3 votes (C, B, D) = majority of 5.
C becomes Leader (term 2).
Time 5: C sends heartbeats. All nodes update to term 2.Randomized timeouts prevent split votes. Each node's election timeout is randomly chosen (e.g., 150-300ms). This makes it unlikely that two nodes start an election simultaneously.
Once elected, the Leader accepts client requests and replicates them as log entries:
Leader's log:
Index: 1 2 3 4 5
Term: 1 1 1 2 2
Cmd: x←1 y←2 x←3 y←4 z←5
Replication:
1. Client sends command "z←5" to Leader.
2. Leader appends entry (index=5, term=2, cmd=z←5) to its log.
3. Leader sends AppendEntries(entries=[5]) to all Followers.
4. Each Follower appends the entry and responds with success.
5. Once a majority (3/5) have replicated entry 5, it is "committed."
6. Leader applies the command to its state machine and responds to client.
7. Followers learn about the commit via the next AppendEntries RPC.Safety guarantee: If a log entry is committed, it will eventually be present in the logs of all healthy nodes. A new Leader is guaranteed to have all committed entries (because it must have received votes from a majority, and at least one voter must have the committed entry).
If a Follower's log diverges from the Leader's (due to a crash during a previous term), the Leader finds the point of divergence and overwrites the Follower's inconsistent entries:
Leader log: [1:x←1] [1:y←2] [2:x←3] [2:y←4]
Follower log: [1:x←1] [1:y←2] [1:z←9] ← diverges at index 3
Leader sends AppendEntries starting from index 3.
Follower detects mismatch, deletes entry at index 3 onwards.
Follower adopts Leader's entries: [1:x←1] [1:y←2] [2:x←3] [2:y←4]ZAB is the consensus protocol used by Apache ZooKeeper. It is similar to Raft but was developed independently (before Raft was published).
Key differences from Raft:
Every consensus algorithm navigates this fundamental tension:
Safety: The algorithm never produces an incorrect result. In Raft:
Liveness: The algorithm eventually makes progress. In Raft:
Safety is always guaranteed. Liveness depends on network conditions. During a network partition where no majority is reachable, the system halts (no new commits) but never produces incorrect results. This is the CP choice in the CAP theorem.
etcd is a distributed key-value store that uses Raft for consensus. It is the backbone of Kubernetes, storing all cluster state (pod definitions, service configs, secrets).
Kubernetes cluster state flow:
kubectl apply → API Server → etcd (Raft consensus)
├── etcd node 1 (Leader)
├── etcd node 2 (Follower)
└── etcd node 3 (Follower)
A kubectl command is only acknowledged after etcd commits it
to a majority of nodes.Typical etcd clusters run 3 or 5 nodes. More nodes increase read availability but slow down writes (more nodes must acknowledge each write).
ZooKeeper provides distributed coordination primitives: locks, leader election, configuration management, and service discovery. Used by Kafka (legacy), HBase, Hadoop, and Solr.
ZooKeeper use cases:
- Kafka broker leader election (legacy, now replaced by KRaft)
- Distributed lock: /locks/resource_1 → ephemeral node
- Service discovery: /services/payment/instance_1 → { host, port }
- Configuration: /config/database_url → "postgres://..."CockroachDB uses Raft for every range (data partition). A single cluster might run thousands of independent Raft groups, one per range. This is called Multi-Raft and enables CockroachDB to scale horizontally while maintaining strong consistency.
TiKV (part of TiDB) also uses Multi-Raft. Each region (16-96 MB of data) is its own Raft group with its own leader. Region leaders are distributed across nodes for load balancing.
One of the trickiest aspects of consensus is changing the set of nodes (adding or removing members) without losing safety.
Raft recommends changing one node at a time. Add or remove a single server as a special log entry. This guarantees that the old and new configurations overlap by a majority, preventing split-brain.
Cluster: {A, B, C} → add D → {A, B, C, D} → add E → {A, B, C, D, E}
Each step is a committed log entry. Between steps, the cluster operates
with the new configuration.For changing multiple nodes simultaneously, Raft uses joint consensus: a transitional configuration that requires majority agreement from both the old and new configurations.
Raft write latency (typical):
- Leader receives request: 0ms
- Leader writes to local log: 1ms (disk sync)
- Leader sends to followers: 0.5ms (network)
- Followers write to log: 1ms (disk sync)
- Followers respond: 0.5ms (network)
- Total: ~3ms for a local-region cluster
Network partitions:
- Minority partition: cannot elect a leader, no progress
- Majority partition: elects a leader, continues operating
Throughput:
- Batching: Leader batches multiple entries per AppendEntries RPC
- Pipelining: Leader sends new entries before previous ones are committed
- Typical: 10,000-100,000 writes/sec depending on hardware and configuration