Classic Designs
Design a system like DynamoDB or Cassandra. Covers consistent hashing, vector clocks, quorum reads/writes, and conflict resolution.
Designing a distributed key-value store like Amazon DynamoDB or Apache Cassandra is one of the most technically rich system design questions. It tests your understanding of partitioning, replication, consistency, failure detection, and conflict resolution -- the building blocks of every distributed database.
`put(key, value)` -- store a key-value pair.`get(key)` -- retrieve the value for a given key.`delete(key)` -- remove a key-value pair.With billions of key-value pairs, data must be distributed across multiple nodes. Consistent hashing is the standard approach.
Imagine a hash ring from 0 to 2^128. Each node is assigned a position on the ring (by hashing its IP or ID). Each key is also hashed, and it is stored on the first node clockwise from its hash position.
Hash Ring:
node_A (pos: 50)
│
●─────●──────────●── node_B (pos: 150)
│ │
│ Hash Ring │
│ │
●─────●──────────●── node_C (pos: 250)
│
node_D (pos: 350)
Key "user:42" hashes to 120 → stored on node_B (next clockwise node)
Key "user:99" hashes to 300 → stored on node_DIn practice, with only a few physical nodes, the data distribution is uneven. Virtual nodes solve this: each physical node owns multiple positions on the ring.
Physical node A → virtual positions: 50, 130, 270, 340
Physical node B → virtual positions: 80, 190, 310, 390
Physical node C → virtual positions: 20, 160, 230, 360
More virtual nodes per physical node = more uniform distributionWhen a physical node is added, it takes over some virtual positions from existing nodes, redistributing only a fraction of the data (roughly 1/N where N is the number of nodes). When a node is removed, its virtual positions are distributed among the remaining nodes.
For fault tolerance, each key is replicated across N nodes (typically N=3). After finding the primary node via consistent hashing, the data is also stored on the next N-1 nodes clockwise on the ring.
Key "user:42" → Primary: node_B, Replicas: node_C, node_D
If node_B fails, the data is still available on node_C and node_D.A subtlety: with virtual nodes, the next N positions on the ring might all belong to the same physical node. The system must skip virtual nodes on the same physical machine when selecting replicas to ensure fault tolerance across distinct hardware.
The system offers tunable consistency via quorum parameters:
The key invariant: if W + R > N, you get strong consistency because at least one node in the read quorum participated in the most recent write.
N=3, W=2, R=2 → Strong consistency (2+2 > 3)
N=3, W=1, R=1 → Eventual consistency (1+1 < 3, but fastest)
N=3, W=3, R=1 → Strong consistency, slow writes, fast reads
N=3, W=1, R=3 → Strong consistency, fast writes, slow readsdef put(key, value, W=2):
replicas = get_replicas(key) # Returns N nodes
acks = 0
for node in replicas:
try:
node.write(key, value, timestamp=now())
acks += 1
except NodeUnavailable:
continue
if acks < W:
raise WriteFailure("Could not achieve write quorum")
def get(key, R=2):
replicas = get_replicas(key)
responses = []
for node in replicas:
try:
responses.append(node.read(key))
except NodeUnavailable:
continue
if len(responses) >= R:
break
if len(responses) < R:
raise ReadFailure("Could not achieve read quorum")
# Return the value with the latest timestamp
return max(responses, key=lambda r: r.timestamp).valueWhen W < N, different replicas may have different versions of the same key. Vector clocks detect and resolve conflicts.
A vector clock is a list of (node, counter) pairs. Each node increments its own counter when it handles a write.
Initial state: user:42 = "Alice"
Write on node_A: vector_clock = [(A, 1)]
Write on node_B (concurrent): vector_clock = [(B, 1)]
These are concurrent writes — neither happened before the other.
The system detects a conflict and must resolve it.
Resolution strategies:
1. Last-writer-wins (LWW): use timestamps. Simple but loses data.
2. Application-level resolution: return both versions to the client,
let the application merge them (Amazon's shopping cart approach).DynamoDB uses last-writer-wins by default. Riak exposes conflicts to the application. Cassandra uses LWW with timestamps.
In a large cluster, how does each node know which other nodes are alive? A centralized health checker is a single point of failure. Instead, use a gossip protocol.
Each node periodically (e.g., every second) picks a random peer and exchanges membership information. Each node maintains a heartbeat counter that it increments. If a node's heartbeat has not been updated within a timeout, it is considered down.
Node A's membership list:
Node B: heartbeat=42, last_updated=10:30:01 → alive
Node C: heartbeat=39, last_updated=10:29:55 → alive
Node D: heartbeat=35, last_updated=10:29:30 → suspected down (>30s stale)
Gossip round:
Node A randomly contacts Node C
A sends: {B: 42, C: 39, D: 35}
C sends: {B: 43, C: 40, D: 36}
A updates: {B: 43, C: 40, D: 36} → D might actually be aliveThis is how Cassandra and DynamoDB detect failures. Information propagates through the cluster in O(log N) gossip rounds.
Over time, replicas can drift out of sync (due to network partitions, temporary failures, or hinted handoffs). Merkle trees efficiently detect and repair inconsistencies.
A Merkle tree is a hash tree where each leaf is the hash of a key-value pair, and each internal node is the hash of its children. Two nodes can compare their Merkle trees starting from the root:
Root: H(AB + CD)
/ \
H(AB) H(CD)
/ \ / \
H(A) H(B) H(C) H(D)
| | | |
key_A key_B key_C key_D
Node 1's root hash: abc123
Node 2's root hash: abc123 → Trees match, no sync needed
Node 1's root hash: abc123
Node 2's root hash: def456 → Trees differ, drill down
Node 1's left child: aaa Node 2's left child: aaa → Match
Node 1's right child: bbb Node 2's right child: ccc → Differ
→ Only sync keys C and DThis approach is logarithmic: instead of comparing every key-value pair between replicas, you only need to transfer the keys that actually differ.
When a node is temporarily down, writes destined for it are stored on another node with a "hint" indicating the intended recipient. When the failed node recovers, the hinted writes are forwarded to it.
Key "user:42" should be stored on nodes B, C, D (N=3)
Node D is temporarily down.
Write goes to nodes B, C, and E (next node on the ring).
E stores the data with a hint: "This belongs to D."
When D recovers, E sends the hinted data to D and deletes its local copy.This maintains availability during temporary failures while ensuring data eventually reaches the correct node.
Standard quorum requires W of the N designated replicas to acknowledge. A sloppy quorum relaxes this: any W nodes can acknowledge, even if they are not in the original replica set. Combined with hinted handoff, this maximizes availability at the cost of consistency guarantees.
DynamoDB uses sloppy quorums. Cassandra uses strict quorums by default but supports hinted handoff for temporary failures.
Each node needs an efficient local storage engine. Common choices:
Used by Cassandra, HBase, RocksDB. Optimized for write-heavy workloads.
Write path:
1. Write to WAL (write-ahead log) for durability.
2. Insert into in-memory sorted structure (memtable).
3. When memtable is full, flush to disk as an immutable SSTable.
4. Background compaction merges SSTables.
Read path:
1. Check memtable.
2. Check bloom filters for each SSTable (skip if key definitely not present).
3. Search SSTables from newest to oldest.Used by traditional databases. Balanced read/write performance. Used by etcd (via BoltDB).