← back

Classic Designs

Distributed Key-Value Store

Design a system like DynamoDB or Cassandra. Covers consistent hashing, vector clocks, quorum reads/writes, and conflict resolution.

Designing a Distributed Key-Value Store

Designing a distributed key-value store like Amazon DynamoDB or Apache Cassandra is one of the most technically rich system design questions. It tests your understanding of partitioning, replication, consistency, failure detection, and conflict resolution -- the building blocks of every distributed database.

Requirements

Functional Requirements

  • `put(key, value)` -- store a key-value pair.
  • `get(key)` -- retrieve the value for a given key.
  • `delete(key)` -- remove a key-value pair.
  • Support for large datasets that do not fit on a single machine.

Non-Functional Requirements

  • High availability: the system remains operational even when nodes fail.
  • Scalability: add nodes to increase capacity linearly.
  • Tunable consistency: allow the caller to choose between strong and eventual consistency.
  • Low latency: single-digit millisecond reads and writes.
  • Fault tolerance: no single point of failure.

Data Partitioning with Consistent Hashing

With billions of key-value pairs, data must be distributed across multiple nodes. Consistent hashing is the standard approach.

Basic Consistent Hashing

Imagine a hash ring from 0 to 2^128. Each node is assigned a position on the ring (by hashing its IP or ID). Each key is also hashed, and it is stored on the first node clockwise from its hash position.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Hash Ring:

        node_A (pos: 50)
          │
    ●─────●──────────●── node_B (pos: 150)
    │                 │
    │    Hash Ring    │
    │                 │
    ●─────●──────────●── node_C (pos: 250)
          │
        node_D (pos: 350)

Key "user:42" hashes to 120 → stored on node_B (next clockwise node)
Key "user:99" hashes to 300 → stored on node_D

Virtual Nodes

In practice, with only a few physical nodes, the data distribution is uneven. Virtual nodes solve this: each physical node owns multiple positions on the ring.

1
2
3
4
5
Physical node A → virtual positions: 50, 130, 270, 340
Physical node B → virtual positions: 80, 190, 310, 390
Physical node C → virtual positions: 20, 160, 230, 360

More virtual nodes per physical node = more uniform distribution

When a physical node is added, it takes over some virtual positions from existing nodes, redistributing only a fraction of the data (roughly 1/N where N is the number of nodes). When a node is removed, its virtual positions are distributed among the remaining nodes.

Replication

For fault tolerance, each key is replicated across N nodes (typically N=3). After finding the primary node via consistent hashing, the data is also stored on the next N-1 nodes clockwise on the ring.

1
2
3
Key "user:42" → Primary: node_B, Replicas: node_C, node_D

If node_B fails, the data is still available on node_C and node_D.

Replica Placement

A subtlety: with virtual nodes, the next N positions on the ring might all belong to the same physical node. The system must skip virtual nodes on the same physical machine when selecting replicas to ensure fault tolerance across distinct hardware.

Consistency: Quorum Reads and Writes

The system offers tunable consistency via quorum parameters:

  • N: total number of replicas (e.g., 3)
  • W: number of replicas that must acknowledge a write (write quorum)
  • R: number of replicas that must respond to a read (read quorum)

The key invariant: if W + R > N, you get strong consistency because at least one node in the read quorum participated in the most recent write.

1
2
3
4
N=3, W=2, R=2 → Strong consistency (2+2 > 3)
N=3, W=1, R=1 → Eventual consistency (1+1 < 3, but fastest)
N=3, W=3, R=1 → Strong consistency, slow writes, fast reads
N=3, W=1, R=3 → Strong consistency, fast writes, slow reads
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def put(key, value, W=2):
    replicas = get_replicas(key)  # Returns N nodes
    acks = 0
    for node in replicas:
        try:
            node.write(key, value, timestamp=now())
            acks += 1
        except NodeUnavailable:
            continue
    if acks < W:
        raise WriteFailure("Could not achieve write quorum")

def get(key, R=2):
    replicas = get_replicas(key)
    responses = []
    for node in replicas:
        try:
            responses.append(node.read(key))
        except NodeUnavailable:
            continue
        if len(responses) >= R:
            break
    if len(responses) < R:
        raise ReadFailure("Could not achieve read quorum")
    # Return the value with the latest timestamp
    return max(responses, key=lambda r: r.timestamp).value

Conflict Resolution with Vector Clocks

When W < N, different replicas may have different versions of the same key. Vector clocks detect and resolve conflicts.

A vector clock is a list of (node, counter) pairs. Each node increments its own counter when it handles a write.

1
2
3
4
5
6
7
8
9
10
11
12
Initial state: user:42 = "Alice"

Write on node_A: vector_clock = [(A, 1)]
Write on node_B (concurrent): vector_clock = [(B, 1)]

These are concurrent writes — neither happened before the other.
The system detects a conflict and must resolve it.

Resolution strategies:
1. Last-writer-wins (LWW): use timestamps. Simple but loses data.
2. Application-level resolution: return both versions to the client,
   let the application merge them (Amazon's shopping cart approach).

DynamoDB uses last-writer-wins by default. Riak exposes conflicts to the application. Cassandra uses LWW with timestamps.

Failure Detection: Gossip Protocol

In a large cluster, how does each node know which other nodes are alive? A centralized health checker is a single point of failure. Instead, use a gossip protocol.

Each node periodically (e.g., every second) picks a random peer and exchanges membership information. Each node maintains a heartbeat counter that it increments. If a node's heartbeat has not been updated within a timeout, it is considered down.

1
2
3
4
5
6
7
8
9
10
Node A's membership list:
  Node B: heartbeat=42, last_updated=10:30:01 → alive
  Node C: heartbeat=39, last_updated=10:29:55 → alive
  Node D: heartbeat=35, last_updated=10:29:30 → suspected down (>30s stale)

Gossip round:
  Node A randomly contacts Node C
  A sends: {B: 42, C: 39, D: 35}
  C sends: {B: 43, C: 40, D: 36}
  A updates: {B: 43, C: 40, D: 36} → D might actually be alive

This is how Cassandra and DynamoDB detect failures. Information propagates through the cluster in O(log N) gossip rounds.

Anti-Entropy with Merkle Trees

Over time, replicas can drift out of sync (due to network partitions, temporary failures, or hinted handoffs). Merkle trees efficiently detect and repair inconsistencies.

A Merkle tree is a hash tree where each leaf is the hash of a key-value pair, and each internal node is the hash of its children. Two nodes can compare their Merkle trees starting from the root:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
        Root: H(AB + CD)
       /              \
   H(AB)             H(CD)
   /    \            /    \
 H(A)  H(B)      H(C)  H(D)
  |      |        |      |
key_A  key_B    key_C  key_D

Node 1's root hash: abc123
Node 2's root hash: abc123  → Trees match, no sync needed

Node 1's root hash: abc123
Node 2's root hash: def456  → Trees differ, drill down
  Node 1's left child: aaa   Node 2's left child: aaa  → Match
  Node 1's right child: bbb  Node 2's right child: ccc → Differ
    → Only sync keys C and D

This approach is logarithmic: instead of comparing every key-value pair between replicas, you only need to transfer the keys that actually differ.

Handling Temporary Failures: Hinted Handoff

When a node is temporarily down, writes destined for it are stored on another node with a "hint" indicating the intended recipient. When the failed node recovers, the hinted writes are forwarded to it.

1
2
3
4
5
6
7
Key "user:42" should be stored on nodes B, C, D (N=3)
Node D is temporarily down.

Write goes to nodes B, C, and E (next node on the ring).
E stores the data with a hint: "This belongs to D."

When D recovers, E sends the hinted data to D and deletes its local copy.

This maintains availability during temporary failures while ensuring data eventually reaches the correct node.

Sloppy Quorum

Standard quorum requires W of the N designated replicas to acknowledge. A sloppy quorum relaxes this: any W nodes can acknowledge, even if they are not in the original replica set. Combined with hinted handoff, this maximizes availability at the cost of consistency guarantees.

DynamoDB uses sloppy quorums. Cassandra uses strict quorums by default but supports hinted handoff for temporary failures.

Storage Engine

Each node needs an efficient local storage engine. Common choices:

LSM Tree (Log-Structured Merge Tree)

Used by Cassandra, HBase, RocksDB. Optimized for write-heavy workloads.

1
2
3
4
5
6
7
8
9
10
Write path:
  1. Write to WAL (write-ahead log) for durability.
  2. Insert into in-memory sorted structure (memtable).
  3. When memtable is full, flush to disk as an immutable SSTable.
  4. Background compaction merges SSTables.

Read path:
  1. Check memtable.
  2. Check bloom filters for each SSTable (skip if key definitely not present).
  3. Search SSTables from newest to oldest.

B+ Tree

Used by traditional databases. Balanced read/write performance. Used by etcd (via BoltDB).

Interview Tips

  • Draw the consistent hash ring. Start with basic consistent hashing, then introduce virtual nodes. This visual grounds the entire design.
  • State the quorum formula. W + R > N for strong consistency. Then explain the trade-offs of different W/R configurations.
  • Explain the CAP trade-off. DynamoDB and Cassandra choose AP (availability and partition tolerance) by default, with tunable consistency. Contrast with CP systems like ZooKeeper.
  • Walk through a failure scenario. "Node B goes down. Here is how the system handles reads, writes, and recovery." This demonstrates deep understanding.
  • Mention vector clocks but acknowledge complexity. In practice, most systems use last-writer-wins for simplicity. Vector clocks are theoretically correct but hard to manage at scale.
  • Distinguish permanent vs temporary failures. Hinted handoff handles temporary failures. Permanent failures require re-replication from surviving replicas using Merkle trees to identify missing data.