Fundamentals
You can only guarantee two of three: consistency, availability, and partition tolerance. Learn how real systems make this trade-off and what it means for your design.
The CAP theorem is one of the most cited — and most misunderstood — concepts in distributed systems. It provides a framework for reasoning about the fundamental trade-offs you face when building systems that span multiple machines connected by a network. In interviews, demonstrating a nuanced understanding of CAP separates candidates who have read a blog post from those who have actually built distributed systems.
In 2000, Eric Brewer conjectured (and it was later proved by Gilbert and Lynch in 2002) that a distributed data store can provide at most two of the following three guarantees simultaneously:
Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time. This is linearizability — the system behaves as if there is a single copy of the data.
Availability (A): Every request receives a non-error response, without the guarantee that it contains the most recent write. The system always responds, even if the data might be stale.
Partition Tolerance (P): The system continues to operate despite arbitrary message loss or failure of part of the network. Nodes cannot communicate with each other, but they keep running.
Here is what many people miss: in a distributed system, network partitions will happen. Cables get cut, switches fail, data centers lose connectivity. Partition tolerance is not a choice — it is a requirement for any system that spans multiple machines.
This means the real choice is between CP (consistency + partition tolerance) and AP (availability + partition tolerance) during a partition. When the network is healthy, you can have all three. The CAP theorem only forces a trade-off when things go wrong.
Consistency
╱╲
╱ ╲
╱ CP ╲
╱ ╲
╱────────╲
╱ Pick ╲
╱ during ╲
╱ partition ╲
╱ ╲
╱────────────────────╲
Availability ──── Partition Tolerance
AP (always required)A CP system refuses to respond (or returns an error) if it cannot guarantee the response is up to date. During a partition, the minority side of the partition stops accepting writes to prevent inconsistency.
HBase: Built on top of HDFS, HBase uses a single active RegionServer per region. If a RegionServer is unreachable, that region becomes unavailable until failover completes. You never get stale data, but you might not get any data.
MongoDB (with majority write concern): When configured with `w: "majority"` and `readConcern: "linearizable"`, MongoDB will not acknowledge writes unless a majority of replica set members confirm them. During a partition, the minority side cannot accept writes.
ZooKeeper: Used for distributed coordination, ZooKeeper prioritizes consistency. If a ZooKeeper node cannot reach a quorum, it stops serving requests. This is exactly what you want for configuration data and distributed locks — stale lock information would be catastrophic.
etcd: The backbone of Kubernetes, etcd uses the Raft consensus protocol and provides strong consistency. If a leader cannot reach a majority of nodes, it steps down rather than risk split-brain.
# CP behavior during a partition
class CPDatabase:
def write(self, key, value):
ack_count = self.replicate_to_all_nodes(key, value)
if ack_count < self.quorum_size():
raise UnavailableError("Cannot reach quorum — rejecting write")
return "OK"
def read(self, key):
if not self.can_reach_quorum():
raise UnavailableError("Cannot confirm data is current")
return self.local_store.get(key)An AP system always responds to requests, even if some nodes have not received the latest updates. During a partition, both sides continue serving traffic, which means different clients might see different versions of the data.
Cassandra: Every node can accept reads and writes. If a partition splits the cluster, both sides keep operating. When the partition heals, Cassandra reconciles conflicts using last-write-wins (by timestamp) or custom conflict resolution. You configure the trade-off through consistency levels — `ONE`, `QUORUM`, `ALL` — tuning per query.
Amazon DynamoDB: Designed for extreme availability, DynamoDB uses consistent hashing with virtual nodes. It accepts writes even when some replicas are unreachable. Conflicts are resolved during reads using vector clocks (in the original Dynamo paper) or last-writer-wins.
CouchDB: Designed for offline-first applications, CouchDB embraces eventual consistency. Each node operates independently, and conflicts are resolved during replication using a deterministic revision tree.
DNS: The Domain Name System is perhaps the most visible AP system. DNS records propagate asynchronously with TTLs. During network issues, DNS servers serve cached (potentially stale) records rather than returning errors. Availability is paramount — a DNS outage breaks the entire internet.
# AP behavior during a partition
class APDatabase:
def write(self, key, value):
self.local_store.put(key, value, timestamp=now())
self.async_replicate(key, value) # Best effort — may fail
return "OK" # Always succeeds locally
def read(self, key):
return self.local_store.get(key) # Returns local copy, may be stale
def on_partition_heal(self):
self.anti_entropy_sync() # Reconcile diverged dataThe CAP theorem only describes behavior during partitions, but systems make trade-offs even when things are running normally. Daniel Abadi proposed the PACELC extension:
If there is a Partition (P), choose between Availability (A) and Consistency (C). Else (E), when the system is running normally, choose between Latency (L) and Consistency (C).
This captures the day-to-day trade-off more accurately:
┌───────────────┬──────────────────────┬────────────────────────┐
│ System │ During Partition │ Normal Operation │
│ │ (PAC) │ (ELC) │
├───────────────┼──────────────────────┼────────────────────────┤
│ DynamoDB │ PA (availability) │ EL (low latency) │
│ Cassandra │ PA (availability) │ EL (low latency) │
│ MongoDB │ PC (consistency) │ EC (consistency) │
│ HBase │ PC (consistency) │ EC (consistency) │
│ Cosmos DB │ PA or PC (tunable) │ EL or EC (tunable) │
│ CockroachDB │ PC (consistency) │ EC (consistency) │
│ Spanner │ PC (consistency) │ EC (consistency, GPS) │
└───────────────┴──────────────────────┴────────────────────────┘Spanner is interesting here — it chooses consistency everywhere (PC/EC) but uses TrueTime (GPS + atomic clocks) to minimize the latency penalty of synchronous replication.
This framing is misleading. You do not choose CA and ignore P. Network partitions happen whether you want them to or not. The choice is only between C and A during a partition.
A CP system is not always unavailable. It is fully available when the network is healthy. It only sacrifices availability during partitions, and only for the minority partition (or whichever side cannot reach quorum).
AP systems are eventually consistent — they converge to a consistent state once the partition heals. Many AP systems offer tunable consistency (like Cassandra's consistency levels) that let you trade latency for stronger guarantees on a per-query basis.
If your system runs on a single machine, CAP does not apply — it is not a distributed system. If it runs on multiple machines, partitions are possible, so it cannot be CA. Single-node PostgreSQL is not CA in the CAP sense; it is simply not a distributed system.
When designing a system, ask yourself:
# Hybrid approach: different consistency for different data
class HybridService:
def update_account_balance(self, user_id, amount):
# CP: use strong consistency for financial data
self.db.write(
key=f"balance:{user_id}",
value=amount,
consistency="quorum" # Wait for majority
)
def update_profile_picture(self, user_id, url):
# AP: use eventual consistency for non-critical data
self.db.write(
key=f"avatar:{user_id}",
value=url,
consistency="one" # Write to one node, replicate async
)