Data Storage
Relational databases vs document stores, key-value stores, wide-column, and graph databases. How to choose the right storage model for your use case.
Choosing the right database is one of the most consequential decisions in system design. It affects your data model, query patterns, scaling strategy, consistency guarantees, and operational complexity for years to come. The SQL vs NoSQL framing is an oversimplification — the real question is which type of database best fits your access patterns, consistency requirements, and scaling needs.
Relational databases store data in tables with predefined schemas. Tables relate to each other through foreign keys. Queries use SQL, a declarative language that lets you express what data you want without specifying how to retrieve it.
Relational databases guarantee ACID transactions:
# ACID transaction: transfer money between accounts
def transfer(db, from_account, to_account, amount):
with db.transaction() as tx:
from_balance = tx.query(
"SELECT balance FROM accounts WHERE id = %s FOR UPDATE",
[from_account]
)
if from_balance < amount:
raise InsufficientFunds()
tx.execute(
"UPDATE accounts SET balance = balance - %s WHERE id = %s",
[amount, from_account]
)
tx.execute(
"UPDATE accounts SET balance = balance + %s WHERE id = %s",
[amount, to_account]
)
# Both updates commit atomically, or neither doesExamples: PostgreSQL, MySQL, SQL Server, Oracle, CockroachDB, Google Spanner
"NoSQL" is an umbrella term for databases that do not use the relational model. There are four major categories, each designed for different access patterns.
Store data as flexible JSON-like documents. Each document can have a different structure. Documents are grouped into collections.
# MongoDB document — no fixed schema
user = {
"_id": "user_123",
"name": "Alice",
"email": "alice@example.com",
"addresses": [
{"type": "home", "city": "Seattle", "zip": "98101"},
{"type": "work", "city": "San Francisco", "zip": "94105"}
],
"preferences": {
"theme": "dark",
"notifications": True
}
}
# Related data is embedded in the document — no joins neededStrengths: Flexible schema, natural mapping to application objects, fast reads when data is co-located in a document, easy horizontal scaling.
Weaknesses: No joins (data denormalization leads to update anomalies), limited transaction support across documents, can lead to data duplication.
Use when: Your data is hierarchical, semi-structured, or varies between records. Content management, user profiles, product catalogs.
Examples: MongoDB, CouchDB, Amazon DocumentDB, Firestore
The simplest NoSQL model: a hash map that maps keys to values. The database treats values as opaque blobs — all intelligence is in the application.
# Redis-style key-value operations
redis.set("session:abc123", json.dumps({"user_id": 456, "role": "admin"}))
redis.expire("session:abc123", 3600) # TTL: 1 hour
session = json.loads(redis.get("session:abc123"))Strengths: Extreme performance (sub-millisecond latency), simple API, easy to scale (partition by key hash), predictable performance.
Weaknesses: No querying by value — you can only look up by exact key. No relationships, no aggregations.
Use when: You need a cache, session store, rate limiter, or any access pattern that is purely key-based lookup.
Examples: Redis, Memcached, Amazon DynamoDB (also a document store), etcd
Data is organized by rows and column families, but unlike relational tables, each row can have different columns. Rows are sorted by key, making range scans efficient.
Row Key │ Column Family: profile │ Column Family: activity
─────────────────┼──────────────────────────┼──────────────────────────
user_alice │ name: "Alice" │ last_login: "2024-01-15"
│ email: "alice@ex.com" │ login_count: 342
│ city: "Seattle" │
─────────────────┼──────────────────────────┼──────────────────────────
user_bob │ name: "Bob" │ last_login: "2024-01-14"
│ phone: "+1234567890" │
│ │Strengths: Handles massive datasets (petabytes), excellent write throughput, efficient range scans, flexible column structure.
Weaknesses: No joins, no multi-row transactions (typically), complex data modeling, steep learning curve.
Use when: You have massive time-series data, event logs, IoT sensor data, or analytics workloads with known access patterns.
Examples: Apache Cassandra, HBase, Google Bigtable, ScyllaDB
Store data as nodes (entities) and edges (relationships). Queries traverse the graph to find connected data. Optimized for relationship-heavy queries that would require expensive joins in SQL.
(Alice)──FRIENDS──►(Bob)
│ │
LIKES WORKS_AT
│ │
▼ ▼
(Python) (Google)
▲ ▲
KNOWS WORKS_AT
│ │
(Carol)──FRIENDS──►(Dave)# Neo4j Cypher query: "Find friends of friends who work at Google"
query = """
MATCH (me:Person {name: 'Alice'})-[:FRIENDS]->(friend)-[:FRIENDS]->(fof)
WHERE (fof)-[:WORKS_AT]->(:Company {name: 'Google'})
AND NOT (me)-[:FRIENDS]->(fof)
RETURN fof.name
"""
# This would require multiple self-joins in SQL — slow and complexStrengths: Natural modeling of highly connected data, efficient traversal of relationships, intuitive query language for graph problems.
Weaknesses: Not great for bulk analytics or aggregations, smaller ecosystem, harder to scale horizontally.
Use when: Social networks, recommendation engines, fraud detection, knowledge graphs, network topology.
Examples: Neo4j, Amazon Neptune, ArangoDB, JanusGraph
Relational databases follow the ACID model. Many NoSQL databases follow BASE:
┌──────────────────┬───────────────────────┬──────────────────────────┐
│ Property │ ACID │ BASE │
├──────────────────┼───────────────────────┼──────────────────────────┤
│ Consistency │ Immediate, strong │ Eventual │
│ Availability │ May sacrifice for │ Prioritized │
│ │ consistency │ │
│ Focus │ Correctness │ Performance & scale │
│ Transactions │ Full multi-row │ Limited or single-row │
│ Schema │ Rigid, predefined │ Flexible, schema-on-read │
│ Scaling │ Vertical first, then │ Horizontal from start │
│ │ shard │ │
│ Best for │ Financial, inventory │ Social, analytics, IoT │
└──────────────────┴───────────────────────┴──────────────────────────┘┌────────────────┬───────────┬──────────┬────────────┬────────────┬───────────┐
│ │ Relational│ Document │ Key-Value │ Wide-Column│ Graph │
├────────────────┼───────────┼──────────┼────────────┼────────────┼───────────┤
│ Data model │ Tables │ JSON docs│ Key→blob │ Row+columns│ Nodes+ │
│ │ │ │ │ │ edges │
│ Schema │ Fixed │ Flexible │ None │ Flexible │ Flexible │
│ Query language │ SQL │ Custom │ Get/Set │ CQL/custom │ Cypher/ │
│ │ │ │ │ │ Gremlin │
│ Joins │ Yes │ Limited │ No │ No │ Traversal │
│ Transactions │ ACID │ Single- │ Single-key │ Row-level │ Yes │
│ │ │ doc │ │ │ │
│ Scale writes │ Hard │ Easy │ Easy │ Very easy │ Hard │
│ Scale reads │ Replicas │ Replicas │ Replicas │ Replicas │ Replicas │
│ Latency │ Low │ Low │ Very low │ Low │ Varies │
│ Complexity │ Medium │ Low │ Very low │ High │ Medium │
└────────────────┴───────────┴──────────┴────────────┴────────────┴───────────┘Use this flowchart when choosing a database in a system design interview:
What queries will you run most? If you need flexible ad-hoc queries with joins, start with SQL. If you have known access patterns that are key-based, NoSQL is likely better.
Do you need ACID transactions across multiple entities? SQL. Can you tolerate eventual consistency for higher availability and throughput? NoSQL.
Will you need to handle millions of writes per second? Wide-column stores like Cassandra excel here. If your data fits on a single powerful machine, PostgreSQL is hard to beat.
Are relationships the core of your data model (social graph, recommendation engine)? Use a graph database. Is your data hierarchical with variable structure? Use a document store.
Do you have a team experienced with a specific database? That matters. PostgreSQL with a skilled DBA often outperforms a poorly operated Cassandra cluster.
# Decision pseudocode
def choose_database(requirements):
if requirements.needs_acid_transactions:
if requirements.needs_global_scale:
return "Google Spanner or CockroachDB"
return "PostgreSQL or MySQL"
if requirements.data_model == "key_value_lookups":
if requirements.needs_persistence:
return "DynamoDB or Redis with AOF"
return "Redis or Memcached"
if requirements.data_model == "hierarchical_documents":
return "MongoDB or Firestore"
if requirements.data_model == "time_series_or_events":
return "Cassandra, ScyllaDB, or TimescaleDB"
if requirements.data_model == "highly_connected":
return "Neo4j or Amazon Neptune"
# When in doubt
return "PostgreSQL — it handles most use cases well"Real-world systems often use multiple databases, each optimized for a specific use case:
E-commerce system:
├── PostgreSQL — Orders, accounts, inventory (ACID transactions)
├── Elasticsearch — Product search (full-text search, facets)
├── Redis — Session cache, rate limiting (sub-ms lookups)
├── Cassandra — Event logging, clickstream (high write throughput)
└── Neo4j — Product recommendations ("customers also bought")This approach uses each database for its strengths but introduces complexity: data synchronization between stores, eventual consistency between them, and operational overhead of managing multiple database technologies.