Data Storage

SQL vs NoSQL

Relational databases vs document stores, key-value stores, wide-column, and graph databases. How to choose the right storage model for your use case.

SQL vs NoSQL

Choosing the right database is one of the most consequential decisions in system design. It affects your data model, query patterns, scaling strategy, consistency guarantees, and operational complexity for years to come. The SQL vs NoSQL framing is an oversimplification — the real question is which type of database best fits your access patterns, consistency requirements, and scaling needs.

The Relational Model (SQL)

Relational databases store data in tables with predefined schemas. Tables relate to each other through foreign keys. Queries use SQL, a declarative language that lets you express what data you want without specifying how to retrieve it.

ACID Properties

Relational databases guarantee ACID transactions:

Atomicity: A transaction either completes entirely or not at all. If transferring money from account A to B, both the debit and credit happen, or neither does.
Consistency: Transactions move the database from one valid state to another. Constraints (foreign keys, unique indexes, check constraints) are enforced.
Isolation: Concurrent transactions do not interfere with each other. Different isolation levels (read uncommitted, read committed, repeatable read, serializable) trade performance for strictness.
Durability: Once a transaction commits, it survives crashes. Data is written to disk before the commit is acknowledged.

# ACID transaction: transfer money between accounts
def transfer(db, from_account, to_account, amount):
    with db.transaction() as tx:
        from_balance = tx.query(
            "SELECT balance FROM accounts WHERE id = %s FOR UPDATE",
            [from_account]
        )

        if from_balance < amount:
            raise InsufficientFunds()

        tx.execute(
            "UPDATE accounts SET balance = balance - %s WHERE id = %s",
            [amount, from_account]
        )
        tx.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s",
            [amount, to_account]
        )
        # Both updates commit atomically, or neither does

Strengths of SQL databases

Flexible querying: SQL can express complex joins, aggregations, subqueries, and window functions. You do not need to know your queries at design time.
Data integrity: Foreign keys, unique constraints, and check constraints prevent invalid data at the database level.
Mature ecosystem: Decades of tooling, optimization, monitoring, and operational knowledge.
Transactions: Multi-row, multi-table transactions with full ACID guarantees.

Weaknesses

Rigid schema: Schema changes (ALTER TABLE) on large tables can be slow and require careful migration.
Scaling writes: Relational databases scale reads easily (read replicas) but scaling writes typically requires sharding, which breaks joins and transactions across shards.
Impedance mismatch: Object-oriented application code must be mapped to relational tables, often requiring an ORM.

When to use SQL

Your data has clear relationships (users, orders, products)
You need complex queries with joins and aggregations
Data integrity is critical (financial systems, inventory)
Your schema is reasonably stable
You need multi-row transactions

Examples: PostgreSQL, MySQL, SQL Server, Oracle, CockroachDB, Google Spanner

The NoSQL Family

"NoSQL" is an umbrella term for databases that do not use the relational model. There are four major categories, each designed for different access patterns.

Document Stores

Store data as flexible JSON-like documents. Each document can have a different structure. Documents are grouped into collections.

# MongoDB document — no fixed schema
user = {
    "_id": "user_123",
    "name": "Alice",
    "email": "alice@example.com",
    "addresses": [
        {"type": "home", "city": "Seattle", "zip": "98101"},
        {"type": "work", "city": "San Francisco", "zip": "94105"}
    ],
    "preferences": {
        "theme": "dark",
        "notifications": True
    }
}
# Related data is embedded in the document — no joins needed

Strengths: Flexible schema, natural mapping to application objects, fast reads when data is co-located in a document, easy horizontal scaling.

Weaknesses: No joins (data denormalization leads to update anomalies), limited transaction support across documents, can lead to data duplication.

Use when: Your data is hierarchical, semi-structured, or varies between records. Content management, user profiles, product catalogs.

Examples: MongoDB, CouchDB, Amazon DocumentDB, Firestore

Key-Value Stores

The simplest NoSQL model: a hash map that maps keys to values. The database treats values as opaque blobs — all intelligence is in the application.

1

2

3

4

5

# Redis-style key-value operations
redis.set("session:abc123", json.dumps({"user_id": 456, "role": "admin"}))
redis.expire("session:abc123", 3600)  # TTL: 1 hour

session = json.loads(redis.get("session:abc123"))

Strengths: Extreme performance (sub-millisecond latency), simple API, easy to scale (partition by key hash), predictable performance.

Weaknesses: No querying by value — you can only look up by exact key. No relationships, no aggregations.

Use when: You need a cache, session store, rate limiter, or any access pattern that is purely key-based lookup.

Examples: Redis, Memcached, Amazon DynamoDB (also a document store), etcd

Wide-Column Stores

Data is organized by rows and column families, but unlike relational tables, each row can have different columns. Rows are sorted by key, making range scans efficient.

Row Key          │ Column Family: profile   │ Column Family: activity
─────────────────┼──────────────────────────┼──────────────────────────
user_alice       │ name: "Alice"            │ last_login: "2024-01-15"
                 │ email: "alice@ex.com"    │ login_count: 342
                 │ city: "Seattle"          │
─────────────────┼──────────────────────────┼──────────────────────────
user_bob         │ name: "Bob"              │ last_login: "2024-01-14"
                 │ phone: "+1234567890"     │
                 │                          │

Strengths: Handles massive datasets (petabytes), excellent write throughput, efficient range scans, flexible column structure.

Weaknesses: No joins, no multi-row transactions (typically), complex data modeling, steep learning curve.

Use when: You have massive time-series data, event logs, IoT sensor data, or analytics workloads with known access patterns.

Examples: Apache Cassandra, HBase, Google Bigtable, ScyllaDB

Graph Databases

Store data as nodes (entities) and edges (relationships). Queries traverse the graph to find connected data. Optimized for relationship-heavy queries that would require expensive joins in SQL.

   (Alice)──FRIENDS──►(Bob)
      │                  │
   LIKES              WORKS_AT
      │                  │
      ▼                  ▼
  (Python)           (Google)
      ▲                  ▲
   KNOWS              WORKS_AT
      │                  │
   (Carol)──FRIENDS──►(Dave)

# Neo4j Cypher query: "Find friends of friends who work at Google"
query = """
MATCH (me:Person {name: 'Alice'})-[:FRIENDS]->(friend)-[:FRIENDS]->(fof)
WHERE (fof)-[:WORKS_AT]->(:Company {name: 'Google'})
AND NOT (me)-[:FRIENDS]->(fof)
RETURN fof.name
"""
# This would require multiple self-joins in SQL — slow and complex

Strengths: Natural modeling of highly connected data, efficient traversal of relationships, intuitive query language for graph problems.

Weaknesses: Not great for bulk analytics or aggregations, smaller ecosystem, harder to scale horizontally.

Use when: Social networks, recommendation engines, fraud detection, knowledge graphs, network topology.

Examples: Neo4j, Amazon Neptune, ArangoDB, JanusGraph

ACID vs BASE

Relational databases follow the ACID model. Many NoSQL databases follow BASE:

Basically Available: The system guarantees availability (in the CAP sense).
Soft state: The state of the system may change over time, even without input, due to eventual consistency.
Eventually consistent: The system will become consistent over time, given that no new updates are made.

┌──────────────────┬───────────────────────┬──────────────────────────┐
│ Property         │ ACID                  │ BASE                     │
├──────────────────┼───────────────────────┼──────────────────────────┤
│ Consistency      │ Immediate, strong     │ Eventual                 │
│ Availability     │ May sacrifice for     │ Prioritized              │
│                  │ consistency            │                          │
│ Focus            │ Correctness           │ Performance & scale      │
│ Transactions     │ Full multi-row        │ Limited or single-row    │
│ Schema           │ Rigid, predefined     │ Flexible, schema-on-read │
│ Scaling          │ Vertical first, then  │ Horizontal from start    │
│                  │ shard                  │                          │
│ Best for         │ Financial, inventory  │ Social, analytics, IoT   │
└──────────────────┴───────────────────────┴──────────────────────────┘

Comparison Table

┌────────────────┬───────────┬──────────┬────────────┬────────────┬───────────┐
│                │ Relational│ Document │ Key-Value  │ Wide-Column│ Graph     │
├────────────────┼───────────┼──────────┼────────────┼────────────┼───────────┤
│ Data model     │ Tables    │ JSON docs│ Key→blob   │ Row+columns│ Nodes+    │
│                │           │          │            │            │ edges     │
│ Schema         │ Fixed     │ Flexible │ None       │ Flexible   │ Flexible  │
│ Query language │ SQL       │ Custom   │ Get/Set    │ CQL/custom │ Cypher/   │
│                │           │          │            │            │ Gremlin   │
│ Joins          │ Yes       │ Limited  │ No         │ No         │ Traversal │
│ Transactions   │ ACID      │ Single-  │ Single-key │ Row-level  │ Yes       │
│                │           │ doc      │            │            │           │
│ Scale writes   │ Hard      │ Easy     │ Easy       │ Very easy  │ Hard      │
│ Scale reads    │ Replicas  │ Replicas │ Replicas   │ Replicas   │ Replicas  │
│ Latency        │ Low       │ Low      │ Very low   │ Low        │ Varies    │
│ Complexity     │ Medium    │ Low      │ Very low   │ High       │ Medium    │
└────────────────┴───────────┴──────────┴────────────┴────────────┴───────────┘

Decision Framework

Use this flowchart when choosing a database in a system design interview:

Step 1: Understand your access patterns

What queries will you run most? If you need flexible ad-hoc queries with joins, start with SQL. If you have known access patterns that are key-based, NoSQL is likely better.

Step 2: Consider your consistency needs

Do you need ACID transactions across multiple entities? SQL. Can you tolerate eventual consistency for higher availability and throughput? NoSQL.

Step 3: Evaluate your scale

Will you need to handle millions of writes per second? Wide-column stores like Cassandra excel here. If your data fits on a single powerful machine, PostgreSQL is hard to beat.

Step 4: Look at your data relationships

Are relationships the core of your data model (social graph, recommendation engine)? Use a graph database. Is your data hierarchical with variable structure? Use a document store.

Step 5: Consider operational complexity

Do you have a team experienced with a specific database? That matters. PostgreSQL with a skilled DBA often outperforms a poorly operated Cassandra cluster.

# Decision pseudocode
def choose_database(requirements):
    if requirements.needs_acid_transactions:
        if requirements.needs_global_scale:
            return "Google Spanner or CockroachDB"
        return "PostgreSQL or MySQL"

    if requirements.data_model == "key_value_lookups":
        if requirements.needs_persistence:
            return "DynamoDB or Redis with AOF"
        return "Redis or Memcached"

    if requirements.data_model == "hierarchical_documents":
        return "MongoDB or Firestore"

    if requirements.data_model == "time_series_or_events":
        return "Cassandra, ScyllaDB, or TimescaleDB"

    if requirements.data_model == "highly_connected":
        return "Neo4j or Amazon Neptune"

    # When in doubt
    return "PostgreSQL — it handles most use cases well"

The Polyglot Persistence Pattern

Real-world systems often use multiple databases, each optimized for a specific use case:

1

2

3

4

5

6

E-commerce system:
├── PostgreSQL     — Orders, accounts, inventory (ACID transactions)
├── Elasticsearch  — Product search (full-text search, facets)
├── Redis          — Session cache, rate limiting (sub-ms lookups)
├── Cassandra      — Event logging, clickstream (high write throughput)
└── Neo4j          — Product recommendations ("customers also bought")

This approach uses each database for its strengths but introduces complexity: data synchronization between stores, eventual consistency between them, and operational overhead of managing multiple database technologies.

Common Mistakes in Interviews

Choosing NoSQL because it scales. PostgreSQL scales further than most people think. Do not jump to Cassandra unless you can articulate why you need it (write-heavy workload, multi-region, specific access patterns).

Ignoring the query patterns. "We will use MongoDB" without explaining what queries you will run. The data model should be driven by access patterns, not by what is trendy.

Forgetting about transactions. If your system involves money, inventory, or any data where partial updates cause corruption, you need ACID transactions. Saying "we will use DynamoDB for the payment service" without addressing this is a red flag.

Over-engineering with polyglot persistence. Using five databases for a system that could run on PostgreSQL alone adds massive operational complexity. Start simple and add specialized stores when you have specific performance needs.

Interview Tips

Start with PostgreSQL as your default. It handles relational data, JSON documents (JSONB), full-text search, time-series data (with TimescaleDB), and even basic key-value workloads. Only deviate when you have a specific reason.

Justify your choice with access patterns. "I am choosing Cassandra for the event store because we need to write 500,000 events per second with a simple time-range query pattern, and Cassandra's write-optimized LSM tree architecture handles this well."

Acknowledge the trade-offs. "By choosing DynamoDB, we get unlimited horizontal scaling and single-digit millisecond reads, but we lose the ability to do ad-hoc joins and complex queries. We will use a separate analytics database for those."

Mention schema evolution. "Document stores like MongoDB make schema changes easy — we can add fields without migrating existing documents. With PostgreSQL, we would need an ALTER TABLE migration."

Discuss the operational angle. "Our team has deep PostgreSQL expertise, which is a factor. Running a Cassandra cluster well requires specific operational knowledge that we would need to build."

Use the right tool for each component. "The user service uses PostgreSQL for account data (ACID transactions for billing), Redis for session caching (sub-millisecond lookups), and Elasticsearch for user search (full-text matching on names and bios)."