Data Storage

Database Sharding

Horizontally partition data across multiple database instances. Covers hash-based, range-based, and directory-based sharding strategies and their trade-offs.

Database Sharding

Sharding is the practice of splitting a single database into multiple smaller databases, each holding a subset of the data. Every shard is a fully functional database that runs on its own server. When your data grows beyond what a single machine can handle — either in storage, write throughput, or query performance — sharding is how you scale horizontally.

This is one of the most important topics in system design interviews because almost every large-scale system eventually needs it, and the tradeoffs are nuanced.

Why Shard?

A single database server has hard limits: CPU, memory, disk I/O, and network bandwidth. Vertical scaling (bigger hardware) eventually hits a ceiling, and it gets exponentially more expensive. Sharding lets you distribute both data and load across multiple machines.

Key benefits:

Write scalability — each shard handles a fraction of total writes
Read scalability — queries only scan a subset of data
Storage scalability — total capacity grows linearly with shard count
Fault isolation — a single shard failure only affects a portion of users

Sharding Strategies

Hash-Based Sharding

You apply a hash function to a shard key (e.g., `user_id`) and use modulo to determine which shard stores that row.

1

2

def get_shard(user_id, num_shards):
    return hash(user_id) % num_shards

Pros: Even distribution of data across shards, simple to implement.

Cons: Adding or removing shards requires rehashing most keys. If you go from 4 to 5 shards, roughly 80% of data needs to move. This is why consistent hashing is often used instead.

Range-Based Sharding

You partition data based on ranges of the shard key. For example, users with IDs 1–1M go to shard 1, 1M–2M go to shard 2, and so on.

1

2

3

Shard 1: user_id [1, 1_000_000)
Shard 2: user_id [1_000_000, 2_000_000)
Shard 3: user_id [2_000_000, 3_000_000)

Pros: Range queries are efficient — you know exactly which shard to hit. Adding new shards is straightforward (just extend the range).

Cons: Prone to hotspots. If recent users are more active, the shard holding the highest user IDs gets disproportionate traffic. Time-based shard keys are especially dangerous here.

Directory-Based Sharding

A lookup service (directory) maintains a mapping from shard key to shard location. Every query first consults the directory to find the right shard.

1

2

3

4

5

Directory Table:
  user_id_range  ->  shard_location
  [1, 500K)      ->  shard-1.db.internal
  [500K, 1.2M)   ->  shard-2.db.internal
  [1.2M, 2M)     ->  shard-3.db.internal

Pros: Maximum flexibility — you can move data between shards without changing the sharding logic. Supports any partitioning scheme.

Cons: The directory is a single point of failure and a potential bottleneck. It must be highly available and fast (often cached in memory or backed by a replicated store like ZooKeeper).

Choosing a Shard Key

This is the single most critical decision in sharding. A bad shard key leads to hotspots, cross-shard queries, and operational nightmares.

Good shard key properties:

High cardinality — many distinct values to distribute across shards
Even distribution — each shard gets roughly equal data and traffic
Query alignment — most queries should be answerable from a single shard

Example — Instagram: Instagram shards by `user_id`. Most queries are user-centric: "show me this user's photos," "show me this user's feed." By sharding on `user_id`, these queries hit a single shard.

Example — Pinterest: Pinterest uses a compound shard key. Pins are sharded by the board they belong to, which means loading a board page (the most common operation) requires hitting only one shard.

Anti-pattern — sharding by timestamp: If you shard by creation date, the latest shard gets all the writes while older shards sit idle. This creates a massive hotspot.

The Hotspot Problem

Even with a good hash function, hotspots happen. A celebrity user with 100M followers generates disproportionate load on their shard when they post. This is sometimes called the "celebrity problem" or "hot partition" problem.

Mitigation strategies:

Split hot shards — detect hotspots and split them further
Add a random suffix — append a random digit to the shard key, spreading one logical key across multiple shards (at the cost of scatter-gather reads)
Application-level caching — cache hot data in Redis or Memcached to absorb read spikes
Rate limiting — throttle writes to hot partitions

Cross-Shard Queries

Once data is sharded, any query that spans multiple shards becomes expensive. For example, if users are sharded by `user_id`, a query like "find all users in San Francisco" must fan out to every shard.

1

2

Client -> Query Router -> [Shard 1, Shard 2, ..., Shard N]
                       <- Merge results from all shards

Strategies to minimize cross-shard queries:

Denormalize — duplicate data so each shard has what it needs
Secondary indexes — maintain a global index (like Elasticsearch) for cross-cutting queries
Application-level joins — fetch from multiple shards in your application code and merge results

In interviews, always acknowledge this tradeoff. Sharding optimizes for one access pattern at the expense of others.

Resharding

When load changes or shards become unbalanced, you need to redistribute data. This is one of the hardest operational challenges in sharding.

The Naive Approach (Modulo Rehashing)

If you add a shard and use modulo-based assignment, most keys change shards. For `N` shards becoming `N+1`, approximately `N/(N+1)` of all keys need to move.

Consistent Hashing

Consistent hashing minimizes data movement. When a new shard is added, only `K/N` keys need to move (where `K` is total keys and `N` is new shard count). Each shard is mapped to positions on a hash ring:

1

2

3

4

5

6

7

        Shard A
       /       \
   Key 1       Shard B
      |           |
   Key 2       Key 3
       \       /
        Shard C

Virtual nodes (multiple positions per physical shard) further improve balance.

Online Resharding

Production systems use a process like this:

Create the new shard(s)
Start double-writing to both old and new locations
Backfill historical data from old shard to new shard
Verify consistency
Switch reads to the new shard layout
Stop writing to old locations and clean up

This is essentially how Vitess (YouTube's MySQL sharding layer) handles resharding.

Real-World Examples

Instagram (PostgreSQL Sharding)

Instagram shards their PostgreSQL databases by `user_id`. Each logical shard ID is embedded into the generated IDs themselves — the shard ID is encoded in the high bits of the primary key. This means you can determine which shard owns a row just by looking at its ID, without consulting a directory.

1

2

ID structure (64 bits):
  [41 bits: timestamp] [13 bits: shard_id] [10 bits: sequence]

Pinterest assigns each piece of data a 64-bit ID that encodes the shard number. They use a fixed number of logical shards (much larger than physical shards) and map multiple logical shards to each physical machine. When they need to scale, they split physical machines without changing the logical sharding scheme.

Vitess (YouTube)

Vitess is a database clustering system for MySQL that handles sharding transparently. It supports range-based and hash-based sharding, online resharding, and acts as a proxy that routes queries to the correct shard.

Common Trade-Offs

Aspect	Single DB	Sharded DB
Complexity	Simple	Significantly more complex
Joins	Native SQL joins	Cross-shard joins are expensive
Transactions	ACID guaranteed	Distributed transactions needed
Schema changes	One migration	Must coordinate across all shards
Consistency	Strong	Eventual (across shards)
Operational cost	Low	High

Interview Tips

When to propose sharding: Only when you have demonstrated that a single database cannot handle the load. Always start with vertical scaling, read replicas, and caching first. Sharding is a last resort because of its complexity.

Always specify your shard key and justify it. Explain what access patterns it optimizes for and what queries become harder.

Acknowledge the tradeoffs. Cross-shard queries, distributed transactions, and resharding complexity are the three biggest costs. An interviewer wants to see that you understand these.

Know the numbers. A single PostgreSQL instance can typically handle ~10K transactions per second and store a few terabytes. If your back-of-envelope math shows you need more, sharding is justified.

Mention consistent hashing when discussing how to distribute data. It shows you understand how to handle the resharding problem gracefully.

Common interview question pattern: "How would you shard a database for a social network?" The answer depends entirely on the access patterns — user-centric queries suggest sharding by `user_id`, but you need to discuss how to handle the follower graph (which is inherently cross-shard).