Data Storage
Horizontally partition data across multiple database instances. Covers hash-based, range-based, and directory-based sharding strategies and their trade-offs.
Sharding is the practice of splitting a single database into multiple smaller databases, each holding a subset of the data. Every shard is a fully functional database that runs on its own server. When your data grows beyond what a single machine can handle — either in storage, write throughput, or query performance — sharding is how you scale horizontally.
This is one of the most important topics in system design interviews because almost every large-scale system eventually needs it, and the tradeoffs are nuanced.
A single database server has hard limits: CPU, memory, disk I/O, and network bandwidth. Vertical scaling (bigger hardware) eventually hits a ceiling, and it gets exponentially more expensive. Sharding lets you distribute both data and load across multiple machines.
Key benefits:
You apply a hash function to a shard key (e.g., `user_id`) and use modulo to determine which shard stores that row.
def get_shard(user_id, num_shards):
return hash(user_id) % num_shardsPros: Even distribution of data across shards, simple to implement.
Cons: Adding or removing shards requires rehashing most keys. If you go from 4 to 5 shards, roughly 80% of data needs to move. This is why consistent hashing is often used instead.
You partition data based on ranges of the shard key. For example, users with IDs 1–1M go to shard 1, 1M–2M go to shard 2, and so on.
Shard 1: user_id [1, 1_000_000)
Shard 2: user_id [1_000_000, 2_000_000)
Shard 3: user_id [2_000_000, 3_000_000)Pros: Range queries are efficient — you know exactly which shard to hit. Adding new shards is straightforward (just extend the range).
Cons: Prone to hotspots. If recent users are more active, the shard holding the highest user IDs gets disproportionate traffic. Time-based shard keys are especially dangerous here.
A lookup service (directory) maintains a mapping from shard key to shard location. Every query first consults the directory to find the right shard.
Directory Table:
user_id_range -> shard_location
[1, 500K) -> shard-1.db.internal
[500K, 1.2M) -> shard-2.db.internal
[1.2M, 2M) -> shard-3.db.internalPros: Maximum flexibility — you can move data between shards without changing the sharding logic. Supports any partitioning scheme.
Cons: The directory is a single point of failure and a potential bottleneck. It must be highly available and fast (often cached in memory or backed by a replicated store like ZooKeeper).
This is the single most critical decision in sharding. A bad shard key leads to hotspots, cross-shard queries, and operational nightmares.
Good shard key properties:
Example — Instagram:
Instagram shards by `user_id`. Most queries are user-centric: "show me this user's photos," "show me this user's feed." By sharding on `user_id`, these queries hit a single shard.
Example — Pinterest: Pinterest uses a compound shard key. Pins are sharded by the board they belong to, which means loading a board page (the most common operation) requires hitting only one shard.
Anti-pattern — sharding by timestamp: If you shard by creation date, the latest shard gets all the writes while older shards sit idle. This creates a massive hotspot.
Even with a good hash function, hotspots happen. A celebrity user with 100M followers generates disproportionate load on their shard when they post. This is sometimes called the "celebrity problem" or "hot partition" problem.
Mitigation strategies:
Once data is sharded, any query that spans multiple shards becomes expensive. For example, if users are sharded by `user_id`, a query like "find all users in San Francisco" must fan out to every shard.
Client -> Query Router -> [Shard 1, Shard 2, ..., Shard N]
<- Merge results from all shardsStrategies to minimize cross-shard queries:
In interviews, always acknowledge this tradeoff. Sharding optimizes for one access pattern at the expense of others.
When load changes or shards become unbalanced, you need to redistribute data. This is one of the hardest operational challenges in sharding.
If you add a shard and use modulo-based assignment, most keys change shards. For `N` shards becoming `N+1`, approximately `N/(N+1)` of all keys need to move.
Consistent hashing minimizes data movement. When a new shard is added, only `K/N` keys need to move (where `K` is total keys and `N` is new shard count). Each shard is mapped to positions on a hash ring:
Shard A
/ \
Key 1 Shard B
| |
Key 2 Key 3
\ /
Shard CVirtual nodes (multiple positions per physical shard) further improve balance.
Production systems use a process like this:
This is essentially how Vitess (YouTube's MySQL sharding layer) handles resharding.
Instagram shards their PostgreSQL databases by `user_id`. Each logical shard ID is embedded into the generated IDs themselves — the shard ID is encoded in the high bits of the primary key. This means you can determine which shard owns a row just by looking at its ID, without consulting a directory.
ID structure (64 bits):
[41 bits: timestamp] [13 bits: shard_id] [10 bits: sequence]Pinterest assigns each piece of data a 64-bit ID that encodes the shard number. They use a fixed number of logical shards (much larger than physical shards) and map multiple logical shards to each physical machine. When they need to scale, they split physical machines without changing the logical sharding scheme.
Vitess is a database clustering system for MySQL that handles sharding transparently. It supports range-based and hash-based sharding, online resharding, and acts as a proxy that routes queries to the correct shard.
| Aspect | Single DB | Sharded DB |
|---|---|---|
| Complexity | Simple | Significantly more complex |
| Joins | Native SQL joins | Cross-shard joins are expensive |
| Transactions | ACID guaranteed | Distributed transactions needed |
| Schema changes | One migration | Must coordinate across all shards |
| Consistency | Strong | Eventual (across shards) |
| Operational cost | Low | High |
When to propose sharding: Only when you have demonstrated that a single database cannot handle the load. Always start with vertical scaling, read replicas, and caching first. Sharding is a last resort because of its complexity.
Always specify your shard key and justify it. Explain what access patterns it optimizes for and what queries become harder.
Acknowledge the tradeoffs. Cross-shard queries, distributed transactions, and resharding complexity are the three biggest costs. An interviewer wants to see that you understand these.
Know the numbers. A single PostgreSQL instance can typically handle ~10K transactions per second and store a few terabytes. If your back-of-envelope math shows you need more, sharding is justified.
Mention consistent hashing when discussing how to distribute data. It shows you understand how to handle the resharding problem gracefully.
Common interview question pattern: "How would you shard a database for a social network?" The answer depends entirely on the access patterns — user-centric queries suggest sharding by `user_id`, but you need to discuss how to handle the follower graph (which is inherently cross-shard).