Fundamentals

Scaling: Vertical vs Horizontal

Understand when to scale up a single machine versus scaling out across many. Covers the trade-offs, bottlenecks, and when each approach makes sense.

Scaling: Vertical vs Horizontal

Every system eventually hits a wall. Users grow, data accumulates, and the single server that worked fine for your first thousand users starts buckling under load. At that point you face a fundamental decision: do you make your existing machine bigger, or do you add more machines? This is the vertical vs horizontal scaling question, and it comes up in nearly every system design interview.

Vertical Scaling (Scaling Up)

Vertical scaling means giving your existing server more resources — more CPU cores, more RAM, faster disks, better network cards. You take the same machine and make it beefier.

Why it is appealing

Vertical scaling is simple. Your application code does not change. You do not need to worry about distributing state across machines, coordinating between nodes, or handling network partitions. A single powerful machine means a single point of coordination: one database, one cache, one application process that can access all the data it needs through local memory.

For many early-stage systems, vertical scaling is the right answer. A modern server with 128 CPU cores, 2 TB of RAM, and NVMe storage can handle a surprising amount of work. Companies like Stack Overflow famously served millions of users from a small number of very powerful machines.

The limits

There is a hard ceiling on how big a single machine can get. You cannot buy a server with a million cores. As you approach the high end of available hardware, costs increase super-linearly — a machine with twice the resources often costs more than twice as much.

More critically, a single machine is a single point of failure. If it goes down, everything goes down. You get zero fault tolerance from vertical scaling alone.

Vertical Scaling Curve:

Performance ▲
             │         ╭── Hardware ceiling
             │        ╱
             │      ╱
             │    ╱
             │  ╱
             │╱
             └──────────────────► Cost
             (super-linear cost growth)

Horizontal Scaling (Scaling Out)

Horizontal scaling means adding more machines to your system. Instead of one powerful server, you have ten, a hundred, or a thousand commodity servers working together.

Why it matters

Horizontal scaling has no theoretical ceiling. Need more capacity? Add more nodes. It also gives you fault tolerance — if one machine dies, the others keep serving traffic. Cloud providers make this especially easy: you can spin up new instances in seconds.

The complexity cost

Horizontal scaling introduces significant architectural complexity:

State management: If your application stores user sessions in memory, a user whose request lands on server A cannot have their next request go to server B (unless you externalize session state).
Data consistency: With multiple database replicas, you need to decide how to keep them in sync and what happens when they disagree.
Network overhead: Machines communicating over the network is orders of magnitude slower than local function calls.
Coordination: Distributed locking, leader election, and consensus protocols become necessary.

Stateless vs stateful services

The key insight for horizontal scaling is to make your services stateless. A stateless service stores no local data between requests — all state lives in an external store (database, cache, object storage). When a service is stateless, any instance can handle any request, and you can add or remove instances freely.

# Stateful — hard to scale horizontally
class SessionServer:
    def __init__(self):
        self.sessions = {}  # Local state — tied to this machine

    def handle_request(self, user_id, request):
        session = self.sessions.get(user_id)  # Only works if same server
        return process(session, request)

# Stateless — easy to scale horizontally
class StatelessServer:
    def __init__(self, redis_client):
        self.redis = redis_client  # External state store

    def handle_request(self, user_id, request):
        session = self.redis.get(f"session:{user_id}")  # Any server can read this
        return process(session, request)

Comparison Table

┌──────────────────────┬────────────────────────┬────────────────────────┐
│ Dimension            │ Vertical               │ Horizontal             │
├──────────────────────┼────────────────────────┼────────────────────────┤
│ Complexity           │ Low                    │ High                   │
│ Cost curve           │ Super-linear           │ Near-linear            │
│ Upper limit          │ Hardware ceiling        │ Practically unlimited  │
│ Fault tolerance      │ None (single machine)  │ Built-in redundancy    │
│ Downtime to scale    │ Often yes (reboot)     │ No (add nodes live)    │
│ Data consistency     │ Simple (one node)      │ Requires coordination  │
│ Network overhead     │ None (local)           │ Significant            │
│ Best for             │ Early stage, databases │ Web servers, stateless │
└──────────────────────┴────────────────────────┴────────────────────────┘

Database Scaling — The Hard Part

Stateless application servers are straightforward to scale horizontally. Databases are not. Your database is inherently stateful, and scaling it out requires one of several strategies:

Read replicas: Replicate data to follower databases that handle read queries. The leader handles all writes. This works when your workload is read-heavy (most web applications).

Sharding: Split your data across multiple databases, each responsible for a subset of the data. This is the path to truly horizontal database scaling, but it introduces complexity around cross-shard queries, rebalancing, and application-level routing.

Vertical scaling first: Many experienced engineers advocate scaling your database vertically as long as possible. A single, powerful database machine is far simpler to operate than a sharded cluster. Amazon RDS instances go up to 128 vCPUs and 1 TB of RAM — that handles a lot of queries.

Typical scaling journey:

1. Single server (app + DB)
2. Separate app and DB servers
3. Add read replicas for the DB
4. Scale app servers horizontally (stateless)
5. Add caching layer (Redis/Memcached)
6. Shard the database (only when necessary)

Auto-Scaling

In cloud environments, horizontal scaling can be automated. Auto-scaling groups monitor metrics like CPU utilization, request queue depth, or custom application metrics, and add or remove instances accordingly.

Key considerations for auto-scaling:

Cool-down periods: Prevent thrashing by waiting between scale events.
Warm-up time: New instances need time to start, load caches, and begin handling traffic effectively.
Predictive scaling: For predictable traffic patterns (like daily peaks), schedule capacity changes ahead of time rather than reacting.
Minimum and maximum bounds: Always set a floor (availability) and ceiling (cost control).

# Simplified auto-scaling logic
def evaluate_scaling(metrics, config):
    avg_cpu = metrics.get_avg_cpu(window_minutes=5)
    current_instances = metrics.get_instance_count()

    if avg_cpu > config.scale_up_threshold:  # e.g., 70%
        desired = min(current_instances + config.scale_up_step,
                      config.max_instances)
    elif avg_cpu < config.scale_down_threshold:  # e.g., 30%
        desired = max(current_instances - config.scale_down_step,
                      config.min_instances)
    else:
        desired = current_instances

    return desired

Real-World Examples

Netflix: Horizontal scaling with thousands of stateless microservices on AWS. Each service auto-scales independently. Data is sharded across Cassandra clusters.

Stack Overflow: Primarily vertical scaling. Serves hundreds of millions of page views with a handful of powerful servers. They chose simplicity over distributed complexity.

Slack: Hybrid approach. Application servers scale horizontally, but they invested heavily in scaling their MySQL databases vertically before eventually sharding.

Trade-Offs and When to Choose

Start vertical when:

You are early stage and engineering time is your scarcest resource
Your workload fits comfortably on one machine
You are running a relational database and want to avoid sharding complexity

Go horizontal when:

You need fault tolerance and high availability
Your traffic is unpredictable and you need elastic scaling
You have reached the limits of vertical scaling
Your services are stateless or can be made stateless

Interview Tips

When discussing scaling in an interview, demonstrate that you understand the nuance:

Do not jump to horizontal scaling immediately. Acknowledge that vertical scaling is simpler and often sufficient. This shows maturity.

Identify what is stateless and what is stateful. Separate them. Scale the stateless parts horizontally first — that is the easy win.

Explain the database scaling journey. Show that you understand read replicas, caching, and sharding as a progression, not a first step.

Mention specific numbers. "A single PostgreSQL instance on modern hardware can handle tens of thousands of queries per second" shows practical knowledge.

Discuss auto-scaling trade-offs. Mention warm-up time, cool-down periods, and predictive vs reactive scaling to demonstrate operational awareness.

The interviewer wants to see that you can reason about trade-offs, not that you always reach for the most complex solution. Sometimes the best answer is a bigger machine.