← back

Scalability Patterns

Circuit Breaker Pattern

Prevent cascading failures by detecting when a downstream service is unhealthy and failing fast. Covers closed, open, and half-open states.

Circuit Breaker Pattern

Imagine your e-commerce checkout service calls a payment gateway. The payment gateway is experiencing intermittent failures, responding to only 10% of requests while the rest timeout after 30 seconds. Without protection, every checkout attempt blocks for 30 seconds, your thread pool fills up, your checkout service becomes unresponsive, and the failure cascades to the product page, the shopping cart, and eventually the entire site goes down. One degraded dependency has taken out your entire system.

The circuit breaker pattern prevents this cascading failure. Like an electrical circuit breaker that trips to prevent a fire, a software circuit breaker detects that a downstream service is failing and stops sending requests to it, failing fast instead of blocking.

The Three States

A circuit breaker is a state machine with three states:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
                    failure threshold
                        exceeded
    ┌────────┐  ──────────────────────>  ┌────────┐
    │ CLOSED │                           │  OPEN  │
    │(normal)│  <──────────────────────  │ (fail  │
    └────────┘       reset timeout       │  fast) │
        ↑            expires             └────┬───┘
        │                                     │
        │        ┌──────────┐                 │
        │        │HALF-OPEN │                 │
        └────────│(testing) │ <───────────────┘
     success     └──────────┘   timeout expires,
     threshold                  allow limited traffic
     reached

Closed State (Normal Operation)

The circuit is closed, and requests flow through normally. The circuit breaker monitors the failure rate. If the failure rate exceeds a threshold (e.g., 50% of the last 100 requests fail), the circuit trips to the Open state.

Open State (Fail Fast)

The circuit is open. All requests are immediately rejected with a fallback response (cached data, default value, or error message) without contacting the downstream service. This protects the downstream service from additional load and prevents your service from wasting resources waiting for timeouts.

The circuit stays open for a configured reset timeout (e.g., 30 seconds). After this period, it transitions to Half-Open.

Half-Open State (Testing Recovery)

The circuit breaker allows a limited number of trial requests through to the downstream service. If these requests succeed, the circuit transitions back to Closed. If they fail, the circuit goes back to Open and the reset timeout restarts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import time
from enum import Enum
from collections import deque

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=0.5, window_size=100,
                 reset_timeout=30, half_open_max_calls=5):
        self.failure_threshold = failure_threshold
        self.window_size = window_size
        self.reset_timeout = reset_timeout
        self.half_open_max_calls = half_open_max_calls

        self.state = State.CLOSED
        self.results = deque(maxlen=window_size)  # True=success, False=failure
        self.last_failure_time = 0
        self.half_open_calls = 0

    def call(self, func, *args, **kwargs):
        if self.state == State.OPEN:
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = State.HALF_OPEN
                self.half_open_calls = 0
            else:
                raise CircuitOpenError("Circuit is open, failing fast")

        if self.state == State.HALF_OPEN:
            if self.half_open_calls >= self.half_open_max_calls:
                raise CircuitOpenError("Half-open call limit reached")
            self.half_open_calls += 1

        try:
            result = func(*args, **kwargs)
            self._record_success()
            return result
        except Exception as e:
            self._record_failure()
            raise

    def _record_success(self):
        self.results.append(True)
        if self.state == State.HALF_OPEN:
            if self.half_open_calls >= self.half_open_max_calls:
                self.state = State.CLOSED
                self.results.clear()

    def _record_failure(self):
        self.results.append(False)
        self.last_failure_time = time.time()

        if self.state == State.HALF_OPEN:
            self.state = State.OPEN
            return

        if self.state == State.CLOSED and len(self.results) >= 10:
            failure_rate = self.results.count(False) / len(self.results)
            if failure_rate >= self.failure_threshold:
                self.state = State.OPEN

Configuration Parameters

Getting the thresholds right is critical. Too sensitive and the circuit trips on normal fluctuations. Too lenient and the circuit never trips when it should.

Failure Threshold

The percentage of failed requests that triggers the circuit to open. Common values:

  • 50% for services with moderate error tolerance.
  • 25% for critical dependencies where even a quarter of failures is alarming.
  • 80% for noisy services where some failures are expected.

Sliding Window

Use a sliding window (time-based or count-based) to calculate the failure rate:

  • Count-based: Track the last N requests (e.g., last 100).
  • Time-based: Track all requests in the last T seconds (e.g., last 60 seconds).

Time-based windows are generally preferred because they adapt to varying traffic volumes.

Reset Timeout

How long the circuit stays open before testing recovery. Too short and you hammer a recovering service. Too long and you waste time waiting when the service has recovered.

A good starting point is 30 seconds, with adaptive backoff: if the half-open test fails, double the reset timeout (up to a maximum).

What Counts as a Failure

Not all errors should trip the circuit:

  • Count: Timeouts, 5xx server errors, connection refused.
  • Don't count: 4xx client errors (bad request, not found), business logic errors, rate limit responses (429).

A 400 Bad Request means your request was wrong, not that the downstream service is unhealthy.

Fallback Strategies

When the circuit is open, you need a fallback:

Cached Response

Return the last known good response from a local cache. This works well for read-heavy endpoints where stale data is acceptable for a short period.

Default Value

Return a sensible default. For example, if the recommendation service is down, show trending items instead of personalized recommendations.

Graceful Degradation

Disable the non-critical feature entirely. If the review service is down, still show the product page without reviews.

Queue for Later

Accept the request and queue it for processing when the service recovers. This works for writes that can tolerate delay.

1
2
3
4
5
6
7
8
9
def get_product_recommendations(user_id):
    try:
        return circuit_breaker.call(recommendation_service.get, user_id)
    except CircuitOpenError:
        # Fallback: return cached recommendations or trending items
        cached = cache.get(f"recs:{user_id}")
        if cached:
            return cached
        return get_trending_items()

The Bulkhead Pattern

The circuit breaker protects against cascading failures from one dependency. The bulkhead pattern isolates dependencies from each other, preventing a slow dependency from consuming all your resources.

Named after ship bulkheads that contain flooding to one compartment, the pattern assigns separate thread pools or connection pools to each dependency.

1
2
3
4
5
6
7
8
9
10
11
12
Without bulkhead:
  Shared thread pool (100 threads)
  ├── Payment Service calls: 90 threads (all waiting on timeout!)
  ├── Inventory Service calls: 5 threads
  └── Email Service calls: 5 threads
  → Payment timeout causes thread starvation for ALL services

With bulkhead:
  Payment pool:   40 threads max (90 requests → 40 active, 50 rejected fast)
  Inventory pool: 30 threads max (5 requests → all served normally)
  Email pool:     30 threads max (5 requests → all served normally)
  → Payment issues are contained

Combining Circuit Breaker with Bulkhead

In practice, you use both patterns together:

  1. Bulkhead isolates the resource pool for each dependency.
  2. Circuit breaker monitors each dependency's health and trips when it degrades.
  3. Timeout caps how long any individual request can wait.
  4. Retry attempts failed requests a limited number of times before giving up.
1
2
3
4
5
6
Request flow:
  Client → Bulkhead (thread pool limit)
         → Circuit Breaker (check state)
         → Timeout (cap wait time)
         → Retry (with backoff, limited attempts)
         → Downstream Service

Real-World Implementations

Netflix Hystrix (Java, now in maintenance mode)

Netflix pioneered the circuit breaker pattern at scale with Hystrix. It combined circuit breaker, bulkhead (thread pool isolation), timeout, and fallback into a single library. Every call to an external service was wrapped in a Hystrix command.

Key design decisions:

  • Thread pool isolation by default (bulkhead pattern).
  • 10-second rolling window for failure rate calculation.
  • 20-request minimum before the circuit can trip (avoids tripping on low traffic).
  • 5-second sleep window before transitioning to half-open.

Resilience4j (Java, successor to Hystrix)

A lightweight, functional alternative to Hystrix. It provides circuit breaker, bulkhead, rate limiter, retry, and time limiter as separate, composable modules.

1
2
3
4
5
6
7
8
Resilience4j circuit breaker configuration:
  slidingWindowType: COUNT_BASED
  slidingWindowSize: 100
  failureRateThreshold: 50
  waitDurationInOpenState: 30s
  permittedNumberOfCallsInHalfOpenState: 10
  recordExceptions: [TimeoutException, IOException]
  ignoreExceptions: [BusinessException]

Envoy Proxy / Service Mesh

In a service mesh architecture (Istio, Linkerd), the circuit breaker is implemented at the proxy level rather than in application code. Envoy sidecar proxies monitor the health of upstream services and trip the circuit automatically. This has the advantage of being language-agnostic and not requiring application changes.

AWS and Cloud-Native

AWS SDK clients have built-in retry with exponential backoff. API Gateway can act as a circuit breaker for backend services. ELBs remove unhealthy targets automatically. These are infrastructure-level circuit breakers.

Monitoring and Alerting

A circuit breaker is only useful if you know when it trips. Essential metrics:

  • Circuit state changes (closed → open, open → half-open, half-open → closed/open).
  • Failure rate per dependency over time.
  • Fallback invocations (how often the fallback path is used).
  • Rejected requests (requests that failed fast due to open circuit).
  • Recovery time (how long circuits stay open).

Alert when a circuit opens. It means a dependency is failing, and your service is in degraded mode.

Interview Tips

  • Explain the problem first. Before describing the pattern, paint the cascading failure scenario. The interviewer needs to feel the pain that motivates the solution.
  • Draw the state machine. Three states, the transitions between them, and what triggers each transition. This is clear and concise.
  • Discuss configuration carefully. Saying "50% failure threshold" is good. Explaining why you chose that number (and when you would choose differently) is great.
  • Mention the bulkhead pattern. Circuit breaker and bulkhead are complementary. Bringing up bulkhead unprompted shows depth.
  • Talk about fallback strategies. The circuit breaker trips -- then what? Having a concrete fallback (cached data, default response, graceful degradation) shows you think about the user experience during failures.
  • Address monitoring. In production, a tripped circuit breaker is an operational event that needs alerting. Mentioning this shows production awareness.
  • Know the difference between retry and circuit breaker. Retry helps with transient failures (single request). Circuit breaker helps with sustained failures (many requests). They work together: retry within the circuit breaker, and the circuit breaker tracks retry failures.