Scalability Patterns
Prevent cascading failures by detecting when a downstream service is unhealthy and failing fast. Covers closed, open, and half-open states.
Imagine your e-commerce checkout service calls a payment gateway. The payment gateway is experiencing intermittent failures, responding to only 10% of requests while the rest timeout after 30 seconds. Without protection, every checkout attempt blocks for 30 seconds, your thread pool fills up, your checkout service becomes unresponsive, and the failure cascades to the product page, the shopping cart, and eventually the entire site goes down. One degraded dependency has taken out your entire system.
The circuit breaker pattern prevents this cascading failure. Like an electrical circuit breaker that trips to prevent a fire, a software circuit breaker detects that a downstream service is failing and stops sending requests to it, failing fast instead of blocking.
A circuit breaker is a state machine with three states:
failure threshold
exceeded
┌────────┐ ──────────────────────> ┌────────┐
│ CLOSED │ │ OPEN │
│(normal)│ <────────────────────── │ (fail │
└────────┘ reset timeout │ fast) │
↑ expires └────┬───┘
│ │
│ ┌──────────┐ │
│ │HALF-OPEN │ │
└────────│(testing) │ <───────────────┘
success └──────────┘ timeout expires,
threshold allow limited traffic
reachedThe circuit is closed, and requests flow through normally. The circuit breaker monitors the failure rate. If the failure rate exceeds a threshold (e.g., 50% of the last 100 requests fail), the circuit trips to the Open state.
The circuit is open. All requests are immediately rejected with a fallback response (cached data, default value, or error message) without contacting the downstream service. This protects the downstream service from additional load and prevents your service from wasting resources waiting for timeouts.
The circuit stays open for a configured reset timeout (e.g., 30 seconds). After this period, it transitions to Half-Open.
The circuit breaker allows a limited number of trial requests through to the downstream service. If these requests succeed, the circuit transitions back to Closed. If they fail, the circuit goes back to Open and the reset timeout restarts.
import time
from enum import Enum
from collections import deque
class State(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=0.5, window_size=100,
reset_timeout=30, half_open_max_calls=5):
self.failure_threshold = failure_threshold
self.window_size = window_size
self.reset_timeout = reset_timeout
self.half_open_max_calls = half_open_max_calls
self.state = State.CLOSED
self.results = deque(maxlen=window_size) # True=success, False=failure
self.last_failure_time = 0
self.half_open_calls = 0
def call(self, func, *args, **kwargs):
if self.state == State.OPEN:
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = State.HALF_OPEN
self.half_open_calls = 0
else:
raise CircuitOpenError("Circuit is open, failing fast")
if self.state == State.HALF_OPEN:
if self.half_open_calls >= self.half_open_max_calls:
raise CircuitOpenError("Half-open call limit reached")
self.half_open_calls += 1
try:
result = func(*args, **kwargs)
self._record_success()
return result
except Exception as e:
self._record_failure()
raise
def _record_success(self):
self.results.append(True)
if self.state == State.HALF_OPEN:
if self.half_open_calls >= self.half_open_max_calls:
self.state = State.CLOSED
self.results.clear()
def _record_failure(self):
self.results.append(False)
self.last_failure_time = time.time()
if self.state == State.HALF_OPEN:
self.state = State.OPEN
return
if self.state == State.CLOSED and len(self.results) >= 10:
failure_rate = self.results.count(False) / len(self.results)
if failure_rate >= self.failure_threshold:
self.state = State.OPENGetting the thresholds right is critical. Too sensitive and the circuit trips on normal fluctuations. Too lenient and the circuit never trips when it should.
The percentage of failed requests that triggers the circuit to open. Common values:
Use a sliding window (time-based or count-based) to calculate the failure rate:
Time-based windows are generally preferred because they adapt to varying traffic volumes.
How long the circuit stays open before testing recovery. Too short and you hammer a recovering service. Too long and you waste time waiting when the service has recovered.
A good starting point is 30 seconds, with adaptive backoff: if the half-open test fails, double the reset timeout (up to a maximum).
Not all errors should trip the circuit:
A 400 Bad Request means your request was wrong, not that the downstream service is unhealthy.
When the circuit is open, you need a fallback:
Return the last known good response from a local cache. This works well for read-heavy endpoints where stale data is acceptable for a short period.
Return a sensible default. For example, if the recommendation service is down, show trending items instead of personalized recommendations.
Disable the non-critical feature entirely. If the review service is down, still show the product page without reviews.
Accept the request and queue it for processing when the service recovers. This works for writes that can tolerate delay.
def get_product_recommendations(user_id):
try:
return circuit_breaker.call(recommendation_service.get, user_id)
except CircuitOpenError:
# Fallback: return cached recommendations or trending items
cached = cache.get(f"recs:{user_id}")
if cached:
return cached
return get_trending_items()The circuit breaker protects against cascading failures from one dependency. The bulkhead pattern isolates dependencies from each other, preventing a slow dependency from consuming all your resources.
Named after ship bulkheads that contain flooding to one compartment, the pattern assigns separate thread pools or connection pools to each dependency.
Without bulkhead:
Shared thread pool (100 threads)
├── Payment Service calls: 90 threads (all waiting on timeout!)
├── Inventory Service calls: 5 threads
└── Email Service calls: 5 threads
→ Payment timeout causes thread starvation for ALL services
With bulkhead:
Payment pool: 40 threads max (90 requests → 40 active, 50 rejected fast)
Inventory pool: 30 threads max (5 requests → all served normally)
Email pool: 30 threads max (5 requests → all served normally)
→ Payment issues are containedIn practice, you use both patterns together:
Request flow:
Client → Bulkhead (thread pool limit)
→ Circuit Breaker (check state)
→ Timeout (cap wait time)
→ Retry (with backoff, limited attempts)
→ Downstream ServiceNetflix pioneered the circuit breaker pattern at scale with Hystrix. It combined circuit breaker, bulkhead (thread pool isolation), timeout, and fallback into a single library. Every call to an external service was wrapped in a Hystrix command.
Key design decisions:
A lightweight, functional alternative to Hystrix. It provides circuit breaker, bulkhead, rate limiter, retry, and time limiter as separate, composable modules.
Resilience4j circuit breaker configuration:
slidingWindowType: COUNT_BASED
slidingWindowSize: 100
failureRateThreshold: 50
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 10
recordExceptions: [TimeoutException, IOException]
ignoreExceptions: [BusinessException]In a service mesh architecture (Istio, Linkerd), the circuit breaker is implemented at the proxy level rather than in application code. Envoy sidecar proxies monitor the health of upstream services and trip the circuit automatically. This has the advantage of being language-agnostic and not requiring application changes.
AWS SDK clients have built-in retry with exponential backoff. API Gateway can act as a circuit breaker for backend services. ELBs remove unhealthy targets automatically. These are infrastructure-level circuit breakers.
A circuit breaker is only useful if you know when it trips. Essential metrics:
Alert when a circuit opens. It means a dependency is failing, and your service is in degraded mode.