The Cascading Failure Problem

Service A calls Service B. Service B becomes slow because its database is overloaded. Service A waits for B's response, holding open hundreds of connections. Soon Service A runs out of threads and starts queuing requests. Then Service A becomes slow. Service C, which depends on A, also starts queuing.

Within minutes, the entire system is unresponsive even though only one component had the original problem. This is a cascading failure, and it's how most major outages happen.

The fix is to limit how dependencies can poison their callers. Two patterns dominate: circuit breakers and bulkheads.

Circuit Breaker

Borrowed from electrical engineering. When the wiring is faulty, the circuit breaker trips and cuts the current. You stop trying to push electricity through a broken wire because it makes things worse.

In code: when a remote call has been failing repeatedly, stop making that call for a while. Return an error immediately instead of waiting for a timeout. This protects both your service (don't waste resources waiting) and the failing service (give it room to recover).

The Three States

Circuit Breaker States
CLOSED
Normal operation. Calls go through. Failures are counted.
If failures exceed threshold, trip to OPEN
OPEN
Calls are rejected immediately without attempting the remote call.
After cooldown timer, move to HALF-OPEN
HALF-OPEN
Allow a few test requests through. Watch carefully.
Success: move to CLOSED. Failure: back to OPEN.

The cooldown timer (typically 30 to 60 seconds) gives the failing service time to recover before being hammered again. The HALF-OPEN state probes carefully to confirm health before resuming full traffic.

Tuning Parameters

Three knobs you'll need to set:

Failure threshold: how many failures before tripping. Common: 50% over a 10-second sliding window, or 5 consecutive failures.
Cooldown duration: how long to stay OPEN. Too short and you'll re-overload the failing service. Too long and you'll keep returning errors after recovery.
HALF-OPEN test count: how many test requests to send. Too few and you might miss a failure pattern; too many and you risk re-overloading.

Real Implementations

Hystrix (deprecated but historically influential): Netflix's original library.
Resilience4j: the modern Java go-to.
Polly: .NET's standard.
opossum: Node.js.
Service mesh (Envoy, Istio, Linkerd): bake circuit breaking into the network layer so applications don't need explicit code.

Bulkheads

Named after ship compartments: if one floods, the others stay dry. The principle: isolate resources so that exhaustion in one place doesn't take down everything.

Concrete implementations:

1. Thread Pool Isolation

Each external dependency gets its own thread pool. If Service B is slow and exhausts its pool, calls to Service C still work (they have their own pool).

// Without bulkheads: shared thread pool, B's slowness blocks C
http_client.call("service-b/...")  // hangs
http_client.call("service-c/...")  // also blocked because pool is full

// With bulkheads: separate pools per dependency
service_b_pool.call("service-b/...")  // hangs in B's pool
service_c_pool.call("service-c/...")  // unaffected, runs normally

2. Connection Pool Isolation

Same idea for database connections, HTTP connection pools, message queue connections. One slow consumer cannot starve the others.

3. Service-Level Bulkheads

Run separate instances of the same service for different consumer tiers. Premium customers get one cluster. Free users get another. A free-tier traffic spike doesn't affect paying customers.

4. Tenant Isolation

In multi-tenant SaaS, give each tenant (or each tier) limits on resources. One bad tenant cannot consume all CPU/memory and starve the rest.

Combining Circuit Breakers and Bulkheads

Use both. Bulkheads contain the blast radius. Circuit breakers cut off the contained area when it's clearly burning.

Example: when Service A calls Service B, A uses:

A bulkhead: 50 threads max for B-bound calls. Even if B is slow, only 50 of A's threads are tied up.
A circuit breaker: if 50% of B calls fail in 10 seconds, stop calling B for 30 seconds.
A timeout: any single B call must complete in 2 seconds or be aborted.
A retry policy: on transient failures, retry once with exponential backoff.
A fallback: if B is unreachable, return cached data or a sensible default.

This combination is sometimes called resilient inter-service communication. It's the difference between a system that gracefully degrades and one that catastrophically fails.

The Fallback Question

When the breaker trips, what do you return? Options:

Cached data: stale, but better than nothing.
Default value: "0 unread messages" instead of an error.
Reduced functionality: hide the feature that depends on the broken service.
Error response: tell the caller honestly. They can decide what to do.

The right fallback depends on the use case. Showing stale data is fine for a dashboard. It's not fine for a price quote on a trading platform.

The One Thing to Remember

Most outages aren't caused by single component failures. They're caused by failures cascading through dependencies. Circuit breakers stop the spread by refusing to keep calling broken services. Bulkheads contain the blast radius so one failure can't starve the rest. Use both, and assume every service you depend on will eventually misbehave; design for that day before it arrives.