What is the difference between a circuit breaker and a bulkhead?

A circuit breaker decides *when* to stop calling a failing service — it trips open after errors cross a threshold so calls fail fast instead of hanging. A bulkhead decides *how much* of your system a single dependency is allowed to consume by isolating its resources into a dedicated pool. One controls timing, the other controls blast radius.

Why are retries not enough to handle a slow service?

Retrying a struggling service adds load to something already drowning. Even with exponential backoff, you accelerate its collapse instead of helping it recover. Retries handle transient blips, not sustained degradation.

What are the three states of a circuit breaker?

Closed (normal, requests pass through), Open (failures crossed the threshold, all calls fail fast), and Half-Open (a single test request probes whether the dependency has recovered before reopening the gates).

What is the gap that a circuit breaker alone leaves open?

Between when a dependency starts failing and when the breaker actually trips, requests still flow and threads still get stuck. If the threshold is five failures but a hundred requests arrive in that window, the breaker stopped the bleeding but never prevented the initial hemorrhage. A bulkhead closes that gap by capping resources from the very first request.

Does Resilience4j support both patterns?

Yes. Resilience4j ships a CircuitBreaker module and two bulkhead implementations — SemaphoreBulkhead (limits concurrent calls) and ThreadPoolBulkhead (full thread-pool isolation). You can decorate the same call with both.

Circuit Breakers and Bulkheads — How to Stop One Slow Service From Burning Down Your System

TL;DR

A slow dependency — not a failed one — can starve every thread in your gateway and take down a healthy API. The failure mode is resource exhaustion, not error propagation.
Retries make it worse. Adding load to a drowning service accelerates its collapse.
Circuit breaker = when to stop calling. Three states (Closed → Open → Half-Open). Fails fast in milliseconds instead of waiting 30s for a timeout.
Bulkhead = how much a failure can consume. Isolates resources per dependency so one slow service can't exhaust the shared pool.
They solve different problems. Use both: the bulkhead contains the blast radius from request one, the breaker stops wasting effort once the failure pattern is clear.
Resilience4j gives you both — set a failure threshold + timeout for the breaker, concurrent-call limits for the bulkhead.

Why failure containment matters

Your distributed system is humming. Requests flow, services respond, everything's fast. Then one service gets slow. Maybe its database is overloaded, maybe a downstream dependency is choking. It's not down — just slow.

Here's how that quietly kills you. Your gateway has a hundred threads. Each request to the slow service holds a thread waiting for a response. 50 threads stuck, then 80, then 90. Now there are no threads left — not even for services that are perfectly healthy.

One slow dependency just took down your entire API. Not because it failed, but because it was slow enough to consume every resource you had.

This is the core insight: resilience in distributed systems isn't about error handling. Failure is guaranteed. Resilience is about failure containment — deciding where failures stop spreading and how much of your system they're allowed to consume.

Why retries are the wrong first instinct

Developers reach for retries first. Service B is struggling? Send the request again. With exponential backoff, of course — we're civilized.

But think about what you're actually doing. Retrying a struggling service is adding load to something already drowning. You're not helping it recover, you're accelerating its collapse. Retries are great for transient blips — a dropped packet, a momentary GC pause. They are exactly the wrong tool for sustained degradation.

⚠️ Warning: Retry without a circuit breaker is a retry storm waiting to happen. When a service degrades, every client retries simultaneously, multiplying load on the exact service that needs it least.

So if retries aren't the answer, what is? Two patterns — and the trick is understanding that they solve different problems.

How a circuit breaker works

A circuit breaker works exactly like the one in your house. When current flows normally, the breaker is closed and requests pass through. When failures spike — too many errors, too many timeouts — the breaker trips open. Requests don't even attempt the call. They fail immediately.

That immediate failure is the whole point. Without a breaker, each request waits for a timeout — 30 seconds of a thread doing nothing. With the breaker open, you fail in milliseconds and your threads stay free.

The three states

Closed — everything normal, requests flow.
Open — failures crossed the threshold, every call fails fast.
Half-Open — after a cooldown, one test request is allowed through. Succeeds → back to Closed. Fails → back to Open.

Half-open is the recovery probe. Instead of flooding a recovering service with traffic the instant the cooldown ends, you send a single test request. Healthy again? Open the gates. Still failing? Back off and wait. This is what stops a recovering service from getting immediately re-buried.

The gap a circuit breaker leaves open

Here's the question that catches most people. The circuit breaker protects your system from a failing dependency — but what happens in those first few seconds before the breaker trips?

Between when B starts failing and when the breaker opens, requests are still going through. Threads are still getting stuck. If the failure threshold is five requests, what happens if a hundred requests arrive in that window?

The circuit breaker stopped the bleeding — but it didn't prevent the initial hemorrhage. One slow service consumed shared resources before the breaker could even react.

The breaker reacts to a pattern. Patterns take time to establish. During that time, damage spreads. You need something that contains resources from request one — not from request threshold.

How a bulkhead works

Ships solved this problem centuries ago. A hull breach doesn't sink a ship because bulkheads — sealed walls — divide the hull into compartments. Water floods one section, the rest stay dry.

The bulkhead pattern does the same with your resources. Instead of one shared thread pool, you isolate resources per dependency.

Service B gets 20 threads. Service C gets 20. Service D gets 20. Now when B goes slow, it can only consume its own allocation — 20 threads, not 100. Services C and D keep running because they have their own isolated pools.

Requests to B beyond its allocation get rejected immediately. They don't even queue. The blast radius is contained to exactly one compartment — from the very first request, no threshold required.

Two patterns, two different jobs

This is where engineers conflate the two. They are not interchangeable.

Dimension	Circuit Breaker	Bulkhead
Core question	When do I stop calling a failing service?	How much can a failure consume?
Mechanism	Trips open after failure threshold	Caps concurrent calls / dedicated pool
Kicks in	After a failure pattern is established	From request one, immediately
Protects against	Wasting effort on a known-bad service	Resource exhaustion / cascading starvation
Failure mode it misses	The pre-trip window	Doesn't know the service is "bad" — just full
Analogy	House electrical breaker	Ship's watertight compartments

Read that "Kicks in" row again — it's the heart of it. The breaker reacts to a pattern over time. The bulkhead acts on a limit, instantly. That's why you want both:

Without protection — one failure cascades everywhere.
With just a circuit breaker — the cascade stops, but there's a window where damage spreads.
With both — the failure is isolated from the start (bulkhead), and the breaker ensures you stop wasting effort once the pattern is clear.

The bulkhead contains the blast radius from the moment failure starts. The circuit breaker stops the bleeding once the pattern is obvious. Your system degrades gracefully instead of collapsing entirely.

Wiring both with Resilience4j

In practice, libraries like Resilience4j give you both. The configuration is simple — the thinking behind it is what matters.

Resilience4j ships two bulkhead flavors: a SemaphoreBulkhead that limits concurrent calls, and a ThreadPoolBulkhead that gives full thread-pool isolation with a bounded queue. Source: Resilience4j Bulkhead docs.

Semaphore bulkhead + circuit breaker (Spring Boot YAML)

resilience4j:
  circuitbreaker:
    instances:
      serviceB:
        failure-rate-threshold: 50        # % of failures that trips the breaker
        sliding-window-size: 10           # evaluate over the last 10 calls
        wait-duration-in-open-state: 10s  # cooldown before half-open probe
        permitted-number-of-calls-in-half-open-state: 1
  bulkhead:
    instances:
      serviceB:
        max-concurrent-calls: 20          # the compartment wall
        max-wait-duration: 0              # over the limit? reject instantly, don't queue

Breaking it down:

failure-rate-threshold — the percentage of failures in the sliding window that flips the breaker to Open.
sliding-window-size — how many recent calls the breaker evaluates.
wait-duration-in-open-state — the cooldown before a single half-open probe is allowed.
max-concurrent-calls — the bulkhead's hard limit. This is your watertight wall.
max-wait-duration: 0 — fail fast on saturation; never let excess requests queue and hold threads.

Thread-pool bulkhead (full isolation)

For complete isolation, the thread-pool bulkhead runs the call on its own dedicated pool:

resilience4j:
  thread-pool-bulkhead:
    instances:
      serviceB:
        core-thread-pool-size: 5
        max-thread-pool-size: 20          # the compartment, isolated from the gateway pool
        queue-capacity: 20                # bounded — past this, reject

Source: Resilience4j thread-pool bulkhead config.

💡 Tip: Use SemaphoreBulkhead (the default) for most synchronous calls — it's lightweight and works across threading models. Reach for ThreadPoolBulkhead when you want the dependency running on a genuinely separate pool, fully decoupled from your serving threads.

Decorating a call with both

// Order matters: Resilience4j wraps Retry → CircuitBreaker → Bulkhead.
Supplier<String> decorated = Bulkhead.decorateSupplier(
    bulkhead,
    CircuitBreaker.decorateSupplier(circuitBreaker, () -> serviceBClient.call())
);
 
String result = Try.ofSupplier(decorated)
    .recover(throwable -> "fallback response")   // graceful degradation
    .get();

Important: Resilience4j's aspect order is fixed — Retry wraps CircuitBreaker wraps RateLimiter wraps TimeLimiter wraps Bulkhead. Reordering annotations on a method won't change it; adjust aspect-order properties if you need something different. Source: Resilience4j Spring Boot guide.

Production checklist

Size bulkheads per dependency, not globally — a slow non-critical service should never be allowed to claim more than its fair share of threads.
Set max-wait-duration low (or zero) — queuing on a saturated bulkhead just relocates the thread-starvation problem.
Always pair a breaker with a timeout — without one, a "slow but not failing" call never trips the breaker. Slowness is the failure mode that bites hardest.
Provide a fallback — fail-fast is only graceful if there's a sensible degraded response (cached data, default, or a clear error).
Tune the half-open probe to a single call — flooding a recovering service re-opens the breaker immediately.
Emit metrics on rejections and state transitions — a bulkhead silently rejecting calls or a breaker flapping Open/Closed is a signal, not noise.
Combine retries inside the breaker, never naked — let the breaker guard against retry storms.

When to use which

Circuit breaker — any call to a dependency that can fail or hang: downstream services, third-party APIs, databases under load. Especially anything where a 30s timeout would otherwise pin a thread.
Bulkhead — whenever multiple dependencies share a thread pool and you can't afford one of them to starve the others. Effectively: any multi-dependency gateway or aggregation service.
Both — the default for production microservices. The bulkhead handles the pre-trip window; the breaker handles the sustained failure.
Neither / just retries — only for genuinely transient, isolated, low-volume calls where degradation can't cascade.

Conclusion

I treat circuit breakers and bulkheads as table stakes for any service that fans out to dependencies — and I'm deliberate about the fact that they're solving two different problems, not one. The bulkhead is the wall that contains the fire from the first spark; the breaker is the alarm that tells everyone to stop walking into the burning room once it's clearly on fire.

The configuration in Resilience4j really is a handful of lines. The discipline is in the thinking: resilience isn't about preventing failure — in distributed systems, failure is guaranteed. It's about building walls so a fire in one room doesn't burn down the house. Start by adding a breaker with a timeout to your riskiest downstream call, then put a bulkhead around it, and watch a single slow dependency stop being an outage.

TL;DR

A slow dependency — not a failed one — can starve every thread in your gateway and take down a healthy API. The failure mode is resource exhaustion, not error propagation.
Retries make it worse. Adding load to a drowning service accelerates its collapse.
Circuit breaker = when to stop calling. Three states (Closed → Open → Half-Open). Fails fast in milliseconds instead of waiting 30s for a timeout.
Bulkhead = how much a failure can consume. Isolates resources per dependency so one slow service can't exhaust the shared pool.
They solve different problems. Use both: the bulkhead contains the blast radius from request one, the breaker stops wasting effort once the failure pattern is clear.
Resilience4j gives you both — set a failure threshold + timeout for the breaker, concurrent-call limits for the bulkhead.

Why failure containment matters

One slow dependency just took down your entire API. Not because it failed, but because it was slow enough to consume every resource you had.

Why retries are the wrong first instinct

Developers reach for retries first. Service B is struggling? Send the request again. With exponential backoff, of course — we're civilized.

⚠️ Warning: Retry without a circuit breaker is a retry storm waiting to happen. When a service degrades, every client retries simultaneously, multiplying load on the exact service that needs it least.

So if retries aren't the answer, what is? Two patterns — and the trick is understanding that they solve different problems.

How a circuit breaker works

The three states

Closed — everything normal, requests flow.
Open — failures crossed the threshold, every call fails fast.
Half-Open — after a cooldown, one test request is allowed through. Succeeds → back to Closed. Fails → back to Open.

The gap a circuit breaker leaves open

Here's the question that catches most people. The circuit breaker protects your system from a failing dependency — but what happens in those first few seconds before the breaker trips?

The circuit breaker stopped the bleeding — but it didn't prevent the initial hemorrhage. One slow service consumed shared resources before the breaker could even react.

The breaker reacts to a pattern. Patterns take time to establish. During that time, damage spreads. You need something that contains resources from request one — not from request threshold.

How a bulkhead works

Ships solved this problem centuries ago. A hull breach doesn't sink a ship because bulkheads — sealed walls — divide the hull into compartments. Water floods one section, the rest stay dry.

The bulkhead pattern does the same with your resources. Instead of one shared thread pool, you isolate resources per dependency.

Requests to B beyond its allocation get rejected immediately. They don't even queue. The blast radius is contained to exactly one compartment — from the very first request, no threshold required.

Two patterns, two different jobs

This is where engineers conflate the two. They are not interchangeable.

Dimension	Circuit Breaker	Bulkhead
Core question	When do I stop calling a failing service?	How much can a failure consume?
Mechanism	Trips open after failure threshold	Caps concurrent calls / dedicated pool
Kicks in	After a failure pattern is established	From request one, immediately
Protects against	Wasting effort on a known-bad service	Resource exhaustion / cascading starvation
Failure mode it misses	The pre-trip window	Doesn't know the service is "bad" — just full
Analogy	House electrical breaker	Ship's watertight compartments

Read that "Kicks in" row again — it's the heart of it. The breaker reacts to a pattern over time. The bulkhead acts on a limit, instantly. That's why you want both:

Without protection — one failure cascades everywhere.
With just a circuit breaker — the cascade stops, but there's a window where damage spreads.
With both — the failure is isolated from the start (bulkhead), and the breaker ensures you stop wasting effort once the pattern is clear.

Wiring both with Resilience4j

In practice, libraries like Resilience4j give you both. The configuration is simple — the thinking behind it is what matters.

Semaphore bulkhead + circuit breaker (Spring Boot YAML)

resilience4j:
  circuitbreaker:
    instances:
      serviceB:
        failure-rate-threshold: 50        # % of failures that trips the breaker
        sliding-window-size: 10           # evaluate over the last 10 calls
        wait-duration-in-open-state: 10s  # cooldown before half-open probe
        permitted-number-of-calls-in-half-open-state: 1
  bulkhead:
    instances:
      serviceB:
        max-concurrent-calls: 20          # the compartment wall
        max-wait-duration: 0              # over the limit? reject instantly, don't queue

Breaking it down:

failure-rate-threshold — the percentage of failures in the sliding window that flips the breaker to Open.
sliding-window-size — how many recent calls the breaker evaluates.
wait-duration-in-open-state — the cooldown before a single half-open probe is allowed.
max-concurrent-calls — the bulkhead's hard limit. This is your watertight wall.
max-wait-duration: 0 — fail fast on saturation; never let excess requests queue and hold threads.

Thread-pool bulkhead (full isolation)

For complete isolation, the thread-pool bulkhead runs the call on its own dedicated pool:

resilience4j:
  thread-pool-bulkhead:
    instances:
      serviceB:
        core-thread-pool-size: 5
        max-thread-pool-size: 20          # the compartment, isolated from the gateway pool
        queue-capacity: 20                # bounded — past this, reject

Source: Resilience4j thread-pool bulkhead config.

💡 Tip: Use SemaphoreBulkhead (the default) for most synchronous calls — it's lightweight and works across threading models. Reach for ThreadPoolBulkhead when you want the dependency running on a genuinely separate pool, fully decoupled from your serving threads.

Decorating a call with both

// Order matters: Resilience4j wraps Retry → CircuitBreaker → Bulkhead.
Supplier<String> decorated = Bulkhead.decorateSupplier(
    bulkhead,
    CircuitBreaker.decorateSupplier(circuitBreaker, () -> serviceBClient.call())
);
 
String result = Try.ofSupplier(decorated)
    .recover(throwable -> "fallback response")   // graceful degradation
    .get();

Important: Resilience4j's aspect order is fixed — Retry wraps CircuitBreaker wraps RateLimiter wraps TimeLimiter wraps Bulkhead. Reordering annotations on a method won't change it; adjust aspect-order properties if you need something different. Source: Resilience4j Spring Boot guide.

Production checklist

Size bulkheads per dependency, not globally — a slow non-critical service should never be allowed to claim more than its fair share of threads.
Set max-wait-duration low (or zero) — queuing on a saturated bulkhead just relocates the thread-starvation problem.
Always pair a breaker with a timeout — without one, a "slow but not failing" call never trips the breaker. Slowness is the failure mode that bites hardest.
Provide a fallback — fail-fast is only graceful if there's a sensible degraded response (cached data, default, or a clear error).
Tune the half-open probe to a single call — flooding a recovering service re-opens the breaker immediately.
Emit metrics on rejections and state transitions — a bulkhead silently rejecting calls or a breaker flapping Open/Closed is a signal, not noise.
Combine retries inside the breaker, never naked — let the breaker guard against retry storms.

When to use which

Circuit breaker — any call to a dependency that can fail or hang: downstream services, third-party APIs, databases under load. Especially anything where a 30s timeout would otherwise pin a thread.
Bulkhead — whenever multiple dependencies share a thread pool and you can't afford one of them to starve the others. Effectively: any multi-dependency gateway or aggregation service.
Both — the default for production microservices. The bulkhead handles the pre-trip window; the breaker handles the sustained failure.
Neither / just retries — only for genuinely transient, isolated, low-volume calls where degradation can't cascade.

Circuit Breakers and Bulkheads — How to Stop One Slow Service From Burning Down Your System

TL;DR

Why failure containment matters

Why retries are the wrong first instinct

How a circuit breaker works

The three states

The gap a circuit breaker leaves open

How a bulkhead works

Two patterns, two different jobs

Wiring both with Resilience4j

Semaphore bulkhead + circuit breaker (Spring Boot YAML)

Thread-pool bulkhead (full isolation)

Decorating a call with both

Production checklist

When to use which

Conclusion

FAQ

Circuit Breakers and Bulkheads — How to Stop One Slow Service From Burning Down Your System

TL;DR

Why failure containment matters

Why retries are the wrong first instinct

How a circuit breaker works

The three states

The gap a circuit breaker leaves open

How a bulkhead works

Two patterns, two different jobs

Wiring both with Resilience4j

Semaphore bulkhead + circuit breaker (Spring Boot YAML)

Thread-pool bulkhead (full isolation)

Decorating a call with both

Production checklist

When to use which

Conclusion

FAQ