Series: System Design · Architecture Patterns — Pillar 7 of 8

Systems Design

#	Post	What it covers
00	Architecture Patterns: How Systems Are Structured	Twenty patterns covering monoliths, microservices, events, resilience, deployment, and data processing. How to structure systems that survive growth.
01	Monolithic Architecture: The Default That Gets Abandoned Too Early	Monoliths are fast to build and easy to operate. Learn when they're the right choice, when they break down, and how to know the difference.
02	Microservices: The Architecture You Earn, Not Choose	Microservices enable independent scaling and team autonomy — but at significant cost. Learn what you actually get, what you pay, and when it's worth it.
03	Serverless: Pay for What You Use, Not What You Provision	Serverless scales to zero and charges per invocation. Learn where it shines, where it fails, and how to design around cold starts and vendor lock-in.
04	Event-Driven Architecture: Decoupling Through Events	Event-driven systems communicate via events rather than direct calls. Learn how producers, consumers, and event brokers work — and the consistency tradeoffs involved.
05	Message Queues: Decoupling Produce from Consume	Message queues decouple producers and consumers, enable load levelling, and provide durability. Learn how they work and when to use Kafka vs SQS vs RabbitMQ.
06	Pub/Sub: Broadcasting Events to Multiple Consumers	Pub/sub decouples publishers from subscribers through topics. Learn how it differs from message queues and when to use Kafka, SNS, or Google Pub/Sub.
07	CQRS: When Reads and Writes Need Different Models	CQRS separates writes from reads so each can be optimised independently. Learn how it works, when it's worth the complexity, and when it isn't.
08	Event Sourcing: The Ledger, Not the Balance	Event sourcing stores state as a sequence of events. Learn how it works, what you get (audit log, time travel), and what it costs (complexity, schema evolution).
09	The Saga Pattern: Distributed Transactions Without Locks	The Saga pattern manages distributed transactions across services using compensating transactions. Learn choreography vs orchestration and when to use each.
10	The Outbox Pattern: Atomic Writes and Event Publishing	The Outbox pattern solves the dual-write problem — publishing an event and writing to a database atomically. Learn how it works using CDC or polling.
11	The Circuit Breaker: Stopping Cascading Failures	Circuit breakers prevent cascading failures by fast-failing calls to unhealthy dependencies. Learn the three states, how to configure them, and where to apply them.
12	The Bulkhead Pattern: Containing Failures Through Resource Isolation ← you are here	Bulkheads isolate thread pools and connections per dependency so one failure can't exhaust resources needed by others. Learn how to apply them in practice.
13	The Sidecar Pattern: Cross-Cutting Concerns Without Code Changes	The sidecar pattern deploys a helper process alongside each service for logging, metrics, TLS, and service discovery — without modifying the service itself.
14	Service Mesh: A Programmable Network for Microservices	A service mesh handles service-to-service traffic, mTLS, circuit breaking, and observability via a fleet of sidecar proxies. Learn how it works and when to use it.
15	Service Discovery: Finding Services in a Dynamic Environment	Service discovery lets services find each other in dynamic environments. Learn client-side vs server-side discovery, health checks, and DNS vs registry approaches.
16	The Strangler Fig: Replacing a Legacy System Without Burning It Down	The Strangler Fig replaces a legacy system incrementally by routing specific functionality to new implementations while the old system keeps running.
17	Backend for Frontend: One API Per Client Type	BFF creates dedicated API backends per client type. Learn why one general API struggles to serve mobile and web well, and how BFF solves it.
18	ETL Pipelines: Moving Data from Operations to Analytics	ETL moves data from operational systems into analytical stores. Learn how pipelines work, what ELT is, and how to design reliable data movement at scale.
19	Batch vs Stream Processing: How Fresh Do Your Answers Need to Be?	Batch processes accumulate data then processes in bulk; streaming processes each event as it arrives. Learn the tradeoffs and when each is right.
20	MapReduce: Processing Petabytes in Parallel	MapReduce processes massive datasets in parallel by splitting work into map and reduce phases. Learn how it works and why Spark has largely replaced it.
21	Architecture Patterns: Wrap-Up	A recap of all 20 architecture patterns across decomposition, async communication, data patterns, resilience, and data processing. How they connect.

The Bulkhead Pattern: Containing Failures Through Resource Isolation

The problem

Your URL shortener's API service calls three downstream services: Analytics, User, and Link. All three share the same thread pool — the default pool with 200 threads.

The Analytics service starts responding slowly (100ms instead of 2ms). Threads begin to accumulate waiting for analytics responses. With enough traffic, all 200 threads are blocked waiting for analytics. When a request comes in that only needs User or Link Service — totally unrelated to analytics — there are no threads to handle it. It times out.

The Analytics service's degradation has starved User and Link functionality of resources, even though neither depends on analytics.

The core idea

The Bulkhead pattern isolates resource pools (threads, connections, semaphores) per downstream dependency. If Analytics has its own pool of 20 threads, a slow analytics response can only consume those 20 threads — User and Link operations have their own separate pools and are unaffected.

The failure is contained to the compartment (bulkhead) around that dependency.

The analogy: watertight compartments in a ship

A ship without internal bulkheads: a hole in the hull floods the entire vessel. A ship with watertight compartments: a hull breach floods one compartment, the rest of the ship stays afloat. The damage is localised.

Without bulkheads in software: a slow dependency floods the shared thread pool. With bulkheads: a slow dependency only fills its dedicated pool — other dependencies keep working.

How it works

Thread pool isolation

Each downstream dependency gets a dedicated thread pool. Calls to that dependency run on its pool. If the pool is saturated (all threads busy), new calls fail immediately — they don't spill into other pools.

API Server thread pools:

Analytics pool (20 threads):
  ← analytics calls
  ← if all 20 busy: reject new analytics calls immediately

User pool (30 threads):
  ← user calls
  ← unaffected by analytics saturation

Link pool (50 threads):
  ← link calls
  ← unaffected by analytics saturation

With thread pool isolation, analytics saturation affects only analytics-dependent operations. User and Link operations continue normally.

Semaphore isolation

A lighter-weight alternative to separate thread pools. Instead of dedicating threads, a semaphore limits the number of concurrent calls to a dependency. No new thread is created — the calling thread is used directly, but the semaphore prevents more than N concurrent calls.

analytics_semaphore = asyncio.Semaphore(20)

async def record_click(event):
    async with analytics_semaphore:
        await analytics_service.record(event)
    # If 20 calls are already in-flight: raises immediately (no new call)

Semaphore isolation has lower overhead (no thread pool management) but doesn't protect against CPU or memory saturation as completely as thread pool isolation.

Connection pool isolation

Beyond thread pools, databases and external services have connection pools. If your application has a single database connection pool of 50 connections shared across all queries, a slow analytics query can occupy all 50 connections.

Without bulkhead: one connection pool for everything
  All 50 connections busy with slow analytics queries
  → User lookups and link writes queue behind them

With bulkhead: separate pools
  analytics_pool: max 10 connections
  users_pool: max 20 connections
  links_pool: max 20 connections
  → Analytics pool saturates; users and links unaffected

Combining with circuit breakers

Bulkheads and circuit breakers are complementary:

Bulkhead: limits the number of concurrent calls allowed to a dependency. Prevents resource exhaustion.
Circuit breaker: limits the duration of calling a failing dependency. Prevents waiting for timeouts.

A bulkhead alone doesn't stop calls from accumulating behind a slow dependency (threads block until timeout). A circuit breaker alone doesn't prevent a fast-failing dependency from using too many threads (each fails quickly but many can be in flight).

Together: the bulkhead caps concurrent calls; the circuit breaker trips when failures are too frequent, fast-failing subsequent attempts.

Incoming analytics call
  → Bulkhead: is analytics pool full? Yes → reject immediately
  → Circuit breaker: is analytics circuit open? Yes → reject immediately
  → Otherwise: make the actual call

Tradeoffs

Resource allocation. Dedicated pools mean each dependency has guaranteed capacity — but also that idle capacity in one pool can't be borrowed by another. A misconfigured pool that's too small drops legitimate requests; too large wastes memory.

Complexity. Managing multiple pools — sizing them correctly, monitoring them separately, alerting on per-pool saturation — adds operational complexity. Worth it in microservices where dependency failures are common; overkill for a monolith with a single database.

Partial availability vs complete failure. With bulkheads, a failing dependency causes partial degradation (only analytics-related features are affected). Without them, a failing dependency can cause complete service failure. The tradeoff is operational complexity for improved availability.

When to apply

Apply bulkheads when:

Your service calls multiple downstream dependencies
You've experienced (or are concerned about) cascade failures where one slow dependency brings down others
Different downstream dependencies have different criticality — user auth must work even if analytics is down

Can skip when:

Your service has only one downstream dependency
You're running in a single-threaded async environment where threads aren't the scarce resource
Dependencies are all equally critical and share the same SLA

The one thing to remember

A bulkhead contains failure by giving each dependency its own resource pool. A dependency that goes slow or stops responding can only consume its allocated resources — other dependencies are unaffected. Used with circuit breakers (which limit call duration) and timeouts (which prevent indefinite blocking), bulkheads form the core of a resilient service that degrades gracefully rather than failing completely when any single downstream dependency misbehaves.

← Previous: Circuit Breaker — when a downstream service starts failing, the circuit breaker prevents a cascade by fast-failing calls rather than waiting for timeouts.

→ Next: Sidecar — deploying a helper process alongside each service to handle cross-cutting concerns like logging, metrics, and service discovery.

The Bulkhead Pattern: Containing Failures Through Resource Isolation

Systems Design

The Bulkhead Pattern: Containing Failures Through Resource Isolation

The problem

The core idea

The analogy: watertight compartments in a ship

How it works

Thread pool isolation

Semaphore isolation

Connection pool isolation

Combining with circuit breakers

Tradeoffs

When to apply

The one thing to remember

Comments

Systems Design

More from this blog

Architecture Patterns: Wrap-Up

MapReduce: Processing Petabytes in Parallel

Batch vs Stream Processing: How Fresh Do Your Answers Need to Be?

ETL Pipelines: Moving Data from Operations to Analytics

Backend for Frontend: One API Per Client Type

Command Palette

Systems Design

The Bulkhead Pattern: Containing Failures Through Resource Isolation

The problem

The core idea

The analogy: watertight compartments in a ship

How it works

Thread pool isolation

Semaphore isolation

Connection pool isolation

Combining with circuit breakers

Tradeoffs

When to apply

The one thing to remember

Comments

Systems Design

More from this blog