The Saga Pattern: Distributed Transactions Without Locks

Series: System Design · Architecture Patterns — Pillar 7 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Architecture Patterns: How Systems Are Structured | Twenty patterns covering monoliths, microservices, events, resilience, deployment, and data processing. How to structure systems that survive growth. |
| 01 | Monolithic Architecture: The Default That Gets Abandoned Too Early | Monoliths are fast to build and easy to operate. Learn when they're the right choice, when they break down, and how to know the difference. |
| 02 | Microservices: The Architecture You Earn, Not Choose | Microservices enable independent scaling and team autonomy — but at significant cost. Learn what you actually get, what you pay, and when it's worth it. |
| 03 | Serverless: Pay for What You Use, Not What You Provision | Serverless scales to zero and charges per invocation. Learn where it shines, where it fails, and how to design around cold starts and vendor lock-in. |
| 04 | Event-Driven Architecture: Decoupling Through Events | Event-driven systems communicate via events rather than direct calls. Learn how producers, consumers, and event brokers work — and the consistency tradeoffs involved. |
| 05 | Message Queues: Decoupling Produce from Consume | Message queues decouple producers and consumers, enable load levelling, and provide durability. Learn how they work and when to use Kafka vs SQS vs RabbitMQ. |
| 06 | Pub/Sub: Broadcasting Events to Multiple Consumers | Pub/sub decouples publishers from subscribers through topics. Learn how it differs from message queues and when to use Kafka, SNS, or Google Pub/Sub. |
| 07 | CQRS: When Reads and Writes Need Different Models | CQRS separates writes from reads so each can be optimised independently. Learn how it works, when it's worth the complexity, and when it isn't. |
| 08 | Event Sourcing: The Ledger, Not the Balance | Event sourcing stores state as a sequence of events. Learn how it works, what you get (audit log, time travel), and what it costs (complexity, schema evolution). |
| 09 | The Saga Pattern: Distributed Transactions Without Locks ← you are here | The Saga pattern manages distributed transactions across services using compensating transactions. Learn choreography vs orchestration and when to use each. |
| 10 | The Outbox Pattern: Atomic Writes and Event Publishing | The Outbox pattern solves the dual-write problem — publishing an event and writing to a database atomically. Learn how it works using CDC or polling. |
| 11 | The Circuit Breaker: Stopping Cascading Failures | Circuit breakers prevent cascading failures by fast-failing calls to unhealthy dependencies. Learn the three states, how to configure them, and where to apply them. |
| 12 | The Bulkhead Pattern: Containing Failures Through Resource Isolation | Bulkheads isolate thread pools and connections per dependency so one failure can't exhaust resources needed by others. Learn how to apply them in practice. |
| 13 | The Sidecar Pattern: Cross-Cutting Concerns Without Code Changes | The sidecar pattern deploys a helper process alongside each service for logging, metrics, TLS, and service discovery — without modifying the service itself. |
| 14 | Service Mesh: A Programmable Network for Microservices | A service mesh handles service-to-service traffic, mTLS, circuit breaking, and observability via a fleet of sidecar proxies. Learn how it works and when to use it. |
| 15 | Service Discovery: Finding Services in a Dynamic Environment | Service discovery lets services find each other in dynamic environments. Learn client-side vs server-side discovery, health checks, and DNS vs registry approaches. |
| 16 | The Strangler Fig: Replacing a Legacy System Without Burning It Down | The Strangler Fig replaces a legacy system incrementally by routing specific functionality to new implementations while the old system keeps running. |
| 17 | Backend for Frontend: One API Per Client Type | BFF creates dedicated API backends per client type. Learn why one general API struggles to serve mobile and web well, and how BFF solves it. |
| 18 | ETL Pipelines: Moving Data from Operations to Analytics | ETL moves data from operational systems into analytical stores. Learn how pipelines work, what ELT is, and how to design reliable data movement at scale. |
| 19 | Batch vs Stream Processing: How Fresh Do Your Answers Need to Be? | Batch processes accumulate data then processes in bulk; streaming processes each event as it arrives. Learn the tradeoffs and when each is right. |
| 20 | MapReduce: Processing Petabytes in Parallel | MapReduce processes massive datasets in parallel by splitting work into map and reduce phases. Learn how it works and why Spark has largely replaced it. |
| 21 | Architecture Patterns: Wrap-Up | A recap of all 20 architecture patterns across decomposition, async communication, data patterns, resilience, and data processing. How they connect. |
The Saga Pattern: Distributed Transactions Without Locks
The problem
A premium user upgrades their URL shortener plan. This involves:
- Charging the user's payment method (Stripe)
- Updating the user's subscription tier (User Service)
- Increasing their link quota (Link Service)
- Sending a confirmation email (Notification Service)
In a monolith with one database, this is one database transaction — if any step fails, the whole thing rolls back. Atomicity is free.
In a microservices system, each step touches a different service with its own database. There's no single transaction that spans all four. If step 3 fails after step 1 and 2 succeed, the user has been charged but their quota wasn't increased. The system is inconsistent.
The naive solution — two-phase commit (2PC) across all four services — requires all services to hold locks while waiting for a coordinator. One slow service holds locks in all the others. One crashed service blocks the entire transaction indefinitely. 2PC is an availability and performance problem in a distributed system.
The Saga pattern is the practical alternative.
The core idea
A saga is a sequence of local transactions, one per service. Each local transaction updates its service's own data and publishes an event or sends a command to trigger the next step. If a step fails, compensating transactions undo the changes made by the preceding successful steps.
There's no distributed lock. There's no coordinator holding all participants hostage. Each service acts on its local database only, and the saga achieves eventual consistency across services through a series of forward steps and, when necessary, compensating rollbacks.
The analogy: a travel booking with cancellation policies
You book a flight, hotel, and rental car for a trip. Each booking is independent — a separate transaction with a separate company. If your flight gets cancelled, you call the hotel and rental car company to cancel those separately. There's no single travel authority that atomically books all three.
If the hotel is sold out after you've booked the flight, you cancel the flight — invoking its cancellation policy (the compensating transaction). The system returns to a consistent state through explicit undo operations, not through a rollback that spans all three parties.
How the saga works
Forward flow (happy path)
Saga: UpgradeSubscription
Step 1: BillingService.ChargePlan(user_id, plan_id)
→ Charges the credit card
→ Publishes: PaymentSucceeded
Step 2: UserService.UpgradeTier(user_id, plan_id)
→ Updates subscription tier
→ Publishes: TierUpgraded
Step 3: LinkService.IncreaseLinkQuota(user_id, plan_id)
→ Increases link quota
→ Publishes: QuotaIncreased
Step 4: NotificationService.SendConfirmation(user_id, plan_id)
→ Sends email
→ Publishes: SagaCompleted
Each step only touches its own service. No distributed lock. No cross-service transaction.
Failure and compensation
If step 3 fails (Link Service is down):
Step 3 FAILS: LinkService.IncreaseLinkQuota → error
Compensating transactions (in reverse order):
Compensate step 2: UserService.DowngradeTier(user_id, original_plan)
Compensate step 1: BillingService.RefundCharge(user_id, charge_id)
Compensation is explicit business logic, not a technical rollback. A refund is not "undoing" a charge at the database level — it's executing the business operation of returning funds. Compensation may fail too, requiring retry and monitoring.
Choreography vs orchestration
Two implementation styles:
Choreography: services react to events published by previous steps. No central coordinator.
BillingService charges → publishes PaymentSucceeded
↓
UserService listens for PaymentSucceeded → upgrades tier → publishes TierUpgraded
↓
LinkService listens for TierUpgraded → increases quota → publishes QuotaIncreased
↓
NotificationService listens for QuotaIncreased → sends email
Services are autonomous — each decides what to do based on events it observes. Adding a new step means adding a new listener; existing services don't change.
Downside: the saga's overall flow is implicit, distributed across multiple services. Debugging "why did this saga fail?" requires correlating events across all participants.
Orchestration: a central coordinator (the saga orchestrator) explicitly tells each service what to do and handles failures.
Saga Orchestrator:
1. send ChargePlan to BillingService
2. on PaymentSucceeded: send UpgradeTier to UserService
3. on TierUpgraded: send IncreaseLinkQuota to LinkService
4. on QuotaIncreased: send SendConfirmation to NotificationService
5. on any failure: execute compensation steps in reverse
The flow is explicit and traceable in one place. Compensations are easier to implement and test. The orchestrator is a single service to maintain and scale.
Handling failure in practice
Idempotency. Each saga step must be idempotent — if the same command is delivered twice (due to retry), the result is the same. Use idempotency keys: ChargePlan with a saga ID won't charge twice if the same saga ID is presented again.
Compensation failures. A compensation step can fail too (the refund API is down). Compensations must be retried with backoff until they succeed, and if they fail repeatedly, they must alert for human intervention. Design for the worst case.
Out-of-order events. In a choreographed saga, events may arrive out of order. Services must handle receiving a compensation event before the corresponding forward step, or a step's event before the previous step's completion is confirmed. Careful sequencing or idempotency handles this.
Observability. Track saga state explicitly — which step is each saga instance at? Which have failed? A saga store (a table tracking saga_id, current_step, status) enables dashboards and alerting.
Tradeoffs
Eventual consistency vs atomicity. A saga does not provide ACID guarantees across services. Between step 1 (charge) and step 3 (quota increase), the system is in an intermediate state. This window may be visible to users (they were charged but their quota hasn't changed yet). Communicate this to users where it's noticeable.
Complexity of compensations. Every forward step needs a compensating step. Compensations must be designed, tested, and maintained. For complex sagas with many steps, the compensation logic can be as complex as the forward logic.
Choreography vs orchestration tradeoff. Choreography is more decoupled and extensible; orchestration is more observable and maintainable. Orchestration is usually the better choice for complex business workflows.
The one thing to remember
A saga replaces a distributed ACID transaction with a sequence of local transactions and compensating actions. Each step is atomic within its own service; cross-service consistency is achieved eventually through the forward chain or, on failure, through explicit compensations in reverse. This is not free — it requires idempotent steps, compensation logic for every forward operation, and explicit saga state tracking. But it avoids the availability and performance problems of distributed locks that make 2PC impractical in microservices.
← Previous: Event Sourcing — instead of storing current state, store the sequence of events that produced it. A natural complement to CQRS, and a pattern with significant power and significant commitment.
→ Next: Outbox Pattern — ensuring that a database write and a message publish happen atomically, solving the "dual-write" problem that plagues event-driven systems.




