Event-Driven Architecture: Decoupling Through Events

Series: System Design · Architecture Patterns — Pillar 7 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Architecture Patterns: How Systems Are Structured | Twenty patterns covering monoliths, microservices, events, resilience, deployment, and data processing. How to structure systems that survive growth. |
| 01 | Monolithic Architecture: The Default That Gets Abandoned Too Early | Monoliths are fast to build and easy to operate. Learn when they're the right choice, when they break down, and how to know the difference. |
| 02 | Microservices: The Architecture You Earn, Not Choose | Microservices enable independent scaling and team autonomy — but at significant cost. Learn what you actually get, what you pay, and when it's worth it. |
| 03 | Serverless: Pay for What You Use, Not What You Provision | Serverless scales to zero and charges per invocation. Learn where it shines, where it fails, and how to design around cold starts and vendor lock-in. |
| 04 | Event-Driven Architecture: Decoupling Through Events ← you are here | Event-driven systems communicate via events rather than direct calls. Learn how producers, consumers, and event brokers work — and the consistency tradeoffs involved. |
| 05 | Message Queues: Decoupling Produce from Consume | Message queues decouple producers and consumers, enable load levelling, and provide durability. Learn how they work and when to use Kafka vs SQS vs RabbitMQ. |
| 06 | Pub/Sub: Broadcasting Events to Multiple Consumers | Pub/sub decouples publishers from subscribers through topics. Learn how it differs from message queues and when to use Kafka, SNS, or Google Pub/Sub. |
| 07 | CQRS: When Reads and Writes Need Different Models | CQRS separates writes from reads so each can be optimised independently. Learn how it works, when it's worth the complexity, and when it isn't. |
| 08 | Event Sourcing: The Ledger, Not the Balance | Event sourcing stores state as a sequence of events. Learn how it works, what you get (audit log, time travel), and what it costs (complexity, schema evolution). |
| 09 | The Saga Pattern: Distributed Transactions Without Locks | The Saga pattern manages distributed transactions across services using compensating transactions. Learn choreography vs orchestration and when to use each. |
| 10 | The Outbox Pattern: Atomic Writes and Event Publishing | The Outbox pattern solves the dual-write problem — publishing an event and writing to a database atomically. Learn how it works using CDC or polling. |
| 11 | The Circuit Breaker: Stopping Cascading Failures | Circuit breakers prevent cascading failures by fast-failing calls to unhealthy dependencies. Learn the three states, how to configure them, and where to apply them. |
| 12 | The Bulkhead Pattern: Containing Failures Through Resource Isolation | Bulkheads isolate thread pools and connections per dependency so one failure can't exhaust resources needed by others. Learn how to apply them in practice. |
| 13 | The Sidecar Pattern: Cross-Cutting Concerns Without Code Changes | The sidecar pattern deploys a helper process alongside each service for logging, metrics, TLS, and service discovery — without modifying the service itself. |
| 14 | Service Mesh: A Programmable Network for Microservices | A service mesh handles service-to-service traffic, mTLS, circuit breaking, and observability via a fleet of sidecar proxies. Learn how it works and when to use it. |
| 15 | Service Discovery: Finding Services in a Dynamic Environment | Service discovery lets services find each other in dynamic environments. Learn client-side vs server-side discovery, health checks, and DNS vs registry approaches. |
| 16 | The Strangler Fig: Replacing a Legacy System Without Burning It Down | The Strangler Fig replaces a legacy system incrementally by routing specific functionality to new implementations while the old system keeps running. |
| 17 | Backend for Frontend: One API Per Client Type | BFF creates dedicated API backends per client type. Learn why one general API struggles to serve mobile and web well, and how BFF solves it. |
| 18 | ETL Pipelines: Moving Data from Operations to Analytics | ETL moves data from operational systems into analytical stores. Learn how pipelines work, what ELT is, and how to design reliable data movement at scale. |
| 19 | Batch vs Stream Processing: How Fresh Do Your Answers Need to Be? | Batch processes accumulate data then processes in bulk; streaming processes each event as it arrives. Learn the tradeoffs and when each is right. |
| 20 | MapReduce: Processing Petabytes in Parallel | MapReduce processes massive datasets in parallel by splitting work into map and reduce phases. Learn how it works and why Spark has largely replaced it. |
| 21 | Architecture Patterns: Wrap-Up | A recap of all 20 architecture patterns across decomposition, async communication, data patterns, resilience, and data processing. How they connect. |
Event-Driven Architecture: Decoupling Through Events
The problem
When a new short link is created in your URL shortener, several things must happen:
- The link is stored in the database
- An analytics record is initialised
- A QR code is generated
- The user's team link count is incremented for billing
- A webhook is fired to any registered endpoints
- The new link appears in the team's real-time dashboard feed
In a synchronous architecture, the link creation API call does all of this in sequence. The user waits while the system makes six operations — some fast (database write), some slow (QR code generation, webhook delivery). If the webhook delivery times out, does the whole link creation fail? If the analytics service is down, should users be blocked from creating links?
The operations are logically connected but operationally independent. The link creation shouldn't depend on the webhook delivery succeeding. The QR code generation shouldn't block the user's response. The analytics initialisation shouldn't make link creation fragile.
Decoupling these operations is what event-driven architecture does.
The core idea
In an event-driven architecture, services communicate by publishing events to a shared event bus or broker rather than calling each other directly. A service that does something interesting publishes an event describing what happened. Other services subscribe to events they care about and react asynchronously.
The producer doesn't know who's listening. The consumer doesn't care who published the event. They're coupled only by the event's schema — not by each other's availability or implementation.
The analogy: a news agency and its subscribers
A news agency (producer) writes and publishes articles. It doesn't know who reads them — subscribers include newspapers, radio stations, blogs, and individual readers. Each subscriber processes articles on their own schedule, in their own way.
If a newspaper subscriber is closed for the day, the news agency doesn't stop publishing. The newspaper gets the articles when it reopens. The news agency's work is not blocked by any subscriber's availability.
Adding a new subscriber (a new media outlet) doesn't require the news agency to change anything — the outlet simply subscribes to the feed.
This is the decoupling that event-driven architecture provides: producers publish facts about what happened. Consumers react when they're ready.
How it works
The three roles
Producer: a service that publishes an event when something happens. Events describe facts: "LinkCreated", "UserSignedUp", "PaymentProcessed". They're named in past tense — they represent something that already happened, not a command for someone to do something.
Event broker: the infrastructure that receives events from producers, stores them durably, and delivers them to consumers. Examples: Apache Kafka, AWS SNS/SQS, Google Pub/Sub, RabbitMQ, AWS EventBridge.
Consumer: a service that subscribes to events and processes them asynchronously.
Link Service publishes:
Event: LinkCreated { id: "x7Kp2", user_id: 123, url: "...", created_at: ... }
→ to Kafka topic: link.events
Analytics Service subscribes to link.events:
Receives LinkCreated → initialises analytics record
QR Service subscribes to link.events:
Receives LinkCreated → generates QR code, stores in S3
Billing Service subscribes to link.events:
Receives LinkCreated → increments monthly link count
Webhook Service subscribes to link.events:
Receives LinkCreated → fires registered webhooks
Dashboard Service subscribes to link.events:
Receives LinkCreated → pushes to connected WebSocket clients
The Link Service makes one network call (publish to Kafka). The five consumers react independently. The user's response is returned after the database write and Kafka publish — typically under 10ms — regardless of how long QR generation or webhook delivery takes.
Event schema and versioning
Events are contracts between producers and consumers. Changing an event's schema is a breaking change for any consumer that depends on the changed fields.
Best practices:
- Additive changes only (backward compatible): add new fields, never remove or rename existing ones. Consumers that don't know about a new field ignore it.
- Version the event type:
LinkCreated.v1,LinkCreated.v2. Consumers can handle both versions during a migration window. - Schema registry: tools like Confluent Schema Registry (for Kafka/Avro) enforce schema compatibility at publish time.
Event ordering
Kafka partitions events by a key (e.g., link_id). Events for the same key are delivered in order within a partition. Events across different keys may be out of order.
If the Analytics Service needs to process LinkCreated before LinkClicked, ensure both events for the same link are partitioned by link_id. They'll arrive in order.
Global ordering (all events, across all partitions) is expensive and usually unnecessary.
Tradeoffs
Decoupling vs debuggability. A synchronous call graph is easy to trace: service A calls B calls C, the call stack tells you exactly what happened. An event-driven system produces a chain of causally related but temporally separated events across multiple services. Debugging "why did this webhook not fire?" requires correlating events across multiple consumers, potentially with minutes of delay between them. Distributed tracing with correlation IDs (a trace ID carried in every event) is essential.
Availability vs consistency. The producer writes to the database and publishes an event. If the event publish fails (Kafka is temporarily unavailable), the database write succeeds but the consumers never receive the event. The system is inconsistent. The Outbox pattern (post 10) solves this.
Eventual consistency. Consumers process events asynchronously — there's a lag between the producer publishing and the consumer acting. In the URL shortener, a new link may not appear in the analytics dashboard for a few seconds after creation. This is usually acceptable; sometimes it isn't. Know which operations require synchronous confirmation and which can tolerate eventual consistency.
Message ordering and duplicates. Distributed brokers can deliver messages out of order or more than once (at-least-once delivery). Consumers must be idempotent — processing the same event twice should produce the same result. Use event IDs and deduplication logic.
When to use event-driven architecture
Use it when:
- Multiple services need to react to the same action (fan-out without coupling)
- Producers and consumers have different scaling requirements or availability profiles
- Operations can tolerate eventual consistency (analytics, notifications, QR generation)
- You want to add new consumers without modifying the producer
Don't use it for:
- Operations that require immediate confirmation (payment processing — the user needs to know if it succeeded or failed right now)
- Simple request-response flows where synchronous coupling isn't a problem
- Small systems where the operational overhead of a message broker exceeds the benefit
The one thing to remember
Event-driven architecture decouples producers from consumers by making "what happened" the shared language. The producer publishes facts; consumers react in their own time. The cost is eventual consistency and the need for idempotent consumers and distributed tracing. The benefit is resilience (a consumer's downtime doesn't affect the producer), scalability (each consumer scales independently), and extensibility (new consumers subscribe without touching the producer).
← Previous: Serverless — a third deployment model where you neither manage servers nor run persistent services; pay per invocation, scale to zero, focus purely on function logic.
→ Next: Message Queues — the concrete infrastructure behind event-driven systems: how queues work, what durability guarantees they provide, and how producers and consumers decouple their processing rates.




