Service Mesh: A Programmable Network for Microservices

Series: System Design · Architecture Patterns — Pillar 7 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Architecture Patterns: How Systems Are Structured | Twenty patterns covering monoliths, microservices, events, resilience, deployment, and data processing. How to structure systems that survive growth. |
| 01 | Monolithic Architecture: The Default That Gets Abandoned Too Early | Monoliths are fast to build and easy to operate. Learn when they're the right choice, when they break down, and how to know the difference. |
| 02 | Microservices: The Architecture You Earn, Not Choose | Microservices enable independent scaling and team autonomy — but at significant cost. Learn what you actually get, what you pay, and when it's worth it. |
| 03 | Serverless: Pay for What You Use, Not What You Provision | Serverless scales to zero and charges per invocation. Learn where it shines, where it fails, and how to design around cold starts and vendor lock-in. |
| 04 | Event-Driven Architecture: Decoupling Through Events | Event-driven systems communicate via events rather than direct calls. Learn how producers, consumers, and event brokers work — and the consistency tradeoffs involved. |
| 05 | Message Queues: Decoupling Produce from Consume | Message queues decouple producers and consumers, enable load levelling, and provide durability. Learn how they work and when to use Kafka vs SQS vs RabbitMQ. |
| 06 | Pub/Sub: Broadcasting Events to Multiple Consumers | Pub/sub decouples publishers from subscribers through topics. Learn how it differs from message queues and when to use Kafka, SNS, or Google Pub/Sub. |
| 07 | CQRS: When Reads and Writes Need Different Models | CQRS separates writes from reads so each can be optimised independently. Learn how it works, when it's worth the complexity, and when it isn't. |
| 08 | Event Sourcing: The Ledger, Not the Balance | Event sourcing stores state as a sequence of events. Learn how it works, what you get (audit log, time travel), and what it costs (complexity, schema evolution). |
| 09 | The Saga Pattern: Distributed Transactions Without Locks | The Saga pattern manages distributed transactions across services using compensating transactions. Learn choreography vs orchestration and when to use each. |
| 10 | The Outbox Pattern: Atomic Writes and Event Publishing | The Outbox pattern solves the dual-write problem — publishing an event and writing to a database atomically. Learn how it works using CDC or polling. |
| 11 | The Circuit Breaker: Stopping Cascading Failures | Circuit breakers prevent cascading failures by fast-failing calls to unhealthy dependencies. Learn the three states, how to configure them, and where to apply them. |
| 12 | The Bulkhead Pattern: Containing Failures Through Resource Isolation | Bulkheads isolate thread pools and connections per dependency so one failure can't exhaust resources needed by others. Learn how to apply them in practice. |
| 13 | The Sidecar Pattern: Cross-Cutting Concerns Without Code Changes | The sidecar pattern deploys a helper process alongside each service for logging, metrics, TLS, and service discovery — without modifying the service itself. |
| 14 | Service Mesh: A Programmable Network for Microservices ← you are here | A service mesh handles service-to-service traffic, mTLS, circuit breaking, and observability via a fleet of sidecar proxies. Learn how it works and when to use it. |
| 15 | Service Discovery: Finding Services in a Dynamic Environment | Service discovery lets services find each other in dynamic environments. Learn client-side vs server-side discovery, health checks, and DNS vs registry approaches. |
| 16 | The Strangler Fig: Replacing a Legacy System Without Burning It Down | The Strangler Fig replaces a legacy system incrementally by routing specific functionality to new implementations while the old system keeps running. |
| 17 | Backend for Frontend: One API Per Client Type | BFF creates dedicated API backends per client type. Learn why one general API struggles to serve mobile and web well, and how BFF solves it. |
| 18 | ETL Pipelines: Moving Data from Operations to Analytics | ETL moves data from operational systems into analytical stores. Learn how pipelines work, what ELT is, and how to design reliable data movement at scale. |
| 19 | Batch vs Stream Processing: How Fresh Do Your Answers Need to Be? | Batch processes accumulate data then processes in bulk; streaming processes each event as it arrives. Learn the tradeoffs and when each is right. |
| 20 | MapReduce: Processing Petabytes in Parallel | MapReduce processes massive datasets in parallel by splitting work into map and reduce phases. Learn how it works and why Spark has largely replaced it. |
| 21 | Architecture Patterns: Wrap-Up | A recap of all 20 architecture patterns across decomposition, async communication, data patterns, resilience, and data processing. How they connect. |
Service Mesh: A Programmable Network for Microservices
The problem
You have twenty microservices. Each needs consistent retry logic, circuit breaking, timeouts, load balancing, mTLS, distributed tracing, and traffic routing for canary deployments. You've solved this with sidecars — every service has an Envoy proxy co-deployed.
But now you have twenty Envoy instances. How do you configure them consistently? How do you push a new circuit breaker policy across all twenty services at once? How do you do a canary deployment for the Link Service — routing 5% of traffic to the new version — without touching application code? How do you get a unified view of service-to-service traffic across the entire cluster?
Each sidecar in isolation is a useful tool. A fleet of sidecars with a shared control plane is a service mesh.
The core idea
A service mesh is a dedicated infrastructure layer for service-to-service communication. It consists of a data plane (the fleet of sidecar proxies that handle actual traffic) and a control plane (the central management component that configures the proxies, distributes certificates, and collects telemetry). Together, they provide traffic management, security, and observability across all services — without changing a line of application code.
The analogy: a managed road network
Independent roads (sidecars alone) work — cars get from A to B — but each driver must know the routes, obey their own traffic rules, and manage their own navigation.
A managed road network (service mesh) adds: traffic signals that can be reconfigured centrally, tolls that enforce access rules, surveillance cameras that feed into a central dashboard, variable speed signs for flow control. Cars (services) still drive — but the network manages the conditions.
How a service mesh works
Data plane (Envoy sidecars)
Every service pod has an Envoy sidecar that intercepts all inbound and outbound network traffic via iptables redirection. Envoy handles:
- Load balancing across destination service instances
- Retries on transient failures
- Circuit breaking when a downstream service degrades
- mTLS — all service-to-service traffic is encrypted and mutually authenticated
- Request tracing — Envoy generates spans and propagates trace headers automatically
The service code makes a plain HTTP call to analytics-service:8080. Envoy intercepts it, adds mTLS, applies retry policy, emits a trace span, and forwards it. From the service's perspective: a normal HTTP call.
Control plane (Istio / Linkerd)
The control plane manages the Envoy fleet:
- Certificate authority: issues and rotates mTLS certificates for every service identity automatically
- Configuration distribution: pushes routing rules, retry policies, circuit breaker settings to each Envoy instance via the xDS API
- Telemetry aggregation: collects metrics (request rate, error rate, latency) from every Envoy and exposes them to Prometheus/Grafana
Control Plane (Istiod):
→ Pushes routing config to all Envoy sidecars
→ Issues mTLS certificates (rotated every 24h)
→ Aggregates telemetry
Envoy sidecar (link-service pod):
← Config from control plane
→ Handles all traffic to/from link-service
→ Reports metrics and traces
Envoy sidecar (analytics-service pod):
← Config from control plane
→ Handles all traffic to/from analytics-service
→ Reports metrics and traces
Traffic management
The control plane enables sophisticated traffic routing that would otherwise require application code or multiple load balancer rules:
Canary deployment:
# Route 95% of traffic to stable, 5% to canary — no application changes
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: link-service
spec:
http:
- route:
- destination:
host: link-service
subset: stable
weight: 95
- destination:
host: link-service
subset: canary
weight: 5
Fault injection (for chaos engineering):
# Inject a 5-second delay on 10% of requests to analytics-service for testing
http:
- fault:
delay:
percentage:
value: 10
fixedDelay: 5s
route:
- destination:
host: analytics-service
Header-based routing: route requests with header X-Beta-User: true to a beta version of the service.
Retry and timeout policies: configure global defaults applied to all services without touching service code.
Tradeoffs
Power vs complexity. A service mesh provides capabilities (mTLS everywhere, canary deployments, fault injection, unified observability) that would otherwise require significant application code investment — or be impossible without it. The cost is substantial: the control plane is a complex distributed system in its own right. Istiod has failure modes, upgrade procedures, and operational quirks that must be understood.
Resource overhead. Each Envoy sidecar consumes 50–200MB of RAM and adds ~1–3ms of latency per service call. In a cluster with hundreds of pods, this is meaningful infrastructure cost.
The complexity cliff. For small microservices deployments (under 5–10 services), a service mesh is almost certainly overkill — the operational overhead exceeds the benefit. The breakeven point is different for every organisation but most teams don't need one until they have genuine at-scale problems with service-to-service security or traffic management.
Linkerd vs Istio: Linkerd is simpler, lighter (Rust-based proxy), and easier to operate. Istio is more powerful, more configurable, and more complex. Linkerd is generally the better starting point; Istio fits organisations with complex traffic management requirements.
The one thing to remember
A service mesh solves service-to-service communication at scale by moving networking concerns (security, routing, observability, resilience) out of application code and into a managed infrastructure layer. The data plane (sidecar proxies) handles traffic; the control plane manages configuration and certificates. You pay for this in operational complexity and resource overhead — only justified when you have a genuine fleet of services with cross-cutting communication requirements that can't be handled by simpler means.
← Previous: Sidecar — deploying a helper process alongside each service to handle cross-cutting concerns like logging, metrics, and service discovery.
→ Next: Service Discovery — in a dynamic environment where service instances start and stop constantly, how do services find each other?




