Observability: Understanding Your System at Runtime

Series: System Design · Distributed Systems — Pillar 8 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Distributed Systems: What Happens When Machines Disagree | Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar. |
| 01 | Network Partitions: The Failure Mode You Can't Design Away | Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice. |
| 02 | Split-Brain: When Two Nodes Both Think They're the Leader | Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it. |
| 03 | Heartbeats: How Nodes Know Their Peers Are Alive | Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work. |
| 04 | Leader Election: Agreeing on Who's in Charge | Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve. |
| 05 | Consensus Algorithms: Agreeing on a Value Across Failures | Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used. |
| 06 | Quorum: How Many Nodes Must Agree? | Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB. |
| 07 | Paxos: The Algorithm That Started It All | Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production. |
| 08 | Raft: Consensus Made Understandable | Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV. |
| 09 | Gossip Protocol: Decentralised Cluster Communication | Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production. |
| 10 | Logical Clocks: When Physical Time Isn't Enough | Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works. |
| 11 | Lamport Timestamps: Ordering Events Without a Global Clock | Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you. |
| 12 | Vector Clocks: Knowing When Events Are Truly Concurrent | Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations. |
| 13 | Distributed Transactions: When One Machine Isn't Enough | Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative. |
| 14 | Two-Phase Commit: Coordinating a Distributed Decision | 2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems. |
| 15 | Three-Phase Commit: Solving 2PC's Blocking Problem | 3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production. |
| 16 | Delivery Semantics: What Does "Delivered" Actually Mean? | Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate. |
| 17 | Change Data Capture: Streaming Your Database in Real Time | CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it. |
| 18 | Erasure Coding: Fault Tolerance Without Full Replication | Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication. |
| 19 | Merkle Trees: Efficiently Finding What's Different | Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy. |
| 20 | Observability: Understanding Your System at Runtime ← you are here | Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together. |
| 21 | Distributed Systems: Wrap-Up | A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series. |
Observability: Understanding Your System at Runtime
The problem
A user reports that redirects for their links are slow — sometimes 3–4 seconds instead of the normal 10ms. You look at the application dashboard: everything green. No errors logged. CPU, memory, and database query times all look normal.
But the user is experiencing real slowness. Somewhere in the system, something is wrong. Without instrumentation, you're debugging blind: the system is a black box, and the only signal you have is "users are unhappy."
This is the problem observability solves. An observable system tells you what it's doing at runtime — not just whether it's up, but why specific requests are slow, where in the call chain the latency is, what the error rate is per service, and which specific dependency is causing the problem.
The core idea
Observability is the property of a system that allows you to understand its internal state from its external outputs. In practice, observability consists of three types of telemetry data — logs, metrics, and traces — each answering a different question: what happened, how is the system behaving, and where did the time go.
The analogy: a pilot's cockpit instruments
An observable system is an aircraft with a full instrument panel. The pilot doesn't need to step outside to check if the engine is running — the tachometer shows RPM. They don't need to feel the altitude — the altimeter shows it. They don't need to hear turbulence — the autopilot's course deviation indicator shows it.
Each instrument answers a specific question. No single instrument tells the whole story — but together they give the pilot a complete picture of the aircraft's state.
An unobservable system is flying by feel in clouds: you know you're moving, you know roughly where you started, but you have no clear picture of where you are or what's about to go wrong.
The three pillars
Logs: what happened
Logs are timestamped, structured records of individual events in the system. They answer: "what exactly happened, and when?"
Structured logging (JSON) is the modern standard. Plain text logs are machine-readable only with complex parsing; JSON logs can be queried directly:
{
"timestamp": "2025-06-01T14:00:00.123Z",
"level": "INFO",
"service": "redirect-service",
"trace_id": "abc123",
"span_id": "def456",
"message": "Redirect completed",
"short_code": "x7Kp2",
"destination": "https://example.com",
"duration_ms": 8,
"cache_hit": true,
"user_agent": "Mozilla/5.0..."
}
Every log entry carries the trace ID (to correlate with a distributed trace) and span ID (to identify the specific operation within the trace).
Log aggregation: services ship logs to a central store (Elasticsearch/Kibana, Datadog, Loki/Grafana). Engineers query across all services by time, service, trace ID, user, error message, etc.
Key design principles: log at appropriate levels (DEBUG for development, INFO for normal operations, WARN for recoverable issues, ERROR for failures); include context (IDs, user references, relevant parameters); never log PII in production.
Metrics: how is the system behaving
Metrics are numeric measurements of system behaviour over time. They answer: "is the system healthy, and what's its current performance?"
The four Golden Signals (from Google's Site Reliability Engineering book):
- Latency: how long requests take (p50, p95, p99)
- Traffic: how many requests per second
- Errors: error rate (% of requests failing)
- Saturation: how full the system is (CPU %, memory %, queue depth)
RED method (for services): Rate, Errors, Duration — the minimum viable metrics for any service.
USE method (for infrastructure): Utilisation, Saturation, Errors — for CPUs, memory, disks, networks.
Prometheus + Grafana is the standard open-source stack. Services expose a /metrics endpoint in Prometheus format; Prometheus scrapes it; Grafana visualises it.
# Prometheus metrics exposed by the redirect service
http_requests_total{method="GET",path="/r",status="200"} 1423904
http_request_duration_seconds_bucket{le="0.01"} 1390000
http_request_duration_seconds_bucket{le="0.1"} 1423500
redirect_cache_hits_total 1401234
redirect_cache_misses_total 22670
Alerting: alert rules fire when metrics cross thresholds. p99 latency > 100ms → PagerDuty alert. Error rate > 1% → alert. Queue depth > 1000 → alert.
Distributed tracing: where did the time go
A trace records the journey of a single request through all the services it touched. It answers: "for this specific request, which service took how long, and where exactly was the time spent?"
Without tracing, a request that touches 5 services produces 5 separate log streams. Correlating them manually to understand which service was slow is laborious and error-prone.
With tracing, the entire request's journey is captured in one trace:
Trace: abc123 (total: 3247ms)
[Redirect Service: 3247ms]
├─ DNS lookup: 1ms
├─ [Redis Cache: 8ms] → MISS
└─ [PostgreSQL Query: 3230ms] ← THE SLOW PART
└─ query: SELECT destination FROM links WHERE short_code='x7Kp2'
indexes used: none (missing index on short_code)
rows scanned: 50,000,000
This trace immediately identifies the problem: no index on short_code. Without tracing, this 3-second slowness would be a mystery — the application logs show "query executed," the database logs show "slow query," but connecting them requires manual correlation.
OpenTelemetry: the standard for instrumentation. Services emit spans (individual operations within a trace) in OpenTelemetry format. Backends (Jaeger, Honeycomb, Datadog) store and visualise traces.
Trace propagation: when Service A calls Service B, it passes the trace ID and span ID in HTTP headers (traceparent: 00-abc123-def456-01). Service B creates a child span under the parent span. The entire distributed call chain is captured.
# Request headers from Service A to Service B
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
↑ trace ID ↑ parent span ID
The relationship between pillars
The three pillars answer different questions but are most powerful combined:
An alert fires (metric): redirect p99 latency > 500ms → Open traces for the time period → find the slow traces → Identify which span is slow (e.g., PostgreSQL query: 3.2s) → Check logs for that trace ID → confirm the exact query, parameters, and context → Root cause identified in minutes, not hours
Each pillar narrows the investigation. Metrics surface the symptom. Traces locate the service and operation. Logs provide the specific context.
Tradeoffs
Sampling. High-traffic services can't record 100% of traces — the storage cost is too high. Most tracing systems sample: record 1% or 10% of traces. This means rare slow requests may not be captured. Adaptive sampling (always record traces with high latency or errors) mitigates this.
Cardinality explosion. Metrics with high-cardinality labels (one label value per user ID, per URL, per session) create millions of time series — most systems can't handle this. Keep metric label cardinality low; use logs or traces for high-cardinality data.
Operational overhead. Running a log aggregation stack (Elasticsearch), a metrics system (Prometheus + Grafana), and a tracing backend (Jaeger) is three separate systems to maintain. Managed services (Datadog, Honeycomb, New Relic) trade cost for operational simplicity.
Instrumentation discipline. Observability requires intentional instrumentation — it doesn't happen automatically. Every service must emit structured logs, expose metrics, and propagate trace headers. This is an ongoing engineering effort, not a one-time setup.
The one thing to remember
Observability is not monitoring — it's the property of a system that lets you ask arbitrary questions about its behaviour at runtime. Logs answer what happened. Metrics answer how the system is performing. Traces answer where time was spent in a specific request. Together, they turn a distributed black box into a system you can reason about during incidents. In a microservices system, observability is not optional — without it, production debugging is archaeology.
← Previous: Merkle Trees — efficiently comparing large datasets across distributed nodes to find which parts have diverged.
→ Next: Distributed Systems — Wrap-up — tying together all 20 concepts in this final pillar and the complete series.



