Skip to main content

Command Palette

Search for a command to run...

The Three Pillars of Observability: And Why You Need All Three

Updated
18 min read
The Three Pillars of Observability: And Why You Need All Three

The Three Pillars of Observability: And Why You Need All Three

Series: The Modern SDLC · Post 13 of 17 Post 12: Release Management · Post 14: Alerting and On-Call →


There is a particular kind of production incident that every engineer eventually experiences. An alert fires. The error rate is up. You open the dashboards and can see that something is wrong — the number is bad, the graph is spiking — but you can't see why. You know the system is broken; you don't know where or how. You start guessing. You tail logs. You restart services hoping something resets. An hour later you find the problem — a slow database query introduced in a deployment three days ago — and wonder why it took so long.

That experience is the absence of observability. Monitoring told you something was wrong. Observability would have told you what was wrong, where it was happening, and — crucially — why.

The distinction matters in practice. Monitoring is a set of checks on known failure modes. Observability is the property of a system that allows you to understand its internal state from its external outputs — without having to add new instrumentation every time something unexpected happens. A monitored system tells you when it crosses a threshold. An observable system lets you ask arbitrary questions and get answers.

The three pillars — metrics, logs, and distributed traces — are the data types that make a system observable. Each answers a different question. Each is necessary. None is sufficient alone.


The one thing to remember

Metrics tell you something is wrong. Logs tell you what happened. Traces tell you why. You need all three because each covers what the others miss.


The three questions and which pillar answers them

Before diving into implementation, it's worth being precise about what each pillar is for.

Metrics answer "what is happening right now?" They're numeric measurements aggregated over time — error rates, latency distributions, request counts, CPU utilisation. They're cheap to store, fast to query, and the foundation of alerting. When your error rate spikes to 5%, a metric is what tells you. A metric alone can't tell you which requests are failing, why they're failing, or which service in your stack is responsible.

Logs answer "what happened in detail?" They're timestamped records of discrete events — requests received, errors thrown, state transitions, external calls made. Rich in detail, expensive to store at volume, and the primary tool for diagnosis once you know something is wrong. Logs tell you the error message, the stack trace, the request payload. They can't efficiently tell you "what's the p95 latency of requests from users in Germany over the last six hours" — that's a metrics question.

Traces answer "why did it happen, and where?" A distributed trace follows a single request as it flows through every service, database call, queue message, and external API. It shows exactly where time was spent and where errors originated. Traces are the tool that transforms "the checkout is slow" into "the checkout is slow because the inventory service is making three sequential database queries that could be parallelised." They can't efficiently tell you aggregate statistics across all requests — that's a metrics question.

The three pillars are not alternatives. They're complementary data types that work together. An effective debugging workflow is: metrics fire an alert, you jump to a trace of a failing request, the trace points to a specific service and span, you query the logs for that span to see the exact error. That workflow requires all three.


Pillar 1 — Metrics: the foundation of alerting

Metrics are the starting point of any observability practice. If you instrument nothing else, instrument the four golden signals — the minimum set that gives you meaningful visibility into a service's health.

Latency — how long requests take. Not just average latency — that number is a lie. A service where 99% of requests complete in 10ms and 1% take 30 seconds has an average latency that looks acceptable while a meaningful fraction of users have a terrible experience. Instrument latency as a histogram and measure p50, p95, and p99. Alert on p95 or p99, not on the mean.

Traffic — requests per second, or events per second for non-HTTP workloads. Traffic context is what makes other metrics meaningful. A 5% error rate at 10 requests per second is 0.5 errors per second. A 5% error rate at 10,000 requests per second is 500 errors per second. The error rate alone doesn't tell you the impact.

Errors — the rate of requests that fail. Distinguish between different failure types: 5xx errors (your problem), 4xx errors (usually the client's problem, but worth monitoring for sudden spikes), and application-level errors that return 200 but represent a business failure.

Saturation — how full the system is. CPU utilisation, memory pressure, queue depth, database connection pool usage, disk space. Saturation metrics are leading indicators — they tell you that the system is approaching a limit before it reaches it.

Beyond the golden signals, instrument business metrics: orders per minute, signups per hour, revenue per minute, active user count. These are the metrics your stakeholders actually care about, and a drop in business metrics is often the first real signal of a user-impacting problem — sometimes before technical metrics show anything unusual.

Cardinality discipline is worth understanding early. Prometheus and most metrics systems work well with low-cardinality labels — service name, endpoint, status code. They break badly with high-cardinality labels — user ID, request ID, customer account number. Adding a user ID as a metric label means creating a separate time series for every user, which can be millions of series and terabytes of storage. High-cardinality analysis is traces work, not metrics work.

Tooling: Prometheus is the open-source default for Kubernetes environments — pull-based collection, PromQL for queries, a rich ecosystem of exporters for common services. Grafana for dashboards over Prometheus (and fifty other data sources). Datadog, New Relic, and Honeycomb are SaaS all-in-one platforms with lower operational overhead than self-hosted Prometheus at the cost of higher per-seat pricing.


Pillar 2 — Logs: diagnosis in detail

Logs are the most familiar observability tool and the most misused. A stream of unstructured text lines is technically logs. It's not useful logs.

Structured logging is the practice that turns logs from text to data. Every log line is a JSON object with consistent, queryable fields: timestamp, level, service, trace_id, span_id, message, and whatever domain-specific fields are relevant. This makes logs filterable by any field, aggregatable across services, and correlatable with traces.

The contrast is stark. An unstructured log line: "2024-11-15 14:32:01 ERROR User 123 failed to checkout". A structured log line: {"timestamp":"2024-11-15T14:32:01Z","level":"error","service":"checkout","user_id":"123","event":"checkout_failed","reason":"payment_declined","amount":4999,"currency":"GBP","trace_id":"abc123def456"}. The second is filterable by user, by event type, by failure reason. The first requires grep.

Log levels exist for a reason. ERROR means something failed that shouldn't have — someone should investigate. WARN means something unexpected happened that was handled — worth monitoring for trends. INFO means a normal significant event occurred — request received, job completed, cache refreshed. DEBUG means detailed internal state useful during development — off in production, or at most available on demand for a specific service. The failure mode is treating DEBUG as INFO and flooding production logs with high-volume noise that makes finding real signal expensive.

Correlation IDs are the most valuable single addition to any logging practice. Every request that enters your system at the API gateway or load balancer gets assigned a unique ID. That ID is propagated through every service call in the chain and included in every log line generated during that request. When a customer reports a problem at 14:32 and you need to trace what happened, you filter logs for that timestamp, find the correlation ID, and then retrieve every log line from every service that handled that request. Without correlation IDs, reconstructing the story of a single request across multiple services is archaeology.

In OpenTelemetry, the correlation ID is the trace_id — and it's set automatically by the SDK when you instrument your services. If you're not using OpenTelemetry yet, adding a correlation ID header manually is five lines of middleware code and worth doing immediately.

What never to log: passwords, tokens, API keys, credit card numbers, full payment details, unmasked PII (names, email addresses, phone numbers, national ID numbers). Log the user ID, not the user's data. Log that a payment was processed, not the card number it was processed against. GDPR, PCI-DSS, and HIPAA all have specific requirements here, and a log aggregation system is a surprisingly common source of compliance violations.

Retention and cost. Logs are expensive. A high-traffic service logging at DEBUG can produce gigabytes per hour. Tier your retention: hot (last seven days, fully indexed and searchable), warm (7–30 days, queryable with slightly higher latency), cold (30–365 days, compressed and cheap, retrieved on demand). Route high-volume low-value logs to cold storage or drop them. Datadog and Splunk bills have a habit of growing faster than any other infrastructure cost — retention policies are not optional.

Tooling: Grafana Loki is the cheapest option — it indexes only labels (like Prometheus) rather than full-text indexing the content of every log line. Dramatically cheaper than Elasticsearch at high volume. OpenSearch or Elasticsearch for full-text search requirements and compliance workloads that need rich query capability. Datadog Logs and Splunk for SaaS with deep integration with the rest of the platform.


Pillar 3 — Distributed traces: find the root cause fast

Distributed tracing is the pillar most teams adopt last and the one that has the highest impact per debugging hour saved once it's in place.

A trace follows a single request through your entire system. Every service it touches, every database query it triggers, every external API it calls, every queue message it produces — all captured as a tree of spans with precise timing. The result is a waterfall diagram that shows exactly where time was spent and exactly where errors originated.

The debugging transformation is real. "The checkout is slow" becomes — in seconds — "the checkout is slow because the get_inventory call on line 47 of the order service is making three sequential queries to the product database that could be a single query." Without tracing, that diagnosis is a 90-minute investigation involving log correlation, database query analysis, and intuition. With tracing, it's a 90-second investigation involving clicking the slow span in a waterfall diagram.

The core concepts:

A trace is the full record of one request's journey through the system. It has a unique trace_id shared across every service involved.

A span is one unit of work within the trace — one service call, one database query, one external HTTP call, one message publication. Spans have a start time, duration, status, and arbitrary metadata. They nest to form a tree — the root span is the incoming request, child spans are the work it triggers.

Context propagation is how the trace ID travels through the system. When service A calls service B, it includes the trace ID in the request headers (the W3C traceparent header is the standard). Service B starts a child span under the same trace. Without context propagation, you have a collection of isolated spans that can't be assembled into a coherent trace.

Auto-instrumentation means you don't have to write span creation code for every database call and HTTP request. OpenTelemetry provides auto-instrumentation agents for Node.js, Python, Java, Go, and .NET that hook into popular frameworks — Express, Django, Spring, gRPC — and generate spans automatically. Get traces from day one. Add custom spans for business logic as the need arises.

Sampling strategy is where most teams make their first mistake. Tracing every request at high traffic volume is expensive. Head-based sampling — decide at the start of a request whether to trace it, sample 10% of traffic — gives you baseline coverage. But it means 90% of error traces are dropped, which is exactly the wrong trade-off.

Tail-based sampling solves this: make the sampling decision at the end of the request, after you know whether it succeeded. Keep 100% of error traces, keep 100% of traces above a latency threshold, sample 10% of successful requests. Never drop the traces you most need to see. The OpenTelemetry Collector's tail sampling processor implements this without requiring application changes.

Trace-to-log correlation is the feature that unlocks the full debugging workflow. Include trace_id and span_id in every structured log line — the OpenTelemetry SDK does this automatically. Modern observability platforms use this to jump from a slow or failing span directly to the logs generated during that span. From "this span failed" to "here is the exact error message and stack trace" in one click.

Tooling: Grafana Tempo is the cheapest self-hosted trace backend — it uses object storage (S3, GCS) and is designed to be cost-effective at scale. Jaeger is the more mature open-source option with a better UI. Honeycomb is the best-in-class SaaS tool for high-cardinality trace analysis — worth the cost for complex distributed systems. Datadog APM provides traces, metrics, and logs in one platform with deep correlation between them.


OpenTelemetry: instrument once, send anywhere

OpenTelemetry is the most important development in observability in the last five years. It's the CNCF standard for producing and collecting telemetry data — metrics, logs, and traces — in a vendor-neutral format. Instrument your code with the OpenTelemetry SDK, and the choice of backend (Prometheus, Tempo, Datadog, Honeycomb, anything else) becomes independent of the instrumentation.

The strategic value: if you instrument with OpenTelemetry, you can migrate from Datadog to Grafana Cloud, or from Jaeger to Honeycomb, without touching application code. Your instrumentation investment is portable. The vendor you choose today doesn't lock in the vendor you're using in three years.

The pipeline: your application uses the OTel SDK to emit traces, metrics, and logs in OTLP (OpenTelemetry Protocol) format. The OTel Collector receives that data, applies transformations (filtering, sampling, attribute enrichment), and routes it to one or more backends. The Collector can fan out to multiple destinations simultaneously — send metrics to Prometheus, traces to Tempo, logs to Loki, and a copy of everything to Datadog during a migration.

Instrument from day one. Retrofitting OpenTelemetry into a large existing codebase is painful — touching every service, every framework integration, every outbound HTTP client. Instrumenting from the start means traces and metrics are there when you need them, not added during the first major incident when you suddenly wish you had them.


SLOs and error budgets: define what good means

An observable system is one where you can measure what's happening. An SLO-driven system is one where you've defined in advance what "good" looks like — and can tell when you're drifting away from it.

SLI (Service Level Indicator) is the metric you measure. The ratio of good events to total events. "The proportion of HTTP requests that complete in under 500ms." "The proportion of checkout requests that succeed." SLIs must be measurable from your existing telemetry.

SLO (Service Level Objective) is the target you set for an SLI over a time window. "99.5% of requests complete in under 500ms over a rolling 30 days." "99.9% of checkout requests succeed over a rolling 28 days." The SLO is your reliability promise — set it just tight enough to matter, not so tight it's impossible to maintain.

Error budget is the allowed failure room within the SLO. A 99.9% SLO over 30 days gives you 43.2 minutes of budget. While budget remains, the team can take risks. When budget is exhausted, freeze risky changes and focus on reliability. Error budgets turn reliability from a vague aspiration into a quantitative resource that engineering and product teams can negotiate over.

Burn rate alerts are the practical implementation of SLO-based alerting. Instead of alerting on "error rate above 1%," alert on "we're consuming error budget faster than sustainable." A burn rate of 14.4× means you'll exhaust the 30-day budget in 2 hours — page someone now. A burn rate of 3× means you'll exhaust it in 10 days — create a ticket, investigate during business hours. Everything below a 1× burn rate is within budget — no alert at all. This is dramatically fewer false positives than threshold-based alerting.

The common mistake is setting SLOs at 99.999% (five nines) for everything. Almost no service needs five nines. Almost no team can maintain it. And the error budget — 26 seconds per month — is so small that any non-trivial deployment carries meaningful risk of exhausting it. Start at 99.5% or 99.9%, measure what you actually achieve, and tighten from there.


The observability maturity ladder

Observability is not binary. Teams improve incrementally, and knowing where you are on the ladder helps identify what's worth investing in next.

Level 1 — Dark. No telemetry. You learn about problems from user complaints. Every incident starts with "a customer reported..."

Level 2 — Monitoring. Basic metrics and uptime checks. You know the service is up or down. You know CPU and memory. You don't know why things fail.

Level 3 — Logs and metrics. Structured logs, four golden signals, basic dashboards. You can diagnose most single-service issues. Cross-service debugging is still painful.

Level 4 — Full observability. Metrics, structured logs, and distributed traces, all correlated by trace ID. SLOs defined. Burn rate alerts. Root cause in minutes, not hours.

Level 5 — Proactive. Anomaly detection, continuous profiling (Pyroscope, Parca), predictive alerting. You know about problems before users do.

Most teams should target Level 4 as their standard operating posture. The jump from Level 2 to Level 4 is the highest-value investment in this space. Getting from Level 4 to Level 5 is worthwhile but yields diminishing returns for most systems.

Getting started without being overwhelmed: OTel SDK with auto-instrumentation plus structured logging to stdout, routed to Grafana Cloud's free tier (which includes Prometheus for metrics, Loki for logs, and Tempo for traces). Production-grade observability for near-zero cost until you scale. Add SLOs and burn rate alerts in the first month. Migrate to a paid tier or self-hosted stack when you outgrow the free tier.


What goes wrong without observability

The one-hour incident. An alert fires. The on-call engineer spends forty-five minutes tailing logs, checking dashboards that don't tell the story, and guessing at root cause. With full observability, the same incident resolves in five minutes — alert fires, trace opens, slow span identified, root cause visible.

Averages hiding the problem. Average latency looks fine at 150ms. p99 latency is 8 seconds. 1% of users are having a terrible experience and every metric you're watching says everything is fine. Histograms and percentiles are non-negotiable.

The missing trace. An incident happens. You have logs from service A and logs from service C. You know the request went through service B but you have no logs from service B for that request. Without distributed tracing and correlation IDs, you cannot reconstruct what service B did. The investigation stalls.

High cardinality on metrics. Someone adds user_id as a Prometheus label to help with debugging. Prometheus creates a time series for every user. With a million users, the metric system falls over. Cardinality belongs in traces, not metrics.

Observability as afterthought. The system ships without instrumentation. An incident happens. Engineers spend hours adding logging and metrics to find the problem. The next incident happens and the instrumentation that was added for the previous incident doesn't cover the new failure mode. Instrument from day one.


If you do one thing from this post

Add OpenTelemetry auto-instrumentation to one service. Configure it to export to Grafana Cloud's free tier — Prometheus for metrics, Tempo for traces, Loki for logs. Enable structured logging to stdout.

Then trigger a slow or failing request deliberately in a test environment. Open the trace. Follow the waterfall. Find the slow span. Click through to the logs for that span.

That workflow — from symptom to root cause without guessing — is what you're building toward. Experiencing it once, even in a test environment, makes the investment in observability concrete in a way that abstract arguments can't.


Next up: Post 14 — Alerting That Doesn't Burn Out Your Team

Post 12: Release Management: How to Ship Without Fear

The Modern SDLC

Part 14 of 19

Most engineering content teaches tools in isolation. This series connects them. From conception and architecture through to observability, incident management, and continuous improvement — a practical guide to how modern software is built, delivered, and operated end to end.

Up next

Alerting That Doesn't Burn Out Your Team

Alerting That Doesn't Burn Out Your Team Series: The Modern SDLC · Post 14 of 17 ← Post 13: Observability · Post 15: Incident Management → Being on-call at a company with bad alerting is one of the

More from this blog

Cloud Tuned

629 posts

Your starting point for anything cloud: AWS, Azure, GCP, Serverless, Architecture, Hybrid Cloud, Systems Design and other Information Technology topics.