DORA Metrics: The Four Numbers That Tell You Whether Your Engineering Is Actually Getting Better

Series: The Modern SDLC · Post 17 of 17 ← Post 16: Platform Engineering and FinOps · Post 0: Series Overview →

Most engineering teams have a vague sense of whether they're getting better or worse at delivering software. Sprints feel faster or slower. Incidents feel more or less frequent. Deployments feel more or less stressful. But "feels like" is a poor instrument for making improvement decisions. You can't tell whether a change to your process actually helped, whether a problem is getting better or worse, or where to focus your improvement effort next.

DORA metrics are the closest thing the industry has to an objective instrument for measuring delivery performance. They emerged from a decade of research by the DevOps Research and Assessment programme — now part of Google — surveying tens of thousands of engineers and organisations across industries and sizes. The finding that makes them significant: four specific metrics reliably predict not just technical delivery performance but organisational performance, including profitability, market share, and customer satisfaction.

The teams that score well on all four don't do so by focusing on the metrics. They do so by building the practices throughout this series — small batches, automated quality gates, reliable pipelines, good observability, fast incident response, continuous improvement. The metrics are the instrument, not the goal.

This is the last post in the series. It's also, in some ways, the one that ties all the others together.

The one thing to remember

DORA metrics are a diagnostic tool, not a scorecard. They tell you where your process has a constraint. They don't tell you what to do about it — the rest of this series does that.

The four metrics

Deploy frequency: how often you ship to production

What it measures: the forcing function for everything else. Small, frequent deployments reduce the size of each change, reduce the risk per deploy, and make the system of delivery practices work properly. Low deploy frequency is almost always a symptom of multiple other problems: long-lived branches, large PRs, slow CI, a deployment process that requires manual coordination, or cultural fear of the deployment process itself.

The bands:

Elite: multiple times per day
High: between once per day and once per week
Medium: between once per week and once per month
Low: less than once per month

What moves it: trunk-based development (Post 3) eliminates the integration delay of long-lived branches. CI pipeline speed (Post 7) reduces the feedback cycle between writing and verifying code. Small PRs produce small deployments. Feature flags (Post 5 and Post 11) decouple deployment from release, removing the pressure to batch changes before deploying. A team that moves from weekly to daily deployment usually doesn't change their tooling — they change their batching behaviour.

The common misreading: treating low deploy frequency as a sign that the team is being careful. In practice, the opposite is true. Teams that deploy infrequently accumulate large batches of changes, which are harder to debug when something goes wrong, harder to roll back, and more likely to interact in unexpected ways. High-frequency deployment is safer than low-frequency deployment, not riskier.

Lead time for changes: how long from commit to production

What it measures: the efficiency of the entire delivery pipeline — from code written to value in users' hands. Long lead time means slow feedback on whether a change works in production, slow response to bugs, and slow delivery of new capabilities to users. It's the metric that most directly represents the velocity users and business stakeholders experience.

The bands:

Elite: less than one hour
High: between one day and one week
Medium: between one week and one month
Low: more than six months

What moves it: the three biggest contributors to long lead time are slow CI (Post 7), PR review queues (Post 5), and manual deployment steps (Post 11). A PR that waits two days for review contributes two days to lead time regardless of how fast everything else is. CI that takes forty-five minutes contributes forty-five minutes to every change. Manual approval processes that require coordination with a release manager add hours or days. Each of these has a specific fix: parallelised and cached CI, a team norm for review response time, and automated deployment pipelines.

The measurement: lead time is measured from the first commit in a PR to when that PR's changes reach production. Not from when the PR was opened, not from when the ticket was created — from the first code change. Tools like LinearB, Swarmia, and GitHub's own engineering insights calculate this automatically from repository and deployment data.

Change failure rate: what percentage of deployments cause incidents

What it measures: the quality of what you ship. A deployment that requires a hotfix, a rollback, or produces a production incident counts as a failure. The change failure rate is the fraction of deployments that result in one of these outcomes.

The bands:

Elite: 0–5%
High: 5–10%
Medium: 10–15%
Low: over 15%

What moves it: test coverage — specifically integration tests against real dependencies (Post 6) — is the strongest predictor of low change failure rate. Progressive delivery (Post 11) limits blast radius when failures occur: a canary that affects 1% of traffic and automatically rolls back is a change failure with minimal user impact. Feature flags provide an instant kill-switch. Code review quality (Post 5) catches logic errors before they reach production.

The critical relationship with deploy frequency: when deploy frequency increases, change failure rate often increases temporarily — you're shipping more often, and some of those ships will have problems. This is expected and acceptable. What's not acceptable is a sustained rise in change failure rate as frequency increases, which signals that the quality gates aren't keeping pace. Watch both metrics together. If deploy frequency doubles and change failure rate stays flat or improves, the quality practices are working. If it doubles and change failure rate doubles too, the quality gates need work.

Mean time to restore (MTTR): how long to recover from incidents

What it measures: resilience — the combination of how quickly problems are detected, how effectively the team responds, and how fast service is restored. Low MTTR means the system fails gracefully and recovers quickly. High MTTR means failures have extended impact.

The bands:

Elite: less than one hour
High: less than one day
Medium: between one day and one week
Low: more than six months

What moves it: observability maturity (Post 13) is the strongest lever — a team with metrics, structured logs, and distributed traces finds root cause five to ten times faster than a team without them. Fast rollback capability (Post 11) means mitigation takes minutes rather than hours. Good runbooks (Post 14) mean the on-call engineer has a documented response procedure rather than starting from first principles at 2am. Feature flags mean the first mitigation step often doesn't require a deployment at all.

The asymmetry worth noting: MTTR improvements are often faster to achieve than deploy frequency improvements. You can instrument a service and write runbooks in a week. Moving from monthly to daily deployment requires process and cultural changes that take months. For teams with high MTTR and low deploy frequency, investing in observability and incident response first produces faster returns.

The fifth metric: reliability

The 2023 DORA report added a fifth metric — operational performance and reliability, specifically whether teams consistently meet their SLO targets.

This addition was deliberate. The four original metrics could theoretically all score "elite" on a team that ships fast, ships often, recovers quickly, and ships rarely broken — but whose system is chronically unreliable due to architectural issues, infrastructure instability, or dependency problems unrelated to recent deployments. The reliability metric captures what the others miss: does the system actually work for users on an ongoing basis?

Elite teams achieve high scores on all five simultaneously. Reliability and speed are not a trade-off in elite teams — they're correlated. Teams that deploy frequently with high-quality automated gates tend to be more reliable than teams that deploy rarely with manual processes, because small changes are easier to verify, easier to roll back, and less likely to cause complex interactions.

What moves each metric: the highest-leverage practices

DORA research identifies specific technical and cultural capabilities that most strongly predict high performance. These capabilities aren't opinions — they're correlations from surveying tens of thousands of teams over a decade. The practices that appear consistently across high-performing teams:

Trunk-based development is the single strongest predictor of high deploy frequency. Everything else that contributes — small PRs, fast integration, no merge hell — follows from the discipline of integrating to main frequently.

Comprehensive automated testing is the strongest predictor of low change failure rate. Not just high coverage numbers — tests that actually catch real bugs before production, particularly integration tests against real dependencies.

Loosely coupled architecture — the ability to deploy a service independently of other services — enables high deploy frequency and low change failure rate simultaneously. Tightly coupled systems require coordinated deployments, which are large and risky.

Good monitoring and observability is the strongest predictor of low MTTR. You can't fix what you can't see.

Version control for all production artefacts — application code, infrastructure code, configuration — enables reliable, repeatable deployments and fast, confident rollback.

Continuous integration where every commit triggers a build and test run catches problems early, when they're cheap to fix.

The common thread across all of these: they reduce batch size and feedback loop length. Smaller batches of change are easier to verify, easier to roll back, and easier to debug when they fail. Shorter feedback loops catch problems earlier, when they're cheaper to fix. Every high-performing practice in DORA research reduces batch size, reduces feedback loop length, or both.

Retrospectives that produce change

DORA metrics tell you there's a problem. Retrospectives are where the team decides what to do about it. Post 4 covered the mechanics of retrospectives — this section focuses on the conditions that make them produce lasting change rather than temporary conversation.

Start with data, not feelings. Open the retrospective with the metrics. "Our lead time increased from 2 days to 4 days this sprint — what happened?" is a different conversation than "how did this sprint feel?" Data anchors the discussion in what actually changed and produces more specific, actionable conclusions.

Name the constraint. Every delivery system has one binding constraint at any given time — the thing that, if improved, would most improve overall throughput. PR review time, CI pipeline speed, deployment approval process, on-call load, technical debt in a critical path — one of these is the current bottleneck. Identifying it is more valuable than generating a general list of improvements.

Follow through on last time's actions first. The most demoralising retro experience is raising the same issue sprint after sprint because last sprint's action wasn't completed. Open every retrospective by reviewing last time's action items. Were they done? Did they help? If not, why not? This single habit transforms the retrospective from a venting session into an improvement engine.

Limit to three actions. Ten action items from a retro get zero done. Three action items, each assigned to a named person with a due date and a ticket, get two done. The constraint forces prioritisation — which is the actual skill. The remaining items go to a visible backlog, reviewed at the next retro.

Rotate the facilitator. The same person facilitating every retrospective creates a dynamic where the discussion gravitates toward their concerns and their framing. Rotating through the team surfaces different perspectives and prevents the retrospective from becoming a leadership update session with audience participation.

Tech debt: managing it as first-class work

Technical debt accumulates in every codebase. The teams that manage it well don't have less of it — they have better visibility into it and better mechanisms for paying it down consistently. The teams that manage it poorly let it compound until it's the dominant force on their velocity.

Visibility first. Tech debt that isn't documented doesn't get prioritised. Every known piece of tech debt is a ticket in the backlog — with a description of the problem, the cost of not fixing it (slow builds, frequent bugs, difficult onboarding, unreliable deployments), and an estimate. Visible in the same backlog as product features. Prioritised by the same criteria. Never in a separate "debt tracker" that nobody opens.

Reserved capacity — not a debt sprint. The "debt sprint" — setting aside an entire sprint to address technical debt — is the most common tech debt management strategy and the least effective. It signals that debt is not normal engineering work, which makes it easy to cancel under product pressure. The better approach: reserve 20% of sprint capacity for tech debt, engineering excellence, and platform improvements in every sprint. One day per engineer per week. Visible in the sprint plan. Protected from product pressure. This is how debt gets paid down continuously rather than in bursts that never come.

The boy scout rule. Leave the code cleaner than you found it. Every PR includes a small improvement to the surrounding code — a renamed variable, a clarified comment, an extracted function, a test for an untested code path. Debt paid down in tiny increments on every PR, invisibly, continuously. The aggregate effect over a year is significant.

The cost of deferral. Tech debt has carrying cost. Code that's harder to understand takes longer to change. Systems that are harder to test produce more bugs. Infrastructure that's harder to operate creates more incidents. Document this cost when you log a debt item: "this authentication module takes three times longer to modify than it should, contributing approximately two hours of extra work per sprint." Making the cost concrete makes the prioritisation case.

Using DORA as a diagnostic tool, not a scorecard

The most important thing to understand about DORA metrics is how not to use them.

Don't compare teams. Comparing deploy frequency across teams creates gaming — trivial commits to inflate the number — and ignores context. A team maintaining a ten-year-old monolith and a team building greenfield microservices have different structural constraints. DORA metrics benchmark a team against itself over time, not against other teams.

Don't set DORA targets in performance reviews. Once DORA metrics appear in an engineer's performance review, they stop measuring real performance and start measuring the engineer's ability to influence the metric. Deploy frequency gets inflated with meaningless commits. Change failure rate gets suppressed by under-reporting incidents. The signal disappears.

Don't look at one metric in isolation. A team can achieve elite deploy frequency by deploying trivial changes constantly while shipping nothing meaningful. A team can achieve elite change failure rate by deploying so rarely that every release is exhaustively vetted. The four metrics constrain each other — optimising one at the expense of others produces worse overall outcomes than balanced improvement across all four.

Do track trends, not snapshots. A single month's DORA numbers are noisy. Three months of consistent improvement in lead time is a meaningful signal. Six months of rising change failure rate is a serious concern. Track monthly, review quarterly, act on trends.

Do use them to start conversations. "Our MTTR increased from thirty minutes to ninety minutes over the last quarter — what changed?" is a productive engineering discussion that leads somewhere. "MTTR is too high" is not. The metrics create the context for honest, specific conversations about what's working and what needs to change.

The compound effect: why the series is a system

This is the last post in the series, which makes it the right place to say something that's been implicit throughout: these seventeen practices are not independent. They compound.

Teams that do trunk-based development (Post 3) have short-lived branches. Short-lived branches produce small PRs. Small PRs get fast reviews. Fast reviews reduce lead time. Reduced lead time enables higher deploy frequency. Higher deploy frequency means smaller batches. Smaller batches have lower change failure rates. Lower change failure rates build confidence. Confidence makes deployment feel safe. Deployment feeling safe enables higher frequency. The virtuous cycle accelerates.

The inverse is equally true. Long-lived branches produce large PRs. Large PRs get slow, shallow reviews. Slow reviews delay merges. Delayed merges increase lead time. Long lead times encourage batching. Large batches have higher change failure rates. High change failure rates create fear. Fear reduces deploy frequency. Low deploy frequency produces large batches. The vicious cycle accelerates.

The implication: where you start matters less than the direction you're moving. You don't need to implement all seventeen practices simultaneously. You need to find the current binding constraint — the one thing that, if improved, would move your DORA metrics most — and fix that. Then find the next constraint. Then the next.

That's the practice of continuous improvement. Not a destination. Not a state you reach. A direction you maintain.

If you do one thing from this post — and from this series

Take the DORA DevOps Quick Check. It's a free survey at dora.dev that takes ten minutes and places your team in an elite/high/medium/low band across all four metrics.

Do it with your team, not alone. The conversations that happen while answering the questions are as valuable as the result — they surface different perceptions of the same process, expose assumptions that were invisible because they were shared, and produce a shared understanding of where the team actually is rather than where it thinks it is.

The result tells you where to start. The series tells you what to do when you get there.

The series: complete

This is Post 17 — the end of the series. We've covered the entire modern software development lifecycle, from the first conversation about a problem worth solving to the metrics that tell you whether you're building and delivering well.

A brief map of where everything connects:

The conception phase (Post 1) produces a project brief that scopes the work and defines success. Architecture decisions (Post 2) shape what's possible for the entire life of the system. The developer toolchain (Post 3) and agile practices (Post 4) determine how fast the team can move. Development practices (Post 5) and testing strategy (Post 6) determine the quality of what's built. The CI pipeline (Post 7) and DevSecOps (Post 8) are the quality gates that ensure only good things proceed. Infrastructure as Code (Post 9) and containers (Post 10) define how the system runs. Continuous delivery and GitOps (Post 11) and release management (Post 12) define how it gets there. Observability (Post 13), alerting (Post 14), and incident management (Post 15) define what happens when things go wrong. Platform engineering and FinOps (Post 16) ensure the whole system is sustainable. And DORA metrics (Post 17) tell you whether any of it is actually getting better.

None of these phases stands alone. Each one produces outputs that the next depends on, and each one creates feedback that flows back to the previous ones. Production incidents improve observability requirements. Observability gaps change alerting design. Alerting experience improves deployment practices. Deployment experience changes architecture decisions. Architecture decisions shape what testing is possible. Testing results change development practices.

It's a loop. The goal is to make the loop faster, tighter, and more honest over time.

That's modern software development. Not a process to follow — a system to improve.

← Post 16: Cloud Costs and Platform Engineering

Back to the series overview — Post 0

Thanks for reading the series. If it's been useful, the best thing you can do is share a specific post with one engineer on your team who'd benefit from it. These ideas compound when a whole team holds them.

DORA Metrics: The Four Numbers That Tell You Whether Your Engineering Is Actually Getting Better

DORA Metrics: The Four Numbers That Tell You Whether Your Engineering Is Actually Getting Better

The one thing to remember

The four metrics

Deploy frequency: how often you ship to production

Lead time for changes: how long from commit to production

Change failure rate: what percentage of deployments cause incidents

Mean time to restore (MTTR): how long to recover from incidents

The fifth metric: reliability

What moves each metric: the highest-leverage practices

Retrospectives that produce change

Tech debt: managing it as first-class work

Using DORA as a diagnostic tool, not a scorecard

The compound effect: why the series is a system

If you do one thing from this post — and from this series

The series: complete

Comments

The Modern SDLC

The Modern SDLC Series: Everything We Covered — By the Numbers

More from this blog

System Design Foundations: Wrap-Up

High Availability vs Fault Tolerance: Similar Goals, Very Different Strategies

Single Point of Failure: Why One Weak Link Breaks the Whole Chain

Consistency Models: The Spectrum Between Always Right and Eventually Right

PACELC Theorem: The Tradeoff CAP Doesn't Cover

Command Palette

DORA Metrics: The Four Numbers That Tell You Whether Your Engineering Is Actually Getting Better

The one thing to remember

The four metrics

Deploy frequency: how often you ship to production

Lead time for changes: how long from commit to production

Change failure rate: what percentage of deployments cause incidents

Mean time to restore (MTTR): how long to recover from incidents

The fifth metric: reliability

What moves each metric: the highest-leverage practices

Retrospectives that produce change

Tech debt: managing it as first-class work

Using DORA as a diagnostic tool, not a scorecard

The compound effect: why the series is a system

If you do one thing from this post — and from this series

The series: complete

Comments

The Modern SDLC

The Modern SDLC Series: Everything We Covered — By the Numbers

More from this blog