Availability in System Design: What the Nines Actually Mean

Foundations Series
| # | Post | What it covers |
|---|---|---|
| 00 | Intro | What the Foundations pillar covers and why it matters |
| 01 | Availability ← you are here | Uptime, the nines, and why 99% isn't good enough |
| 02 | Reliability | Correctness over time — when uptime isn't enough |
| 03 | Latency vs Throughput vs Bandwidth | The three numbers that define system performance |
| 04 | ACID vs BASE | Two philosophies for handling data under pressure |
| 05 | CAP Theorem | The impossibility result every distributed system runs into |
| 06 | PACELC Theorem | What CAP doesn't tell you about latency |
| 07 | Consistency Models | The spectrum from "always correct" to "eventually correct" |
| 08 | Single Point of Failure | Why one weak link breaks the whole chain |
| 09 | High Availability vs Fault Tolerance | Similar goals, very different strategies |
| 10 | Wrap-up | How all nine concepts connect |
Availability in System Design: What the Nines Actually Mean
The problem
It's 11:47 PM on Black Friday. Your e-commerce platform has been running smoothly all day, handling ten times normal traffic. Then one database node starts returning timeouts. Within four minutes, the checkout flow is broken for every user. By the time the on-call engineer wakes up, reads the alert, and deploys a fix, 23 minutes have passed.
Your system was "up" for 99.84% of November. That sounds impressive. It means your customers couldn't buy anything for roughly 12 hours across the month — including your most important 23 minutes of the year.
Availability isn't just a percentage. It's a promise — and the way you measure it determines whether you're actually keeping it.
The core idea
Availability is the proportion of time a system is operational and able to serve requests. It sounds simple. The hard part is that "operational" needs a definition — a system that returns error 500 on every request is technically "up" — and that even tiny reductions in availability compound into serious downtime when you're running at scale.
The analogy: a hospital emergency department
Think of your system like a hospital A&E. "Available" doesn't mean the building is standing — it means a patient who walks in right now gets assessed and treated. The hospital can be physically open while being effectively unavailable: all doctors on break, the triage system crashed, the pharmacy out of stock.
This is the distinction that matters in system design. Availability isn't about whether your servers are running. It's about whether a user who makes a request right now gets a useful response.
Like an A&E, you design for the worst-case arrival rate, not the average. And like an A&E, you build redundancy — multiple doctors, multiple wards — not because the first doctor will definitely fail, but because the cost of failure is too high to rely on one.
How it works
The nines
Availability is expressed as a percentage of uptime over a period, and the industry shorthand is "the nines":
| Availability | Downtime per year | Downtime per month |
|---|---|---|
| 99% ("two nines") | ~3.65 days | ~7.3 hours |
| 99.9% ("three nines") | ~8.77 hours | ~43.8 minutes |
| 99.99% ("four nines") | ~52.6 minutes | ~4.4 minutes |
| 99.999% ("five nines") | ~5.26 minutes | ~26 seconds |
The jump from 99% to 99.9% saves you three and a half days of downtime a year. The jump from 99.9% to 99.99% saves you another eight hours. Each additional nine is roughly ten times harder to achieve than the last.
Five nines — the gold standard for telecoms and critical infrastructure — means your entire year's allowable downtime is the length of a coffee break.
Measuring it correctly
The basic formula is:
Availability = Uptime / (Uptime + Downtime)
But the definition of "downtime" is where teams go wrong. A common mistake is only counting full outages — times when the system returned no response at all. A more honest measure includes:
Degraded responses: requests that succeeded but were too slow to be useful
Partial outages: the checkout works but search is broken
Error responses: the system is responding, but with 500s
Teams that measure availability honestly tend to have better systems. Teams that game the metric tend to have very available dashboards and very unhappy users.
Availability in series vs parallel
When components are in series (each request must pass through A, then B, then C), their availabilities multiply:
System availability = A × B × C
= 0.999 × 0.999 × 0.999
= 0.997 (99.7%)
Each component you add to a critical path reduces overall availability. A chain of five 99.9% components gives you only 99.5%.
When components are in parallel (any one of them can serve the request), the math inverts in your favour:
System unavailability = (1 - A) × (1 - B)
= 0.001 × 0.001 = 0.000001
System availability = 1 - 0.000001 = 99.9999%
Two 99.9% components running in parallel give you a combined 99.9999%. This is the core insight behind redundancy: run more than one instance of anything that matters.
The tradeoffs
Higher availability costs money — in infrastructure, engineering time, and operational complexity. A system running two redundant nodes costs twice as much to run. A system designed for five nines needs automated failover, health checks, chaos engineering, 24/7 on-call rotations, and runbooks for edge cases that may never happen.
The more subtle cost is consistency. When you run multiple nodes to increase availability, those nodes can temporarily disagree about the state of the world. You'll meet this tradeoff properly in the CAP Theorem post, but the preview is: high availability and strong consistency are genuinely in tension. You often can't have both — and pretending otherwise leads to bugs that are very hard to find.
There's also an availability/complexity curve. Adding a third redundant node past a certain point stops adding meaningful availability and starts adding operational complexity. At some point, your biggest availability risk isn't hardware failure — it's a botched deployment by an exhausted engineer at 2 AM.
When to care about which level
99% is usually not enough for anything customer-facing. Seven hours of downtime a month is enough to destroy trust and trigger SLA penalties.
99.9% is a reasonable baseline for most production web services. It allows about 45 minutes of downtime per month — enough for a planned maintenance window if you schedule carefully.
99.99% is the target for revenue-critical paths — checkout flows, authentication, payment processing. Getting here requires investment: multi-AZ deployments, automated failover, thorough health checks.
99.999% is for critical infrastructure — telecommunications, air traffic control, financial clearing systems. Achieving it usually means treating every single change as a potential outage and moving very slowly.
The right number is the one your users actually need, not the highest one you can defensibly claim. A developer tooling product with users in one timezone doesn't need five nines. A global payments API does.
The one thing to remember
Availability is not binary. "The site is up" and "the site is down" are the two endpoints of a spectrum — most real incidents live in the middle, where the system is responding but not usefully. Design your metrics, your alerts, and your SLAs around the question users actually care about: can someone accomplish what they came here to do, right now?
Further reading
← No previous post — this is the start of the Foundations series
Next up: Reliability — Availability tells you whether your system responds. Reliability tells you whether it responds correctly. They're related but not the same — and confusing them leads to systems that are always up and frequently wrong.




