Skip to main content

Command Palette

Search for a command to run...

Availability in System Design: What the Nines Actually Mean

Updated
7 min read
Availability in System Design: What the Nines Actually Mean

Foundations Series

# Post What it covers
00 Intro What the Foundations pillar covers and why it matters
01 Availability ← you are here Uptime, the nines, and why 99% isn't good enough
02 Reliability Correctness over time — when uptime isn't enough
03 Latency vs Throughput vs Bandwidth The three numbers that define system performance
04 ACID vs BASE Two philosophies for handling data under pressure
05 CAP Theorem The impossibility result every distributed system runs into
06 PACELC Theorem What CAP doesn't tell you about latency
07 Consistency Models The spectrum from "always correct" to "eventually correct"
08 Single Point of Failure Why one weak link breaks the whole chain
09 High Availability vs Fault Tolerance Similar goals, very different strategies
10 Wrap-up How all nine concepts connect

Availability in System Design: What the Nines Actually Mean

The problem

It's 11:47 PM on Black Friday. Your e-commerce platform has been running smoothly all day, handling ten times normal traffic. Then one database node starts returning timeouts. Within four minutes, the checkout flow is broken for every user. By the time the on-call engineer wakes up, reads the alert, and deploys a fix, 23 minutes have passed.

Your system was "up" for 99.84% of November. That sounds impressive. It means your customers couldn't buy anything for roughly 12 hours across the month — including your most important 23 minutes of the year.

Availability isn't just a percentage. It's a promise — and the way you measure it determines whether you're actually keeping it.


The core idea

Availability is the proportion of time a system is operational and able to serve requests. It sounds simple. The hard part is that "operational" needs a definition — a system that returns error 500 on every request is technically "up" — and that even tiny reductions in availability compound into serious downtime when you're running at scale.


The analogy: a hospital emergency department

Think of your system like a hospital A&E. "Available" doesn't mean the building is standing — it means a patient who walks in right now gets assessed and treated. The hospital can be physically open while being effectively unavailable: all doctors on break, the triage system crashed, the pharmacy out of stock.

This is the distinction that matters in system design. Availability isn't about whether your servers are running. It's about whether a user who makes a request right now gets a useful response.

Like an A&E, you design for the worst-case arrival rate, not the average. And like an A&E, you build redundancy — multiple doctors, multiple wards — not because the first doctor will definitely fail, but because the cost of failure is too high to rely on one.


How it works

The nines

Availability is expressed as a percentage of uptime over a period, and the industry shorthand is "the nines":

Availability Downtime per year Downtime per month
99% ("two nines") ~3.65 days ~7.3 hours
99.9% ("three nines") ~8.77 hours ~43.8 minutes
99.99% ("four nines") ~52.6 minutes ~4.4 minutes
99.999% ("five nines") ~5.26 minutes ~26 seconds

The jump from 99% to 99.9% saves you three and a half days of downtime a year. The jump from 99.9% to 99.99% saves you another eight hours. Each additional nine is roughly ten times harder to achieve than the last.

Five nines — the gold standard for telecoms and critical infrastructure — means your entire year's allowable downtime is the length of a coffee break.

Measuring it correctly

The basic formula is:

Availability = Uptime / (Uptime + Downtime)

But the definition of "downtime" is where teams go wrong. A common mistake is only counting full outages — times when the system returned no response at all. A more honest measure includes:

  • Degraded responses: requests that succeeded but were too slow to be useful

  • Partial outages: the checkout works but search is broken

  • Error responses: the system is responding, but with 500s

Teams that measure availability honestly tend to have better systems. Teams that game the metric tend to have very available dashboards and very unhappy users.

Availability in series vs parallel

When components are in series (each request must pass through A, then B, then C), their availabilities multiply:

System availability = A × B × C
= 0.999 × 0.999 × 0.999
= 0.997  (99.7%)

Each component you add to a critical path reduces overall availability. A chain of five 99.9% components gives you only 99.5%.

When components are in parallel (any one of them can serve the request), the math inverts in your favour:

System unavailability = (1 - A) × (1 - B)
= 0.001 × 0.001 = 0.000001

System availability = 1 - 0.000001 = 99.9999%

Two 99.9% components running in parallel give you a combined 99.9999%. This is the core insight behind redundancy: run more than one instance of anything that matters.


The tradeoffs

Higher availability costs money — in infrastructure, engineering time, and operational complexity. A system running two redundant nodes costs twice as much to run. A system designed for five nines needs automated failover, health checks, chaos engineering, 24/7 on-call rotations, and runbooks for edge cases that may never happen.

The more subtle cost is consistency. When you run multiple nodes to increase availability, those nodes can temporarily disagree about the state of the world. You'll meet this tradeoff properly in the CAP Theorem post, but the preview is: high availability and strong consistency are genuinely in tension. You often can't have both — and pretending otherwise leads to bugs that are very hard to find.

There's also an availability/complexity curve. Adding a third redundant node past a certain point stops adding meaningful availability and starts adding operational complexity. At some point, your biggest availability risk isn't hardware failure — it's a botched deployment by an exhausted engineer at 2 AM.


When to care about which level

99% is usually not enough for anything customer-facing. Seven hours of downtime a month is enough to destroy trust and trigger SLA penalties.

99.9% is a reasonable baseline for most production web services. It allows about 45 minutes of downtime per month — enough for a planned maintenance window if you schedule carefully.

99.99% is the target for revenue-critical paths — checkout flows, authentication, payment processing. Getting here requires investment: multi-AZ deployments, automated failover, thorough health checks.

99.999% is for critical infrastructure — telecommunications, air traffic control, financial clearing systems. Achieving it usually means treating every single change as a potential outage and moving very slowly.

The right number is the one your users actually need, not the highest one you can defensibly claim. A developer tooling product with users in one timezone doesn't need five nines. A global payments API does.


The one thing to remember

Availability is not binary. "The site is up" and "the site is down" are the two endpoints of a spectrum — most real incidents live in the middle, where the system is responding but not usefully. Design your metrics, your alerts, and your SLAs around the question users actually care about: can someone accomplish what they came here to do, right now?


Further reading


← No previous post — this is the start of the Foundations series

Next up: Reliability — Availability tells you whether your system responds. Reliability tells you whether it responds correctly. They're related but not the same — and confusing them leads to systems that are always up and frequently wrong.

Systems Design

Part 11 of 50

Understanding these system design concepts is essential for architects, developers, and engineers to create scalable, reliable, and maintainable software systems that meet the needs of businesses.

Up next

Reliability in System Design: When Being Up Isn't Enough

Foundations Series # Post What it covers 00 Intro What the Foundations pillar covers and why it matters 01 Availability Uptime, the nines, and why 99% isn't good enough 02 Reliability ← you

More from this blog

Cloud Tuned

729 posts

Your starting point for anything cloud: AWS, Azure, GCP, Serverless, Architecture, Hybrid Cloud, Systems Design and other Information Technology topics.