Foundations Series

#	Post	What it covers
00	Intro	What the Foundations pillar covers and why it matters
01	Availability ← you are here	Uptime, the nines, and why 99% isn't good enough
02	Reliability	Correctness over time — when uptime isn't enough
03	Latency vs Throughput vs Bandwidth	The three numbers that define system performance
04	ACID vs BASE	Two philosophies for handling data under pressure
05	CAP Theorem	The impossibility result every distributed system runs into
06	PACELC Theorem	What CAP doesn't tell you about latency
07	Consistency Models	The spectrum from "always correct" to "eventually correct"
08	Single Point of Failure	Why one weak link breaks the whole chain
09	High Availability vs Fault Tolerance	Similar goals, very different strategies
10	Wrap-up	How all nine concepts connect

Availability in System Design: What the Nines Actually Mean

The problem

It's 11:47 PM on Black Friday. Your e-commerce platform has been running smoothly all day, handling ten times normal traffic. Then one database node starts returning timeouts. Within four minutes, the checkout flow is broken for every user. By the time the on-call engineer wakes up, reads the alert, and deploys a fix, 23 minutes have passed.

Your system was "up" for 99.84% of November. That sounds impressive. It means your customers couldn't buy anything for roughly 12 hours across the month — including your most important 23 minutes of the year.

Availability isn't just a percentage. It's a promise — and the way you measure it determines whether you're actually keeping it.

The core idea

Availability is the proportion of time a system is operational and able to serve requests. It sounds simple. The hard part is that "operational" needs a definition — a system that returns error 500 on every request is technically "up" — and that even tiny reductions in availability compound into serious downtime when you're running at scale.

The analogy: a hospital emergency department

Think of your system like a hospital A&E. "Available" doesn't mean the building is standing — it means a patient who walks in right now gets assessed and treated. The hospital can be physically open while being effectively unavailable: all doctors on break, the triage system crashed, the pharmacy out of stock.

This is the distinction that matters in system design. Availability isn't about whether your servers are running. It's about whether a user who makes a request right now gets a useful response.

Like an A&E, you design for the worst-case arrival rate, not the average. And like an A&E, you build redundancy — multiple doctors, multiple wards — not because the first doctor will definitely fail, but because the cost of failure is too high to rely on one.

How it works

The nines

Availability is expressed as a percentage of uptime over a period, and the industry shorthand is "the nines":

Availability	Downtime per year	Downtime per month
99% ("two nines")	~3.65 days	~7.3 hours
99.9% ("three nines")	~8.77 hours	~43.8 minutes
99.99% ("four nines")	~52.6 minutes	~4.4 minutes
99.999% ("five nines")	~5.26 minutes	~26 seconds

The jump from 99% to 99.9% saves you three and a half days of downtime a year. The jump from 99.9% to 99.99% saves you another eight hours. Each additional nine is roughly ten times harder to achieve than the last.

Five nines — the gold standard for telecoms and critical infrastructure — means your entire year's allowable downtime is the length of a coffee break.

Measuring it correctly

The basic formula is:

Availability = Uptime / (Uptime + Downtime)

But the definition of "downtime" is where teams go wrong. A common mistake is only counting full outages — times when the system returned no response at all. A more honest measure includes:

Degraded responses: requests that succeeded but were too slow to be useful
Partial outages: the checkout works but search is broken
Error responses: the system is responding, but with 500s

Teams that measure availability honestly tend to have better systems. Teams that game the metric tend to have very available dashboards and very unhappy users.

Availability in series vs parallel

When components are in series (each request must pass through A, then B, then C), their availabilities multiply:

System availability = A × B × C
= 0.999 × 0.999 × 0.999
= 0.997  (99.7%)

Each component you add to a critical path reduces overall availability. A chain of five 99.9% components gives you only 99.5%.

When components are in parallel (any one of them can serve the request), the math inverts in your favour:

System unavailability = (1 - A) × (1 - B)
= 0.001 × 0.001 = 0.000001

System availability = 1 - 0.000001 = 99.9999%

Two 99.9% components running in parallel give you a combined 99.9999%. This is the core insight behind redundancy: run more than one instance of anything that matters.

The tradeoffs

Higher availability costs money — in infrastructure, engineering time, and operational complexity. A system running two redundant nodes costs twice as much to run. A system designed for five nines needs automated failover, health checks, chaos engineering, 24/7 on-call rotations, and runbooks for edge cases that may never happen.

The more subtle cost is consistency. When you run multiple nodes to increase availability, those nodes can temporarily disagree about the state of the world. You'll meet this tradeoff properly in the CAP Theorem post, but the preview is: high availability and strong consistency are genuinely in tension. You often can't have both — and pretending otherwise leads to bugs that are very hard to find.

There's also an availability/complexity curve. Adding a third redundant node past a certain point stops adding meaningful availability and starts adding operational complexity. At some point, your biggest availability risk isn't hardware failure — it's a botched deployment by an exhausted engineer at 2 AM.

When to care about which level

99% is usually not enough for anything customer-facing. Seven hours of downtime a month is enough to destroy trust and trigger SLA penalties.

99.9% is a reasonable baseline for most production web services. It allows about 45 minutes of downtime per month — enough for a planned maintenance window if you schedule carefully.

99.99% is the target for revenue-critical paths — checkout flows, authentication, payment processing. Getting here requires investment: multi-AZ deployments, automated failover, thorough health checks.

99.999% is for critical infrastructure — telecommunications, air traffic control, financial clearing systems. Achieving it usually means treating every single change as a potential outage and moving very slowly.

The right number is the one your users actually need, not the highest one you can defensibly claim. A developer tooling product with users in one timezone doesn't need five nines. A global payments API does.

The one thing to remember

Availability is not binary. "The site is up" and "the site is down" are the two endpoints of a spectrum — most real incidents live in the middle, where the system is responding but not usefully. Design your metrics, your alerts, and your SLAs around the question users actually care about: can someone accomplish what they came here to do, right now?

Availability in System Design: What the Nines Actually Mean

Foundations Series

Availability in System Design: What the Nines Actually Mean

The problem

The core idea

The analogy: a hospital emergency department

How it works

The nines

Measuring it correctly

Availability in series vs parallel

The tradeoffs

When to care about which level

The one thing to remember

Further reading

Comments

Systems Design

Reliability in System Design: When Being Up Isn't Enough

More from this blog

Architecture Patterns: Wrap-Up

MapReduce: Processing Petabytes in Parallel

Batch vs Stream Processing: How Fresh Do Your Answers Need to Be?

ETL Pipelines: Moving Data from Operations to Analytics

Backend for Frontend: One API Per Client Type

Command Palette

Foundations Series

Availability in System Design: What the Nines Actually Mean

The problem

The core idea

The analogy: a hospital emergency department

How it works

The nines

Measuring it correctly

Availability in series vs parallel

The tradeoffs

When to care about which level

The one thing to remember

Further reading

Comments

Systems Design

Reliability in System Design: When Being Up Isn't Enough

More from this blog