Skip to main content

Command Palette

Search for a command to run...

Reliability in System Design: When Being Up Isn't Enough

Updated
9 min read
Reliability in System Design: When Being Up Isn't Enough

Foundations Series

# Post What it covers
00 Intro What the Foundations pillar covers and why it matters
01 Availability Uptime, the nines, and why 99% isn't good enough
02 Reliability ← you are here Correctness over time — when uptime isn't enough
03 Latency vs Throughput vs Bandwidth The three numbers that define system performance
04 ACID vs BASE Two philosophies for handling data under pressure
05 CAP Theorem The impossibility result every distributed system runs into
06 PACELC Theorem What CAP doesn't tell you about latency
07 Consistency Models The spectrum from "always correct" to "eventually correct"
08 Single Point of Failure Why one weak link breaks the whole chain
09 High Availability vs Fault Tolerance Similar goals, very different strategies
10 Wrap-up How all nine concepts connect

Reliability in System Design: When Being Up Isn't Enough

The problem

Imagine your bank's mobile app never crashes. It responds instantly, every time, 24 hours a day, 365 days a year. Your availability team is celebrating five nines.

Then your customers start noticing something: occasionally, a transfer goes through twice. Sometimes a balance is displayed in the wrong currency. Once in a while, a payment marked "failed" was actually debited.

The system is perfectly available. It is not reliable.

Availability gets the headlines because outages are visible — users complain immediately, the status page goes red, engineers scramble. Reliability failures are quieter and often more dangerous. Wrong data looks just like right data until someone checks.


The core idea

Reliability is the probability that a system performs its intended function correctly, consistently, over a given period of time, under expected conditions. Not just "is it responding?" but "is it doing the right thing?" — and doing it the same way on the ten-thousandth request as on the first.

A reliable system is correct, consistent, and predictable. When it can't fulfil a request correctly, it fails loudly rather than silently returning bad data.


The analogy: a mechanical watch vs a stopped clock

A stopped clock is right twice a day — that's its availability rate. A cheap digital watch might keep imperfect time, drifting by a few minutes a month. A quality mechanical watch keeps near-perfect time for decades with regular servicing.

The stopped clock has zero reliability despite occasional correctness. The cheap watch has moderate reliability — it's right most of the time, but you can't trust it for anything precise. The mechanical watch is reliable: its behaviour is consistent, predictable, and correct within a known tolerance.

The analogy carries further: reliability doesn't mean perfect. It means consistently correct within defined tolerances, and it means the watch tells you when it needs servicing rather than silently drifting into uselessness.

In systems, that last part is critical. A reliable system fails visibly. A watched that silently shows the wrong time is more dangerous than one with a dead battery.


How it works

Defining correct

Before you can measure reliability, you have to define what "correct" means for your system. This sounds obvious. In practice, many teams ship systems with no formal specification of correctness — which means they have no way to detect reliability failures until customers report them.

Correctness definitions typically look like:

  • A payment transaction should debit and credit exactly once, or not at all

  • A search query should return results ranked by the same algorithm in the same order for the same inputs

  • A user's profile update should be reflected in every subsequent read of that profile

Each of these is a testable invariant. If your system violates one, that's a reliability failure — regardless of whether the system was "up" at the time.

Mean Time Between Failures (MTBF)

The standard reliability metric is MTBF — the average time between failures in a system. A higher MTBF means the system fails less often.

MTBF = Total operational time / Number of failures

If a service runs for 10,000 hours and has 4 failures in that time:

MTBF = 10,000 / 4 = 2,500 hours between failures

MTBF pairs with Mean Time To Recovery (MTTR) — how long it takes to restore service after a failure. These two numbers together give you a real picture of system health:

System MTBF MTTR Reality
A 500 hrs 2 mins Fails often, recovers fast — frustrating but manageable
B 5,000 hrs 4 hrs Rarely fails, but when it does it's a serious incident
C 500 hrs 4 hrs Fails often and slowly — this is the one that ends careers
D 5,000 hrs 2 mins Rarely fails, recovers instantly — what you're aiming for

High availability work focuses on MTTR — detect and recover fast. Reliability work focuses on MTBF — make failures happen less often in the first place.

Failure modes: how systems become unreliable

Reliability failures come from predictable places:

Hardware degradation. Disks fail, memory errors accumulate, network cards flap. Hardware doesn't fail immediately — it degrades. Systems that don't account for graceful degradation will silently produce wrong results long before they fully fail.

Software bugs under edge conditions. A function works correctly for 99.9% of inputs and produces subtle wrong results for the edge 0.1%. In a system handling millions of requests, that's thousands of quietly incorrect responses per day.

Concurrency and race conditions. Two processes read the same value, both decide to update it, and one update silently overwrites the other. The system never crashes. The data is wrong.

Dependency failures. Your service is correct; the third-party API it calls returns malformed data. If your service doesn't validate and handle that case, it propagates the error downstream — reliably producing wrong results.

Configuration drift. A config change is applied to 7 of 8 nodes. The eighth node is subtly misconfigured. For roughly 12.5% of requests routed to it, behaviour is different. Not broken enough to alert, wrong enough to matter.

Building for reliability

Define your invariants explicitly. Write down what correct behaviour means. Turn those definitions into automated tests that run on every deployment.

Validate at boundaries. Every input from an external system — user input, third-party API response, message queue payload — should be validated before it enters your system. Never trust data you didn't generate.

Fail loudly. A system that returns an error is more reliable than one that returns a wrong answer. When your system can't guarantee correctness, surfacing the error is the reliable choice. Silent failures are the enemy of reliability.

Test failure modes explicitly. Chaos engineering — deliberately introducing failures in controlled environments — finds reliability gaps that normal testing misses. The failure modes that hurt you in production are usually the ones nobody thought to test.

Monitor for correctness, not just uptime. Availability dashboards tell you the system is responding. Reliability monitoring checks whether responses are correct: comparing outputs against known-good results, tracking error rates by type, alerting on data inconsistencies.


The tradeoffs

Reliability vs speed of development. Rigorous invariant testing, boundary validation, and failure mode coverage take time to build and maintain. Teams under delivery pressure often deprioritise this work — and build up a quiet debt of reliability failures they won't discover until production.

Reliability vs performance. Validation has a cost. Checking every input, logging every operation, running consistency checks on writes — these all add latency. Systems that need extreme performance sometimes relax reliability checks on non-critical paths. The key word is sometimes, and the decision should always be deliberate.

Reliability vs availability. This is the tension worth sitting with. A system that returns errors when it can't guarantee correctness has lower availability than one that always returns something. But the something it returns might be wrong. For financial systems, medical records, anything where incorrect data causes real-world harm — low availability with high reliability is the right tradeoff. You'd rather a user sees "service unavailable" than see the wrong account balance.


When reliability matters most

Financial systems. Double charges, missed credits, incorrect balances — these have direct real-world consequences and often legal liability. Reliability is non-negotiable.

Healthcare. Incorrect dosage calculations, missing allergy records, wrong patient data — reliability failures here can cause direct physical harm.

Anywhere data is the product. Analytics platforms, reporting tools, recommendation engines — if the data is wrong, the product is wrong, even if the system never goes down.

Distributed systems under eventual consistency. When nodes can temporarily disagree, the window between a write and its propagation is a reliability risk. Any system that reads during that window might read stale data. Knowing this and designing around it is a reliability concern, not just a consistency one.


The one thing to remember

A system that fails loudly is more reliable than one that fails silently. Returning an error is honest. Returning wrong data is a lie your system tells with confidence. When you can't guarantee correctness, surfacing the failure is the reliable choice — and the one your users, your on-call team, and your data integrity will thank you for.


← Previous: Availability — whether your system responds at all

→ Next: Latency vs Throughput vs Bandwidth — reliability tells you if the response is correct; these three numbers tell you how fast it arrives and how many you can handle. They're often confused and rarely all optimised at once.

Systems Design

Part 12 of 50

Understanding these system design concepts is essential for architects, developers, and engineers to create scalable, reliable, and maintainable software systems that meet the needs of businesses.

Up next

Latency vs Throughput vs Bandwidth: Three Numbers, One System

Foundations Series # Post What it covers 00 Intro What the Foundations pillar covers and why it matters 01 Availability Uptime, the nines, and why 99% isn't good enough 02 Reliability Correc

More from this blog

Cloud Tuned

729 posts

Your starting point for anything cloud: AWS, Azure, GCP, Serverless, Architecture, Hybrid Cloud, Systems Design and other Information Technology topics.