Reliability in System Design: When Being Up Isn't Enough

Foundations Series
| # | Post | What it covers |
|---|---|---|
| 00 | Intro | What the Foundations pillar covers and why it matters |
| 01 | Availability | Uptime, the nines, and why 99% isn't good enough |
| 02 | Reliability ← you are here | Correctness over time — when uptime isn't enough |
| 03 | Latency vs Throughput vs Bandwidth | The three numbers that define system performance |
| 04 | ACID vs BASE | Two philosophies for handling data under pressure |
| 05 | CAP Theorem | The impossibility result every distributed system runs into |
| 06 | PACELC Theorem | What CAP doesn't tell you about latency |
| 07 | Consistency Models | The spectrum from "always correct" to "eventually correct" |
| 08 | Single Point of Failure | Why one weak link breaks the whole chain |
| 09 | High Availability vs Fault Tolerance | Similar goals, very different strategies |
| 10 | Wrap-up | How all nine concepts connect |
Reliability in System Design: When Being Up Isn't Enough
The problem
Imagine your bank's mobile app never crashes. It responds instantly, every time, 24 hours a day, 365 days a year. Your availability team is celebrating five nines.
Then your customers start noticing something: occasionally, a transfer goes through twice. Sometimes a balance is displayed in the wrong currency. Once in a while, a payment marked "failed" was actually debited.
The system is perfectly available. It is not reliable.
Availability gets the headlines because outages are visible — users complain immediately, the status page goes red, engineers scramble. Reliability failures are quieter and often more dangerous. Wrong data looks just like right data until someone checks.
The core idea
Reliability is the probability that a system performs its intended function correctly, consistently, over a given period of time, under expected conditions. Not just "is it responding?" but "is it doing the right thing?" — and doing it the same way on the ten-thousandth request as on the first.
A reliable system is correct, consistent, and predictable. When it can't fulfil a request correctly, it fails loudly rather than silently returning bad data.
The analogy: a mechanical watch vs a stopped clock
A stopped clock is right twice a day — that's its availability rate. A cheap digital watch might keep imperfect time, drifting by a few minutes a month. A quality mechanical watch keeps near-perfect time for decades with regular servicing.
The stopped clock has zero reliability despite occasional correctness. The cheap watch has moderate reliability — it's right most of the time, but you can't trust it for anything precise. The mechanical watch is reliable: its behaviour is consistent, predictable, and correct within a known tolerance.
The analogy carries further: reliability doesn't mean perfect. It means consistently correct within defined tolerances, and it means the watch tells you when it needs servicing rather than silently drifting into uselessness.
In systems, that last part is critical. A reliable system fails visibly. A watched that silently shows the wrong time is more dangerous than one with a dead battery.
How it works
Defining correct
Before you can measure reliability, you have to define what "correct" means for your system. This sounds obvious. In practice, many teams ship systems with no formal specification of correctness — which means they have no way to detect reliability failures until customers report them.
Correctness definitions typically look like:
A payment transaction should debit and credit exactly once, or not at all
A search query should return results ranked by the same algorithm in the same order for the same inputs
A user's profile update should be reflected in every subsequent read of that profile
Each of these is a testable invariant. If your system violates one, that's a reliability failure — regardless of whether the system was "up" at the time.
Mean Time Between Failures (MTBF)
The standard reliability metric is MTBF — the average time between failures in a system. A higher MTBF means the system fails less often.
MTBF = Total operational time / Number of failures
If a service runs for 10,000 hours and has 4 failures in that time:
MTBF = 10,000 / 4 = 2,500 hours between failures
MTBF pairs with Mean Time To Recovery (MTTR) — how long it takes to restore service after a failure. These two numbers together give you a real picture of system health:
| System | MTBF | MTTR | Reality |
|---|---|---|---|
| A | 500 hrs | 2 mins | Fails often, recovers fast — frustrating but manageable |
| B | 5,000 hrs | 4 hrs | Rarely fails, but when it does it's a serious incident |
| C | 500 hrs | 4 hrs | Fails often and slowly — this is the one that ends careers |
| D | 5,000 hrs | 2 mins | Rarely fails, recovers instantly — what you're aiming for |
High availability work focuses on MTTR — detect and recover fast. Reliability work focuses on MTBF — make failures happen less often in the first place.
Failure modes: how systems become unreliable
Reliability failures come from predictable places:
Hardware degradation. Disks fail, memory errors accumulate, network cards flap. Hardware doesn't fail immediately — it degrades. Systems that don't account for graceful degradation will silently produce wrong results long before they fully fail.
Software bugs under edge conditions. A function works correctly for 99.9% of inputs and produces subtle wrong results for the edge 0.1%. In a system handling millions of requests, that's thousands of quietly incorrect responses per day.
Concurrency and race conditions. Two processes read the same value, both decide to update it, and one update silently overwrites the other. The system never crashes. The data is wrong.
Dependency failures. Your service is correct; the third-party API it calls returns malformed data. If your service doesn't validate and handle that case, it propagates the error downstream — reliably producing wrong results.
Configuration drift. A config change is applied to 7 of 8 nodes. The eighth node is subtly misconfigured. For roughly 12.5% of requests routed to it, behaviour is different. Not broken enough to alert, wrong enough to matter.
Building for reliability
Define your invariants explicitly. Write down what correct behaviour means. Turn those definitions into automated tests that run on every deployment.
Validate at boundaries. Every input from an external system — user input, third-party API response, message queue payload — should be validated before it enters your system. Never trust data you didn't generate.
Fail loudly. A system that returns an error is more reliable than one that returns a wrong answer. When your system can't guarantee correctness, surfacing the error is the reliable choice. Silent failures are the enemy of reliability.
Test failure modes explicitly. Chaos engineering — deliberately introducing failures in controlled environments — finds reliability gaps that normal testing misses. The failure modes that hurt you in production are usually the ones nobody thought to test.
Monitor for correctness, not just uptime. Availability dashboards tell you the system is responding. Reliability monitoring checks whether responses are correct: comparing outputs against known-good results, tracking error rates by type, alerting on data inconsistencies.
The tradeoffs
Reliability vs speed of development. Rigorous invariant testing, boundary validation, and failure mode coverage take time to build and maintain. Teams under delivery pressure often deprioritise this work — and build up a quiet debt of reliability failures they won't discover until production.
Reliability vs performance. Validation has a cost. Checking every input, logging every operation, running consistency checks on writes — these all add latency. Systems that need extreme performance sometimes relax reliability checks on non-critical paths. The key word is sometimes, and the decision should always be deliberate.
Reliability vs availability. This is the tension worth sitting with. A system that returns errors when it can't guarantee correctness has lower availability than one that always returns something. But the something it returns might be wrong. For financial systems, medical records, anything where incorrect data causes real-world harm — low availability with high reliability is the right tradeoff. You'd rather a user sees "service unavailable" than see the wrong account balance.
When reliability matters most
Financial systems. Double charges, missed credits, incorrect balances — these have direct real-world consequences and often legal liability. Reliability is non-negotiable.
Healthcare. Incorrect dosage calculations, missing allergy records, wrong patient data — reliability failures here can cause direct physical harm.
Anywhere data is the product. Analytics platforms, reporting tools, recommendation engines — if the data is wrong, the product is wrong, even if the system never goes down.
Distributed systems under eventual consistency. When nodes can temporarily disagree, the window between a write and its propagation is a reliability risk. Any system that reads during that window might read stale data. Knowing this and designing around it is a reliability concern, not just a consistency one.
The one thing to remember
A system that fails loudly is more reliable than one that fails silently. Returning an error is honest. Returning wrong data is a lie your system tells with confidence. When you can't guarantee correctness, surfacing the failure is the reliable choice — and the one your users, your on-call team, and your data integrity will thank you for.
← Previous: Availability — whether your system responds at all
→ Next: Latency vs Throughput vs Bandwidth — reliability tells you if the response is correct; these three numbers tell you how fast it arrives and how many you can handle. They're often confused and rarely all optimised at once.




