Foundations Series

#	Post	What it covers
00	Intro	What the Foundations pillar covers and why it matters
01	Availability	Uptime, the nines, and why 99% isn't good enough
02	Reliability	Correctness over time — when uptime isn't enough
03	Latency vs Throughput vs Bandwidth	The three numbers that define system performance
04	ACID vs BASE	Two philosophies for handling data under pressure
05	CAP Theorem	The impossibility result every distributed system runs into
06	PACELC Theorem	What CAP doesn't tell you about latency
07	Consistency Models	The spectrum from "always correct" to "eventually correct"
08	Single Point of Failure	Why one weak link breaks the whole chain
09	High Availability vs Fault Tolerance ← you are here	Similar goals, very different strategies
10	Wrap-up	How all nine concepts connect

High Availability vs Fault Tolerance: Similar Goals, Very Different Strategies

The problem

Two teams are presenting their resilience strategies to the same CTO.

Team A has built a system with automated health checks, multi-AZ deployments, and a tested failover runbook. When a node fails, the load balancer detects it within 10 seconds, routes traffic away, and a replacement node is provisioned within 90 seconds. Total user-facing disruption: roughly 10–15 seconds during the detection window.

Team B has built a system where every component runs as an active-active cluster with synchronised state. When a node fails, the remaining nodes absorb its traffic instantaneously. There is no detection window. There is no failover. There is no disruption — users never know anything happened.

Both teams solved the same problem. The cost difference between their solutions is significant. The right choice depends entirely on what "unacceptable downtime" means for their specific workload.

High availability and fault tolerance are not the same strategy with different marketing names. They're different engineering philosophies, built for different tolerances, at different cost points — and conflating them leads to either systems that go down longer than they should, or systems that cost far more than the workload justifies.

The core idea

High availability (HA) is a design approach that minimises downtime through rapid detection and recovery. When a component fails, the system detects the failure quickly, routes around it, and restores normal operation — accepting that there is a brief interruption between failure and recovery.

Fault tolerance is a design approach that eliminates downtime entirely by running redundant components in parallel, synchronised in real time. When a component fails, the system continues operating without interruption — because another component was already running and ready to take over instantly.

The distinction hinges on one word: interruption. HA minimises it. Fault tolerance eliminates it.

The analogy: a hospital's power supply

A hospital is the clearest real-world illustration of both strategies coexisting for different systems within the same building.

High availability — the emergency generator. When mains power fails, the hospital's emergency generator kicks in after a 5–10 second gap. During that window, non-critical systems go dark. The generator was designed for this — it starts, stabilises, and restores power quickly. Most of the hospital doesn't miss a beat. But there is a gap.

Fault tolerance — the operating theatre's UPS. The operating theatre cannot tolerate even a 5-second power interruption. A patient on bypass, a ventilator mid-cycle, a surgeon mid-incision — a brief gap is potentially fatal. So the theatre runs on a Uninterruptible Power Supply that is always online, always running in parallel, switching instantaneously when mains power fails. There is no gap because there is no switchover — the backup was already active.

Same hospital. Same goal of keeping the lights on. Two completely different strategies, applied based on how much interruption is tolerable for each system.

How high availability works

HA systems are built around three mechanisms: redundancy, detection, and failover.

Redundancy means running multiple instances of critical components — multiple application servers, multiple database replicas, multiple load balancers — so that the failure of any one instance doesn't eliminate the capability.

Detection means continuously monitoring component health so failures are identified quickly. Health checks, heartbeats, and watchdog processes reduce the window between a failure occurring and the system knowing about it. The faster the detection, the shorter the disruption.

Failover is the process of routing traffic or responsibility away from the failed component and onto a healthy one. Failover can be:

Automatic — the system detects failure and reroutes without human intervention. Faster, requires confidence in the detection mechanism.
Manual — an on-call engineer triggers failover after confirming the failure. Slower, provides a human sanity check before rerouting.

A typical HA architecture for a web service:

The HA promise: when the primary database fails, the replica is promoted (automatically or manually), the application reconfigures its connection, and service resumes. Total downtime: the time from failure detection to failover completion — typically seconds to minutes depending on the implementation.

Where HA is measured: Recovery Time Objective (RTO) — how long the system is down after a failure — and Recovery Point Objective (RPO) — how much data is lost. HA systems target low RTO (seconds to minutes) and low RPO (seconds of data loss at most). They don't promise zero.

How fault tolerance works

Fault tolerance removes the recovery window entirely by ensuring the backup is always active, always synchronised, and always ready to take over without a switchover event.

The key architectural difference from HA: active-active vs active-passive.

HA typically uses active-passive: one component is active, one is on standby. When the active fails, the passive is promoted. There's a transition.

Fault tolerance uses active-active: multiple components handle every operation simultaneously. When one fails, the others continue without interruption — because they were already handling the load.

For this to work, all active nodes must be synchronised in real time. Every write must be applied to all nodes before being acknowledged. Every node must be capable of handling the full load independently.

Fault-tolerant database cluster:

  Write ──► Node A (active) ──┐
                               ├── All three confirm before ACK
  Write ──► Node B (active) ──┤
                               │
  Write ──► Node C (active) ──┘

  If Node A fails:
  ┌─────────────────────────────┐
  │  Node B and C continue serving all traffic   │
  │  No failover. No promotion. No interruption. │
  └─────────────────────────────┘

The synchronous replication required for true fault tolerance is expensive — every write waits for confirmation from all nodes, adding latency proportional to the slowest node in the cluster. Geographic distribution amplifies this cost significantly.

Where fault tolerance is used: aircraft flight control systems, financial clearing infrastructure, hospital life-support systems, telecommunications core networks, air traffic control. Environments where even a two-second gap causes irreversible harm.

Side by side

	High Availability	Fault Tolerance
Goal	Minimise downtime	Eliminate downtime
On failure	Detect, failover, recover	Continue without interruption
Interruption	Brief (seconds to minutes)	None
Redundancy model	Active-passive (typically)	Active-active
State synchronisation	Asynchronous replication	Synchronous replication
Cost	Moderate	High
Complexity	Moderate	High
Latency impact	Low	Higher (synchronous writes)
RTO	Seconds to minutes	Zero (no recovery needed)
RPO	Near-zero to seconds	Zero
Typical use cases	Most production web services	Life-critical, financial clearing, telecoms

The tradeoffs

Fault tolerance's synchronous replication tax. Every write waiting for all nodes to confirm adds latency. In a geographically distributed fault-tolerant cluster, this is the round-trip time to the furthest node — potentially hundreds of milliseconds. Systems that need both fault tolerance and low write latency must co-locate their nodes, sacrificing geographic redundancy.

HA's recovery window is often acceptable. For most user-facing web services, a 10–30 second failover window — while not ideal — doesn't cause irreversible harm. Users see an error, retry, and succeed. Designing for true fault tolerance to eliminate those 10 seconds costs significantly more than it returns for most workloads.

The operational complexity of active-active. Fault-tolerant active-active clusters must handle split-brain scenarios — where nodes lose connectivity to each other and each believes the others have failed. Without a correct resolution strategy, both partitions continue operating independently, diverging in state. Resolving this cleanly requires consensus protocols (Raft, Paxos) that add their own complexity and latency.

Testing fault tolerance is harder than testing HA. HA failover can be tested by killing a node and timing the recovery. True fault tolerance testing requires simulating simultaneous failures, network partitions, and partial degradation — and verifying that the system genuinely continues without interruption in all cases. This testing is expensive to build and maintain.

Choosing between them

The decision framework is simpler than the implementation:

Use HA when: your workload can tolerate a brief recovery window (seconds to low minutes), you need multi-region or multi-AZ resilience without the cost of synchronous replication, or you're building a user-facing web service where brief disruption is recoverable.

Use fault tolerance when: any interruption causes irreversible harm (a transaction mid-flight, a life-support system, an aircraft in operation), regulatory or contractual requirements mandate zero downtime, or the cost of downtime vastly exceeds the cost of fault-tolerant infrastructure.

Use both, layered: most large production systems apply HA at the application tier (where brief failovers are acceptable) and fault tolerance selectively at the data layer for the most critical write paths. The payment processing service might use synchronous replication to a standby for its transaction log while the recommendation engine runs standard HA replicas.

The one thing to remember

High availability and fault tolerance solve the same problem at different cost points and for different tolerances. HA asks: "how quickly can we recover?" Fault tolerance asks: "how do we never need to?" Before choosing, define your actual requirement: can your workload tolerate a 15-second recovery window? If yes — and for most workloads the answer is yes — HA is the right investment. Reserve fault tolerance for the systems where the honest answer is no.

← Previous: Single Point of Failure — identifying the components whose failure breaks everything

→ Next: Wrap-up — all nine Foundations concepts pulled together, showing how they connect and what they set up for the pillars ahead.

High Availability vs Fault Tolerance: Similar Goals, Very Different Strategies

Foundations Series

High Availability vs Fault Tolerance: Similar Goals, Very Different Strategies

The problem

The core idea

The analogy: a hospital's power supply

How high availability works

How fault tolerance works

Side by side

The tradeoffs

Choosing between them

The one thing to remember

Comments

Systems Design

System Design Foundations: Wrap-Up

More from this blog

Networking & Protocols: Wrap-Up

CDN: Moving Content Closer to the People Who Need It

Anycast Routing: One Address, Everywhere at Once

DNS Load Balancing: Traffic Distribution at the Name Layer

DNS: The Phone Book That Runs the Internet

Command Palette

Foundations Series

High Availability vs Fault Tolerance: Similar Goals, Very Different Strategies

The problem

The core idea

The analogy: a hospital's power supply

How high availability works

How fault tolerance works

Side by side

The tradeoffs

Choosing between them

The one thing to remember

Comments

Systems Design

System Design Foundations: Wrap-Up

More from this blog