Foundations Series

#	Post	What it covers
00	Intro	What the Foundations pillar covers and why it matters
01	Availability	Uptime, the nines, and why 99% isn't good enough
02	Reliability	Correctness over time — when uptime isn't enough
03	Latency vs Throughput vs Bandwidth	The three numbers that define system performance
04	ACID vs BASE	Two philosophies for handling data under pressure
05	CAP Theorem	The impossibility result every distributed system runs into
06	PACELC Theorem	What CAP doesn't tell you about latency
07	Consistency Models	The spectrum from "always correct" to "eventually correct"
08	Single Point of Failure ← you are here	Why one weak link breaks the whole chain
09	High Availability vs Fault Tolerance	Similar goals, very different strategies
10	Wrap-up	How all nine concepts connect

Single Point of Failure: Why One Weak Link Breaks the Whole Chain

The problem

It's 3am on a Saturday. Your on-call engineer gets paged: the entire platform is down. They SSH into the primary database server to investigate. They can't — that server is also the bastion host through which all SSH access is routed. The bastion is down because it shares the same network interface as the primary database. The primary database is down because the SSD it runs on failed.

One SSD. The entire platform — application servers, APIs, admin tooling, the ability to even begin diagnosing the problem — all of it offline because of one physical disk.

Nobody designed it this way. It accumulated. A shortcut here, a "we'll fix that later" there, and gradually the architecture developed a hidden spine: a single component that everything depended on, that nobody was watching, that had no fallback.

This is a single point of failure. And the most dangerous ones are always the ones you didn't know were there.

The core idea

A single point of failure (SPOF) is any component in a system whose failure causes the entire system — or a critical part of it — to stop functioning. Remove that component, and everything that depends on it goes down with it.

The definition sounds simple. The practice of finding SPOFs is not — because they hide in unexpected places, accumulate silently over time, and often only reveal themselves at the worst possible moment.

The analogy: a power strip with one socket

Picture an office where every device — computers, monitors, servers, the router, the phone charging dock — is plugged into a single power strip. That strip is plugged into one wall socket on one circuit.

The individual devices are fine. The monitors are fine. The computers are fine. But the entire office depends on one socket in one wall. A blown fuse on that circuit, a faulty strip, someone accidentally kicking the plug — and everything stops simultaneously.

The fix seems obvious: distribute across multiple circuits, use a UPS for critical devices, add redundancy at the power layer. But the fix costs money and takes time, so it doesn't happen until the day the whole office goes dark during a client presentation.

Every SPOF in a software system follows the same pattern: visible in retrospect, avoidable in principle, neglected until it fails.

How it works

Where SPOFs hide

The instinct when auditing for SPOFs is to look at servers and hardware. That's necessary but not sufficient. SPOFs hide across four categories:

Infrastructure SPOFs are the most visible: a single database primary with no replica, one load balancer with no standby, a single availability zone hosting all services. These show up on architecture diagrams and are relatively straightforward to eliminate through redundancy.

Software and configuration SPOFs are subtler: a single deployment pipeline that all services depend on, a shared configuration service with no fallback, a feature flag system that becomes a dependency for every request. When the deployment pipeline goes down, nobody can ship a fix. When the config service fails, services that can't load their configuration fail to start.

Dependency SPOFs are the ones that feel unfair: a third-party payment provider with no fallback, a single DNS provider whose outage takes down your domain resolution, an external authentication service that every login flows through. Your infrastructure is fine. Someone else's isn't. Your users don't care about the distinction.

Human SPOFs are the most overlooked and often the most dangerous: one engineer who understands how the legacy billing system works, one person with the production database credentials, one team member who wrote the critical deployment script and didn't document it. When that person is on holiday, unreachable, or has left the company, the knowledge SPOF manifests.

The anatomy of a failure cascade

SPOFs are dangerous not just because they fail, but because their failure tends to cascade. A component that other components depend on creates a dependency graph — and when the depended-upon component fails, the failure propagates outward.

          ┌─────────────────────────────┐
          │       Load Balancer         │  ← SPOF
          └──────────────┬──────────────┘
                         │
          ┌──────────────▼──────────────┐
          │         App Servers         │  ← Healthy but unreachable
          └──────────────┬──────────────┘
                         │
          ┌──────────────▼──────────────┐
          │      Primary Database       │  ← Healthy but irrelevant
          └─────────────────────────────┘

The database is fine. The app servers are fine. The load balancer fails, and both layers below it become unreachable — not because they're broken, but because the single entry point is gone. The SPOF doesn't just fail itself; it takes everything downstream with it.

Auditing for SPOFs

A systematic SPOF audit asks one question about every component: what happens if this fails right now?

Work through each layer:

DNS — if your DNS provider goes down, can users reach your service? (Many can't — DNS is one of the most overlooked SPOFs in production systems.)

Load balancing — is there one load balancer, or are there multiple instances with automatic failover?

Application tier — are your application servers distributed across multiple availability zones? If one AZ goes down, do the others handle the load?

Database primary — if the primary fails, how long until a replica is promoted? Is promotion automatic or manual? Who does it at 3am?

External dependencies — for each third-party service your system calls, what happens if it's unavailable? Do you have a circuit breaker, a fallback, a graceful degradation path?

Deployment pipeline — if CI/CD goes down during an incident, can you still deploy a hotfix? How?

Access and credentials — if your primary on-call engineer is unreachable, who else can access production? Are credentials documented somewhere secure?

Knowledge — for each critical system, is there more than one person who understands it well enough to debug it at 3am?

Eliminating SPOFs

The standard approach to eliminating infrastructure SPOFs is redundancy: run multiple instances of every critical component, detect failures automatically, and fail over without human intervention.

Before:                          After:
                                 
  Load Balancer                  Load Balancer A ─── Load Balancer B
       │                                │                   │
  App Server                    App Server 1         App Server 2
       │                         (AZ us-east-1a)    (AZ us-east-1b)
  Database                              │                   │
                                 Database Primary ── Database Replica
                                  (us-east-1a)        (us-east-1b)

But redundancy alone isn't enough — the redundancy has to actually work when needed. Three common failure modes of "redundant" systems that turn out to still have SPOFs:

Untested failover. The replica exists. Nobody has ever tested promoting it. When the primary fails, the promotion process is manual, poorly documented, and takes 45 minutes. The SPOF was the untested runbook.

Shared fate. Two load balancers exist, but they're both in the same availability zone, share the same network interface, or sit behind the same upstream router. One failure takes both. Redundancy must be fault-isolated — the redundant components must fail independently.

The single coordinator. You've added redundancy at every layer, but all redundant components coordinate through a single service — a configuration manager, a leader election service, a shared cache. That coordinator is now the SPOF.

The tradeoffs

Redundancy costs money. Running two of everything doubles the infrastructure bill for those components. Running across multiple availability zones or regions adds data transfer costs and replication overhead. The decision about which SPOFs to eliminate is partly an engineering decision and partly a financial one.

Redundancy adds complexity. Every redundant component needs a failover mechanism, health checks, monitoring, and operational runbooks. A system with aggressive redundancy has more moving parts than one without — which creates new failure modes (split-brain, partial failover, inconsistent state during transition) that need their own mitigations.

Not all SPOFs are equal. The right approach is risk-based: identify every SPOF, estimate the probability and impact of each one failing, and prioritise elimination accordingly. A SPOF in your payment processing flow warrants immediate investment. A SPOF in your internal analytics dashboard can wait.

Accept the SPOFs you've decided to accept. The goal isn't zero SPOFs — it's zero unacknowledged SPOFs. A documented, risk-assessed, monitored SPOF that the team has consciously accepted is a different animal from one nobody knows is there.

When the SPOF is a person

Human SPOFs deserve their own treatment because they're systematically underweighted in technical post-mortems. The pattern is consistent:

One person built a critical system. They're the only one who fully understands it. The documentation is sparse or outdated. They leave, or go on holiday, or are simply unreachable during an incident — and suddenly a technical problem becomes a knowledge problem.

The mitigations are straightforward in principle and chronically underprioritised in practice:

Runbooks — written procedures for every critical operational task, detailed enough for an engineer who has never seen the system before to follow at 3am
Pair operations — never let one person be the only one who has performed a critical operation (database failover, credential rotation, disaster recovery)
Bus factor audits — for each critical system, ask: how many people would have to be hit by a bus before we couldn't operate this? Any answer of "one" is a SPOF
Knowledge transfer sessions — regular walkthroughs of critical systems, recorded and documented, updated when the system changes

The one thing to remember

The most dangerous single point of failure is the one you don't know about. Hardware SPOFs show up on architecture diagrams. Human SPOFs show up in post-mortems. The practice that finds them both is the same: regularly ask "what happens if this — exactly this — fails right now?" and follow the dependency chain all the way down until you either find a fallback or find a gap.

← Previous: Consistency Models — what distributed systems promise about data correctness

→ Next: High Availability vs Fault Tolerance — once you know where your single points of failure are, the next question is how aggressively to protect against them. HA and fault tolerance are two very different answers to that question.

Single Point of Failure: Why One Weak Link Breaks the Whole Chain

Foundations Series

Single Point of Failure: Why One Weak Link Breaks the Whole Chain

The problem

The core idea

The analogy: a power strip with one socket