Foundations Series

#	Post	What it covers
00	Intro	What the Foundations pillar covers and why it matters
01	Availability	Uptime, the nines, and why 99% isn't good enough
02	Reliability	Correctness over time — when uptime isn't enough
03	Latency vs Throughput vs Bandwidth	The three numbers that define system performance
04	ACID vs BASE	Two philosophies for handling data under pressure
05	CAP Theorem	The impossibility result every distributed system runs into
06	PACELC Theorem	What CAP doesn't tell you about latency
07	Consistency Models	The spectrum from "always correct" to "eventually correct"
08	Single Point of Failure	Why one weak link breaks the whole chain
09	High Availability vs Fault Tolerance	Similar goals, very different strategies
10	Wrap-up ← you are here	How all nine concepts connect

System Design Foundations: Wrap-Up

You've covered nine concepts. Before moving on to the next pillar, it's worth pulling the thread — because these nine don't sit in isolation. They form a web of connected tradeoffs, and the real skill is knowing how they pull against each other in a live system.

What we covered

Availability

Availability is the proportion of time your system successfully serves requests. Expressed as "the nines" (99%, 99.9%, 99.99%), each additional nine is roughly ten times harder to achieve. The key insight: measure what users experience, not just whether your servers are running. A system returning 500 errors on every request is technically "up."

The one thing to remember: availability is not binary. Design your metrics around whether users can actually accomplish what they came to do.

Reliability

A system can be highly available and deeply unreliable — always responding, often with wrong answers. Reliability is about correctness over time: does the system behave according to its specification, consistently, including under unexpected conditions?

The one thing to remember: reliability requires you to define "correct" before you can measure it. Teams that skip this step ship confident, wrong systems.

Latency vs Throughput vs Bandwidth

Three numbers that define performance — and three numbers that teams routinely confuse. Latency is the time for one request. Throughput is requests per second. Bandwidth is network pipe capacity. Optimising one can hurt another; fixing the wrong one wastes engineering effort.

The one thing to remember: when a system "feels slow," diagnose which of the three is actually the constraint before reaching for a solution.

ACID vs BASE

Two philosophies for what a database does when things go wrong. ACID systems prioritise correctness: every transaction is atomic, consistent, isolated, durable. BASE systems prioritise availability: be basically available, allow a soft state, reach eventual consistency. Neither is universally better — they're different bets about which failure mode is more acceptable.

The one thing to remember: "eventually consistent" is not a get-out-of-jail-free card. It requires explicit reasoning about what happens during the window before consistency is reached.

CAP Theorem

In a distributed system, a network partition forces a choice: consistency (all nodes return the same answer) or availability (all nodes keep responding). You cannot have both. This is not a technology limitation — it's a mathematical proof. Every distributed database makes a CAP choice; the documentation tells you which.

The one thing to remember: CAP is often misquoted as "pick two of three." During a partition, you only pick one — CA systems don't exist in a distributed network.

PACELC Theorem

CAP describes the partition case. PACELC covers the rest of the time. Even when the network is healthy, distributed systems must trade off between latency and consistency. Low-latency reads serve from the nearest node (fast, potentially stale). Strongly consistent reads wait for all nodes to agree (accurate, slower). PACELC is why strong consistency has a cost even on a good day.

The one thing to remember: PACELC is the question your performance team will ask after your architects have satisfied CAP. Both conversations are necessary.

Consistency Models

Not a binary choice but a spectrum: strong consistency, linearisability, sequential consistency, causal consistency, monotonic reads, read-your-own-writes, eventual consistency. Different models offer different guarantees at different performance costs. Knowing the spectrum lets you choose deliberately rather than accepting a database's default and hoping it's good enough.

The one thing to remember: "eventually consistent" is the bottom of the spectrum, not a synonym for "distributed." Most systems can afford something stronger than eventual consistency for most of their operations.

Single Point of Failure

Any component whose failure brings down the system. SPOFs hide in unexpected places: a shared database primary, a deployment pipeline, a single engineer with tribal knowledge. Finding them requires mapping every critical path. Eliminating them requires redundancy, which has costs — which is why SPOFs persist even in mature, well-funded systems.

The one thing to remember: the SPOF you don't know about is more dangerous than the one you've accepted. Audit regularly.

High Availability vs Fault Tolerance

Both aim to keep systems running. HA minimises downtime through rapid failover — there's a brief interruption, but recovery is fast. Fault tolerance eliminates the interruption entirely by running redundant components in parallel. Fault tolerance is the right choice when even a two-second gap is unacceptable. It's significantly more expensive to build.

The one thing to remember: most systems need HA. Very few need true fault tolerance. Conflating them leads to either underengineered systems that go down too long, or overengineered ones that cost too much.

How they connect

These nine concepts aren't independent. They form a set of connected tensions:

Availability vs Consistency — the CAP tradeoff at scale. You can chase high availability or strong consistency; you cannot maximise both in a distributed system under partition.

Reliability vs Availability — a system optimised purely for uptime may accept degraded responses rather than returning errors. That protects the availability metric while quietly destroying reliability.

Latency vs Consistency — the PACELC tradeoff. Serving reads from the nearest replica is fast. Serving them from a strongly consistent primary is correct. You're almost always trading one for the other.

Fault tolerance vs cost — the most reliable architecture is also the most expensive. Every redundant component doubles that part of your infrastructure bill. The engineering question is always: what level of resilience does this workload actually need?

Throughput vs latency — batching improves throughput but increases latency per item. Stream processing minimises latency but reduces throughput efficiency. Real systems often run both in parallel for different data paths.

The skill isn't knowing each concept in isolation. It's knowing which tensions are active in your system, right now, and which end of each tradeoff your architecture is sitting on — by design or by accident.

The running example so far

In this pillar we've kept examples abstract. From Pillar 2 (Networking & Protocols) onwards, we'll anchor each concept to a concrete evolving system — a URL shortener that we'll build up, stress-test, and redesign as complexity grows. By the time we reach Distributed Systems (Pillar 8), that URL shortener will have encountered and solved most of the problems described in these nine posts.

What's next: Networking & Protocols

The next pillar is where your data actually travels. We'll cover:

The OSI model — the seven-layer stack that explains why HTTP, TCP, and IP all coexist
TCP vs UDP — reliability vs speed at the transport layer
HTTP vs HTTPS — what TLS actually does to a request
DNS — how a domain name becomes an IP address, and why it's harder than it sounds
CDN — how content gets closer to users

The Foundations pillar gave you the vocabulary for what a system should do. The Networking pillar explains the medium it does it through.

Start Pillar 2 → Networking & Protocols: Overview

Part of the System Design series. Tags: #systemdesign #distributedsystems #softwarearchitecture #backend #engineering

System Design Foundations: Wrap-Up

Foundations Series

System Design Foundations: Wrap-Up

What we covered

Availability

Reliability

Latency vs Throughput vs Bandwidth

ACID vs BASE

CAP Theorem

PACELC Theorem

Consistency Models

Single Point of Failure

High Availability vs Fault Tolerance

How they connect

The running example so far

What's next: Networking & Protocols

Comments

Systems Design

Networking & Protocols: How Bytes Actually Travel

More from this blog

Networking & Protocols: Wrap-Up

CDN: Moving Content Closer to the People Who Need It

Anycast Routing: One Address, Everywhere at Once

DNS Load Balancing: Traffic Distribution at the Name Layer

DNS: The Phone Book That Runs the Internet

Command Palette

Foundations Series

System Design Foundations: Wrap-Up

What we covered

Availability

Reliability

Latency vs Throughput vs Bandwidth

ACID vs BASE

CAP Theorem

PACELC Theorem

Consistency Models

Single Point of Failure

High Availability vs Fault Tolerance

How they connect

The running example so far

What's next: Networking & Protocols

Comments

Systems Design

Networking & Protocols: How Bytes Actually Travel

More from this blog