System Design Foundations: Wrap-Up

Foundations Series
| # | Post | What it covers |
|---|---|---|
| 00 | Intro | What the Foundations pillar covers and why it matters |
| 01 | Availability | Uptime, the nines, and why 99% isn't good enough |
| 02 | Reliability | Correctness over time — when uptime isn't enough |
| 03 | Latency vs Throughput vs Bandwidth | The three numbers that define system performance |
| 04 | ACID vs BASE | Two philosophies for handling data under pressure |
| 05 | CAP Theorem | The impossibility result every distributed system runs into |
| 06 | PACELC Theorem | What CAP doesn't tell you about latency |
| 07 | Consistency Models | The spectrum from "always correct" to "eventually correct" |
| 08 | Single Point of Failure | Why one weak link breaks the whole chain |
| 09 | High Availability vs Fault Tolerance | Similar goals, very different strategies |
| 10 | Wrap-up ← you are here | How all nine concepts connect |
System Design Foundations: Wrap-Up
You've covered nine concepts. Before moving on to the next pillar, it's worth pulling the thread — because these nine don't sit in isolation. They form a web of connected tradeoffs, and the real skill is knowing how they pull against each other in a live system.
What we covered
Availability
Availability is the proportion of time your system successfully serves requests. Expressed as "the nines" (99%, 99.9%, 99.99%), each additional nine is roughly ten times harder to achieve. The key insight: measure what users experience, not just whether your servers are running. A system returning 500 errors on every request is technically "up."
The one thing to remember: availability is not binary. Design your metrics around whether users can actually accomplish what they came to do.
Reliability
A system can be highly available and deeply unreliable — always responding, often with wrong answers. Reliability is about correctness over time: does the system behave according to its specification, consistently, including under unexpected conditions?
The one thing to remember: reliability requires you to define "correct" before you can measure it. Teams that skip this step ship confident, wrong systems.
Latency vs Throughput vs Bandwidth
Three numbers that define performance — and three numbers that teams routinely confuse. Latency is the time for one request. Throughput is requests per second. Bandwidth is network pipe capacity. Optimising one can hurt another; fixing the wrong one wastes engineering effort.
The one thing to remember: when a system "feels slow," diagnose which of the three is actually the constraint before reaching for a solution.
ACID vs BASE
Two philosophies for what a database does when things go wrong. ACID systems prioritise correctness: every transaction is atomic, consistent, isolated, durable. BASE systems prioritise availability: be basically available, allow a soft state, reach eventual consistency. Neither is universally better — they're different bets about which failure mode is more acceptable.
The one thing to remember: "eventually consistent" is not a get-out-of-jail-free card. It requires explicit reasoning about what happens during the window before consistency is reached.
CAP Theorem
In a distributed system, a network partition forces a choice: consistency (all nodes return the same answer) or availability (all nodes keep responding). You cannot have both. This is not a technology limitation — it's a mathematical proof. Every distributed database makes a CAP choice; the documentation tells you which.
The one thing to remember: CAP is often misquoted as "pick two of three." During a partition, you only pick one — CA systems don't exist in a distributed network.
PACELC Theorem
CAP describes the partition case. PACELC covers the rest of the time. Even when the network is healthy, distributed systems must trade off between latency and consistency. Low-latency reads serve from the nearest node (fast, potentially stale). Strongly consistent reads wait for all nodes to agree (accurate, slower). PACELC is why strong consistency has a cost even on a good day.
The one thing to remember: PACELC is the question your performance team will ask after your architects have satisfied CAP. Both conversations are necessary.
Consistency Models
Not a binary choice but a spectrum: strong consistency, linearisability, sequential consistency, causal consistency, monotonic reads, read-your-own-writes, eventual consistency. Different models offer different guarantees at different performance costs. Knowing the spectrum lets you choose deliberately rather than accepting a database's default and hoping it's good enough.
The one thing to remember: "eventually consistent" is the bottom of the spectrum, not a synonym for "distributed." Most systems can afford something stronger than eventual consistency for most of their operations.
Single Point of Failure
Any component whose failure brings down the system. SPOFs hide in unexpected places: a shared database primary, a deployment pipeline, a single engineer with tribal knowledge. Finding them requires mapping every critical path. Eliminating them requires redundancy, which has costs — which is why SPOFs persist even in mature, well-funded systems.
The one thing to remember: the SPOF you don't know about is more dangerous than the one you've accepted. Audit regularly.
High Availability vs Fault Tolerance
Both aim to keep systems running. HA minimises downtime through rapid failover — there's a brief interruption, but recovery is fast. Fault tolerance eliminates the interruption entirely by running redundant components in parallel. Fault tolerance is the right choice when even a two-second gap is unacceptable. It's significantly more expensive to build.
The one thing to remember: most systems need HA. Very few need true fault tolerance. Conflating them leads to either underengineered systems that go down too long, or overengineered ones that cost too much.
How they connect
These nine concepts aren't independent. They form a set of connected tensions:
Availability vs Consistency — the CAP tradeoff at scale. You can chase high availability or strong consistency; you cannot maximise both in a distributed system under partition.
Reliability vs Availability — a system optimised purely for uptime may accept degraded responses rather than returning errors. That protects the availability metric while quietly destroying reliability.
Latency vs Consistency — the PACELC tradeoff. Serving reads from the nearest replica is fast. Serving them from a strongly consistent primary is correct. You're almost always trading one for the other.
Fault tolerance vs cost — the most reliable architecture is also the most expensive. Every redundant component doubles that part of your infrastructure bill. The engineering question is always: what level of resilience does this workload actually need?
Throughput vs latency — batching improves throughput but increases latency per item. Stream processing minimises latency but reduces throughput efficiency. Real systems often run both in parallel for different data paths.
The skill isn't knowing each concept in isolation. It's knowing which tensions are active in your system, right now, and which end of each tradeoff your architecture is sitting on — by design or by accident.
The running example so far
In this pillar we've kept examples abstract. From Pillar 2 (Networking & Protocols) onwards, we'll anchor each concept to a concrete evolving system — a URL shortener that we'll build up, stress-test, and redesign as complexity grows. By the time we reach Distributed Systems (Pillar 8), that URL shortener will have encountered and solved most of the problems described in these nine posts.
What's next: Networking & Protocols
The next pillar is where your data actually travels. We'll cover:
The OSI model — the seven-layer stack that explains why HTTP, TCP, and IP all coexist
TCP vs UDP — reliability vs speed at the transport layer
HTTP vs HTTPS — what TLS actually does to a request
DNS — how a domain name becomes an IP address, and why it's harder than it sounds
CDN — how content gets closer to users
The Foundations pillar gave you the vocabulary for what a system should do. The Networking pillar explains the medium it does it through.
Start Pillar 2 → Networking & Protocols: Overview
Part of the System Design series. Tags: #systemdesign #distributedsystems #softwarearchitecture #backend #engineering




