Skip to main content

Command Palette

Search for a command to run...

System Design Foundations: The Core Concepts Every Engineer Needs

Updated
10 min read
System Design Foundations: The Core Concepts Every Engineer Needs

Series: System Design · Foundations — Pillar 1 of 8 This pillar: 00 — Overview · 01 — Availability · 02 — Reliability · 03 — Latency vs Throughput vs Bandwidth · 04 — ACID vs BASE · 05 — CAP Theorem · 06 — PACELC Theorem · 07 — Consistency Models · 08 — Single Point of Failure · 09 — High Availability vs Fault Tolerance · 10 — Wrap-up


System Design Foundations: The Core Concepts Every Engineer Needs

The scenario

You're in a system design interview. The interviewer says: "Design Twitter." You open your mouth and reach for load balancers, microservices, a caching layer, a CDN — all the things you've read about. Twenty minutes in, the interviewer stops you: "How does your system behave when the database is slow?"

You pause. You know the words. You're less sure about the reasoning.

This pillar is about the reasoning. Before you can make good decisions about databases, caching, networking, or architecture patterns, you need a shared vocabulary — a set of concepts that let you describe what a system is doing, why it's failing, and what tradeoffs you're making when you fix it.

Nine concepts. They won't tell you which database to pick or how to structure your API. What they'll do is give you the mental model that makes every other decision legible.

TL;DR: Availability, reliability, latency, throughput, ACID, BASE, CAP, consistency, and fault tolerance aren't buzzwords to drop in interviews. They're the axes along which every real system design decision gets made. Know them cold and the rest of system design stops being a collection of patterns to memorise — it becomes a set of engineering tradeoffs to reason about.


What this pillar covers

Availability and reliability — is it up, and does it work?

These two are often used interchangeably. They mean different things. Availability is whether your system responds at all. Reliability is whether it responds correctly. A system can be highly available (always responding) while being deeply unreliable (responding with wrong data). A system can be reliable in what it returns but frequently unavailable (crashes often, recovers cleanly).

Both matter. They fail in different ways and are fixed with different tools.

Best mental model: availability is a hospital that's open 24 hours. Reliability is whether the hospital actually diagnoses you correctly when you walk in.


Latency, throughput, and bandwidth — how fast, how much, how wide?

These three numbers define system performance, and teams constantly confuse them. Latency is how long a single request takes. Throughput is how many requests a system handles per second. Bandwidth is how much data the network pipe can carry.

Optimising for one doesn't automatically improve the others — and sometimes actively hurts them. A system designed for extreme throughput (handling millions of requests per second) often accepts higher per-request latency. A system that minimises latency (sub-millisecond responses) often sacrifices throughput under load.

Best mental model: a motorway. Bandwidth is the number of lanes. Throughput is how many cars pass a point per hour. Latency is how long it takes one car to get from junction 1 to junction 10.


ACID vs BASE — two philosophies for data under pressure

When your database is under stress — network partition, high write load, node failure — it has to make a choice about what to prioritise. ACID databases (most traditional relational databases) choose correctness: every transaction is atomic, consistent, isolated, durable. BASE systems (most NoSQL databases) choose availability: be basically available, keep a soft state, and aim for eventual consistency.

Neither is better. They're different bets about what failure looks like and what recovery requires.

Best mental model: ACID is a bank vault — everything that goes in is logged, verified, and permanent. BASE is a whiteboard in a busy office — fast, collaborative, and eventually someone tidies it up.


CAP Theorem — the impossibility result

In a distributed system, when a network partition occurs (nodes can't talk to each other), you have to choose between consistency (all nodes return the same data) and availability (all nodes keep responding). You cannot have both.

This is not a limitation of current technology. It's a mathematical proof. Understanding it changes how you read every distributed system's documentation — every "we sacrifice X for Y" design decision is, at some level, a CAP decision.

Best mental model: two cashiers in different branches of the same bank, unable to call each other. If a customer withdraws £500 from branch A, does branch B let them withdraw another £500 before they sync? Choose consistency (branch B waits) or availability (branch B allows it and reconciles later).


PACELC Theorem — what CAP doesn't tell you

CAP describes the tradeoff during a partition. PACELC extends it: even when the network is healthy, distributed systems must still trade off between latency and consistency. You can have low latency or strong consistency — serving a read from the nearest node is fast but might return stale data; waiting for all nodes to agree is consistent but slow.

PACELC is why "just use a strongly consistent database" isn't a free choice. Strong consistency has a latency cost even on a good day.

Best mental model: an extension of the bank analogy. Even when both branches can call each other, confirming a balance across both branches takes longer than one branch just answering from its own records.


Consistency models — the spectrum

Consistency isn't binary. Between "every read returns the most recent write" (strong consistency) and "reads might return old data" (eventual consistency) there's a full spectrum: linearisability, sequential consistency, causal consistency, monotonic reads, read-your-own-writes. Different systems guarantee different points on this spectrum.

Knowing the spectrum means you can have an honest conversation about what your system promises — not just say "it's eventually consistent" and hope for the best.

Best mental model: a team editing a shared document. Strong consistency is Google Docs — everyone sees every keystroke in real time. Eventual consistency is emailing Word files back and forth — everyone's version is right from their perspective; they'll converge when someone merges.


A single point of failure (SPOF) is any component whose failure takes the whole system down. The interesting thing about SPOFs is how often they're invisible: not just hardware, but a single database primary, a single deployment pipeline, a single on-call engineer who knows how a critical system works.

Identifying SPOFs is the first step in designing for resilience. Eliminating them requires redundancy, and redundancy has costs — which is why SPOFs persist even in mature systems.

Best mental model: a power strip with one socket feeding everything in an office. The strip failing takes everything with it. The fix (multiple circuits, a UPS) adds cost and complexity — so teams accept the risk until the day they regret it.


High availability vs fault tolerance — similar goals, different strategies

Both aim for systems that stay up. High availability minimises downtime through rapid detection and recovery — failover is fast, but there's still a brief interruption. Fault tolerance means the system continues operating without interruption even when components fail — no failover, no recovery window, just seamless continuation.

Fault tolerance is significantly more expensive to build. It's the right choice for systems where even a two-second failover window is unacceptable: aircraft systems, financial clearing, emergency services infrastructure.

Best mental model: HA is a hospital with an emergency generator — when power fails, there's a two-second gap while it kicks in. Fault tolerance is a hospital with fully redundant power running in parallel at all times — the lights never flicker.


The decision framework

When you sit down to design a system — or to diagnose why an existing one is misbehaving — these nine concepts give you the axes to reason along:

Is the system responding at all?
  └─ No → Availability problem

Is it responding correctly?
  └─ No → Reliability problem

Is it too slow for individual requests?
  └─ Yes → Latency problem

Can't it handle enough load?
  └─ Yes → Throughput or bandwidth problem

Does data need to be perfectly correct, always?
  └─ Yes → ACID; be ready to pay a consistency cost
  └─ No → BASE may be appropriate; define what "eventually" means

Are nodes sometimes unable to communicate?
  └─ CAP applies: you must choose consistency or availability during partitions
  └─ PACELC applies even when they can: low latency or strong consistency?

What does your system promise about read freshness?
  └─ Define your consistency model explicitly

Where does your system fail if one component goes down?
  └─ Find your SPOFs; decide which ones are acceptable

How long can you tolerate being down?
  └─ HA if seconds are acceptable; fault tolerance if they're not

Common traps at this stage

Treating availability and reliability as the same metric. They need separate dashboards and separate engineering conversations. A team that conflates them ends up optimising uptime while their data silently corrupts.

Assuming CAP means you can ignore consistency. "We chose AP, so we're eventually consistent" is a design decision, not an excuse. Eventual consistency requires explicit reasoning about what happens when nodes diverge and how they reconcile.

Chasing five nines prematurely. Every additional nine of availability is roughly ten times harder to achieve than the last. Most user-facing applications don't need five nines — and the engineering effort to get there takes time away from features users actually notice.

Confusing latency and throughput when debugging performance. They have different root causes and different fixes. Throwing more servers at a latency problem often does nothing; improving a single slow query can fix it entirely.


Key takeaways

  • Availability is whether your system responds; reliability is whether it responds correctly. Both matter and they fail differently.
  • Latency, throughput, and bandwidth are three separate performance axes. Optimising one can hurt another.
  • ACID vs BASE is a choice about what to prioritise when things go wrong — correctness or availability.
  • CAP Theorem is a hard limit: during a network partition, choose consistency or availability.
  • PACELC extends CAP to the non-partition case: even on a good day, low latency and strong consistency are in tension.
  • Consistency models are a spectrum, not a binary. Know what your system actually promises.
  • SPOFs are often invisible until they fail. Finding them is the first step to resilience.
  • HA vs fault tolerance are different cost/complexity points on the same goal of staying up.

Up next

Part 1 → Availability: What the Nines Actually Mean

We start with availability — the most quoted metric in system design and the most commonly misunderstood. What does 99.9% uptime actually mean in minutes of downtime? And why does the way you measure availability matter as much as the number itself?


Part of the System Design series. Tags: #systemdesign #distributedsystems #softwarearchitecture #backend #engineering