Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All ← you are here	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Paxos: The Algorithm That Started It All

The problem

You need five distributed nodes to agree on a single value — which server should be the leader, what the value of a configuration key should be, or whether a transaction should commit. Any node can propose a value. Any node can fail. Messages can be delayed, lost, or reordered. No node can know for certain what other nodes have decided.

This is the consensus problem. Paxos, proposed by Leslie Lamport in 1989 (published 1998), was the first algorithm to prove this problem solvable in an asynchronous network with crash failures. Everything that followed — Multi-Paxos, Raft, Zab — is built on its foundation.

The core idea

Paxos runs in two phases. In Phase 1 (Prepare), a proposer asks a majority of acceptors to promise not to accept older proposals. In Phase 2 (Accept), the proposer sends its chosen value; acceptors accept it if they haven't made a newer promise. Once a majority of acceptors accept the same value, it is chosen — irrevocably.

The analogy: reserving a meeting room

You want to book a conference room. The booking system has three administrators (acceptors). To book, you:

Phase 1 (Prepare/Promise): Call all three admins: "I'm proposing booking ID #42. Will you promise not to accept any booking with an ID lower than 42?"

If they've already promised a higher number, they say "No (and by the way, the highest I've seen is #48)"
If #42 is the highest they've seen, they say "Yes, I promise"

If a majority promise: proceed to Phase 2. If not: your booking ID is too low — try again with a higher ID.

Phase 2 (Accept/Commit): Tell the same admins: "Please record this booking for ID #42: Meeting room B, 3pm Tuesday"

If they haven't promised anything higher than #42 since Phase 1: they accept
If they have (another proposer snuck in): they reject

If a majority accept: the booking is made. It can never be unmade. If not: start over with a higher proposal number.

How Paxos works in detail

Roles

Proposer: a node that initiates a consensus round by proposing a value.

Acceptor: a node that receives proposals and accepts or rejects them. Maintains the highest proposal number it has seen and the last value it accepted.

Learner: a node that learns the decided value (the chosen value) to act on it.

In practice, nodes often play all three roles.

Phase 1: Prepare

The proposer selects a globally unique, monotonically increasing proposal number n. It sends Prepare(n) to a majority of acceptors.

Each acceptor responds with a promise: it will not accept any proposal numbered less than n. If the acceptor has previously accepted a value, it includes that value in its response (so the proposer knows about any previously accepted values it must preserve).

Proposer → Prepare(n=5) → Acceptors A, B, C
A: hasn't accepted anything → Promise(n=5, accepted=null)
B: previously accepted (n=3, v="foo") → Promise(n=5, accepted=(3, "foo"))
C: doesn't respond (failed or delayed)

Proposer receives promises from A and B (majority of 3) → can proceed.

Phase 2: Accept

The proposer picks a value:

If any acceptor reported a previously accepted value, the proposer must use the value from the highest-numbered accepted proposal (v="foo" from n=3 in the example above)
If no acceptor reported a prior accepted value, the proposer uses its own proposed value

The proposer sends Accept(n, v) to a majority.

Each acceptor accepts if it hasn't promised a higher proposal number since Phase 1.

Proposer → Accept(n=5, v="foo") → Acceptors A, B, C
A: promised n=5, no higher → Accept(5, "foo") ✓
B: promised n=5, no higher → Accept(5, "foo") ✓

When a majority accept the same (n, v), the value v is chosen (committed). It will be the decision regardless of future rounds.

Why Paxos is safe

No two values can be chosen: only one proposal can achieve majority acceptance at any given proposal number (acceptors only accept one proposal per number). Any subsequent majority must include at least one node that accepted the current value, forcing the proposer to propagate it.

The value of a chosen value is preserved: if value v was chosen in round n, any higher-numbered round's Phase 1 will discover the previously accepted v and must propose it — never a different value.

Why Paxos is hard to implement correctly

Liveness is not guaranteed. If two proposers compete with increasing proposal numbers, they can livelock indefinitely: proposer A interrupts proposer B's Phase 1 with a higher number, B does the same to A, neither completes. Randomised backoff or a distinguished proposer (Multi-Paxos) solves this.

The algorithm doesn't specify how to handle leader crashes. What happens to an in-progress proposal when the proposer fails? Paxos proves the value is safe — no conflicting decision will be made — but recovering the current state requires a new proposal round.

Multi-Paxos for log replication. Single-Decree Paxos agrees on one value. For a replicated log, you need Multi-Paxos: a stable leader that skips Phase 1 for log entries (the leader's authority was established once during its election), running only Phase 2 for each entry. This is how Google Chubby, Spanner, and Zookeeper work.

Implementation requires careful handling of: network message deduplication, acceptor state persistence (must survive crashes), leader leases, read linearizability, and cluster membership changes. Chubby's implementation note (Burrows, 2006) lists dozens of non-obvious correctness considerations.

Where Paxos is used in production

Google Chubby: distributed lock service. Uses Multi-Paxos for its replicated log. Chubby powers leader election across most of Google's infrastructure.

Google Spanner: uses Paxos groups (one Paxos instance per data shard) for globally-consistent distributed transactions.

Apache Zookeeper: uses Zab (ZooKeeper Atomic Broadcast), a Paxos variant, for its replicated log. Zookeeper coordinates Kafka, HBase, and many other distributed systems.

CockroachDB: uses Multi-Raft (Raft, which is a more understandable reformulation of Multi-Paxos) for its replicated key-value store.

The one thing to remember

Paxos guarantees that a distributed system agrees on exactly one value by ensuring that any majority of acceptors that accepts a value has already promised not to accept conflicting values. The two-phase structure — prepare (gather promises) and accept (commit the value) — is the foundation of all practical consensus algorithms. Paxos is correct but notoriously difficult to implement; this is why Raft was created to express the same guarantees more clearly.

← Previous: Quorum — the minimum vote count that makes consensus safe, and how choosing the right quorum size balances consistency against availability.

→ Next: Raft — the consensus algorithm designed for understandability; the most important one to know in practice.

Paxos: The Algorithm That Started It All

Systems Design

Paxos: The Algorithm That Started It All

The problem

The core idea

The analogy: reserving a meeting room