Series: System Design · Distributed Systems — Pillar 8 of 8

Systems Design

#	Post	What it covers
00	Distributed Systems: What Happens When Machines Disagree	Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar.
01	Network Partitions: The Failure Mode You Can't Design Away	Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice.
02	Split-Brain: When Two Nodes Both Think They're the Leader	Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it.
03	Heartbeats: How Nodes Know Their Peers Are Alive	Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work.
04	Leader Election: Agreeing on Who's in Charge	Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve.
05	Consensus Algorithms: Agreeing on a Value Across Failures ← you are here	Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used.
06	Quorum: How Many Nodes Must Agree?	Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB.
07	Paxos: The Algorithm That Started It All	Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production.
08	Raft: Consensus Made Understandable	Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV.
09	Gossip Protocol: Decentralised Cluster Communication	Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production.
10	Logical Clocks: When Physical Time Isn't Enough	Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works.
11	Lamport Timestamps: Ordering Events Without a Global Clock	Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you.
12	Vector Clocks: Knowing When Events Are Truly Concurrent	Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations.
13	Distributed Transactions: When One Machine Isn't Enough	Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative.
14	Two-Phase Commit: Coordinating a Distributed Decision	2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems.
15	Three-Phase Commit: Solving 2PC's Blocking Problem	3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production.
16	Delivery Semantics: What Does "Delivered" Actually Mean?	Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate.
17	Change Data Capture: Streaming Your Database in Real Time	CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it.
18	Erasure Coding: Fault Tolerance Without Full Replication	Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication.
19	Merkle Trees: Efficiently Finding What's Different	Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy.
20	Observability: Understanding Your System at Runtime	Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together.
21	Distributed Systems: Wrap-Up	A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series.

Consensus Algorithms: Agreeing on a Value Across Failures

The problem

Distributed systems need to agree on things. Which value should a configuration key have? Which node is the current leader? In what order should a sequence of writes be applied? What is the committed state of a transaction?

In a single-machine system, "agree" means "whatever the program says." In a distributed system with multiple nodes that can fail, agreeing becomes a formal problem: how do N nodes reach agreement on a single value, given that some nodes may crash, messages may be delayed, and no node can know for certain what the others have decided?

This is the consensus problem, and it has a family of solutions — Paxos, Raft, Zab, Multi-Paxos — each trading clarity, performance, and operational complexity differently.

The core idea

A consensus algorithm allows a group of nodes to agree on a value such that: all non-faulty nodes eventually decide on the same value (agreement), the decided value was proposed by some node (validity), and every node that decides, decides exactly once (termination). This must hold even if up to ⌊(N-1)/2⌋ nodes fail.

The analogy: a committee voting on a resolution

A committee must pass a resolution. Rules: a resolution passes only if a majority votes for it; no one can change their vote once cast; any committee member who doesn't respond is assumed absent (failed). The chairperson (proposer) proposes a resolution, collects votes, and declares it passed if a majority agree.

The difficulty: some members may receive the proposal late, vote on a different proposal, or fail mid-vote. The consensus algorithm is the set of rules that guarantees the committee eventually reaches a decision that a majority agreed to, despite these complications.

What consensus algorithms provide

The replicated state machine

The most important application of consensus: building a replicated state machine (RSM). An RSM is a service that:

Maintains state (a key-value store, a configuration registry, a transaction log)
Receives commands from clients
Applies commands in a consistent order on all replicas
Provides strongly consistent reads and writes despite node failures

The consensus algorithm is how all replicas agree on the same order of commands. If replica A applies writes as [w1, w2, w3] and replica B applies them as [w2, w1, w3], they end up with different states. Consensus ensures all replicas apply writes in the same order.

Used by: etcd (Kubernetes configuration), ZooKeeper (Kafka coordination), CockroachDB, TiKV, Google Spanner's Paxos groups.

Properties

Safety: all nodes decide the same value. A consensus algorithm that allows different nodes to decide different values is broken. Safety is unconditional — it must hold even under network partitions.

Liveness: all non-faulty nodes eventually decide. A consensus algorithm that blocks forever is useless. Liveness requires a functioning majority.

The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proves that no purely asynchronous distributed system can guarantee consensus in the presence of even one failed node. This is not a limitation of algorithm design — it's a mathematical proof. In practice, consensus algorithms work around this by assuming partially synchronous networks (timeouts are bounded) and using randomisation or leader-based approaches to break tie states.

The consensus algorithm family

Single-Decree Paxos

The original Paxos (Lamport, 1989) solves consensus for a single value: how do nodes agree on one value? It runs in two phases: prepare (a proposer asks acceptors to promise not to accept older proposals) and accept (the proposer sends the value; acceptors accept if they haven't promised a newer proposal).

Single-Decree Paxos is correct but impractical for logs — running a new Paxos round for every log entry is expensive.

Multi-Paxos

Extends single-decree Paxos to agree on a sequence of values (a log). A stable leader eliminates the prepare phase for subsequent entries: the leader appends log entries directly, running only the accept phase. When the leader fails, a new leader runs a full Paxos election to establish its authority, then continues appending.

Multi-Paxos is used by Google Chubby, Google Spanner, and is the theoretical basis for most production consensus implementations.

Raft

Raft (Ongaro and Ousterhout, 2014) was explicitly designed to be understandable — Paxos is notorious for being difficult to reason about. Raft decomposes consensus into leader election, log replication, and safety. Covered in depth in post 08.

Zab (ZooKeeper Atomic Broadcast)

Zab is ZooKeeper's consensus protocol. Similar to Raft in its leader-based approach. It provides total order broadcast: all delivered messages are delivered in the same order to all processes. Used exclusively by ZooKeeper.

When is consensus needed vs not needed?

Consensus is needed for:

Leader election (exactly one leader must be agreed upon)
Distributed configuration (all nodes must agree on the current config)
Ordered log replication (replicated databases, Kafka controller)
Distributed transactions (commit or abort must be agreed upon)

Consensus is NOT needed for:

Eventual consistency workloads (Cassandra, DynamoDB) — these accept different nodes seeing different values temporarily
Best-effort coordination (Gossip protocol membership)
Read-heavy workloads with relaxed consistency requirements

Consensus algorithms are expensive — they require multiple round trips between nodes and at least a majority of nodes to be available and reachable. They should be applied precisely where their guarantees are required, not everywhere.

Tradeoffs

Throughput vs latency vs fault tolerance. A consensus algorithm requires at least one round-trip (often two) for each committed value. For a 5ms cross-region RTT, each consensus round takes at minimum 5ms. For a 1ms intra-datacenter RTT, 1ms. This is the fundamental latency floor of consensus-based systems.

Fault tolerance requires more nodes. Tolerating f failures requires 2f+1 nodes. Tolerating 2 failures requires 5 nodes. More nodes means more messages per consensus round and higher coordination overhead.

Leader bottleneck. Leader-based consensus (Raft, Multi-Paxos) routes all writes through one node. Under high write load, the leader becomes the bottleneck. Systems like CockroachDB mitigate this by running multiple independent Raft groups (one per range), each with its own leader.

The one thing to remember

Consensus algorithms allow distributed nodes to agree on a value — or a sequence of values — despite a minority of failures. The key guarantee is that all non-faulty nodes decide the same thing (safety), and that a decision is eventually reached (liveness) as long as a majority is available. Raft is the most important consensus algorithm to understand in practice: it powers etcd, CockroachDB, TiKV, and most modern distributed databases. Consensus is expensive — use it where strong consistency is genuinely required, and accept eventual consistency everywhere else.

← Previous: Leader Election — when a leader fails, the cluster must agree on a new one; this post covers how that agreement is reached safely.

→ Next: Quorum — the minimum vote count that makes consensus safe, and how choosing the right quorum size balances consistency against availability.

Consensus Algorithms: Agreeing on a Value Across Failures

Systems Design

Consensus Algorithms: Agreeing on a Value Across Failures

The problem

The core idea

The analogy: a committee voting on a resolution