Merkle Trees: Efficiently Finding What's Different

Series: System Design · Distributed Systems — Pillar 8 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Distributed Systems: What Happens When Machines Disagree | Twenty concepts covering network partitions, consensus, clocks, distributed transactions, CDC, erasure coding, and observability. The final pillar. |
| 01 | Network Partitions: The Failure Mode You Can't Design Away | Network partitions are inevitable. Learn what happens when nodes can't communicate, how systems choose between availability and consistency, and what that means in practice. |
| 02 | Split-Brain: When Two Nodes Both Think They're the Leader | Split-brain occurs when two nodes both believe they're the primary. Learn how it happens, why it causes data corruption, and how STONITH and fencing prevent it. |
| 03 | Heartbeats: How Nodes Know Their Peers Are Alive | Heartbeats let nodes detect peer failures. Learn how timeouts, phi accrual failure detectors, and the tradeoff between false positives and detection speed work. |
| 04 | Leader Election: Agreeing on Who's in Charge | Leader election coordinates which node acts as primary. Learn the bully algorithm, Raft-based election, and why exactly-one-leader guarantees are hard to achieve. |
| 05 | Consensus Algorithms: Agreeing on a Value Across Failures | Consensus lets distributed nodes agree on a value despite failures. Learn what FLP impossibility means, what Paxos and Raft provide, and where consensus is used. |
| 06 | Quorum: How Many Nodes Must Agree? | Quorum determines how many nodes must agree for an operation to succeed. Learn how R + W > N ensures consistency in distributed databases like Cassandra and DynamoDB. |
| 07 | Paxos: The Algorithm That Started It All | Paxos is the foundational distributed consensus algorithm. Learn how its two phases work, why it's hard to implement, and what systems use it in production. |
| 08 | Raft: Consensus Made Understandable | Raft makes distributed consensus understandable. Learn how leader election, log replication, and safety work in the algorithm that powers etcd, CockroachDB, and TiKV. |
| 09 | Gossip Protocol: Decentralised Cluster Communication | Gossip protocols propagate information across a cluster without a central coordinator. Learn how epidemic spreading works and where it's used in production. |
| 10 | Logical Clocks: When Physical Time Isn't Enough | Physical clocks drift and can't establish event order in distributed systems. Logical clocks track causality instead. Learn why this matters and how it works. |
| 11 | Lamport Timestamps: Ordering Events Without a Global Clock | Lamport timestamps assign logical counters to events to establish causal order in distributed systems. Learn how they work and what they can and can't tell you. |
| 12 | Vector Clocks: Knowing When Events Are Truly Concurrent | Vector clocks detect causality and concurrency in distributed systems. Learn how they work, how Dynamo uses them for conflict detection, and their limitations. |
| 13 | Distributed Transactions: When One Machine Isn't Enough | Distributed transactions are hard. Learn why cross-service atomicity is expensive, when to use it, and when eventual consistency is the right alternative. |
| 14 | Two-Phase Commit: Coordinating a Distributed Decision | 2PC ensures distributed atomicity through prepare and commit phases. Learn how it works, the coordinator failure problem, and why it's rarely used in modern systems. |
| 15 | Three-Phase Commit: Solving 2PC's Blocking Problem | 3PC adds a pre-commit phase to eliminate 2PC's blocking problem. Learn how it works, what assumptions it requires, and why it's rarely used in production. |
| 16 | Delivery Semantics: What Does "Delivered" Actually Mean? | Message delivery guarantees define system reliability. Learn what at-most-once, at-least-once, and exactly-once mean, what they cost, and when each is appropriate. |
| 17 | Change Data Capture: Streaming Your Database in Real Time | CDC streams database changes in real time by reading the write-ahead log. Learn how Debezium works, what CDC enables, and when to use it. |
| 18 | Erasure Coding: Fault Tolerance Without Full Replication | Erasure coding stores data across nodes using math, not full replication. Learn how Reed-Solomon works, how S3 uses it, and when it beats 3x replication. |
| 19 | Merkle Trees: Efficiently Finding What's Different ← you are here | Merkle trees efficiently detect which parts of a large dataset differ between nodes. Learn how Bitcoin, Cassandra, and Git use them for verification and anti-entropy. |
| 20 | Observability: Understanding Your System at Runtime | Logs, metrics, and distributed traces are how you understand a system at runtime. Learn what each pillar provides, the tools involved, and how they work together. |
| 21 | Distributed Systems: Wrap-Up | A recap of all 20 distributed systems concepts and the complete URL shortener architecture spanning all 8 pillars. The final post in the series. |
Merkle Trees: Efficiently Finding What's Different
The problem
Your Cassandra cluster has two replicas of the same partition. After a network partition healed, you're not sure they're in sync — some writes may have gone to only one replica during the partition. You need to identify which records differ so you can repair the inconsistency.
The brute-force approach: send every row from Replica A to Replica B, compare each one. For a partition with 100 million rows, that's 100 million rows transferred over the network — enormous I/O just to find a handful of diverged records.
A smarter approach: compare summaries of the data rather than the data itself. If the summary of all 100 million rows is the same on both replicas, no comparison needed. If they differ, recursively drill into subsections until you find the specific rows that diverged.
This is the Merkle tree.
The core idea
A Merkle tree is a binary tree where each leaf node contains the hash of a data block, and each non-leaf node contains the hash of its children's hashes. The root hash is a single value that represents the entire dataset: if any data block changes, the hash propagates up through the tree, changing the root hash. Two datasets with the same root hash have identical content; different root hashes mean something differs, and the tree structure reveals exactly where.
The analogy: a library's section catalogue
A librarian checks whether two library branches have identical collections. Rather than comparing every book:
- Compare the hash of the entire "Science Fiction" section. Same? No need to check individual SF books.
- Compare the hash of "Classic SF". Different → drill deeper.
- Compare "Classic SF, A–L". Same. Compare "Classic SF, M–Z". Different → drill deeper.
- Identify the specific shelf, then the specific book that differs.
Each level narrows the search. Instead of comparing every book (O(n)), you do O(log n) comparisons to find each differing item.
How Merkle trees work
Construction
Data blocks: [D1, D2, D3, D4, D5, D6, D7, D8]
Leaf nodes (hash of each block):
H1=hash(D1), H2=hash(D2), H3=hash(D3), H4=hash(D4)
H5=hash(D5), H6=hash(D6), H7=hash(D7), H8=hash(D8)
Level 2:
H12=hash(H1+H2), H34=hash(H3+H4)
H56=hash(H5+H6), H78=hash(H7+H8)
Level 1:
H1234=hash(H12+H34), H5678=hash(H56+H78)
Root:
H_root=hash(H1234+H5678)
Two datasets with the same root hash are guaranteed identical (assuming collision-resistant hashes). Two datasets with different root hashes differ in at least one block.
Efficient comparison
To find which blocks differ between two Merkle trees:
- Compare root hashes. Same → done (all data identical). Different → proceed.
- Compare the two children of the root. Find which subtree(s) differ.
- Recursively compare the differing subtrees.
- Continue until leaf nodes — the specific differing blocks are identified.
In a balanced binary tree with n leaves, this takes O(k log n) comparisons, where k is the number of differing blocks. For a dataset with 100 million blocks and only 100 differing, this is dramatically fewer comparisons than comparing all 100 million.
Where Merkle trees are used
Cassandra anti-entropy repair
Cassandra's nodetool repair command uses Merkle trees to synchronise replicas. For a given token range:
- Each replica builds a Merkle tree of its data for that range
- Replicas exchange their root hashes (gossip)
- If hashes differ, replicas compare trees recursively to find differing rows
- Only the differing rows are transferred and reconciled
Without Merkle trees, repair would require comparing every row — too expensive for multi-terabyte partitions. With Merkle trees, only divergent rows are identified and repaired.
Git
Git's internal object model is a Merkle tree. Every file is stored as a "blob" object (content + hash). Every directory is a "tree" object containing hashes of its children (files and subdirectories). Every commit is a "commit" object containing the hash of the root tree.
If two commits have different root tree hashes, something in the codebase changed. Finding what changed means traversing the tree to find which subtrees (directories) differ — O(log n) comparisons instead of comparing every file.
Bitcoin blockchain
Each block in the Bitcoin blockchain contains a Merkle root of all transactions in that block. To verify that a specific transaction is included in a block, you need only log(n) hashes (the "Merkle proof") rather than all n transactions. Light clients (mobile wallets) use this to verify transactions without downloading the full blockchain.
DynamoDB/Cassandra replication synchronisation
Both use variants of Merkle trees (hash trees) for comparing replica state and identifying diverged segments for repair.
Tradeoffs
Tree construction cost. Building a Merkle tree over a large dataset requires hashing every data block. For a 1 TB partition, this is significant I/O and CPU — Cassandra's repair process is resource-intensive and should be run during low-traffic periods.
Freshness vs accuracy. The tree is a snapshot at construction time. If data is changing rapidly while the tree is being built, the tree may be immediately stale. Cassandra limits repair scope to control this.
Depth vs granularity. A deeper tree finds differences with more precision (smaller data blocks per leaf) but requires more comparison steps. The right tree depth depends on the ratio of total data to expected number of differences.
The one thing to remember
A Merkle tree makes comparing large datasets efficient: the root hash proves identity; differing subtrees narrow the search. Instead of comparing every record (O(n)), you compare log(n) hashes to identify exactly which data blocks differ. This is how Cassandra finds inconsistent replicas without transferring terabytes of data, how Git tracks repository changes, and how Bitcoin verifies transactions without a full blockchain download.
← Previous: Erasure Coding — storing data redundantly using mathematics rather than full replication.
→ Next: Observability — structured logs, metrics, and distributed tracing: the tools that make distributed systems understandable at runtime.



