Skip to main content

Command Palette

Search for a command to run...

LSM Trees: Why Some Databases Are Built for Writes

Updated
12 min read
LSM Trees: Why Some Databases Are Built for Writes

Series: System Design · Data & Storage — Pillar 4 of 8

Systems Design

# Post What it covers
00 Data & Storage: Where Everything Lives Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars)
01 SQL vs NoSQL: Choosing the Right Database SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call.
02 Database Indexing: The Highest-Leverage Performance Tool Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one.
03 B-Trees & B+ Trees: The Data Structure Behind Database Indexes Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries.
04 LSM Trees: Why Some Databases Are Built for Writes ← you are here LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it.
05 Denormalisation: Trading Storage for Speed Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely.
06 Database Sharding: Scaling Beyond a Single Node Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces.
07 Data Partitioning: Choosing How to Divide Your Data Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale.
08 Consistent Hashing: Minimising Resharding Pain Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems.
09 Replication & Read Replicas: Scaling Reads and Surviving Failures Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them.
10 Object Storage: Unlimited Scale for Large Binary Data Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it.
11 Block vs File vs Object Storage: Three Models, Three Use Cases Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them.
12 Distributed File Systems: File Storage Across Many Machines Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them.
13 Time Series Databases: Built for Metrics and Events Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB.
14 Vector Databases: Semantic Search and AI Memory Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one.
15 Full-Text Search Engines: Beyond SQL LIKE Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful.
16 Materialized Views: Pre-Computing Expensive Queries Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches.
17 Query Optimisation: From Slow to Fast Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast.
18 Connection Pooling: Managing the Hidden Bottleneck Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly.
19 Data & Storage: Wrap-Up A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system.

LSM Trees: Why Some Databases Are Built for Writes

The problem

Your URL shortener now records every click event: timestamp, IP, country, device, referer. At peak, that's 50,000 click events per second. Every event must be written to storage — immediately, durably, at sustained high volume.

PostgreSQL with B+ tree indexes handles 50,000 writes per second, but only just. The problem is that B+ tree writes aren't sequential — each insert navigates the tree to find the right position, then modifies a page somewhere on disk. Under high write load, this random I/O saturates disk throughput, write latency climbs, and the database falls behind.

You move the click events to Cassandra. Write latency drops by 80%. Cassandra can handle 500,000 writes per second on the same hardware. What is Cassandra doing differently?

The answer is the LSM tree — a data structure that turns random writes into sequential writes, at the cost of making reads somewhat more complex.


The core idea

An LSM (Log-Structured Merge) tree batches writes in memory, flushes them to disk as sorted, immutable files, and periodically merges those files in the background. Writes are always sequential (appends), never random updates — which is dramatically faster on both spinning disks and SSDs.


The analogy: a journal and a filing cabinet

Imagine an office where documents need to be filed. Two approaches:

The B+ tree way (filing cabinet): every new document is immediately filed in the correct alphabetical position in the cabinet. Fast to retrieve (it's always exactly where it should be), but slow to file — you have to open the right drawer, find the right position, shift other documents, insert the new one.

The LSM way (journal + periodic filing): every new document goes into a journal on your desk (fast — just write at the end). When the journal fills up, you sort it and file the batch alphabetically into the cabinet in one sweep. The cabinet accumulates multiple sorted batches. Retrieving something means checking the journal, then the most recent batch in the cabinet, then older batches — more work to read, but writing is always just appending to the journal.

Periodically, you merge the oldest batches in the cabinet together (compaction) — sorting and combining them into a single, cleaner filing. This keeps the number of batches manageable.


How it works

The three-layer structure

1. Memtable (in-memory, mutable)

Writes go here first. The memtable is an in-memory sorted data structure (usually a red-black tree or skiplist) that accepts writes and maintains keys in sorted order.

Write: click_event(link_id="x7Kp2", at="2025-06-01T12:00:00", country="AU")
→ Insert into memtable at the correct sorted position (in memory, fast)

The write is also appended to a Write-Ahead Log (WAL) on disk before the memtable is modified — this ensures durability if the process crashes before the memtable is flushed.

2. SSTables (on-disk, immutable)

When the memtable reaches a size threshold (typically 64MB–256MB), it's flushed to disk as a Sorted String Table (SSTable) — an immutable, sorted file containing all key-value pairs from the memtable.

Memtable flush:
(x7Kp2, 12:00:00) → (x7Kp2, 12:00:05) → (x7Kp2, 12:01:00) → (yz9A1, 12:00:01)
                  Written as a sorted, immutable file on disk

Once an SSTable is written, it is never modified. Updates and deletes are handled by writing new entries, not by modifying old ones.

  • Updates: a new entry with the same key and a higher timestamp. When reading, the most recent entry for a key wins.
  • Deletes: a "tombstone" entry — a marker indicating this key has been deleted. During compaction, tombstones remove both the tombstone and the original entry.

3. Compaction

Over time, many SSTables accumulate on disk. This has two problems: reads must check multiple SSTables to find a key's most recent value, and disk space is wasted by old versions and tombstones.

Compaction merges SSTables together. It reads multiple SSTables, merges them (like merge sort), resolves conflicts (latest timestamp wins), removes tombstones paired with their targets, and writes a new, smaller set of SSTables.

Before compaction: 8 SSTables, some containing outdated versions of the same keys
After compaction:  3 SSTables, each containing only the latest version of each key

Compaction is a background process. It uses CPU and I/O, which is why Cassandra's compaction strategy is configurable and why it's one of the main operational knobs to tune.

A write operation

  1. Append to WAL (sequential disk write — fast)
  2. Insert into memtable (in-memory — very fast)
  3. Acknowledge the write to the client

The disk write is sequential (appending to the WAL). There's no page lookup, no tree traversal, no random I/O. This is why LSM trees sustain dramatically higher write throughput than B+ trees.

A read operation

  1. Check the memtable (in-memory — very fast)
  2. Check the most recent SSTable (using a bloom filter to skip quickly if the key isn't present)
  3. Check older SSTables in reverse chronological order
  4. Return the most recent value found (or "not found" if only tombstones are found)

Reads are more complex than in a B+ tree. In the worst case, a read must check the memtable and every SSTable on disk. In practice, bloom filters eliminate most unnecessary SSTable checks (a bloom filter can tell you with certainty that a key is not in an SSTable, avoiding that I/O entirely), and compaction keeps the number of SSTables manageable.

Bloom filters: the read optimisation

A bloom filter is a probabilistic data structure that can answer "is key X definitely not in this SSTable?" in constant time. Each SSTable has a corresponding bloom filter. Before checking an SSTable, the read path checks its bloom filter:

  • Bloom filter says "definitely not here" → skip this SSTable entirely
  • Bloom filter says "might be here" → check the SSTable (there may be false positives, but never false negatives)

On a workload with high key selectivity, bloom filters eliminate 90%+ of unnecessary SSTable reads, making LSM reads much closer to B+ tree reads in practice.


Compaction strategies

Different workloads need different compaction approaches. Cassandra offers three main strategies:

Size-Tiered Compaction (STCS): merge SSTables of similar sizes together. Good for write-heavy workloads. Creates fewer, larger SSTables over time. Read amplification increases between compaction cycles.

Leveled Compaction (LCS): organise SSTables into levels, where each level has a fixed size limit and non-overlapping key ranges. Better read performance than STCS. Higher write amplification (more compaction work per byte written). Used when read performance matters more than write performance.

Time-Window Compaction (TWCS): designed for time-series data. Compacts SSTables within a time window together; doesn't merge across windows. When old data expires, entire SSTables can be dropped without reading and rewriting them. Ideal for the URL shortener's click events table.


The tradeoffs

Write amplification. Data is written once to the WAL, once to the memtable (in-memory), once to an SSTable, and possibly many times during compaction. Each key may be written to disk 5–30× before it settles into a stable level. This is called write amplification. B+ trees have less write amplification but more random I/O per write.

Read amplification. Finding a key may require checking the memtable and multiple SSTables. Bloom filters and compaction reduce this, but reads are still more complex than in a B+ tree index, which stores all data in a single sorted structure.

Space amplification. Multiple versions of a key exist across SSTables until compaction cleans them up. A key that's updated frequently occupies disk space proportional to the number of versions between compaction cycles.

Compaction I/O. Compaction is expensive — it reads SSTables from disk, merges them, and writes new SSTables. On a busy node, compaction can consume 30–50% of disk I/O bandwidth, affecting latency.

The LSM tree's fundamental trade: sequential writes (fast, high throughput, SSD-friendly) at the cost of read and space complexity managed by compaction.


When LSM trees are the right choice

High write throughput: 100,000+ writes/second sustained, where B+ tree random I/O becomes a bottleneck. Cassandra, RocksDB.

Time-series and event data: append-mostly workloads where old data is rarely updated or queried by individual key. LSM's time-window compaction strategy makes this especially efficient.

Write-optimised storage layers: RocksDB powers dozens of systems as a storage engine — Kafka for log segments, TiKV for distributed KV, CockroachDB's storage layer.

Not a good fit for:

  • Point-in-time reads of frequently-updated data — multiple SSTables must be checked, and compaction lag means stale versions persist longer
  • Low-latency reads with no write volume — B+ trees read faster when there's no write throughput to justify the LSM trade
  • Ad-hoc queries across arbitrary columns — LSM stores key-value or row data efficiently; complex queries still require a query layer on top

The one thing to remember

LSM trees make writes fast by making them always sequential: everything goes to the journal (memtable) first, then gets sorted and filed (SSTable) in the background. The cost is that reads must check multiple files until compaction consolidates them. If your workload is write-heavy and you can afford slightly more complex reads, an LSM-based store will outperform B+ tree storage by a large margin.


← Previous: B-Trees & B+ Trees: The Data Structure Behind Database Indexes — Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this mea...

→ Next: Denormalisation: Trading Storage for Speed — Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to...

Systems Design

Part 47 of 50

Understanding these system design concepts is essential for architects, developers, and engineers to create scalable, reliable, and maintainable software systems that meet the needs of businesses.

Up next

Denormalisation: Trading Storage for Speed

Series: System Design · Data & Storage — Pillar 4 of 8 Systems Design # Post What it covers 00 Data & Storage: Where Everything Lives Where data lives shapes everything about a system. Nineteen

More from this blog

Cloud Tuned

751 posts

Your starting point for anything cloud: AWS, Azure, GCP, Serverless, Architecture, Hybrid Cloud, Systems Design and other Information Technology topics.