Series: System Design · Data & Storage — Pillar 4 of 8

Systems Design

#	Post	What it covers
00	Data & Storage: Where Everything Lives	Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars)
01	SQL vs NoSQL: Choosing the Right Database	SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call.
02	Database Indexing: The Highest-Leverage Performance Tool ← you are here	Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one.
03	B-Trees & B+ Trees: The Data Structure Behind Database Indexes	Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries.
04	LSM Trees: Why Some Databases Are Built for Writes	LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it.
05	Denormalisation: Trading Storage for Speed	Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely.
06	Database Sharding: Scaling Beyond a Single Node	Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces.
07	Data Partitioning: Choosing How to Divide Your Data	Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale.
08	Consistent Hashing: Minimising Resharding Pain	Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems.
09	Replication & Read Replicas: Scaling Reads and Surviving Failures	Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them.
10	Object Storage: Unlimited Scale for Large Binary Data	Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it.
11	Block vs File vs Object Storage: Three Models, Three Use Cases	Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them.
12	Distributed File Systems: File Storage Across Many Machines	Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them.
13	Time Series Databases: Built for Metrics and Events	Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB.
14	Vector Databases: Semantic Search and AI Memory	Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one.
15	Full-Text Search Engines: Beyond SQL LIKE	Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful.
16	Materialized Views: Pre-Computing Expensive Queries	Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches.
17	Query Optimisation: From Slow to Fast	Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast.
18	Connection Pooling: Managing the Hidden Bottleneck	Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly.
19	Data & Storage: Wrap-Up	A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system.

Database Indexing: The Highest-Leverage Performance Tool

The problem

Your URL shortener's redirect endpoint has been fast since day one. A user hits sho.rt/x7Kp2, your app looks up the long URL in PostgreSQL, and redirects in under 10ms.

Six months later, the links table has grown to 50 million rows. Suddenly your redirect latency is 3–4 seconds. Users are bouncing before the redirect completes. Your database is showing 100% CPU utilisation.

You haven't changed a line of application code. The query is identical. What changed is that PostgreSQL is now scanning 50 million rows to find one link — and doing it thousands of times per second.

You add a single index. Redirect latency drops back to under 10ms. CPU drops to 12%. The whole incident takes 15 minutes to fix.

This is the power of indexing. It is also a lesson in what happens when you don't have one.

The core idea

An index is a separate data structure — maintained by the database alongside your table — that enables fast lookup without scanning every row. It trades write overhead and storage space for dramatically faster reads.

Without an index, finding a row means scanning from the beginning of the table until a matching row is found (or the whole table is scanned). With an index on the column being queried, the database can jump directly to the right location.

The analogy: a book index vs reading cover to cover

A 500-page textbook contains one reference to "consistent hashing" somewhere. Finding it without the index at the back means reading every page — a full table scan. With the index at the back, you flip to "C", find "consistent hashing → page 312", and go directly there.

The index at the back of the book takes up space (a few pages). Creating it required work (someone had to index every term). Every time a new edition is printed, the index must be updated. But for anyone who reads the book, the index pays for itself many times over.

Database indexes work the same way. They take up storage, they slow down writes (because the index must be updated every time a row is inserted or modified), and they cost time to build initially. For the right queries, they're one of the best investments in your database schema.

How it works

The full table scan

Without an index, a query like:

SELECT * FROM links WHERE short_code = 'x7Kp2';

causes the database to read every row in the links table and check whether short_code equals 'x7Kp2'. On a table with 50 million rows, that's 50 million comparisons — potentially hundreds of megabytes of data read from disk.

This is called a sequential scan (or full table scan). It has O(n) time complexity: the larger the table, the longer the query.

The index scan

With an index on short_code:

CREATE INDEX idx_links_short_code ON links(short_code);

The database maintains a separate data structure (almost certainly a B+ tree — more on this in the next post) that maps each short_code value to the disk location of the corresponding row. The query now:

Traverses the B+ tree to find 'x7Kp2' — O(log n) comparisons
Reads the row at the location the index points to — one disk read

On a 50 million row table, O(log n) is about 25 comparisons instead of 50 million. The speedup is not 2x — it's closer to 2,000,000x for this query.

Types of indexes

Single-column index The simplest case: an index on one column.

CREATE INDEX idx_links_user_id ON links(user_id);
-- Fast: SELECT * FROM links WHERE user_id = 123
-- No benefit: SELECT * FROM links WHERE created_at > '2025-01-01'

Composite index (multi-column) An index on two or more columns, in order.

CREATE INDEX idx_links_user_created ON links(user_id, created_at);
-- Fast: WHERE user_id = 123
-- Fast: WHERE user_id = 123 AND created_at > '2025-01-01'
-- No benefit: WHERE created_at > '2025-01-01'  (leftmost column rule)

The leftmost column rule is crucial: a composite index on (A, B, C) is usable for queries on A, A + B, and A + B + C. It is not usable for queries on just B, just C, or B + C. The index is ordered first by A, then B within A, then C within B — a query that doesn't provide A has no way to navigate the index.

Unique index Enforces uniqueness while also enabling fast lookup.

CREATE UNIQUE INDEX idx_links_short_code ON links(short_code);
-- Prevents duplicate short codes AND enables fast lookup

Partial index An index on a subset of rows matching a condition.

CREATE INDEX idx_links_active ON links(user_id) WHERE deleted_at IS NULL;
-- Much smaller index: only indexes active links
-- Only helps queries that include WHERE deleted_at IS NULL

Partial indexes are powerful when a large fraction of rows are "inactive" (soft-deleted, archived, completed) and queries almost always target only the active subset.

Covering index An index that contains all the columns needed to answer a query — the database never needs to read the actual table row.

CREATE INDEX idx_links_covering ON links(user_id, short_code, created_at);
-- This query can be answered entirely from the index:
SELECT short_code, created_at FROM links WHERE user_id = 123;

When an index covers a query, the database reads only the index — which is typically much smaller than the table and more likely to be cached in memory.

What indexes cost

Write overhead. Every INSERT, UPDATE, and DELETE must also update every index on the table. A table with 10 indexes has 10 additional writes per row modification. On write-heavy tables, excessive indexes hurt throughput significantly.

Storage. Indexes take disk space — typically 10–30% of the table size per index, but can be more for wide tables with composite indexes.

Maintenance. PostgreSQL periodically vacuums dead index entries. Large tables with many indexes need more maintenance work, which can impact performance during maintenance windows.

Query planner overhead. With many indexes, the query planner spends more time evaluating which index (or combination) to use. Usually negligible, but worth knowing.

Reading the query plan

The database query planner decides whether to use an index based on statistics about the data. You can inspect its decision with EXPLAIN:

EXPLAIN SELECT * FROM links WHERE short_code = 'x7Kp2';

-- Without index:
Seq Scan on links  (cost=0.00..1250000.00 rows=1 width=200)
  Filter: (short_code = 'x7Kp2')

-- With index:
Index Scan using idx_links_short_code on links  (cost=0.43..8.45 rows=1 width=200)
  Index Cond: (short_code = 'x7Kp2')

The planner will sometimes choose a sequential scan even when an index exists — for example, if it estimates the query will return a large fraction of the table (say, 30%+ of rows), a sequential scan is actually faster than many individual index lookups. The planner uses table statistics to make this call.

The tradeoffs

Reads vs writes. Every index added to a table improves certain reads and slows all writes. The question is always: does the read gain outweigh the write cost? On read-heavy tables (user profiles, product catalogs), indexes are almost always worth it. On write-heavy tables (event logs, click streams), every unnecessary index is a tax on every write.

The right index for the query. A composite index on (user_id, created_at) doesn't help a query that filters only on created_at. A partial index on active rows doesn't help a query that needs all rows. Adding indexes without understanding your queries adds write overhead without adding read performance.

Index bloat. Over time, tables with heavy update workloads accumulate dead index entries (rows that have been updated or deleted but whose index entries haven't been reclaimed yet). PostgreSQL's autovacuum handles this, but large, heavily-updated tables need tuning.

Selectivity matters. An index on a boolean is_active column with 95% of rows being true is nearly useless for queries that filter WHERE is_active = true — the index points to 95% of the table, so a sequential scan is faster. Indexes are most valuable on high-selectivity columns where the query matches a small fraction of rows.

When to use an index

Add an index when:

A column appears in WHERE, JOIN ON, ORDER BY, or GROUP BY clauses in frequent queries
The column has high cardinality (many distinct values) and queries are selective
A query is running slowly and EXPLAIN shows a sequential scan on a large table

Don't add an index when:

The table is small (under ~10,000 rows — a sequential scan is fine)
The column has very low selectivity (booleans, status fields with 2–3 values)
The table has extremely high write volume and reads are rare
You're indexing "just in case" without a specific slow query driving the decision

Signs of over-indexing:

Write throughput dropping without an obvious cause
pg_stat_user_indexes showing indexes with zero or near-zero scans
Index size approaching or exceeding table size

The one thing to remember

An index is a contract: it makes specific reads fast at the cost of all writes. Add indexes for queries you know are slow, not queries you imagine might be slow someday. One well-chosen index on the right column can make a 10-second query into a 10-millisecond one. Ten poorly-chosen indexes on a write-heavy table can halve your write throughput without improving anything.

← Previous: SQL vs NoSQL: Choosing the Right Database — SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when No...

→ Next: B-Trees & B+ Trees: The Data Structure Behind Database Indexes — Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this mea...

Database Indexing: The Highest-Leverage Performance Tool

Systems Design

Database Indexing: The Highest-Leverage Performance Tool

The problem

The core idea

The analogy: a book index vs reading cover to cover

How it works

The full table scan

The index scan

Types of indexes

What indexes cost

Reading the query plan

The tradeoffs

When to use an index

The one thing to remember

Comments

Systems Design

B-Trees & B+ Trees: The Data Structure Behind Database Indexes

More from this blog

Architecture Patterns: Wrap-Up

MapReduce: Processing Petabytes in Parallel

Batch vs Stream Processing: How Fresh Do Your Answers Need to Be?

ETL Pipelines: Moving Data from Operations to Analytics

Backend for Frontend: One API Per Client Type

Command Palette

Systems Design

Database Indexing: The Highest-Leverage Performance Tool

The problem

The core idea

The analogy: a book index vs reading cover to cover

How it works

The full table scan

The index scan

Types of indexes

What indexes cost

Reading the query plan

The tradeoffs

When to use an index

The one thing to remember

Comments

Systems Design

B-Trees & B+ Trees: The Data Structure Behind Database Indexes

More from this blog