Series: System Design · Data & Storage — Pillar 4 of 8

Systems Design

#	Post	What it covers
00	Data & Storage: Where Everything Lives	Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars)
01	SQL vs NoSQL: Choosing the Right Database	SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call.
02	Database Indexing: The Highest-Leverage Performance Tool	Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one.
03	B-Trees & B+ Trees: The Data Structure Behind Database Indexes	Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries.
04	LSM Trees: Why Some Databases Are Built for Writes	LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it.
05	Denormalisation: Trading Storage for Speed	Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely.
06	Database Sharding: Scaling Beyond a Single Node	Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces.
07	Data Partitioning: Choosing How to Divide Your Data	Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale.
08	Consistent Hashing: Minimising Resharding Pain	Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems.
09	Replication & Read Replicas: Scaling Reads and Surviving Failures	Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them.
10	Object Storage: Unlimited Scale for Large Binary Data	Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it.
11	Block vs File vs Object Storage: Three Models, Three Use Cases	Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them.
12	Distributed File Systems: File Storage Across Many Machines	Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them.
13	Time Series Databases: Built for Metrics and Events	Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB.
14	Vector Databases: Semantic Search and AI Memory	Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one.
15	Full-Text Search Engines: Beyond SQL LIKE	Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful.
16	Materialized Views: Pre-Computing Expensive Queries	Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches.
17	Query Optimisation: From Slow to Fast ← you are here	Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast.
18	Connection Pooling: Managing the Hidden Bottleneck	Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly.
19	Data & Storage: Wrap-Up	A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system.

Query Optimisation: From Slow to Fast

The problem

A query is taking 12 seconds. It's a straightforward join of two tables. You have indexes on the relevant columns. The tables aren't particularly large — 50 million rows in clicks, 2 million rows in links. This should be fast.

You add more indexes. Still 12 seconds. You increase the instance's RAM. Still 12 seconds. You ask a colleague and they rewrite the query slightly differently. It runs in 80 milliseconds.

The query itself was the problem — not the schema, not the hardware, not the indexes. A different query structure asking for identical results was 150x faster.

This is the uncomfortable truth about query performance: the SQL you write is not directly executed. The database query planner transforms it into an execution plan, and the quality of that plan — which indexes are used, how tables are joined, in what order — determines performance. Understanding how query planners work, and how to guide them, is what separates engineers who "add indexes and hope" from engineers who reliably make slow queries fast.

The core idea

A database query planner takes a SQL query and produces an execution plan — a tree of operations that will retrieve the requested data. The planner uses table statistics, available indexes, and cost estimation to choose among many possible plans. Understanding the planner's output and knowing how to influence its choices is the foundation of query optimisation.

The analogy: a GPS with traffic data

You need to drive from A to B. A basic GPS finds the shortest route. A smart GPS considers current traffic, construction, and road capacity — it might choose a longer road that's actually faster because the short route is congested.

The query planner is the smart GPS. It knows the road map (schema and indexes), the current conditions (table statistics — how many rows, how selective each column is), and it estimates the cost of each possible route before choosing one. Getting the statistics wrong — telling the GPS it's midnight traffic when it's actually rush hour — leads to bad route choices.

When you provide hints (or just better-structured queries), you're giving the GPS better information.

How it works

Reading EXPLAIN output

EXPLAIN shows the query plan without executing it. EXPLAIN (ANALYZE, BUFFERS) actually runs the query and shows both the estimated and actual costs, plus buffer hit/miss information.

EXPLAIN (ANALYZE, BUFFERS)
SELECT l.short_code, COUNT(c.id) as click_count
FROM links l
JOIN clicks c ON c.link_id = l.id
WHERE l.user_id = 123
  AND c.clicked_at >= NOW() - INTERVAL '30 days'
GROUP BY l.id, l.short_code;

Sample output:

HashAggregate  (cost=85432.00..85440.00 rows=50 width=32) (actual time=12341.203..12341.245 rows=47)
  Group Key: l.id
  ->  Hash Join  (cost=320.00..85200.00 rows=46400 width=24) (actual time=8.234..12280.445 rows=180000)
        Hash Cond: (c.link_id = l.id)
        Buffers: shared hit=4200 read=38000
        ->  Seq Scan on clicks c  (cost=0.00..75000.00 rows=2500000 width=16) (actual time=0.034..8900.234 rows=8000000)
              Filter: (clicked_at >= (now() - '30 days'::interval))
              Rows Removed by Filter: 42000000
              Buffers: shared hit=200 read=38000
        ->  Hash  (cost=200.00..200.00 rows=150 width=24) (actual time=2.134..2.134 rows=150)
              Buckets: 1024  Batches: 1
              ->  Index Scan using idx_links_user_id on links l  ...

Reading the plan:

The tree reads from bottom (inner operations) to top (outer result)
Seq Scan on clicks — scanning 50 million rows, removing 42 million by the filter. The big problem: no index is being used to filter clicked_at
rows=2500000 (estimated) vs rows=8000000 (actual) — the planner estimated 2.5M rows but got 8M. Stale statistics led to a bad cost estimate
Buffers: shared hit=200 read=38000 — 38,000 pages read from disk (slow), only 200 from cache (fast)

The plan reveals: clicks is being scanned in full and then filtered by date. An index on (clicked_at) — or better, (link_id, clicked_at) — would let the database go directly to the last 30 days' data.

The five operations to recognise

Seq Scan (sequential scan): reads every row in the table. Appropriate for small tables or queries returning a large fraction of rows. On large tables with selective filters, a sign that an index is missing or not being used.

Index Scan: navigates the B+ tree to find matching rows, then fetches each row from the heap (the main table storage). Fast for selective queries (small fraction of rows).

Index Only Scan: serves the query entirely from the index without touching the heap. Only possible with a covering index. Fastest possible index access.

Hash Join: loads one table into a hash map, scans the other, probes the hash map for matches. Good when one table is small enough to fit in memory (the "inner" side) and the join key has no index.

Merge Join: sorts both tables on the join key, then merges them. Best when both sides are already sorted (e.g., both have an index on the join key). Can be faster than hash join for large datasets.

Nested Loop Join: for each row in the outer table, scan the inner table. Excellent when the inner lookup uses an index and few rows match. Catastrophic when the inner lookup is a full scan — O(n × m) time.

Common slow query patterns and fixes

Pattern 1: Full scan instead of index scan

-- Slow: LIKE with leading wildcard can't use a standard B+ tree index
SELECT * FROM links WHERE destination_url LIKE '%example.com%';

-- Fix: full-text search, trigram index (pg_trgm), or restructure the query
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_links_url_trgm ON links USING gin(destination_url gin_trgm_ops);

Pattern 2: Function on indexed column defeats the index

-- Slow: applying a function to an indexed column prevents index use
SELECT * FROM clicks WHERE DATE(clicked_at) = '2025-06-01';
-- DATE() wraps clicked_at — the index on clicked_at can't be used

-- Fix: use a range filter instead
SELECT * FROM clicks WHERE clicked_at >= '2025-06-01' AND clicked_at < '2025-06-02';

Pattern 3: Implicit type conversion

-- Slow: short_code is VARCHAR, comparing to integer forces a cast
SELECT * FROM links WHERE short_code = 12345;

-- Fix: use the correct type
SELECT * FROM links WHERE short_code = '12345';

Pattern 4: Selecting more columns than needed

-- Slow: SELECT * loads every column including large TEXT/JSONB fields
SELECT * FROM links WHERE user_id = 123;

-- Fix: select only what you need
SELECT id, short_code, link_name, created_at FROM links WHERE user_id = 123;
-- Smaller rows = more rows fit in a page = fewer disk reads

Pattern 5: OR across different columns

-- Slow: OR across different columns prevents efficient index use
SELECT * FROM links WHERE short_code = 'x7Kp2' OR destination_url = 'https://example.com';

-- Fix: UNION (each branch uses its own index)
SELECT * FROM links WHERE short_code = 'x7Kp2'
UNION
SELECT * FROM links WHERE destination_url = 'https://example.com';

Pattern 6: Missing composite index column order

-- Has index: (user_id, created_at)
-- Slow: created_at filter alone can't use the index (leftmost column rule)
SELECT * FROM links WHERE created_at > '2025-01-01';

-- Fast: user_id first
SELECT * FROM links WHERE user_id = 123 AND created_at > '2025-01-01';

Statistics and the planner

The planner's cost estimates depend on table statistics collected by ANALYZE. Statistics include:

Row count estimates
Column value distribution (histogram)
Most common values and their frequencies
Null fraction

Stale statistics cause bad plans. After a large data load, bulk delete, or significant schema change, run ANALYZE table_name to refresh statistics.

-- Check when statistics were last collected
SELECT schemaname, tablename, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE tablename = 'clicks';

PostgreSQL's autovacuum runs ANALYZE automatically, but may lag on rapidly-changing tables. For large batch loads, run ANALYZE manually after the load completes.

N+1 queries: the application-level anti-pattern

Query optimisation isn't only about individual queries. A common application-level issue: loading a list of links, then fetching the click count for each link in a separate query.

# N+1 pattern: 1 query for the list + N queries for click counts
links = db.execute("SELECT * FROM links WHERE user_id = 123")
for link in links:
    link['click_count'] = db.execute(
        "SELECT COUNT(*) FROM clicks WHERE link_id = ?", link['id']
    )
# 51 queries for 50 links

Fix: fetch everything in one query with a JOIN or subquery.

# 1 query
links = db.execute("""
    SELECT l.*, COUNT(c.id) AS click_count
    FROM links l
    LEFT JOIN clicks c ON c.link_id = l.id
    WHERE l.user_id = 123
    GROUP BY l.id
""")

N+1 queries are often the biggest performance problem in applications, and they're invisible in the database's slow query log (each individual query is fast; the problem is the count).

The optimisation workflow

Identify slow queries — use pg_stat_statements to find the queries consuming the most total time, not just the slowest individual execution
Run EXPLAIN (ANALYZE, BUFFERS) — understand the actual execution plan, row estimates vs actuals, and where disk reads are concentrated
Look for sequential scans on large tables — check if an index is missing, not being used, or if query structure prevents its use
Check statistics freshness — stale statistics cause bad plans
Rewrite the query — eliminate functions on indexed columns, use range filters, avoid leading wildcards
Add targeted indexes — only after confirming via EXPLAIN that an index would be used
Measure — run EXPLAIN (ANALYZE) again, confirm the plan changed and improved

The one thing to remember

The query planner chooses an execution plan based on cost estimates derived from table statistics. The fastest path to a slow query is to understand the plan EXPLAIN shows you, not to guess. A sequential scan on 50 million rows when you needed 50 rows is almost always fixable — with an index, a query rewrite, or a statistics refresh. Start with EXPLAIN, find the most expensive node in the plan, fix that, repeat.

← Previous: Materialized Views: Pre-Computing Expensive Queries — Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and w...

→ Next: Connection Pooling: Managing the Hidden Bottleneck — Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and...

Query Optimisation: From Slow to Fast

Systems Design

Query Optimisation: From Slow to Fast

The problem

The core idea

The analogy: a GPS with traffic data

How it works

Reading EXPLAIN output

The five operations to recognise

Common slow query patterns and fixes

Statistics and the planner

N+1 queries: the application-level anti-pattern

The optimisation workflow

The one thing to remember

Comments

Systems Design

More from this blog

Data & Storage: Wrap-Up

Connection Pooling: Managing the Hidden Bottleneck

Materialized Views: Pre-Computing Expensive Queries

Full-Text Search Engines: Beyond SQL LIKE

Command Palette

Systems Design

Query Optimisation: From Slow to Fast

The problem

The core idea

The analogy: a GPS with traffic data

How it works

Reading EXPLAIN output

The five operations to recognise

Common slow query patterns and fixes

Statistics and the planner

N+1 queries: the application-level anti-pattern

The optimisation workflow

The one thing to remember

Comments

Systems Design

More from this blog