Series: System Design · Data & Storage — Pillar 4 of 8

Systems Design

#	Post	What it covers
00	Data & Storage: Where Everything Lives	Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars)
01	SQL vs NoSQL: Choosing the Right Database	SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call.
02	Database Indexing: The Highest-Leverage Performance Tool	Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one.
03	B-Trees & B+ Trees: The Data Structure Behind Database Indexes	Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries.
04	LSM Trees: Why Some Databases Are Built for Writes	LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it.
05	Denormalisation: Trading Storage for Speed ← you are here	Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely.
06	Database Sharding: Scaling Beyond a Single Node	Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces.
07	Data Partitioning: Choosing How to Divide Your Data	Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale.
08	Consistent Hashing: Minimising Resharding Pain	Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems.
09	Replication & Read Replicas: Scaling Reads and Surviving Failures	Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them.
10	Object Storage: Unlimited Scale for Large Binary Data	Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it.
11	Block vs File vs Object Storage: Three Models, Three Use Cases	Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them.
12	Distributed File Systems: File Storage Across Many Machines	Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them.
13	Time Series Databases: Built for Metrics and Events	Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB.
14	Vector Databases: Semantic Search and AI Memory	Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one.
15	Full-Text Search Engines: Beyond SQL LIKE	Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful.
16	Materialized Views: Pre-Computing Expensive Queries	Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches.
17	Query Optimisation: From Slow to Fast	Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast.
18	Connection Pooling: Managing the Hidden Bottleneck	Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly.
19	Data & Storage: Wrap-Up	A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system.

Denormalisation: Trading Storage for Speed

The problem

Your URL shortener's dashboard shows users a table of their links: short code, destination URL, creation date, and total click count.

The normalised query looks like this:

SELECT
  l.short_code,
  l.destination_url,
  l.created_at,
  COUNT(c.id) AS click_count
FROM links l
LEFT JOIN clicks c ON c.link_id = l.id
WHERE l.user_id = 123
GROUP BY l.id, l.short_code, l.destination_url, l.created_at
ORDER BY l.created_at DESC
LIMIT 50;

This query is correct. It's also a join across two large tables, with a GROUP BY and aggregate on potentially millions of click rows. At 10 users, it's fine. At 10,000 concurrent users each loading their dashboards, it's 10,000 simultaneous aggregate queries on a clicks table with billions of rows. The database falls over.

You've already added indexes. You've tried materialized views. The fundamental problem is that the query requires assembling data from two tables at read time — and it's doing that assembly thousands of times per second for data that doesn't change that often.

The core idea

Denormalisation deliberately reintroduces redundancy into a database schema to eliminate expensive computation at read time. Instead of computing a join or aggregate when a user asks for data, you pre-compute it and store the result alongside the source data.

It violates the principles of third normal form. It does so deliberately, with clear-eyed understanding of the tradeoffs.

The analogy: photocopies vs the master file

Database normalisation is the policy of keeping one authoritative copy of every piece of information. If a customer's address changes, you update one row in the customers table and every system that queries customers gets the new address.

Denormalisation is the policy of making photocopies. You stamp the customer's address onto every invoice, every shipment record, every support ticket — anywhere you need it. Now reading an invoice doesn't require fetching the customer record. The address is right there.

The cost: when the customer moves, you must update the address on every invoice, every shipment, every support ticket. If you miss any, those records become stale — showing an old address that no longer reflects reality.

This is the core tension of denormalisation: reads become faster because the data is pre-assembled. Writes become more complex because every copy must be kept in sync.

How it works

Storing pre-computed aggregates

The simplest form: add a click_count column to the links table.

ALTER TABLE links ADD COLUMN click_count BIGINT DEFAULT 0;

Every time a click is recorded, increment the counter:

INSERT INTO clicks (link_id, clicked_at, country, device) VALUES (...);
UPDATE links SET click_count = click_count + 1 WHERE id = :link_id;

Now the dashboard query becomes:

SELECT short_code, destination_url, created_at, click_count
FROM links
WHERE user_id = 123
ORDER BY created_at DESC
LIMIT 50;

No join. No aggregate. One table scan on a well-indexed query. This query is orders of magnitude faster at high concurrency.

The tradeoff: click_count can drift out of sync if a click write succeeds but the counter update fails, or if click records are deleted/corrected later. You need application logic (or database triggers) to keep them in sync, and you need to decide what "eventually consistent click count" means for your users.

In document stores (MongoDB), denormalisation often means embedding related documents rather than referencing them by ID.

Normalised (references):

// links collection
{ "id": "x7Kp2", "user_id": "user_123", "url": "https://example.com" }

// users collection
{ "id": "user_123", "name": "Alex", "email": "alex@example.com" }

Loading a link's creator requires two queries: fetch the link, then fetch the user by user_id.

Denormalised (embedded):

// links collection
{
  "id": "x7Kp2",
  "url": "https://example.com",
  "creator": {
    "id": "user_123",
    "name": "Alex",
    "email": "alex@example.com"
  }
}

Loading the link now includes the creator's name and email — one query, no join. The cost: if Alex changes her email, it must be updated in every link document she created, not just in one users row.

Duplicating columns to eliminate joins

Rather than embedding full objects, sometimes you just copy one or two frequently-accessed columns.

-- Normalised: links table has user_id, must join users for email
SELECT l.short_code, u.email
FROM links l JOIN users u ON l.user_id = u.id
WHERE l.id = :id;

-- Denormalised: copy user_email into links table
SELECT short_code, user_email
FROM links
WHERE id = :id;

Now the frequently-accessed user_email is on the links row. Changing a user's email requires updating users and all their links — but email changes are rare, and dashboard reads happen millions of times.

Pre-computed join tables

For many-to-many relationships with complex aggregations, sometimes the right answer is a dedicated summary table.

-- Summary table: per-user, per-day click totals
CREATE TABLE user_daily_stats (
  user_id    BIGINT,
  date       DATE,
  link_count INT DEFAULT 0,
  click_count BIGINT DEFAULT 0,
  PRIMARY KEY (user_id, date)
);

Updated incrementally as links are created and clicks are recorded. The analytics dashboard reads from user_daily_stats rather than aggregating the full clicks table.

The tradeoffs

Write complexity. Every denormalised field adds logic to every write path. When you add click_count to links, every click handler must now also update links. Miss one write path (a bulk import, an admin correction, a backfill script) and the count drifts.

Consistency challenges. Normalised data has one source of truth. Denormalised data has multiple copies that must be kept in sync. Distributed systems make this harder — what happens if the counter increment and the click insert are in different transactions? What if one succeeds and one fails?

Read performance gains. Eliminating a join that previously merged two large tables can reduce query time by 10–100x. For read-heavy workloads with predictable access patterns, this gain is substantial and sustained.

Storage cost. Duplicated data uses more disk space. Usually acceptable — storage is cheap, latency is not.

Maintenance burden. Every denormalised field must be kept in sync across application updates. As the codebase grows, the places that must be aware of "also update this counter" multiply.

When to denormalise

Denormalise when:

A specific read query is a performance bottleneck, clearly caused by a join or aggregate
The data being denormalised changes infrequently relative to how often it's read
You can reliably keep the denormalised copy in sync (via transaction, trigger, or event)
The join or aggregate is done at high concurrency (dashboards, landing pages, API endpoints under heavy load)

Don't denormalise when:

The data changes frequently (the sync overhead eats the read gain)
You're doing it speculatively, before measuring a problem
The query is slow for a different reason (missing index, bad query plan, under-provisioned hardware)
The sync logic would be spread across too many code paths to maintain reliably

The normalise-first rule: start with a fully normalised schema. Add denormalisation only when a specific query is demonstrably too slow and cannot be fixed with indexing or query optimisation alone. Premature denormalisation adds write complexity to queries that might not need it.

The one thing to remember

Denormalisation is a conscious decision to maintain redundant copies of data in exchange for faster reads. The benefit is real — eliminating a join can turn a 5-second query into a 50-millisecond one under load. The cost is also real: every denormalised field adds complexity to every write path, and out-of-sync copies cause subtle, hard-to-find bugs. Measure first, denormalise second, and always know exactly which code paths are responsible for keeping every copy in sync.

← Previous: LSM Trees: Why Some Databases Are Built for Writes — LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade...

→ Next: Database Sharding: Scaling Beyond a Single Node — Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant t...

Denormalisation: Trading Storage for Speed

Systems Design

Denormalisation: Trading Storage for Speed

The problem

The core idea

The analogy: photocopies vs the master file

How it works

Storing pre-computed aggregates

Duplicating columns to eliminate joins

Pre-computed join tables

The tradeoffs

When to denormalise

The one thing to remember

Comments

Systems Design

Database Sharding: Scaling Beyond a Single Node

More from this blog

Architecture Patterns: Wrap-Up

MapReduce: Processing Petabytes in Parallel

Batch vs Stream Processing: How Fresh Do Your Answers Need to Be?

ETL Pipelines: Moving Data from Operations to Analytics

Backend for Frontend: One API Per Client Type

Command Palette

Systems Design

Denormalisation: Trading Storage for Speed

The problem

The core idea

The analogy: photocopies vs the master file

How it works

Storing pre-computed aggregates

Embedding related data

Duplicating columns to eliminate joins

Pre-computed join tables

The tradeoffs

When to denormalise

The one thing to remember

Comments

Systems Design

Database Sharding: Scaling Beyond a Single Node

More from this blog