Series: System Design · Data & Storage — Pillar 4 of 8

Systems Design

#	Post	What it covers
00	Data & Storage: Where Everything Lives	Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars)
01	SQL vs NoSQL: Choosing the Right Database	SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call.
02	Database Indexing: The Highest-Leverage Performance Tool	Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one.
03	B-Trees & B+ Trees: The Data Structure Behind Database Indexes	Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries.
04	LSM Trees: Why Some Databases Are Built for Writes	LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it.
05	Denormalisation: Trading Storage for Speed	Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely.
06	Database Sharding: Scaling Beyond a Single Node	Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces.
07	Data Partitioning: Choosing How to Divide Your Data	Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale.
08	Consistent Hashing: Minimising Resharding Pain	Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems.
09	Replication & Read Replicas: Scaling Reads and Surviving Failures ← you are here	Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them.
10	Object Storage: Unlimited Scale for Large Binary Data	Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it.
11	Block vs File vs Object Storage: Three Models, Three Use Cases	Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them.
12	Distributed File Systems: File Storage Across Many Machines	Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them.
13	Time Series Databases: Built for Metrics and Events	Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB.
14	Vector Databases: Semantic Search and AI Memory	Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one.
15	Full-Text Search Engines: Beyond SQL LIKE	Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful.
16	Materialized Views: Pre-Computing Expensive Queries	Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches.
17	Query Optimisation: From Slow to Fast	Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast.
18	Connection Pooling: Managing the Hidden Bottleneck	Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly.
19	Data & Storage: Wrap-Up	A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system.

Replication & Read Replicas: Scaling Reads and Surviving Failures

The problem

Your URL shortener's PostgreSQL primary instance is under pressure, but not from write load — from reads. Your analytics dashboard, the public link preview pages, the team reporting features, the API serving mobile clients: every read hits the same single database.

Your database is showing 90% CPU utilisation, but write activity accounts for only 20% of it. The remaining 70% is read queries — analytics aggregates, dashboard page loads, link detail fetches. The primary is struggling not because it can't absorb writes but because it's serving too many concurrent reads.

At the same time, a nagging concern: the primary database is a single point of failure. If the primary goes down — hardware failure, corrupt data, a botched migration — every part of the platform that requires data is unavailable.

Both problems — read scaling and fault tolerance — have the same solution: replication.

The core idea

Database replication creates one or more copies of a database on separate nodes. The primary node accepts writes; replicas receive a copy of every write and can serve reads. Replication solves two problems simultaneously: read throughput (spread reads across multiple nodes) and availability (if the primary fails, promote a replica).

The analogy: a library with branch copies

A central library (the primary) holds the authoritative collection. When a new book arrives, a copy is made and sent to each branch library (the replicas). Most readers visit their local branch. The branches serve the same collection as the central library, reducing load there. If the central library catches fire, a branch holds everything needed to reconstruct the collection — and can temporarily take over as the new central library.

The branch copies might lag slightly behind the central library — a book acquired yesterday might take a few hours to appear at branches. This is replication lag, and it's the central tradeoff of replication.

How it works

Primary-replica (single-primary) replication

The most common architecture: one primary accepts all writes; one or more replicas receive those writes asynchronously and serve reads.

Application
├── All writes → Primary (read-write)
└── Read queries → Replica 1, Replica 2, Replica 3 (read-only)
                                    ↑
                         WAL stream from primary

Write path: the application sends writes to the primary. The primary writes to its WAL (Write-Ahead Log) and acknowledges the write.

Replication path: the primary streams its WAL to each replica. Each replica applies the WAL entries in order, maintaining an identical copy of the primary's data (with some lag).

Read path: read queries are distributed across replicas. The application (or a connection proxy like PgBouncer, ProxySQL, or RDS Proxy) routes read queries to replicas.

Synchronous vs asynchronous replication

Asynchronous replication (default in PostgreSQL):

The primary acknowledges the write to the client as soon as it's committed locally, without waiting for replicas to confirm receipt.

Pros: low write latency — the client doesn't wait for the replication network round-trip
Cons: replication lag — replicas trail the primary by milliseconds to seconds (or more under high load). A read from a replica immediately after a write may return stale data

Synchronous replication:

The primary waits for at least one replica to confirm receipt before acknowledging the write to the client.

-- PostgreSQL: require at least one replica to confirm
SET synchronous_commit = on;
ALTER SYSTEM SET synchronous_standby_names = 'replica1';

Pros: guaranteed durability — data is on at least two nodes before the write is acknowledged. No data loss even if the primary fails immediately after the write
Cons: write latency increases by the network round-trip to the replica (typically 1–10ms). If the synchronous replica is unavailable, writes block

Most production setups use asynchronous replication for performance, with the understanding that a primary failure may lose a small window of recent writes. Critical data paths (financial transactions, user account changes) sometimes use synchronous replication for the strongest durability guarantee.

Replication lag

Replication lag is the delay between a write being committed on the primary and it becoming visible on a replica. Under normal conditions, it's under 100ms. Under high write load, it can grow to seconds or minutes.

Lag matters in two scenarios:

Read-after-write consistency: a user creates a short link, is redirected to the link's detail page, but the detail page reads from a replica that hasn't yet received the new link. The page shows "link not found." This is confusing and wrong.

Stale analytics: dashboard queries on a replica may show click counts that are seconds or minutes behind the primary. Usually acceptable for analytics; not acceptable for balance-critical operations.

Strategies to handle lag:

Read-your-writes: after a write, route the user's next read to the primary (or to a replica you know has received the write)
Monotonic reads: once a user has seen data from a replica at lag offset X, always route their subsequent reads to a replica at the same or lower lag
Causal consistency: track which writes a user has seen and only serve reads from replicas that have applied at least those writes

Most applications handle this pragmatically: write-heavy operations read from the primary; most other reads go to replicas.

Failover

When the primary fails, a replica is promoted to become the new primary. This process is called failover.

Manual failover: an operator observes that the primary is down, manually promotes a replica, and updates the application's connection configuration. Reliable but slow — typically 5–15 minutes of downtime.

Automatic failover: a high-availability agent (Patroni for PostgreSQL, MHA for MySQL, AWS RDS Multi-AZ) automatically detects primary failure and promotes a replica. Typical automated failover time: 30–60 seconds. The application needs to handle reconnection gracefully.

Failover considerations:

Data loss window: with async replication, the promoted replica may not have the last few seconds of writes from the primary. These writes are lost.
Split-brain: two nodes simultaneously believe they are the primary. This can happen if the primary is slow but not completely dead — the HA agent promotes a replica, but the "dead" primary recovers and starts accepting writes again. Both nodes are now accepting writes to the same dataset. Fencing mechanisms (STONITH — "Shoot The Other Node In The Head") prevent this.
Connection routing: applications must be able to reconnect to the new primary. Connection poolers and DNS-based failover (updating a CNAME to point to the new primary) are common solutions.

Multi-primary replication

In a multi-primary (multi-master) setup, multiple nodes accept writes. Replicas from any primary must be merged.

Galera Cluster (MySQL): uses synchronous multi-primary replication with certification-based conflict detection. Writes go to any node; conflicts (two nodes trying to write the same row simultaneously) are detected and one is rolled back.

Amazon Aurora Multi-Master: allows writes to multiple availability zones simultaneously, with automatic conflict resolution.

CockroachDB / Google Spanner: distributed SQL databases that support multi-primary across regions using consensus protocols (Paxos/Raft).

Multi-primary is powerful for write availability (no single primary is a write SPOF) but introduces write conflict complexity. For most applications, single-primary with automated failover is operationally simpler and sufficient.

The tradeoffs

Read scalability vs consistency. Read replicas scale reads linearly — three replicas can serve three times the read throughput. The cost is eventual consistency: replicas may lag. Applications must be written with this in mind.

Durability vs write latency. Synchronous replication guarantees no data loss but adds latency. Asynchronous replication is fast but risks losing recent writes on primary failure.

Failover speed vs safety. Automated failover reduces downtime but risks split-brain. Manual failover is safer but slower.

Read replicas vs sharding. Read replicas scale reads; sharding scales writes and storage. If your bottleneck is read throughput, add read replicas (much simpler). If your bottleneck is write throughput or data volume, you need sharding. Many production systems use both.

When to use replication

Use read replicas when:

Read queries dominate your workload (analytics dashboards, reporting, high-traffic reads)
You need a hot standby for fast failover without downtime
You want to run analytics workloads without impacting production write performance

Use synchronous replication when:

Data loss on primary failure is unacceptable
You have a high-RPO (Recovery Point Objective) requirement

Consider multi-primary when:

Write availability across regions is required
You're building an active-active multi-region setup

Don't use read replicas as a substitute for sharding if your problem is write throughput or storage — replicas are still limited to one primary's write capacity.

The one thing to remember

Replication makes a copy of your data on multiple nodes, solving two problems at once: read throughput (spread reads across replicas) and availability (promote a replica if the primary fails). The fundamental tradeoff is between write latency and durability: async replication is fast but risks a small data loss window on failure; sync replication eliminates that window but adds latency to every write. Know which tradeoff your application can afford before choosing.

← Previous: Consistent Hashing: Minimising Resharding Pain — Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundament...

→ Next: Object Storage: Unlimited Scale for Large Binary Data — Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and w...

Replication & Read Replicas: Scaling Reads and Surviving Failures

Systems Design

Replication & Read Replicas: Scaling Reads and Surviving Failures

The problem

The core idea

The analogy: a library with branch copies

How it works