Replication & Read Replicas: Scaling Reads and Surviving Failures

Series: System Design · Data & Storage — Pillar 4 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Data & Storage: Where Everything Lives | Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars) |
| 01 | SQL vs NoSQL: Choosing the Right Database | SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call. |
| 02 | Database Indexing: The Highest-Leverage Performance Tool | Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one. |
| 03 | B-Trees & B+ Trees: The Data Structure Behind Database Indexes | Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries. |
| 04 | LSM Trees: Why Some Databases Are Built for Writes | LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it. |
| 05 | Denormalisation: Trading Storage for Speed | Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely. |
| 06 | Database Sharding: Scaling Beyond a Single Node | Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces. |
| 07 | Data Partitioning: Choosing How to Divide Your Data | Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale. |
| 08 | Consistent Hashing: Minimising Resharding Pain | Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems. |
| 09 | Replication & Read Replicas: Scaling Reads and Surviving Failures ← you are here | Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them. |
| 10 | Object Storage: Unlimited Scale for Large Binary Data | Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it. |
| 11 | Block vs File vs Object Storage: Three Models, Three Use Cases | Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them. |
| 12 | Distributed File Systems: File Storage Across Many Machines | Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them. |
| 13 | Time Series Databases: Built for Metrics and Events | Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB. |
| 14 | Vector Databases: Semantic Search and AI Memory | Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one. |
| 15 | Full-Text Search Engines: Beyond SQL LIKE | Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful. |
| 16 | Materialized Views: Pre-Computing Expensive Queries | Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches. |
| 17 | Query Optimisation: From Slow to Fast | Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast. |
| 18 | Connection Pooling: Managing the Hidden Bottleneck | Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly. |
| 19 | Data & Storage: Wrap-Up | A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system. |
Replication & Read Replicas: Scaling Reads and Surviving Failures
The problem
Your URL shortener's PostgreSQL primary instance is under pressure, but not from write load — from reads. Your analytics dashboard, the public link preview pages, the team reporting features, the API serving mobile clients: every read hits the same single database.
Your database is showing 90% CPU utilisation, but write activity accounts for only 20% of it. The remaining 70% is read queries — analytics aggregates, dashboard page loads, link detail fetches. The primary is struggling not because it can't absorb writes but because it's serving too many concurrent reads.
At the same time, a nagging concern: the primary database is a single point of failure. If the primary goes down — hardware failure, corrupt data, a botched migration — every part of the platform that requires data is unavailable.
Both problems — read scaling and fault tolerance — have the same solution: replication.
The core idea
Database replication creates one or more copies of a database on separate nodes. The primary node accepts writes; replicas receive a copy of every write and can serve reads. Replication solves two problems simultaneously: read throughput (spread reads across multiple nodes) and availability (if the primary fails, promote a replica).
The analogy: a library with branch copies
A central library (the primary) holds the authoritative collection. When a new book arrives, a copy is made and sent to each branch library (the replicas). Most readers visit their local branch. The branches serve the same collection as the central library, reducing load there. If the central library catches fire, a branch holds everything needed to reconstruct the collection — and can temporarily take over as the new central library.
The branch copies might lag slightly behind the central library — a book acquired yesterday might take a few hours to appear at branches. This is replication lag, and it's the central tradeoff of replication.
How it works
Primary-replica (single-primary) replication
The most common architecture: one primary accepts all writes; one or more replicas receive those writes asynchronously and serve reads.
Application
├── All writes → Primary (read-write)
└── Read queries → Replica 1, Replica 2, Replica 3 (read-only)
↑
WAL stream from primary
Write path: the application sends writes to the primary. The primary writes to its WAL (Write-Ahead Log) and acknowledges the write.
Replication path: the primary streams its WAL to each replica. Each replica applies the WAL entries in order, maintaining an identical copy of the primary's data (with some lag).
Read path: read queries are distributed across replicas. The application (or a connection proxy like PgBouncer, ProxySQL, or RDS Proxy) routes read queries to replicas.
Synchronous vs asynchronous replication
Asynchronous replication (default in PostgreSQL):
The primary acknowledges the write to the client as soon as it's committed locally, without waiting for replicas to confirm receipt.
- Pros: low write latency — the client doesn't wait for the replication network round-trip
- Cons: replication lag — replicas trail the primary by milliseconds to seconds (or more under high load). A read from a replica immediately after a write may return stale data
Synchronous replication:
The primary waits for at least one replica to confirm receipt before acknowledging the write to the client.
-- PostgreSQL: require at least one replica to confirm
SET synchronous_commit = on;
ALTER SYSTEM SET synchronous_standby_names = 'replica1';
- Pros: guaranteed durability — data is on at least two nodes before the write is acknowledged. No data loss even if the primary fails immediately after the write
- Cons: write latency increases by the network round-trip to the replica (typically 1–10ms). If the synchronous replica is unavailable, writes block
Most production setups use asynchronous replication for performance, with the understanding that a primary failure may lose a small window of recent writes. Critical data paths (financial transactions, user account changes) sometimes use synchronous replication for the strongest durability guarantee.
Replication lag
Replication lag is the delay between a write being committed on the primary and it becoming visible on a replica. Under normal conditions, it's under 100ms. Under high write load, it can grow to seconds or minutes.
Lag matters in two scenarios:
Read-after-write consistency: a user creates a short link, is redirected to the link's detail page, but the detail page reads from a replica that hasn't yet received the new link. The page shows "link not found." This is confusing and wrong.
Stale analytics: dashboard queries on a replica may show click counts that are seconds or minutes behind the primary. Usually acceptable for analytics; not acceptable for balance-critical operations.
Strategies to handle lag:
- Read-your-writes: after a write, route the user's next read to the primary (or to a replica you know has received the write)
- Monotonic reads: once a user has seen data from a replica at lag offset X, always route their subsequent reads to a replica at the same or lower lag
- Causal consistency: track which writes a user has seen and only serve reads from replicas that have applied at least those writes
Most applications handle this pragmatically: write-heavy operations read from the primary; most other reads go to replicas.
Failover
When the primary fails, a replica is promoted to become the new primary. This process is called failover.
Manual failover: an operator observes that the primary is down, manually promotes a replica, and updates the application's connection configuration. Reliable but slow — typically 5–15 minutes of downtime.
Automatic failover: a high-availability agent (Patroni for PostgreSQL, MHA for MySQL, AWS RDS Multi-AZ) automatically detects primary failure and promotes a replica. Typical automated failover time: 30–60 seconds. The application needs to handle reconnection gracefully.
Failover considerations:
- Data loss window: with async replication, the promoted replica may not have the last few seconds of writes from the primary. These writes are lost.
- Split-brain: two nodes simultaneously believe they are the primary. This can happen if the primary is slow but not completely dead — the HA agent promotes a replica, but the "dead" primary recovers and starts accepting writes again. Both nodes are now accepting writes to the same dataset. Fencing mechanisms (STONITH — "Shoot The Other Node In The Head") prevent this.
- Connection routing: applications must be able to reconnect to the new primary. Connection poolers and DNS-based failover (updating a CNAME to point to the new primary) are common solutions.
Multi-primary replication
In a multi-primary (multi-master) setup, multiple nodes accept writes. Replicas from any primary must be merged.
Galera Cluster (MySQL): uses synchronous multi-primary replication with certification-based conflict detection. Writes go to any node; conflicts (two nodes trying to write the same row simultaneously) are detected and one is rolled back.
Amazon Aurora Multi-Master: allows writes to multiple availability zones simultaneously, with automatic conflict resolution.
CockroachDB / Google Spanner: distributed SQL databases that support multi-primary across regions using consensus protocols (Paxos/Raft).
Multi-primary is powerful for write availability (no single primary is a write SPOF) but introduces write conflict complexity. For most applications, single-primary with automated failover is operationally simpler and sufficient.
The tradeoffs
Read scalability vs consistency. Read replicas scale reads linearly — three replicas can serve three times the read throughput. The cost is eventual consistency: replicas may lag. Applications must be written with this in mind.
Durability vs write latency. Synchronous replication guarantees no data loss but adds latency. Asynchronous replication is fast but risks losing recent writes on primary failure.
Failover speed vs safety. Automated failover reduces downtime but risks split-brain. Manual failover is safer but slower.
Read replicas vs sharding. Read replicas scale reads; sharding scales writes and storage. If your bottleneck is read throughput, add read replicas (much simpler). If your bottleneck is write throughput or data volume, you need sharding. Many production systems use both.
When to use replication
Use read replicas when:
- Read queries dominate your workload (analytics dashboards, reporting, high-traffic reads)
- You need a hot standby for fast failover without downtime
- You want to run analytics workloads without impacting production write performance
Use synchronous replication when:
- Data loss on primary failure is unacceptable
- You have a high-RPO (Recovery Point Objective) requirement
Consider multi-primary when:
- Write availability across regions is required
- You're building an active-active multi-region setup
Don't use read replicas as a substitute for sharding if your problem is write throughput or storage — replicas are still limited to one primary's write capacity.
The one thing to remember
Replication makes a copy of your data on multiple nodes, solving two problems at once: read throughput (spread reads across replicas) and availability (promote a replica if the primary fails). The fundamental tradeoff is between write latency and durability: async replication is fast but risks a small data loss window on failure; sync replication eliminates that window but adds latency to every write. Know which tradeoff your application can afford before choosing.
← Previous: Consistent Hashing: Minimising Resharding Pain — Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundament...
→ Next: Object Storage: Unlimited Scale for Large Binary Data — Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and w...




