Series: System Design · Data & Storage — Pillar 4 of 8

Systems Design

#	Post	What it covers
00	Data & Storage: Where Everything Lives	Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars)
01	SQL vs NoSQL: Choosing the Right Database	SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call.
02	Database Indexing: The Highest-Leverage Performance Tool	Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one.
03	B-Trees & B+ Trees: The Data Structure Behind Database Indexes	Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries.
04	LSM Trees: Why Some Databases Are Built for Writes	LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it.
05	Denormalisation: Trading Storage for Speed	Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely.
06	Database Sharding: Scaling Beyond a Single Node	Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces.
07	Data Partitioning: Choosing How to Divide Your Data	Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale.
08	Consistent Hashing: Minimising Resharding Pain	Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems.
09	Replication & Read Replicas: Scaling Reads and Surviving Failures	Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them.
10	Object Storage: Unlimited Scale for Large Binary Data	Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it.
11	Block vs File vs Object Storage: Three Models, Three Use Cases	Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them.
12	Distributed File Systems: File Storage Across Many Machines	Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them.
13	Time Series Databases: Built for Metrics and Events	Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB.
14	Vector Databases: Semantic Search and AI Memory	Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one.
15	Full-Text Search Engines: Beyond SQL LIKE	Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful.
16	Materialized Views: Pre-Computing Expensive Queries	Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches.
17	Query Optimisation: From Slow to Fast	Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast.
18	Connection Pooling: Managing the Hidden Bottleneck	Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly.
19	Data & Storage: Wrap-Up ← you are here	A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system.

Data & Storage: Wrap-Up

Nineteen concepts across five themes. This post ties them together — not as a list of definitions, but as a set of connected decisions you'll encounter when building and operating data-driven systems.

The one thing to remember from each post

SQL vs NoSQL — NoSQL is not one thing. Document, key-value, wide-column, and graph databases each optimise for a specific data model and access pattern. Choose based on shape and access, not trend.

Indexing — an index is a contract: specific reads become fast at the cost of every write. Add indexes for queries you know are slow, not queries you imagine might be slow someday.

B-trees & B+ trees — the index is a balanced tree sorted by key. The leftmost column rule exists because the tree is sorted left-to-right. Range queries are fast because leaf nodes are linked in sorted order.

LSM trees — writes are always appends (to the journal, then to immutable files). Reads check the journal and multiple files. Compaction merges files in the background. The write-throughput win is large; the read complexity is managed by bloom filters.

Denormalisation — deliberately redundant copies make reads fast but write complexity rises. Every denormalised field adds logic to every write path. Measure first, denormalise second, and know exactly who keeps every copy in sync.

Sharding — splits data across nodes for write throughput and storage capacity beyond one machine. The shard key is the most consequential design decision. Cross-shard queries, cross-shard transactions, and resharding are where complexity lives.

Data partitioning — range partitioning enables efficient time-range queries and retention (drop old partitions). Hash partitioning distributes writes evenly but loses range pruning. The partition key is a query-routing decision.

Consistent hashing — nodes and keys on a ring; adding or removing a node moves only ~1/n of keys. Virtual nodes ensure even distribution. The solution to the rehashing storm that modulo-based routing produces.

Replication — copies data to multiple nodes. Read replicas scale reads and provide failover. Async replication is fast but may lose recent writes on primary failure. Sync replication eliminates that gap at write latency cost.

Object storage — unlimited scale for large binary objects, accessed via HTTP. Immutable (can't update in-place). Eleven nines durability. The right home for user uploads, backups, and static assets.

Block vs file vs object storage — block storage is a disk (for databases and VMs). File storage is a shared filesystem (for multi-host POSIX access). Object storage is a key-value HTTP store (for large binary content). Using the wrong model doesn't just hurt performance — it breaks things.

Distributed file systems — file storage across multiple machines. Relevant for on-premises compute-storage co-location. On public cloud, S3 + modern compute connectors has largely displaced HDFS for new workloads.

Time series databases — purpose-built for append-only, time-ordered data. Delta-encoded columnar storage compresses 10–100x better than row-based formats. Retention policies and downsampling keep storage costs manageable. Use when write volume or query complexity exceeds what PostgreSQL can handle.

Vector databases — store high-dimensional embeddings and answer "what is most similar to this?" via approximate nearest neighbour search. The foundation of semantic search, recommendations, and LLM memory. Start with pgvector; move to a purpose-built vector DB at scale.

Full-text search engines — inverted indexes map terms to documents; BM25 scores by relevance; stemming and fuzzy matching handle linguistic variation. SQL LIKE is not search. Use Elasticsearch, OpenSearch, or Typesense when search is a user-facing feature.

Materialized views — pre-computed query results stored as physical tables. Expensive aggregates run once (on a schedule), fast reads serve the result. The tradeoff is freshness: data is as fresh as the last refresh.

Query optimisation — EXPLAIN (ANALYZE) is your primary tool. Find the most expensive node in the plan, fix it, repeat. Functions on indexed columns, leading wildcards, and OR across different columns are common plan-breakers. Statistics drive the planner — stale statistics produce bad plans.

Connection pooling — keeps a fixed number of pre-established connections and shares them across many application workers. Pool size should match database CPU capacity (~2× vCPUs), not application worker count. PgBouncer in transaction mode is the standard PostgreSQL solution for multi-server deployments.

How they connect

These concepts don't exist independently. Most real storage decisions involve several of them interacting:

Indexing + B+ trees + query optimisation form a cluster. You understand B+ trees to know why indexes work and why the leftmost column rule exists. You use EXPLAIN to verify that indexes are actually being used. You know when to add an index and when a query rewrite is the right fix instead.

Sharding + data partitioning + consistent hashing form the horizontal scaling cluster. Partitioning is the strategy (range, hash, list) for dividing data. Sharding is distributing those partitions across nodes. Consistent hashing makes cluster size changes non-destructive.

Replication + read replicas are the availability and read-scale cluster. Replication provides redundancy; read replicas offload the primary. They work alongside sharding — a sharded cluster usually has each shard replicated.

LSM trees + time series databases share a storage model — append-only writes, immutable files, background compaction. Time series databases take these properties and build a full query engine optimised for time-windowed aggregation on top.

Denormalisation + materialized views are the pre-computation cluster. Both trade freshness for read speed. Denormalisation happens at the schema level; materialized views happen at the query result level. Both add write complexity or require scheduled refresh.

Object storage + distributed file systems are the binary data cluster. Object storage is the cloud-native approach (simple, scalable, operationally minimal). Distributed file systems are the on-premises approach (POSIX semantics, data locality for co-located compute).

Vector databases + full-text search are the "finding things" cluster. Full-text search finds documents containing specific terms (lexical similarity). Vector search finds documents with similar meaning (semantic similarity). Modern search products often combine both: vector search for semantic recall, keyword search for precision.

End-to-end: the URL shortener's complete data architecture

The URL shortener started with a single PostgreSQL instance. By the end of this pillar, it looks like this:

PostgreSQL (primary + 2 read replicas)
  Tables: users, links, teams, billing
  Storage: EBS block storage, db.r5.8xlarge
  Purpose: core relational data, ACID transactions, complex joins
  
  Indexes:
  - links(short_code) UNIQUE  ← redirect hot path
  - links(user_id, created_at DESC)  ← dashboard queries
  - links(destination_url) gin trgm  ← URL fuzzy search

PgBouncer
  Pool size: 40 server connections
  Mode: transaction pooling
  Between: 80 application servers → PostgreSQL primary

Redis Cluster
  Purpose: redirect cache (top 1M links), session tokens, rate limit counters
  TTL: 24 hours for links, 30 minutes for sessions
  Consistent hashing: 6-node cluster

Cassandra Cluster (6 nodes)
  Keyspace: click_events
  Partition key: link_id (all clicks for a link on one shard)
  Clustering key: clicked_at DESC
  Compaction: TWCS (time-window, 7-day windows)
  Purpose: 500B+ click events, write-heavy, time-series access
  
InfluxDB
  Purpose: infrastructure metrics (CPU, latency, error rates)
  Retention: raw data 7 days, 1-min aggregates 90 days, 1-hour aggregates 3 years

Elasticsearch Cluster (5 nodes, 1 replica each)
  Index: links
  Populated from: PostgreSQL changes via Debezium CDC
  Purpose: full-text search over link names and destination URLs

Pinecone (or pgvector on the primary)
  Purpose: semantic similarity for link recommendations
  Embeddings: OpenAI text-embedding-3-small (1536 dimensions)
  Index: HNSW, cosine similarity

Amazon S3
  Buckets: user-uploads (QR codes, thumbnails), exports (analytics CSV), backups
  Lifecycle: exports → Glacier after 90 days
  CDN: CloudFront as origin shield

Materialized views (on PostgreSQL read replica):
  mv_user_link_analytics: per-user, per-link, per-day click totals
  Refreshed every 15 minutes via pg_cron
  
Data partitioning:
  PostgreSQL links table: range-partitioned by user_id range (future growth)
  Cassandra click_events: hash-partitioned by link_id

A redirect in this system:

Request arrives: GET sho.rt/x7Kp2
Application checks Redis: GET link:x7Kp2 — cache hit in <1ms → return destination URL
(On cache miss): PostgreSQL index scan on links(short_code) — ~2ms
Return 301 redirect to browser
Async click event: write to Cassandra via the click pipeline (non-blocking, fire-and-forget)
Async Redis counter: INCR stats:link:x7Kp2:clicks (for real-time counter display)
Async Kafka event → Elasticsearch indexer (keep search index current)

The dashboard for user 123:

Read from materialized view: SELECT * FROM mv_user_link_analytics WHERE user_id = 123 — fast, pre-built
Stale by up to 15 minutes: acceptable for an analytics dashboard

A recommendation:

Retrieve link x7Kp2's stored embedding from Pinecone
ANN search: 5 most similar embeddings in user 123's link collection
Return recommendations alongside the link detail page

What this pillar didn't cover

Distributed transactions across heterogeneous stores — if a write must atomically update PostgreSQL and Cassandra, you need saga patterns or two-phase commit. Covered in Pillar 8.

Change data capture — how changes in PostgreSQL flow to Elasticsearch and other downstream systems in real time. Covered in Pillar 8.

Caching layer in depth — Redis and caching strategies, cache invalidation, cache eviction policies, cache stampede. Covered in Pillar 5.

Query languages for specialised stores — PromQL, Flux, InfluxQL, Cassandra CQL, Elasticsearch DSL. These deserve deeper treatment than this pillar could give them.

The decision tree

Before adding a new data store, work through this:

Is the data relational with unknown query patterns?
  → PostgreSQL (default)

Is it large binary content?
  → Object storage (S3)

Is it metrics or time-ordered events at high volume?
  → Time series database (InfluxDB, TimescaleDB)

Do users need to search it by keywords with relevance ranking?
  → Full-text search engine (Elasticsearch, Typesense)

Do you need to find semantically similar items?
  → Vector database (Pinecone, pgvector)

Is it key-value lookups needing sub-millisecond latency?
  → Redis (covered in Pillar 5)

Does it need POSIX filesystem access across multiple servers?
  → File storage (EFS, GlusterFS)

Is your PostgreSQL read load the bottleneck?
  → Read replicas (before sharding)

Has your PostgreSQL write load or data volume exceeded one node?
  → Sharding with consistent hashing

Up next: Pillar 5 — Caching

The redirect critical path has one cache lookup on it: Redis. We've treated Redis as a black box throughout this pillar. Pillar 5 opens it up.

Caching is one of the most consequential performance decisions in a distributed system — and one of the most error-prone. Cache invalidation is famously one of the two hard problems in computer science. Cache stampedes can bring a database down when a cache layer fails or expires. Cache warming determines how quickly a new deployment recovers.

**← Previous: Connection Pooling: Managing the Hidden Bottleneck — Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and...*

Six concepts that determine whether your caching layer is an asset or a liability: Caching, Cache Invalidation, Distributed Cache, Cache Eviction Policies, Cache Stampede, and Cache Warming.

*← Previous: Connection Pooling: Managing the Hidden Bottleneck — Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and...*This is the end of the Data & Storage pillar. Continue to Pillar 5 — Caching →

Data & Storage: Wrap-Up

Systems Design

Data & Storage: Wrap-Up

The one thing to remember from each post

How they connect

End-to-end: the URL shortener's complete data architecture

What this pillar didn't cover

The decision tree

Up next: Pillar 5 — Caching

Comments

Systems Design

More from this blog

Architecture Patterns: Wrap-Up

MapReduce: Processing Petabytes in Parallel

Batch vs Stream Processing: How Fresh Do Your Answers Need to Be?

ETL Pipelines: Moving Data from Operations to Analytics

Backend for Frontend: One API Per Client Type

Command Palette

Systems Design

Data & Storage: Wrap-Up

The one thing to remember from each post

How they connect

End-to-end: the URL shortener's complete data architecture

What this pillar didn't cover

The decision tree

Up next: Pillar 5 — Caching

Comments

Systems Design

More from this blog