Series: System Design · Data & Storage — Pillar 4 of 8

Systems Design

#	Post	What it covers
00	Data & Storage: Where Everything Lives	Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars)
01	SQL vs NoSQL: Choosing the Right Database ← you are here	SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call.
02	Database Indexing: The Highest-Leverage Performance Tool	Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one.
03	B-Trees & B+ Trees: The Data Structure Behind Database Indexes	Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries.
04	LSM Trees: Why Some Databases Are Built for Writes	LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it.
05	Denormalisation: Trading Storage for Speed	Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely.
06	Database Sharding: Scaling Beyond a Single Node	Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces.
07	Data Partitioning: Choosing How to Divide Your Data	Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale.
08	Consistent Hashing: Minimising Resharding Pain	Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems.
09	Replication & Read Replicas: Scaling Reads and Surviving Failures	Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them.
10	Object Storage: Unlimited Scale for Large Binary Data	Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it.
11	Block vs File vs Object Storage: Three Models, Three Use Cases	Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them.
12	Distributed File Systems: File Storage Across Many Machines	Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them.
13	Time Series Databases: Built for Metrics and Events	Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB.
14	Vector Databases: Semantic Search and AI Memory	Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one.
15	Full-Text Search Engines: Beyond SQL LIKE	Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful.
16	Materialized Views: Pre-Computing Expensive Queries	Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches.
17	Query Optimisation: From Slow to Fast	Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast.
18	Connection Pooling: Managing the Hidden Bottleneck	Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly.
19	Data & Storage: Wrap-Up	A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system.

SQL vs NoSQL: Choosing the Right Database

The problem

Your URL shortener has been running on PostgreSQL for two years. It handles users, links, and click events. The relational model fits perfectly — users have many links, links have many clicks, foreign keys keep everything consistent.

Then three things happen in the same quarter. Your analytics team wants to run aggregate queries across twelve months of click data — hundreds of millions of rows, slow as death on PostgreSQL under concurrent write load. Your new recommendation feature needs to find links similar to what a user just visited — a query PostgreSQL simply can't answer efficiently. And your mobile team wants to store user preferences with a schema that changes every sprint, and they're tired of writing migrations.

Three problems. Three different data shapes. Three different access patterns. One relational database, struggling to be all things.

The question isn't "should I use SQL or NoSQL?" The question is: what does this specific data look like, and how will it be accessed?

The core idea

SQL (relational) databases store data in tables with fixed schemas and support joins, transactions, and complex queries across related data. NoSQL databases trade some of those capabilities for flexibility, scale, or performance in a specific access pattern — but "NoSQL" isn't a single thing; it's four distinct data models, each optimising for something different.

The right choice depends on your data's structure and how you read it. Not on what's fashionable, not on what scale you imagine reaching someday.

The analogy: a filing cabinet vs a specialised warehouse

A relational database is a filing cabinet with a meticulous cross-reference system. Every document follows the same form. Every drawer connects to every other via index cards. Want to find all invoices for customers in Sydney who ordered more than $1,000 last quarter? The cross-reference system handles it. The cost: every document must conform to the form, and adding a new field means updating every drawer.

NoSQL databases are specialised warehouses, each built for a different kind of cargo:

A key-value store is a parcel locker — give it a key, get back a package, nothing else
A document store is a filing cabinet where each folder can have a completely different internal structure
A wide-column store is a spreadsheet optimised for columns with millions of rows and sparse data
A graph database is a web of interconnected nodes where the connections are as important as the contents

The filing cabinet handles anything. Each specialised warehouse handles its specific cargo faster, at larger scale, and with less overhead — but only for that cargo.

How it works

Relational databases (SQL)

Relational databases organise data into tables (relations) with rows and columns. Every row in a table has the same set of columns. Relationships between tables are expressed via foreign keys.

The core strengths:

ACID transactions — atomicity, consistency, isolation, durability. Multiple related changes succeed or fail together. The bank transfer that debits one account and credits another either fully completes or fully rolls back.
Joins — query across related tables in a single statement. SELECT u.email, COUNT(l.id) FROM users u JOIN links l ON l.user_id = u.id GROUP BY u.email — one query, two tables, no application-level assembly.
Schema enforcement — the database refuses data that doesn't match the schema. Bad data is caught at write time.
Flexible querying — SQL's WHERE, GROUP BY, ORDER BY, and aggregate functions let you query data in ways you didn't anticipate at design time.

The constraints:

Schema rigidity — adding or changing columns requires migrations. On large tables, this can be expensive.
Horizontal scaling is hard — relational databases are designed to run on one node. Distributing them across multiple nodes (sharding) requires significant engineering and introduces tradeoffs.
Write throughput limits — a single PostgreSQL node handles tens of thousands of writes per second. For workloads with millions of writes per second, you need a different approach.

When to reach for SQL: structured data with clear relationships, complex queries that aren't known ahead of time, any situation where data correctness and transactional integrity matter.

The URL shortener uses PostgreSQL for: users, links, teams, billing — core relational data with foreign keys and transactions.

NoSQL: document stores

Document stores (MongoDB, DynamoDB in document mode, Firestore) store data as self-contained documents — typically JSON or BSON — with no fixed schema. Each document can have different fields.

// Link document — fields vary by link type
{
  "id": "x7Kp2",
  "url": "https://example.com/landing-page",
  "created_by": "user_123",
  "metadata": {
    "utm_source": "newsletter",
    "utm_campaign": "summer_2025",
    "custom_fields": {
      "internal_tag": "Q3-campaign",
      "product_line": "enterprise"
    }
  },
  "ab_test": {
    "variant": "B",
    "experiment_id": "exp_789"
  }
}

Strengths:

Schema flexibility — different documents in the same collection can have different shapes. No migrations when you add a field.
Embedded documents — related data lives together. One read fetches the whole object rather than joining across tables.
Horizontal scaling — most document stores are designed to shard natively.

Weaknesses:

No joins (or expensive application-level joins) — if data is related across collections, the application must assemble it
Weaker consistency guarantees in multi-document operations (MongoDB added multi-document ACID transactions, but they're slower and less ergonomic than SQL)
Duplication — embedding data means the same value appears in multiple documents and can drift out of sync

Best fit: flexible schemas, document-centric reads (retrieve the whole object at once), and write-heavy workloads where schema evolution is frequent.

NoSQL: key-value stores

Key-value stores (Redis, DynamoDB in KV mode, Memcached) are the simplest data model: a key maps to a value. Give the key, get the value. Nothing else is supported.

SET session:user_123  {"token": "abc...", "expires": 1735689600}
GET session:user_123  → {"token": "abc...", "expires": 1735689600}

Strengths:

Extreme speed — Redis delivers sub-millisecond lookups. Nothing else comes close.
Simple operations — get, set, delete, expire. No query planning, no parsing, no joins.
In-memory option — Redis keeps data in RAM, making it the natural choice for caching.

Weaknesses:

No querying — you must know the exact key. There's no equivalent of WHERE value > X.
Limited data relationships — you can store complex objects as values, but the store knows nothing about their structure.

Best fit: sessions, caches, rate limiting counters, real-time leaderboards, anything where you know the key ahead of time and need the value immediately.

The URL shortener uses Redis for: session tokens, rate limit counters, and caching the top 1,000 most-clicked links (the redirect hot path needs sub-millisecond lookup, not a PostgreSQL query).

NoSQL: wide-column stores

Wide-column stores (Cassandra, HBase, Google Bigtable) organise data as rows with a very large, sparse number of columns. Rows are indexed by a partition key; within a partition, rows are sorted by a clustering key. The data model is designed around specific query patterns — you design the table for the queries you'll run, not for normalisation.

-- Cassandra table for click events
CREATE TABLE click_events (
  link_id     TEXT,
  clicked_at  TIMESTAMP,
  country     TEXT,
  device      TEXT,
  PRIMARY KEY (link_id, clicked_at)
) WITH CLUSTERING ORDER BY (clicked_at DESC);

This table answers "give me the last N clicks for link X" in constant time — that's the query it was designed for. Answering "give me all clicks from Australia yesterday" requires a separate table designed for that access pattern.

Strengths:

Massive write throughput — Cassandra is designed to handle millions of writes per second across a cluster
Linear horizontal scaling — add nodes, get proportionally more capacity
Time-series friendly — the partition + clustering key model maps naturally to "give me data for entity X in time range Y"

Weaknesses:

Query inflexibility — you design tables around known access patterns. Ad-hoc queries are painful.
Eventual consistency — Cassandra defaults to eventual consistency; strong consistency requires quorum reads/writes at a performance cost
Operational complexity — Cassandra clusters are non-trivial to operate

Best fit: write-heavy, time-ordered data with well-known access patterns. Event logs, IoT sensor data, time series at massive scale.

NoSQL: graph databases

Graph databases (Neo4j, Amazon Neptune) store data as nodes (entities) and edges (relationships) with properties on both. The fundamental query is traversal: "starting from this node, follow these relationship types N hops deep."

// Neo4j: find links that users similar to user_123 have bookmarked
MATCH (u:User {id: 'user_123'})-[:SIMILAR_TO]->(other:User)-[:BOOKMARKED]->(link:Link)
WHERE NOT (u)-[:BOOKMARKED]->(link)
RETURN link.url LIMIT 10

Strengths:

Relationship traversal — queries that would require many self-joins in SQL are expressed naturally
Variable-depth traversal — "find all nodes within 3 hops" is trivial; in SQL it requires recursive CTEs
Relationship properties — edges can carry data ("user A followed user B on date X with confidence score Y")

Weaknesses:

Poor fit for non-graph workloads — if your primary access pattern isn't traversal, the overhead isn't worth it
Scaling challenges — graph databases are harder to shard than other NoSQL types because relationships cross partition boundaries

Best fit: social networks, recommendation engines, fraud detection, knowledge graphs — any domain where the relationships between entities are as important as the entities themselves.

The decision framework

Is your data relational with complex querying needs?
  Yes → SQL (PostgreSQL is the right default)

Is your data document-oriented with flexible schema?
  Yes → Document store (MongoDB, DynamoDB)

Do you need sub-millisecond lookup by known key?
  Yes → Key-value store (Redis, DynamoDB)

Is your workload write-heavy with millions of events per second?
  Yes → Wide-column store (Cassandra)

Is the relationship between entities the primary concern?
  Yes → Graph database (Neo4j, Neptune)

Do you need time-series aggregation at scale?
  → Time series database (covered in Post 13)

Do you need similarity search or semantic queries?
  → Vector database (covered in Post 14)

Do you need full-text search with relevance ranking?
  → Search engine (covered in Post 15)

The tradeoffs

Flexibility vs correctness. NoSQL's schema flexibility is genuinely useful — but it's a double-edged sword. A document store that accepts any shape means bad data slips in at write time and causes bugs at read time. Schema validation at the application layer is messier than schema enforcement at the database layer.

Horizontal scale vs transactional integrity. Most NoSQL databases sacrifice multi-document ACID transactions to achieve horizontal scale. For most workloads that's fine — you rarely need to atomically update five documents. For financial transactions, it's a serious problem.

Query flexibility vs access pattern alignment. SQL lets you query any column with any condition; you don't need to know your queries at schema design time. Wide-column stores require you to design tables around your queries. If your access patterns change, your data model may need to change too — a painful migration.

One database vs many. The URL shortener using PostgreSQL + Redis + Cassandra + Elasticsearch is the right architecture — but it means four databases to operate, monitor, back up, and reason about. The operational overhead of a polyglot persistence architecture is real. Start with one database and add others when there's a concrete need, not a speculative one.

When to use which

Use PostgreSQL as your default for anything with relationships, transactions, or unknown query patterns. It scales further than most teams need before sharding becomes necessary.

Add a key-value store (Redis) when you need caching, sessions, or sub-millisecond lookups by a known key.

Reach for a document store (MongoDB) when your schema changes frequently, your data is document-shaped, and you rarely need cross-document joins.

Use a wide-column store (Cassandra) when you have millions of writes per second with well-known time-series access patterns.

Consider a graph database (Neo4j) when relationship traversal is the primary query pattern, not an occasional join.

Don't use NoSQL because it sounds scalable. A poorly designed Cassandra cluster does not outperform a well-tuned PostgreSQL instance on most workloads. The fastest database is the one that matches your access patterns.

The one thing to remember

NoSQL is not a single thing. Document stores, key-value stores, wide-column stores, and graph databases have almost nothing in common except that they aren't SQL. Each optimises for a specific data model and access pattern. Choosing between SQL and NoSQL is the wrong question — choose the data model that fits your data's shape and your most important queries.

← Previous: Data & Storage: Where Everything Lives — Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replicat...

→ Next: Database Indexing: The Highest-Leverage Performance Tool — Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide wh...

SQL vs NoSQL: Choosing the Right Database

Systems Design

SQL vs NoSQL: Choosing the Right Database

The problem

The core idea

The analogy: a filing cabinet vs a specialised warehouse

How it works

Relational databases (SQL)

NoSQL: document stores

NoSQL: key-value stores

NoSQL: wide-column stores

NoSQL: graph databases

The decision framework

The tradeoffs

When to use which

The one thing to remember

Comments

Systems Design

Database Indexing: The Highest-Leverage Performance Tool

More from this blog

Architecture Patterns: Wrap-Up

MapReduce: Processing Petabytes in Parallel

Batch vs Stream Processing: How Fresh Do Your Answers Need to Be?

ETL Pipelines: Moving Data from Operations to Analytics

Backend for Frontend: One API Per Client Type

Command Palette

Systems Design

SQL vs NoSQL: Choosing the Right Database

The problem

The core idea

The analogy: a filing cabinet vs a specialised warehouse

How it works

Relational databases (SQL)

NoSQL: document stores

NoSQL: key-value stores

NoSQL: wide-column stores

NoSQL: graph databases

The decision framework

The tradeoffs

When to use which

The one thing to remember

Comments

Systems Design

Database Indexing: The Highest-Leverage Performance Tool

More from this blog