Vector Databases: Semantic Search and AI Memory

Series: System Design · Data & Storage — Pillar 4 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Data & Storage: Where Everything Lives | Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars) |
| 01 | SQL vs NoSQL: Choosing the Right Database | SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call. |
| 02 | Database Indexing: The Highest-Leverage Performance Tool | Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one. |
| 03 | B-Trees & B+ Trees: The Data Structure Behind Database Indexes | Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries. |
| 04 | LSM Trees: Why Some Databases Are Built for Writes | LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it. |
| 05 | Denormalisation: Trading Storage for Speed | Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely. |
| 06 | Database Sharding: Scaling Beyond a Single Node | Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces. |
| 07 | Data Partitioning: Choosing How to Divide Your Data | Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale. |
| 08 | Consistent Hashing: Minimising Resharding Pain | Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems. |
| 09 | Replication & Read Replicas: Scaling Reads and Surviving Failures | Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them. |
| 10 | Object Storage: Unlimited Scale for Large Binary Data | Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it. |
| 11 | Block vs File vs Object Storage: Three Models, Three Use Cases | Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them. |
| 12 | Distributed File Systems: File Storage Across Many Machines | Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them. |
| 13 | Time Series Databases: Built for Metrics and Events | Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB. |
| 14 | Vector Databases: Semantic Search and AI Memory ← you are here | Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one. |
| 15 | Full-Text Search Engines: Beyond SQL LIKE | Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful. |
| 16 | Materialized Views: Pre-Computing Expensive Queries | Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches. |
| 17 | Query Optimisation: From Slow to Fast | Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast. |
| 18 | Connection Pooling: Managing the Hidden Bottleneck | Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly. |
| 19 | Data & Storage: Wrap-Up | A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system. |
Vector Databases: Semantic Search and AI Memory
The problem
Your URL shortener wants to add a recommendation feature: after a user creates a link to a blog post about distributed systems, suggest other links in their account that are topically related — even if they don't share any exact keywords.
A SQL query can find links with WHERE destination_url LIKE '%distributed-systems%' or match on tags = 'distributed-systems'. But what about a link to a Raft algorithm explainer — the same topic, completely different text? A link to a Netflix engineering blog post about consensus protocols? A YouTube video titled "Building CockroachDB"?
The semantic meaning is the same. The literal text is completely different. SQL can't help you here. Full-text search (which we'll cover next) can find keyword overlaps but still can't capture semantic similarity. You need a way to represent meaning numerically and find numerically similar representations.
This is what embeddings and vector databases were built for.
The core idea
An embedding model converts content (text, images, audio) into a high-dimensional numerical vector that represents its semantic meaning. Similar content produces similar vectors — measured by distance in high-dimensional space. A vector database stores these vectors and efficiently answers the query: "what are the N items most similar to this query vector?"
The analogy: a map of concepts
Imagine placing every concept in your dataset as a pin on an enormous map, where pins are placed closer together if the concepts are more related. "Distributed systems" and "consensus algorithms" are near each other. "Machine learning" is in a different cluster but not completely distant from "distributed systems" (there's some overlap in engineering topics). "French cuisine" is in a completely different region of the map.
Finding "things similar to this concept" means finding all pins within a certain radius. The map is the vector space; the pins are the embeddings; the radius search is the similarity query. The challenge: this map has 768 or 1536 dimensions instead of 2, which makes "distance" non-intuitive but mathematically well-defined.
How it works
Embeddings
An embedding is a fixed-length numerical vector produced by a model trained to encode semantic meaning. Common embedding models:
- OpenAI text-embedding-3-small: 1536 dimensions
- Sentence Transformers (open source): typically 384–768 dimensions
- CLIP (images + text): 512 dimensions, shared embedding space
# Convert a link's destination page title to a vector
embedding = openai_client.embeddings.create(
model="text-embedding-3-small",
input="Understanding the Raft Consensus Algorithm"
)
# Returns a list of 1536 floats representing the semantic content
vector = embedding.data[0].embedding
# [0.023, -0.147, 0.089, 0.312, -0.056, ...] ← 1536 numbers
Two similar inputs produce similar vectors:
"Understanding Raft consensus algorithm" → [0.023, -0.147, 0.089, ...]
"How Paxos achieves distributed consensus" → [0.019, -0.152, 0.094, ...]
# Cosine similarity: 0.94 — very similar
"Understanding Raft consensus algorithm" → [0.023, -0.147, 0.089, ...]
"Classic French onion soup recipe" → [-0.312, 0.456, -0.201, ...]
# Cosine similarity: 0.03 — very different
Similarity metrics
Cosine similarity: measures the angle between two vectors. Values range from -1 (opposite directions) to 1 (identical). Commonly used for text embeddings.
Euclidean distance: measures the straight-line distance between two points in the vector space. Used when magnitude matters, not just direction.
Dot product: related to cosine similarity; useful when vectors are normalised.
The choice of metric depends on how the embedding model was trained and what it optimises for.
The naive approach: brute-force k-NN
Given a query vector and a dataset of N stored vectors, find the K most similar. The brute-force approach:
- Compute the similarity between the query vector and every stored vector
- Sort by similarity, return the top K
At 1M vectors × 1536 dimensions: 1M × 1536 multiply-add operations per query. At 10ms per query (fast), this is barely acceptable for light load. At 100M vectors, it's 100x slower and completely impractical.
Vector databases exist to make this fast at scale.
Approximate Nearest Neighbour (ANN) search
Exact nearest neighbour search scales poorly. ANN algorithms trade a small accuracy reduction for orders-of-magnitude faster search.
HNSW (Hierarchical Navigable Small World graphs): the dominant ANN algorithm in production vector databases.
The structure: a multi-layer graph where:
- Each node is a vector in the dataset
- Layer 0 contains all nodes with dense connections to neighbours
- Higher layers contain fewer nodes with longer-range connections (an "expressway" layer)
Search starts at the top layer (few nodes, fast navigation), descends through layers, and arrives at the right neighbourhood in the dense bottom layer with logarithmic complexity.
Query: "consensus algorithms"
Layer 2 (sparse): start at a random entry node, greedily move toward closest match
Layer 1 (medium): continue greedy descent in smaller neighbourhood
Layer 0 (dense): exhaustively search the final candidate neighbourhood
HNSW achieves search in O(log n) time with ~98–99% recall (finds the true nearest neighbours 98–99% of the time). The small recall gap is usually acceptable — a recommendation engine that suggests 49 perfect results and 1 slightly imperfect one is fine.
IVF (Inverted File Index): clusters vectors into groups (via k-means). A query first identifies the nearest clusters, then searches only within those clusters. Faster for bulk workloads; requires knowing cluster count upfront.
Vector databases vs pgvector
Purpose-built vector databases (Pinecone, Weaviate, Qdrant, Milvus):
- HNSW indexes optimised for ANN search
- Native support for metadata filtering alongside vector similarity
- Managed scaling, sharding, and replication of vector data
- APIs designed for embedding workflows
pgvector (PostgreSQL extension):
- Adds a
vectordata type and HNSW/IVF indexes to PostgreSQL - Vector similarity queries alongside standard SQL
- Strong choice for applications already on PostgreSQL that need moderate-scale vector search
- Simpler operationally — one database instead of two
-- pgvector: find 5 links most similar to a query vector
SELECT id, destination_url, 1 - (embedding <=> $query_embedding) AS similarity
FROM links
WHERE user_id = 123
ORDER BY embedding <=> $query_embedding
LIMIT 5;
The URL shortener's vector database use
When a user creates a link, the platform:
- Fetches the destination page title (or uses the user-supplied name)
- Generates an embedding via the OpenAI API
- Stores the embedding in Pinecone (or pgvector) alongside the link's ID
When showing the link detail page:
- Retrieve the link's stored embedding
- Query the vector store for the 5 most similar embeddings in the user's collection
- Return those links as recommendations
The embedding encodes semantic meaning — similar topics produce similar vectors regardless of exact wording.
The tradeoffs
Approximate, not exact. ANN search trades recall for speed. For most recommendation and search use cases, 98% recall is indistinguishable from 100% to users. For use cases requiring exact matches, use exact k-NN (slower) or a hybrid approach.
Embedding quality determines search quality. The vector database is only as good as the embedding model. A weak embedding model that doesn't capture semantic meaning well produces poor recommendations no matter how fast the ANN search is. Embedding model selection is a product decision, not just an infrastructure one.
High-dimensional data is expensive to store. 1M vectors × 1536 dimensions × 4 bytes = ~6GB just for the vectors, before indexing overhead. HNSW indexes add 1.5–4x storage overhead. At 100M vectors, vector storage is a meaningful cost.
Metadata filtering + vector search is complex. "Find the 10 links most similar to this query, but only from links created in the last 30 days" requires either filtering before the vector search (may miss good results if the time filter is aggressive) or filtering after (may require fetching many more than 10 results to get 10 that pass the filter). Vector databases handle this differently; pgvector handles it naturally with SQL WHERE clauses.
When to use a vector database
Use vector search when:
- Semantic similarity is the query (recommendation, semantic search, deduplication of similar content)
- You're building LLM-powered features (RAG — Retrieval Augmented Generation — uses vector search to find relevant context to feed to the LLM)
- Finding "things like this thing" is a core user feature
- Keyword matching fails because similar content uses different vocabulary
Start with pgvector when:
- You're already on PostgreSQL
- Scale is moderate (under ~10M vectors)
- Operational simplicity matters more than maximum ANN performance
Use a purpose-built vector database when:
- Millions to billions of vectors
- Sub-10ms ANN search latency required
- Advanced filtering and hybrid search (vector + metadata) at scale
The one thing to remember
A vector database stores numerical representations of meaning (embeddings) and answers "what is most similar to this?" using approximate nearest neighbour search. The foundation — that similar things produce nearby vectors — is what enables semantic search, recommendations, and LLM memory. The ANN index (usually HNSW) makes this fast by accepting a small accuracy tradeoff to avoid computing similarity against every stored vector for every query.
← Previous: Time Series Databases: Built for Metrics and Events — Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use Influx...
→ Next: Full-Text Search Engines: Beyond SQL LIKE — Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text...




