Skip to main content

Command Palette

Search for a command to run...

Vector Databases: Semantic Search and AI Memory

Updated
12 min read
Vector Databases: Semantic Search and AI Memory

Series: System Design · Data & Storage — Pillar 4 of 8

Systems Design

# Post What it covers
00 Data & Storage: Where Everything Lives Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars)
01 SQL vs NoSQL: Choosing the Right Database SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call.
02 Database Indexing: The Highest-Leverage Performance Tool Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one.
03 B-Trees & B+ Trees: The Data Structure Behind Database Indexes Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries.
04 LSM Trees: Why Some Databases Are Built for Writes LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it.
05 Denormalisation: Trading Storage for Speed Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely.
06 Database Sharding: Scaling Beyond a Single Node Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces.
07 Data Partitioning: Choosing How to Divide Your Data Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale.
08 Consistent Hashing: Minimising Resharding Pain Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems.
09 Replication & Read Replicas: Scaling Reads and Surviving Failures Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them.
10 Object Storage: Unlimited Scale for Large Binary Data Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it.
11 Block vs File vs Object Storage: Three Models, Three Use Cases Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them.
12 Distributed File Systems: File Storage Across Many Machines Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them.
13 Time Series Databases: Built for Metrics and Events Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB.
14 Vector Databases: Semantic Search and AI Memory ← you are here Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one.
15 Full-Text Search Engines: Beyond SQL LIKE Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful.
16 Materialized Views: Pre-Computing Expensive Queries Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches.
17 Query Optimisation: From Slow to Fast Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast.
18 Connection Pooling: Managing the Hidden Bottleneck Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly.
19 Data & Storage: Wrap-Up A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system.

Vector Databases: Semantic Search and AI Memory

The problem

Your URL shortener wants to add a recommendation feature: after a user creates a link to a blog post about distributed systems, suggest other links in their account that are topically related — even if they don't share any exact keywords.

A SQL query can find links with WHERE destination_url LIKE '%distributed-systems%' or match on tags = 'distributed-systems'. But what about a link to a Raft algorithm explainer — the same topic, completely different text? A link to a Netflix engineering blog post about consensus protocols? A YouTube video titled "Building CockroachDB"?

The semantic meaning is the same. The literal text is completely different. SQL can't help you here. Full-text search (which we'll cover next) can find keyword overlaps but still can't capture semantic similarity. You need a way to represent meaning numerically and find numerically similar representations.

This is what embeddings and vector databases were built for.


The core idea

An embedding model converts content (text, images, audio) into a high-dimensional numerical vector that represents its semantic meaning. Similar content produces similar vectors — measured by distance in high-dimensional space. A vector database stores these vectors and efficiently answers the query: "what are the N items most similar to this query vector?"


The analogy: a map of concepts

Imagine placing every concept in your dataset as a pin on an enormous map, where pins are placed closer together if the concepts are more related. "Distributed systems" and "consensus algorithms" are near each other. "Machine learning" is in a different cluster but not completely distant from "distributed systems" (there's some overlap in engineering topics). "French cuisine" is in a completely different region of the map.

Finding "things similar to this concept" means finding all pins within a certain radius. The map is the vector space; the pins are the embeddings; the radius search is the similarity query. The challenge: this map has 768 or 1536 dimensions instead of 2, which makes "distance" non-intuitive but mathematically well-defined.


How it works

Embeddings

An embedding is a fixed-length numerical vector produced by a model trained to encode semantic meaning. Common embedding models:

  • OpenAI text-embedding-3-small: 1536 dimensions
  • Sentence Transformers (open source): typically 384–768 dimensions
  • CLIP (images + text): 512 dimensions, shared embedding space
# Convert a link's destination page title to a vector
embedding = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input="Understanding the Raft Consensus Algorithm"
)
# Returns a list of 1536 floats representing the semantic content
vector = embedding.data[0].embedding
# [0.023, -0.147, 0.089, 0.312, -0.056, ...]  ← 1536 numbers

Two similar inputs produce similar vectors:

"Understanding Raft consensus algorithm"  → [0.023, -0.147, 0.089, ...]
"How Paxos achieves distributed consensus" → [0.019, -0.152, 0.094, ...]
# Cosine similarity: 0.94 — very similar
"Understanding Raft consensus algorithm" → [0.023, -0.147, 0.089, ...]
"Classic French onion soup recipe"        → [-0.312, 0.456, -0.201, ...]
# Cosine similarity: 0.03 — very different

Similarity metrics

Cosine similarity: measures the angle between two vectors. Values range from -1 (opposite directions) to 1 (identical). Commonly used for text embeddings.

Euclidean distance: measures the straight-line distance between two points in the vector space. Used when magnitude matters, not just direction.

Dot product: related to cosine similarity; useful when vectors are normalised.

The choice of metric depends on how the embedding model was trained and what it optimises for.

The naive approach: brute-force k-NN

Given a query vector and a dataset of N stored vectors, find the K most similar. The brute-force approach:

  1. Compute the similarity between the query vector and every stored vector
  2. Sort by similarity, return the top K

At 1M vectors × 1536 dimensions: 1M × 1536 multiply-add operations per query. At 10ms per query (fast), this is barely acceptable for light load. At 100M vectors, it's 100x slower and completely impractical.

Vector databases exist to make this fast at scale.

Exact nearest neighbour search scales poorly. ANN algorithms trade a small accuracy reduction for orders-of-magnitude faster search.

HNSW (Hierarchical Navigable Small World graphs): the dominant ANN algorithm in production vector databases.

The structure: a multi-layer graph where:

  • Each node is a vector in the dataset
  • Layer 0 contains all nodes with dense connections to neighbours
  • Higher layers contain fewer nodes with longer-range connections (an "expressway" layer)

Search starts at the top layer (few nodes, fast navigation), descends through layers, and arrives at the right neighbourhood in the dense bottom layer with logarithmic complexity.

Query: "consensus algorithms"

Layer 2 (sparse): start at a random entry node, greedily move toward closest match
Layer 1 (medium): continue greedy descent in smaller neighbourhood
Layer 0 (dense):  exhaustively search the final candidate neighbourhood

HNSW achieves search in O(log n) time with ~98–99% recall (finds the true nearest neighbours 98–99% of the time). The small recall gap is usually acceptable — a recommendation engine that suggests 49 perfect results and 1 slightly imperfect one is fine.

IVF (Inverted File Index): clusters vectors into groups (via k-means). A query first identifies the nearest clusters, then searches only within those clusters. Faster for bulk workloads; requires knowing cluster count upfront.

Vector databases vs pgvector

Purpose-built vector databases (Pinecone, Weaviate, Qdrant, Milvus):

  • HNSW indexes optimised for ANN search
  • Native support for metadata filtering alongside vector similarity
  • Managed scaling, sharding, and replication of vector data
  • APIs designed for embedding workflows

pgvector (PostgreSQL extension):

  • Adds a vector data type and HNSW/IVF indexes to PostgreSQL
  • Vector similarity queries alongside standard SQL
  • Strong choice for applications already on PostgreSQL that need moderate-scale vector search
  • Simpler operationally — one database instead of two
-- pgvector: find 5 links most similar to a query vector
SELECT id, destination_url, 1 - (embedding <=> $query_embedding) AS similarity
FROM links
WHERE user_id = 123
ORDER BY embedding <=> $query_embedding
LIMIT 5;

The URL shortener's vector database use

When a user creates a link, the platform:

  1. Fetches the destination page title (or uses the user-supplied name)
  2. Generates an embedding via the OpenAI API
  3. Stores the embedding in Pinecone (or pgvector) alongside the link's ID

When showing the link detail page:

  1. Retrieve the link's stored embedding
  2. Query the vector store for the 5 most similar embeddings in the user's collection
  3. Return those links as recommendations

The embedding encodes semantic meaning — similar topics produce similar vectors regardless of exact wording.


The tradeoffs

Approximate, not exact. ANN search trades recall for speed. For most recommendation and search use cases, 98% recall is indistinguishable from 100% to users. For use cases requiring exact matches, use exact k-NN (slower) or a hybrid approach.

Embedding quality determines search quality. The vector database is only as good as the embedding model. A weak embedding model that doesn't capture semantic meaning well produces poor recommendations no matter how fast the ANN search is. Embedding model selection is a product decision, not just an infrastructure one.

High-dimensional data is expensive to store. 1M vectors × 1536 dimensions × 4 bytes = ~6GB just for the vectors, before indexing overhead. HNSW indexes add 1.5–4x storage overhead. At 100M vectors, vector storage is a meaningful cost.

Metadata filtering + vector search is complex. "Find the 10 links most similar to this query, but only from links created in the last 30 days" requires either filtering before the vector search (may miss good results if the time filter is aggressive) or filtering after (may require fetching many more than 10 results to get 10 that pass the filter). Vector databases handle this differently; pgvector handles it naturally with SQL WHERE clauses.


When to use a vector database

Use vector search when:

  • Semantic similarity is the query (recommendation, semantic search, deduplication of similar content)
  • You're building LLM-powered features (RAG — Retrieval Augmented Generation — uses vector search to find relevant context to feed to the LLM)
  • Finding "things like this thing" is a core user feature
  • Keyword matching fails because similar content uses different vocabulary

Start with pgvector when:

  • You're already on PostgreSQL
  • Scale is moderate (under ~10M vectors)
  • Operational simplicity matters more than maximum ANN performance

Use a purpose-built vector database when:

  • Millions to billions of vectors
  • Sub-10ms ANN search latency required
  • Advanced filtering and hybrid search (vector + metadata) at scale

The one thing to remember

A vector database stores numerical representations of meaning (embeddings) and answers "what is most similar to this?" using approximate nearest neighbour search. The foundation — that similar things produce nearby vectors — is what enables semantic search, recommendations, and LLM memory. The ANN index (usually HNSW) makes this fast by accepting a small accuracy tradeoff to avoid computing similarity against every stored vector for every query.


← Previous: Time Series Databases: Built for Metrics and Events — Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use Influx...

→ Next: Full-Text Search Engines: Beyond SQL LIKE — Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text...

Systems Design

Part 1 of 50

Understanding these system design concepts is essential for architects, developers, and engineers to create scalable, reliable, and maintainable software systems that meet the needs of businesses.

More from this blog

Cloud Tuned

751 posts

Your starting point for anything cloud: AWS, Azure, GCP, Serverless, Architecture, Hybrid Cloud, Systems Design and other Information Technology topics.