Full-Text Search Engines: Beyond SQL LIKE

Series: System Design · Data & Storage — Pillar 4 of 8
Systems Design
| # | Post | What it covers |
|---|---|---|
| 00 | Data & Storage: Where Everything Lives | Where data lives shapes everything about a system. Nineteen concepts covering databases, indexing, sharding, replication, and the data structures underneath. (161 chars) |
| 01 | SQL vs NoSQL: Choosing the Right Database | SQL vs NoSQL isn't a simple choice. Learn what each type optimises for, when to use relational databases, and when NoSQL is the right call. |
| 02 | Database Indexing: The Highest-Leverage Performance Tool | Indexes are the highest-leverage database performance tool. Learn how they work, what they cost, and how to decide when to add one. |
| 03 | B-Trees & B+ Trees: The Data Structure Behind Database Indexes | Almost every database index is built on a B-tree or B+ tree. Learn how they work, why they're fast, and what this means for your queries. |
| 04 | LSM Trees: Why Some Databases Are Built for Writes | LSM trees power Cassandra, RocksDB, and LevelDB. Learn how they achieve massive write throughput and what they trade off to get it. |
| 05 | Denormalisation: Trading Storage for Speed | Denormalisation trades storage for read speed by pre-computing joins. Learn when it helps, when it hurts, and how to do it safely. |
| 06 | Database Sharding: Scaling Beyond a Single Node | Sharding splits a database across multiple nodes. Learn how it works, the strategies available, and the significant tradeoffs it introduces. |
| 07 | Data Partitioning: Choosing How to Divide Your Data | Range, hash, and list partitioning each make different tradeoffs. Learn how to divide data effectively for queries, maintenance, and scale. |
| 08 | Consistent Hashing: Minimising Resharding Pain | Consistent hashing minimises data movement when nodes are added or removed. Learn how it works and why it's fundamental to distributed systems. |
| 09 | Replication & Read Replicas: Scaling Reads and Surviving Failures | Replication copies data across nodes for fault tolerance and read scaling. Learn how primary-replica setups work and when to use them. |
| 10 | Object Storage: Unlimited Scale for Large Binary Data | Object storage handles large binary files at unlimited scale. Learn how it works, why it replaced file servers, and when to use it. |
| 11 | Block vs File vs Object Storage: Three Models, Three Use Cases | Three storage models, three different use cases. Learn what block, file, and object storage optimise for and how to choose between them. |
| 12 | Distributed File Systems: File Storage Across Many Machines | Distributed file systems spread file storage across many machines. Learn how HDFS, Ceph, and GlusterFS work and when to use them. |
| 13 | Time Series Databases: Built for Metrics and Events | Time series databases handle append-heavy metric data far better than SQL. Learn how they work and when to use InfluxDB, Prometheus, or TimescaleDB. |
| 14 | Vector Databases: Semantic Search and AI Memory | Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search is, and when to use one. |
| 15 | Full-Text Search Engines: Beyond SQL LIKE ← you are here | Full-text search needs more than SQL LIKE. Learn how inverted indexes, relevance ranking, and Elasticsearch make text search fast and powerful. |
| 16 | Materialized Views: Pre-Computing Expensive Queries | Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and when to use them vs other approaches. |
| 17 | Query Optimisation: From Slow to Fast | Slow queries aren't always fixed by adding indexes. Learn how to read EXPLAIN output, understand query plans, and systematically make queries fast. |
| 18 | Connection Pooling: Managing the Hidden Bottleneck | Opening a database connection per request doesn't scale. Learn how connection pooling works, what PgBouncer does, and how to size your pool correctly. |
| 19 | Data & Storage: Wrap-Up | A recap of all 19 data storage concepts: SQL, NoSQL, indexing, sharding, replication, specialised databases, and how they connect in a real system. |
Full-Text Search Engines: Beyond SQL LIKE
The problem
Your URL shortener lets users search their link library. The initial implementation:
SELECT * FROM links
WHERE destination_url ILIKE '%machine learning%'
OR link_name ILIKE '%machine learning%'
ORDER BY created_at DESC;
This works at 10,000 links. At 10 million links, it scans every row — the ILIKE with a leading wildcard can't use a B+ tree index. It's slow.
More importantly, it's not what users want. "machine learning" should find links titled "ML tutorial", "deep learning intro", and "neural network basics" — not just links containing the exact phrase. Results should be ranked by relevance, not just creation date. Typos should be tolerated. "machin lernig" should still find the right links.
PostgreSQL's full-text search (tsvector, tsquery) can handle some of this, but lacks relevance tuning, faceting, autocomplete, and the operability of a dedicated search platform.
Users want search. SQL LIKE is not search.
The core idea
A full-text search engine builds an inverted index over your documents — a data structure that maps each term (word) to the list of documents containing it. Given a query, the engine finds documents containing the query terms, scores them by relevance, and returns them ranked by score. This is fast, relevance-aware, and supports features (fuzzy matching, stemming, facets, autocomplete) that SQL simply doesn't provide.
The analogy: the back-of-book index
A book's back index is an inverted index. Instead of "which words are on page 42?", it answers "which pages contain the word 'consensus'?" — pages 12, 47, 89, 203, 311.
Building the index requires reading every page and noting which terms appear where. Querying the index is instant — you look up the term, get the page list. Searching for "consensus AND quorum" means finding the intersection of the two page lists.
A full-text search engine does this at scale: millions of documents, billions of terms, with ranking to determine which documents are most relevant to a given query.
How it works
The inverted index
For each document (a link record, a product listing, an article), the search engine:
- Tokenises the text: splits into individual terms
- Normalises terms: lowercase, remove punctuation
- Applies linguistic analysis: stemming (running → run, machines → machine), stop-word removal (the, a, is are excluded), synonyms
- Builds the posting list: for each term, a sorted list of document IDs containing it
Document 1: "Introduction to the Raft Consensus Algorithm"
Document 2: "Raft vs Paxos: Understanding Distributed Consensus"
Document 3: "Machine Learning Fundamentals"
Inverted index:
"raft" → [1, 2]
"consensus" → [1, 2]
"algorithm" → [1]
"paxos" → [2]
"distributed" → [2]
"machine" → [3]
"learning" → [3]
Query "raft consensus":
- Fetch posting list for "raft": [1, 2]
- Fetch posting list for "consensus": [1, 2]
- Intersection: [1, 2] — both documents match
- Rank by relevance score
Relevance scoring: TF-IDF and BM25
A search engine doesn't just return matching documents — it ranks them by relevance. The industry standard is BM25 (Best Match 25), an evolution of TF-IDF.
TF-IDF (Term Frequency × Inverse Document Frequency):
- TF (term frequency): a term that appears more often in a document is more relevant to that document
- IDF (inverse document frequency): a term that appears in many documents is less meaningful (common words carry less signal)
- Score = TF × IDF — high score means the term is frequent in this document but rare across the corpus
BM25 adds saturation (very high term frequency gives diminishing returns) and document length normalisation (a short document that mentions "consensus" three times is more relevant than a long document that mentions it three times).
Query: "distributed consensus"
Document 1 (short article, "consensus" appears 8 times): BM25 score = 4.2
Document 2 (long textbook, "consensus" appears 30 times): BM25 score = 3.1
Document 1 ranks higher — more dense signal, shorter document
Features SQL LIKE can't provide
Stemming: "running" matches documents containing "run", "runs", "runner". The search engine stores the stem; both the query and the document terms are normalised to their stems before indexing and query processing.
Fuzzy matching: "elsticsearch" (typo) matches "elasticsearch". Elasticsearch uses edit-distance-based fuzzy matching to find terms within N character edits of the query term.
Phrase queries: "machine learning" as a phrase only matches documents where those words appear adjacent, not just anywhere in the document.
Boosting: configure certain fields to matter more. A query term appearing in the document title counts more than the same term in the document body.
Faceting and aggregations: "show me the count of links by category alongside search results." Elasticsearch aggregations return facet counts alongside search results in a single query — not a separate COUNT query.
Autocomplete/suggest: as a user types, suggest completions based on terms in the index. Elasticsearch's completion suggester handles this efficiently.
Highlighting: return snippets from the matching document with the matched terms highlighted, showing users why a result was returned.
Elasticsearch architecture
Elasticsearch is the dominant open-source full-text search engine. Under the hood, each Elasticsearch node runs an instance of Apache Lucene — the core indexing and search library that underpins most modern search engines.
Index: a collection of documents with a common schema (mapping). Roughly equivalent to a database table. The URL shortener has a links index.
Shard: Elasticsearch distributes an index across shards. Each shard is a complete Lucene index. Shards distribute across nodes — a 5-shard index on a 5-node cluster puts one shard per node, distributing both storage and query load.
Replica: each shard can have one or more replicas. Replicas serve read queries, providing fault tolerance and read scaling.
links index: 5 primary shards × 1 replica = 10 total shards
Distributed across 5 nodes, each holding 1 primary + 1 replica from a different shard
Near real-time indexing: Elasticsearch uses an LSM-tree-inspired segment-based storage model internally. New documents are first written to an in-memory buffer (searchable within ~1 second by default), then periodically merged into larger immutable segments. This is why Elasticsearch is described as "near real-time" — a freshly indexed document may not be immediately searchable.
Dedicated search engine vs PostgreSQL full-text search
PostgreSQL includes tsvector/tsquery with GIN indexes for full-text search:
-- PostgreSQL FTS
SELECT *, ts_rank(search_vector, query) AS rank
FROM links,
plainto_tsquery('english', 'distributed consensus') query
WHERE search_vector @@ query
AND user_id = 123
ORDER BY rank DESC;
PostgreSQL FTS is appropriate when:
- You're already on PostgreSQL and want to avoid another system
- Search volume is moderate and query patterns are simple
- You don't need faceting, autocomplete, or advanced relevance tuning
Elasticsearch/OpenSearch/Typesense are appropriate when:
- Search is a core feature (user-facing, not internal)
- You need relevance tuning, faceting, autocomplete, highlighting
- Search must scale independently of your transactional database
- Write volume to the search index is high enough to impact the primary database
The tradeoffs
Eventual consistency with the source. Search indexes are typically populated asynchronously — a write to PostgreSQL triggers an event that updates the Elasticsearch index seconds later. Users may not see freshly created content in search results immediately.
Operational overhead. Running Elasticsearch alongside PostgreSQL is two systems to monitor, back up, tune, and upgrade. Managed services (Elastic Cloud, OpenSearch Service, Typesense Cloud) reduce this but add cost.
Schema (mapping) changes are expensive. Changing an Elasticsearch mapping often requires reindexing — re-processing every document through the new mapping. On large indexes, this takes hours and requires careful orchestration (build the new index while the old one serves traffic, swap when complete).
Resource appetite. Elasticsearch is memory-hungry — it caches the inverted index and segment data in the JVM heap. Under-provisioned Elasticsearch nodes cause slow queries and garbage collection pauses.
When to use a full-text search engine
Use Elasticsearch (or Typesense, or OpenSearch) when:
- Users need to search free-text content with relevance ranking
- Typo tolerance, stemming, and linguistic analysis are required
- Faceting alongside search results (filter by category, date range, author)
- Search is a prominent product feature, not just an admin utility
Stay with PostgreSQL FTS when:
- Search is a secondary feature, admin tooling, or low-traffic
- Operational simplicity outweighs the additional capabilities
- Query patterns are simple (keyword match, no facets, no autocomplete)
The one thing to remember
Full-text search engines are built around the inverted index — a structure that maps terms to the documents containing them, enabling fast lookup and relevance ranking. The key capabilities (stemming, fuzzy matching, BM25 ranking, facets, autocomplete) aren't possible with SQL
LIKEor even B+ tree indexes, because they require linguistic analysis at index time and relevance scoring at query time. If search is a user-facing feature that matters, a dedicated search engine will deliver a dramatically better experience than a relational database trying to simulate one.
← Previous: Vector Databases: Semantic Search and AI Memory — Vector databases power semantic search, recommendations, and LLM memory. Learn how embeddings work, what ANN search i...
→ Next: Materialized Views: Pre-Computing Expensive Queries — Materialized views cache expensive query results as physical tables. Learn how they work, when to refresh them, and w...




