Glossary/Semantic Search
Natural Language Processing

Semantic Search

Finding meaning, not just matching keywords.


Definition

Semantic search uses AI embeddings to find information based on meaning and intent rather than exact keyword matches. A semantic search system understands that 'heart attack treatment' and 'myocardial infarction therapy' are asking the same thing, and can return relevant results even when none of the query words appear in the document.

Keyword search vs semantic search

The fundamental difference: keyword search finds documents that contain your words; semantic search finds documents that mean what you mean.

DimensionKeyword / Lexical (BM25)Semantic (Embedding-based)
How it worksTerm frequency × inverse document frequency scoringEncode query + docs to vectors; cosine similarity
Handles synonyms?No — "heart attack" misses "myocardial infarction"Yes — same region in embedding space
Handles paraphrase?No — different words = no matchYes — meaning preserved in embedding
Handles typos?Partially (fuzzy matching add-ons)Yes — nearby spelling = similar embedding
SpeedVery fast — inverted index lookupFast — ANN search, ~1–10ms
Interpretable?Yes — exact term matches visibleLess so — similarity score only
When it failsVocabulary mismatch, paraphrase, concept queriesVery specific technical terms, very short docs
Best forExact product names, codes, legal termsGeneral Q&A, intent search, FAQ matching

Hybrid search always wins

Production search systems (Google, Elasticsearch, Weaviate, pgvector) consistently find that hybrid search — BM25 score + semantic similarity score, combined with a cross-encoder re-ranker — outperforms either alone. The intuition: BM25 catches exact technical terms that embeddings blur; semantic search catches synonyms and intent that BM25 misses. Use Reciprocal Rank Fusion (RRF) to merge the two ranked lists.

Bi-encoder vs cross-encoder architecture

The core tension in semantic search: accuracy vs speed. Bi-encoders are fast but less accurate; cross-encoders are slow but very accurate. The standard solution: use both in a two-stage pipeline.

PropertyBi-encoderCross-encoder
How it worksEncode query and doc separately → fixed vectors → cosine similarityConcatenate query + doc → run full model → single relevance score
Token interactionNone — query and doc never attend to each otherFull cross-attention between every query and doc token
IndexingPre-compute doc embeddings offline; query embedding at runtimeCannot pre-compute — must run at query time with each doc
Latency~1–10ms for ANN search over millions of docs~100–500ms per query for top-100 docs
QualityGood — limited by single-vector bottleneckBest — tokens attend directly to each other
ScaleBillions of docs — ANN index handles itOnly feasible for small candidate sets (~100–500 docs)
Example modelsE5, BGE, Voyage, text-embedding-3-largeMS-MARCO cross-encoder, Cohere Rerank, bge-reranker

The retrieve-then-rerank pipeline

Industry standard: (1) Bi-encoder retrieves top-100 candidates from millions of docs in ~10ms. (2) Cross-encoder re-ranks those 100 to produce final top-10 in ~200ms. Total latency ~210ms — the quality of a cross-encoder at the scale of a bi-encoder. Cohere Rerank, Jina Reranker, and BGE Reranker are popular choices for the reranking step.

Two-stage retrieval pipeline with SentenceTransformers

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

# Stage 1: Bi-encoder retrieval
bi_encoder = SentenceTransformer('intfloat/e5-large-v2')
docs = ["heart attack symptoms", "myocardial infarction treatment", "chest pain causes", ...]

# Pre-compute document embeddings (done once, cached)
doc_embeddings = bi_encoder.encode(["passage: " + d for d in docs], normalize_embeddings=True)

query = "query: what causes heart attacks"
query_emb = bi_encoder.encode([query], normalize_embeddings=True)

# Retrieve top-100 by cosine similarity
scores = (query_emb @ doc_embeddings.T)[0]
top100_indices = np.argsort(scores)[::-1][:100]
top100_docs = [docs[i] for i in top100_indices]

# Stage 2: Cross-encoder reranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
pairs = [("what causes heart attacks", doc) for doc in top100_docs]
rerank_scores = cross_encoder.predict(pairs)

# Final top-10
top10 = sorted(zip(rerank_scores, top100_docs), reverse=True)[:10]
for score, doc in top10:
    print(f"{score:.3f}  {doc}")

Dense Passage Retrieval (DPR)

DPR (Karpukhin et al., Facebook AI, 2020) was the breakthrough paper showing learned dense embeddings can outperform BM25 for open-domain question answering — launching the modern semantic search era.

DPR contrastive loss: maximize similarity between (query, positive passage) pairs while minimizing similarity to negative passages. sim(q, p) = dot product of BERT encodings. In-batch negatives: other passages in the batch serve as easy negatives; BM25 hard negatives added for harder training signal.

AspectDetail
ArchitectureTwo independent BERT encoders — one for queries, one for passages
Training dataNatural Questions + TriviaQA — Wikipedia passages, annotated with gold passages
Negative miningIn-batch negatives + BM25 hard negatives (passages that contain query terms but aren't the answer)
Result on NQTop-20 retrieval accuracy: DPR 79.4% vs BM25 59.1% — +20 points
LegacyTraining paradigm (contrastive, hard negatives) adopted by E5, BGE, GTE, Voyage, text-embedding-3

Hard negatives are the key

The biggest DPR insight: easy negatives (random passages) make the model lazy — it only needs to avoid obviously unrelated content. Hard negatives (passages that look relevant but aren't) force the encoder to develop a precise semantic understanding. Modern embedding models (E5, BGE, GTE) spend significant effort on hard negative mining strategies — using BM25, a weaker model, or mined adversarial examples to generate challenging training pairs.

Neural semantic search in production

A production semantic search system has three phases: offline indexing, online retrieval, and optional reranking. Each phase has distinct engineering tradeoffs.

End-to-end semantic search with pgvector (PostgreSQL)

import psycopg2
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('intfloat/e5-large-v2')

# ── INDEXING PHASE (run once / on updates) ─────────────────────────
conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute(
    "CREATE TABLE IF NOT EXISTS documents "
    "(id SERIAL PRIMARY KEY, content TEXT, embedding vector(1024))"
)
cur.execute(
    "CREATE INDEX IF NOT EXISTS doc_idx ON documents "
    "USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64)"
)

docs = ["Introduction to neural networks...", "Attention mechanism explained..."]
embeddings = model.encode(["passage: " + d for d in docs], normalize_embeddings=True)
for doc, emb in zip(docs, embeddings):
    cur.execute("INSERT INTO documents (content, embedding) VALUES (%s, %s)",
                (doc, emb.tolist()))
conn.commit()

# ── RETRIEVAL PHASE (per query) ─────────────────────────────────────
query = "query: how does self-attention work"
q_emb = model.encode([query], normalize_embeddings=True)[0]
cur.execute(
    "SELECT content, 1 - (embedding <=> %s::vector) AS sim "
    "FROM documents ORDER BY embedding <=> %s::vector LIMIT 10",
    (q_emb.tolist(), q_emb.tolist())
)
for text, sim in cur.fetchall():
    print(f"{sim:.3f}  {text[:80]}")
Vector DBBest forHostingANN algorithm
pgvectorExisting PostgreSQL infra; moderate scale (<50M vecs)Self-hosted or SupabaseIVFFlat or HNSW
PineconeManaged, zero ops, horizontal scaleFully managed SaaSProprietary (HNSW-based)
WeaviateHybrid search built-in; multi-modalSelf-hosted or managedHNSW
QdrantHigh performance, Rust core, filtering-heavy workloadsSelf-hosted or managedHNSW
ChromaLocal dev, prototyping, embedded useSelf-hosted (embedded lib)HNSW (via hnswlib)

Chunking strategy matters most

The single biggest factor in semantic search quality is not the model — it's how you chunk documents. Too large: retrieved chunks contain mostly irrelevant content. Too small: a single concept is split across chunks, losing context. Rule of thumb: 256–512 tokens with 50-token overlap for general text. Use semantic chunking (split at paragraph/section boundaries) when document structure is available.

ColBERT: late interaction for efficiency

ColBERT (Khattab & Zaharia, Stanford, 2020) bridges the gap between bi-encoders (fast, less accurate) and cross-encoders (slow, most accurate) via late interaction — token-level similarity without full cross-attention.

ColBERT MaxSim scoring: for each query token embedding, find the most similar document token embedding (MaxSim). Sum these per-query-token max scores to get the final relevance score. Documents are pre-encoded to token vectors and stored in compressed form.

ArchitectureStorageQuery latencyQualityScale
Bi-encoder (single vector)1 vector per doc<5ms ANNGoodBillions of docs
ColBERT (late interaction)N token vectors per doc (~128 tokens)20–50msNear cross-encoder100M+ docs with PLAID
Cross-encoder (full attention)None (compute at query time)100–500ms for top-100BestOnly reranking, not retrieval

ColBERT v2 + PLAID

ColBERT v2 (2022) added residual compression — reducing storage by 6–10× with minimal quality loss. PLAID (2022) adds a fast candidate generation phase before full ColBERT scoring, enabling sub-100ms retrieval over 100M+ documents. RAGatouille (Python library) makes ColBERT v2 accessible without custom infrastructure — one-line indexing and search.

Practice questions

  1. What is the difference between semantic search and lexical search in terms of query handling? (Answer: Lexical search (BM25): tokenises query and document, computes term frequency and inverse document frequency scores. Handles: exact term matching, rare technical terms, product codes, proper nouns. Fails on: paraphrase, synonym, cross-language queries. Semantic search (bi-encoder): converts query and document to dense embeddings, retrieves by vector similarity. Handles: paraphrase ('cheap car' matches 'affordable vehicle'), intent ('how to fix' matches troubleshooting docs). Fails on: exact code/ID matching, very rare domain terms not in training data.)
  2. What is a bi-encoder vs cross-encoder for semantic search and when do you use each? (Answer: Bi-encoder: encode query and document INDEPENDENTLY → embeddings stored offline. Query at search time: encode query, ANN search. Very fast (O(1) per query). Good recall but not optimal precision. Cross-encoder: jointly encode (query, document) pair → single relevance score. Much more accurate (attends across both). Cannot pre-encode — must run for every (query, document) pair at query time. Too slow for first-stage retrieval. Architecture: bi-encoder for retrieval (recall), cross-encoder for re-ranking top-k results (precision).)
  3. What is BEIR (Benchmarking IR) and what has it revealed about semantic search generalisation? (Answer: BEIR (Thakur et al. 2021): 18 diverse information retrieval benchmarks (MSMARCO, TREC-COVID, NQ, ArguAna, etc.) covering different domains and query types. Key finding: models fine-tuned on one retrieval dataset (MSMARCO) significantly underperform on other domains — generalisation is poor. Dense retrieval often underperforms BM25 on out-of-domain data. Hybrid BM25 + dense retrieval consistently outperforms either alone. Conclusion: domain-specific fine-tuning or robust generalisation training is essential for production semantic search.)
  4. What is late interaction semantic search (ColBERT) and when is it preferred? (Answer: ColBERT: encode query and document separately into per-token embeddings (not pooled sentence vectors). Scoring: sum of max cosine similarities between each query token and its best-matching document token (MaxSim). More accurate than bi-encoder (richer interaction) while allowing offline document indexing (unlike cross-encoder). Storage cost: all document token embeddings stored (much larger than bi-encoder). Use when: bi-encoder recall is insufficient, cross-encoder is too slow, and storage cost is acceptable.)
  5. What is approximate nearest neighbour (ANN) indexing and which algorithm does FAISS use by default? (Answer: ANN trades exact nearest neighbour accuracy for massive speed gains. FAISS (Facebook AI Similarity Search) supports multiple index types: IndexFlatL2/IP (exact, brute force — baseline). IndexIVFFlat: inverted file index with cluster-based search — 10–100× faster, ~95% recall with nprobe=64. IndexHNSW: graph-based, excellent recall and speed balance, memory-intensive. IndexIVFPQ: inverted file + product quantisation for memory compression. Production recommendation: IVF_HNSW for high-recall/low-latency; IVFPQ for memory-constrained deployments.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms