What is building a RAG system with a vector database?

Vector Database: Building a RAG system with a vector database. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/vector-database

What is practice questions?

Vector Database: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/vector-database

Vector Database Explained: FAISS, Pinecone, Chroma & RAG

Vector Database

A vector database is a specialized data store designed to efficiently store, index, and retrieve high-dimensional vector embeddings — numerical representations of text, images, audio, or other data produced by embedding models. Unlike traditional databases that retrieve rows matching exact conditions, vector databases retrieve items by semantic similarity: finding the vectors most similar to a query vector using approximate nearest neighbor (ANN) search algorithms. They are the core infrastructure of RAG systems, semantic search engines, recommendation systems, and long-term AI agent memory.

A database built for storing and searching AI embeddings at scale.

Category: Inference & Deployment

Why traditional databases can't do this

A text embedding is a vector of 768 to 3072 floating-point numbers — representing the semantic meaning of a passage. To find the passages most similar to a query, you need to compute the cosine similarity between the query vector and every stored vector, then return the top-k results. For a database of 10 million documents with 1536-dimensional embeddings, a naive brute-force search requires 10 million dot products per query. A PostgreSQL table can store these vectors, but SQL's query engine was not designed for this operation. Approximate nearest neighbor algorithms — HNSW, IVF-Flat, ScaNN — reduce this from O(n·d) to O(log n · d) with controllable accuracy tradeoffs, making billion-scale semantic search feasible.

\text{cosine\_similarity}(\mathbf{q}, \mathbf{v}) = \frac{\mathbf{q} \cdot \mathbf{v}}{\|\mathbf{q}\| \cdot \|\mathbf{v}\|}

Database	Type	Algorithm	Managed service	Best for
Pinecone	Purpose-built vector DB	HNSW + proprietary	Yes (cloud-only)	Production RAG apps; no infra management
Weaviate	Purpose-built vector DB	HNSW	Yes + self-host	Multi-tenancy; hybrid BM25+vector search
Chroma	Purpose-built vector DB	HNSW (via hnswlib)	No — local/self-host	Development, local testing, small-scale RAG
Qdrant	Purpose-built vector DB	HNSW	Yes + self-host	High-performance; advanced filtering
pgvector (PostgreSQL)	Extension to existing DB	IVF-Flat / HNSW	Via Supabase, Neon	Teams already on Postgres; simpler stack
FAISS (Meta)	Library (not a DB)	IVF-Flat, HNSW, PQ	No — library only	Research; custom applications; maximum control

Building a RAG system with a vector database

from anthropic import Anthropic
import chromadb
from chromadb.utils import embedding_functions

# ── 1. Set up Chroma vector database ──────────────────────────────────────
client = chromadb.Client()  # in-memory; use chromadb.PersistentClient() for disk

# Use OpenAI embeddings (or swap for sentence-transformers for free local embed)
embed_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key="YOUR_OPENAI_KEY",
    model_name="text-embedding-3-small"  # 1536-dim, $0.02 per 1M tokens
)

collection = client.create_collection("lumichats_docs", embedding_function=embed_fn)

# ── 2. Ingest documents ────────────────────────────────────────────────────
documents = [
    "LumiChats charges ₹69 per active day. You only pay on days you use it.",
    "LumiChats Study Mode locks all AI answers to specific pages of your uploaded PDF.",
    "LumiChats supports 40+ models including Claude Sonnet 4.6, GPT-5.4, and Gemini 3 Pro.",
    "LumiChats Agent Mode enables multi-step autonomous task execution using frontier models.",
]
collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

# ── 3. Retrieve relevant context for a query ──────────────────────────────
query = "How much does LumiChats cost?"
results = collection.query(query_texts=[query], n_results=2)
context = "\n".join(results["documents"][0])

# ── 4. Generate answer with Claude using retrieved context ─────────────────
anthropic = Anthropic()
response = anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system=f"""Answer questions using only the provided context.
Context:
{context}

If the answer isn't in the context, say so explicitly.""",
    messages=[{"role": "user", "content": query}]
)
print(response.content[0].text)
# → "LumiChats charges ₹69 per active day — you only pay on days you actually use it."

Choosing the right vector database: For development and small projects (under 100,000 vectors): use Chroma locally — free, no account needed, 5-minute setup. For production (100K–10M vectors): Pinecone Serverless or Qdrant Cloud — managed, scalable, reasonable pricing. For teams already on Supabase or Neon (PostgreSQL): use pgvector — eliminates a separate service. For billion-scale search with full control: FAISS + custom infrastructure. The most common mistake is over-engineering: most RAG applications serve well under 1M vectors and don't need Pinecone's scale.

Practice questions

What is approximate nearest neighbor (ANN) search and why is it used instead of exact nearest neighbor? (Answer: Exact nearest neighbor in 1536-dimensional embeddings space requires comparing the query vector against every stored vector — O(n·d) time and impractical for millions of vectors. ANN algorithms (HNSW, IVF, LSH) sacrifice a small amount of accuracy (miss a few true nearest neighbors) for massive speed improvements — typically 100–1000× faster. Production vector databases use ANN: Pinecone, Weaviate, Qdrant, and pgvector all use HNSW as their primary index.)
What is HNSW (Hierarchical Navigable Small World) and why is it the dominant ANN algorithm? (Answer: HNSW builds a multi-layer graph where vectors are connected to their nearest neighbors. The top layers are sparse graphs with long-range connections (for fast approximate traversal). Bottom layers are dense with precise neighborhood connections. Search: enter at the top layer, greedily navigate toward the query, descend to more detailed layers. Insert/search are both O(log n) — unlike tree structures which degrade in high dimensions. HNSW dominates because it achieves 95%+ recall at 100–1000× speedup over brute force.)
In a RAG system, when would you use hybrid search (vector + keyword) instead of pure vector search? (Answer: Hybrid search combines dense (embedding) and sparse (BM25/TF-IDF) retrieval. Use hybrid when: (1) Queries include exact terms that must be matched (product IDs, proper nouns, technical terms). Pure vector search may retrieve semantically similar but wrong product. (2) Domain vocabulary is specialized — embeddings may not capture domain-specific term similarity. (3) Users mix broad conceptual queries with specific searches. Reciprocal Rank Fusion (RRF) combines the two ranking lists. Weaviate and Qdrant both support hybrid search natively.)
What is the embedding dimensionality trade-off for vector databases? (Answer: Higher dimensions (e.g., 3072 for text-embedding-3-large): more nuanced semantic representation, higher search quality. Costs: more storage per vector (3072 × 4 bytes = 12KB vs 384 × 4 bytes = 1.5KB for small embeddings), slower indexing and search, higher memory usage. Lower dimensions (e.g., Matryoshka embeddings can be truncated to 256 dims): ~12× storage reduction, ~4× faster search, small accuracy loss. Matryoshka Representation Learning (MRL) trains models so early dimensions capture the most important information — enabling dimension selection at retrieval time.)
What is the 'semantic gap' problem in vector search and how does query rewriting address it? (Answer: Semantic gap: user queries are often short, keyword-like, and expressed differently than the documents they target. A query 'Python list comprehension' may not retrieve a document titled 'Compact syntax for creating lists in Python' even if they are semantically equivalent — embedding similarity depends on training distribution. Query rewriting: use an LLM to expand or rephrase the query into multiple forms. HyDE (Hypothetical Document Embeddings): generate a hypothetical answer to the query and embed THAT — the answer embedding is closer in space to the actual answer document.)

LumiChats Study Mode uses vector similarity search internally to retrieve the most relevant passages from your uploaded PDFs before generating answers — the same RAG architecture described here, built specifically for exam preparation with zero hallucination risk.

Definition

Why traditional databases can't do this

Cosine similarity: measures the angle between two vectors regardless of magnitude. Returns 1.0 for identical direction (most similar), 0 for orthogonal (unrelated), -1 for opposite.

Database	Type	Algorithm	Managed service	Best for
Pinecone	Purpose-built vector DB	HNSW + proprietary	Yes (cloud-only)	Production RAG apps; no infra management
Weaviate	Purpose-built vector DB	HNSW	Yes + self-host	Multi-tenancy; hybrid BM25+vector search
Chroma	Purpose-built vector DB	HNSW (via hnswlib)	No — local/self-host	Development, local testing, small-scale RAG
Qdrant	Purpose-built vector DB	HNSW	Yes + self-host	High-performance; advanced filtering
pgvector (PostgreSQL)	Extension to existing DB	IVF-Flat / HNSW	Via Supabase, Neon	Teams already on Postgres; simpler stack
FAISS (Meta)	Library (not a DB)	IVF-Flat, HNSW, PQ	No — library only	Research; custom applications; maximum control

Building a RAG system with a vector database

Complete RAG pipeline: embed documents → store in Chroma → retrieve → generate with Claude

from anthropic import Anthropic
import chromadb
from chromadb.utils import embedding_functions

# ── 1. Set up Chroma vector database ──────────────────────────────────────
client = chromadb.Client()  # in-memory; use chromadb.PersistentClient() for disk

# Use OpenAI embeddings (or swap for sentence-transformers for free local embed)
embed_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key="YOUR_OPENAI_KEY",
    model_name="text-embedding-3-small"  # 1536-dim, $0.02 per 1M tokens
)

collection = client.create_collection("lumichats_docs", embedding_function=embed_fn)

# ── 2. Ingest documents ────────────────────────────────────────────────────
documents = [
    "LumiChats charges ₹69 per active day. You only pay on days you use it.",
    "LumiChats Study Mode locks all AI answers to specific pages of your uploaded PDF.",
    "LumiChats supports 40+ models including Claude Sonnet 4.6, GPT-5.4, and Gemini 3 Pro.",
    "LumiChats Agent Mode enables multi-step autonomous task execution using frontier models.",
]
collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

# ── 3. Retrieve relevant context for a query ──────────────────────────────
query = "How much does LumiChats cost?"
results = collection.query(query_texts=[query], n_results=2)
context = "\n".join(results["documents"][0])

# ── 4. Generate answer with Claude using retrieved context ─────────────────
anthropic = Anthropic()
response = anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system=f"""Answer questions using only the provided context.
Context:
{context}

If the answer isn't in the context, say so explicitly.""",
    messages=[{"role": "user", "content": query}]
)
print(response.content[0].text)
# → "LumiChats charges ₹69 per active day — you only pay on days you actually use it."

Choosing the right vector database

For development and small projects (under 100,000 vectors): use Chroma locally — free, no account needed, 5-minute setup. For production (100K–10M vectors): Pinecone Serverless or Qdrant Cloud — managed, scalable, reasonable pricing. For teams already on Supabase or Neon (PostgreSQL): use pgvector — eliminates a separate service. For billion-scale search with full control: FAISS + custom infrastructure. The most common mistake is over-engineering: most RAG applications serve well under 1M vectors and don't need Pinecone's scale.

Practice questions

What is approximate nearest neighbor (ANN) search and why is it used instead of exact nearest neighbor? (Answer: Exact nearest neighbor in 1536-dimensional embeddings space requires comparing the query vector against every stored vector — O(n·d) time and impractical for millions of vectors. ANN algorithms (HNSW, IVF, LSH) sacrifice a small amount of accuracy (miss a few true nearest neighbors) for massive speed improvements — typically 100–1000× faster. Production vector databases use ANN: Pinecone, Weaviate, Qdrant, and pgvector all use HNSW as their primary index.)
What is HNSW (Hierarchical Navigable Small World) and why is it the dominant ANN algorithm? (Answer: HNSW builds a multi-layer graph where vectors are connected to their nearest neighbors. The top layers are sparse graphs with long-range connections (for fast approximate traversal). Bottom layers are dense with precise neighborhood connections. Search: enter at the top layer, greedily navigate toward the query, descend to more detailed layers. Insert/search are both O(log n) — unlike tree structures which degrade in high dimensions. HNSW dominates because it achieves 95%+ recall at 100–1000× speedup over brute force.)
In a RAG system, when would you use hybrid search (vector + keyword) instead of pure vector search? (Answer: Hybrid search combines dense (embedding) and sparse (BM25/TF-IDF) retrieval. Use hybrid when: (1) Queries include exact terms that must be matched (product IDs, proper nouns, technical terms). Pure vector search may retrieve semantically similar but wrong product. (2) Domain vocabulary is specialized — embeddings may not capture domain-specific term similarity. (3) Users mix broad conceptual queries with specific searches. Reciprocal Rank Fusion (RRF) combines the two ranking lists. Weaviate and Qdrant both support hybrid search natively.)
What is the embedding dimensionality trade-off for vector databases? (Answer: Higher dimensions (e.g., 3072 for text-embedding-3-large): more nuanced semantic representation, higher search quality. Costs: more storage per vector (3072 × 4 bytes = 12KB vs 384 × 4 bytes = 1.5KB for small embeddings), slower indexing and search, higher memory usage. Lower dimensions (e.g., Matryoshka embeddings can be truncated to 256 dims): ~12× storage reduction, ~4× faster search, small accuracy loss. Matryoshka Representation Learning (MRL) trains models so early dimensions capture the most important information — enabling dimension selection at retrieval time.)
What is the 'semantic gap' problem in vector search and how does query rewriting address it? (Answer: Semantic gap: user queries are often short, keyword-like, and expressed differently than the documents they target. A query 'Python list comprehension' may not retrieve a document titled 'Compact syntax for creating lists in Python' even if they are semantically equivalent — embedding similarity depends on training distribution. Query rewriting: use an LLM to expand or rephrase the query into multiple forms. HyDE (Hypothetical Document Embeddings): generate a hypothetical answer to the query and embed THAT — the answer embedding is closer in space to the actual answer document.)

On LumiChats

Try it free

Vector Database

Why traditional databases can't do this

Building a RAG system with a vector database

Practice questions

Vector Database

Why traditional databases can't do this

Building a RAG system with a vector database

Practice questions

Practice what you just learned

Related Terms