Retrieval-Augmented Generation (RAG) is an AI architecture that enhances a language model's responses by first retrieving relevant information from a knowledge base, then generating a response grounded in that retrieved content. RAG solves LLMs' knowledge cutoff and hallucination problems by giving models access to specific, up-to-date, or proprietary information at inference time.
The problem RAG solves
Standard LLMs answer from their parametric memory (weights trained on a fixed corpus with a cutoff date). They can't know about events after training, have no access to your private documents, and may confuse or misremember specific details.
Analogy
RAG is like allowing a student to reference their textbook during an exam, rather than relying purely on memory. The model still does the reasoning — but it's grounded in retrieved evidence, not confabulation.
- Knowledge cutoff: Base LLMs don't know events after their training cutoff.
- Private data: Your company's internal documents, your personal PDFs — none of this is in the training data.
- Hallucination: Without source grounding, models generate plausible-sounding but incorrect specific facts.
- Attribution: RAG lets the model cite exact sources (page numbers, documents).
The full RAG pipeline
RAG has two phases: offline indexing (one-time) and online query (per request):
Complete RAG pipeline — indexing phase
from openai import OpenAI
import numpy as np
from typing import List, Dict
client = OpenAI()
# ══════════════════════════════════════════════════════════
# PHASE 1: INDEXING (runs once when you upload a document)
# ══════════════════════════════════════════════════════════
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> List[str]:
"""Split text into overlapping chunks."""
words = text.split()
chunks = []
step = chunk_size - overlap
for i in range(0, len(words), step):
chunk = " ".join(words[i : i + chunk_size])
if chunk:
chunks.append(chunk)
return chunks
def embed_texts(texts: List[str]) -> np.ndarray:
"""Embed a list of texts using OpenAI's embedding model."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
embeddings = [r.embedding for r in response.data]
return np.array(embeddings)
# Your document (in practice, extracted from PDF)
document = """
Photosynthesis is the process by which plants convert sunlight into glucose.
It occurs in two stages: the light-dependent reactions in the thylakoids,
and the Calvin cycle in the stroma of the chloroplast.
The overall equation is: 6CO2 + 6H2O + light → C6H12O6 + 6O2.
...
"""
# Split into chunks and embed
chunks = chunk_text(document)
chunk_embeddings = embed_texts(chunks) # shape: (n_chunks, 1536)
# Normalize for fast cosine similarity
norms = np.linalg.norm(chunk_embeddings, axis=1, keepdims=True)
chunk_embeddings_norm = chunk_embeddings / norms
# Store in "vector database" (in-memory here; use pgvector/Pinecone in production)
index = {"chunks": chunks, "embeddings": chunk_embeddings_norm}
print(f"Indexed {len(chunks)} chunks")Complete RAG pipeline — query phase
# ══════════════════════════════════════════════════════════
# PHASE 2: QUERY (runs on every user question)
# ══════════════════════════════════════════════════════════
def retrieve(query: str, index: Dict, top_k: int = 3) -> List[str]:
"""Find the most relevant chunks for a query."""
query_emb = embed_texts([query])[0]
query_emb = query_emb / np.linalg.norm(query_emb) # normalize
# Cosine similarity = dot product with normalized vectors
scores = index["embeddings"] @ query_emb # (n_chunks,)
top_indices = np.argsort(scores)[::-1][:top_k]
return [index["chunks"][i] for i in top_indices]
def rag_answer(question: str, index: Dict) -> str:
"""Retrieve relevant context, then generate a grounded answer."""
# Step 1: Retrieve
context_chunks = retrieve(question, index, top_k=3)
context = "\n\n".join(f"[Chunk {i+1}]: {c}" for i, c in enumerate(context_chunks))
# Step 2: Generate (grounded answer)
system_prompt = """You are a precise AI assistant.
Answer ONLY from the provided context below.
If the answer is not in the context, say "I don't have that information in the provided documents."
Always cite which chunk your answer comes from."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
],
temperature=0 # deterministic for factual Q&A
)
return response.choices[0].message.content
answer = rag_answer("What is the equation for photosynthesis?", index)
print(answer)
# "According to Chunk 1, the equation for photosynthesis is:
# 6CO2 + 6H2O + light → C6H12O6 + 6O2"Naive RAG vs Advanced RAG
Basic RAG (retrieve → generate) breaks down in real-world scenarios. Advanced RAG techniques address common failure modes:
| Problem | Technique | How it helps |
|---|---|---|
| Query ambiguous | Query rewriting / HyDE | Rewrite query or generate a hypothetical answer to embed, then retrieve |
| Multi-hop questions | Iterative retrieval | Retrieve → generate sub-answer → use sub-answer to retrieve more context |
| Irrelevant chunks retrieved | Re-ranking | Use a cross-encoder to re-rank top-k retrieved chunks by true relevance |
| Keyword terms missed | Hybrid search | Combine dense vector search + sparse BM25 keyword search |
| Large documents | Hierarchical indexing | Index summaries + full chunks; search summaries first for efficiency |
Why RAG dramatically reduces hallucination
The key instruction that eliminates most hallucination in RAG systems:
System prompt that grounds the model in retrieved evidence
SYSTEM: You are a precise study assistant. Answer questions based ONLY on
the provided context passages below.
Rules:
1. If the answer is directly stated in the context, cite it with [Chunk N].
2. If the answer is implied but not directly stated, say "Based on the context..."
3. If the answer is NOT in the context, respond: "This isn't covered in the
provided material. Try asking a more specific question or uploading a
document that covers this topic."
4. NEVER use your general knowledge to fill gaps — only use the context.
This is critical for academic use: fabricating information could harm students.RAG limitations
RAG is not foolproof. If the relevant chunk wasn't retrieved (retrieval failure), the model has no source to ground its answer and may hallucinate. This is why chunk size, overlap, and top-k tuning matter.
Practice questions
- What is the difference between naive RAG, advanced RAG, and modular RAG? (Answer: Naive RAG: query → embed query → retrieve top-k chunks → concatenate with prompt → generate. Simple but fragile. Advanced RAG: adds pre-retrieval (query rewriting, HyDE) and post-retrieval (re-ranking, compression) steps. Modular RAG: loosely coupled pipeline with interchangeable components — different retrievers, rankers, generators, memory modules can be swapped per use case. Modular RAG is the current production standard: enables A/B testing components, graceful degradation, and specialised modules for different query types.)
- What is HyDE (Hypothetical Document Embeddings) and when should you use it? (Answer: HyDE: instead of embedding the user query directly, prompt an LLM to generate a hypothetical answer to the query. Embed the hypothetical answer and use it as the search query. Rationale: the answer embedding is closer in embedding space to actual answer documents than the query embedding (questions and answers have different linguistic patterns). Use when: queries are short and ambiguous (the LLM expansion adds context), domain vocabulary differs between queries and documents, or retrieval quality is poor with direct query embedding.)
- What is the context window stuffing problem in RAG and how do re-ranking and compression address it? (Answer: Naive top-k retrieval may return: (1) Redundant chunks covering the same information. (2) Low-relevance chunks that merely contain query keywords. (3) More content than the context window can hold. Re-ranking (Cohere Rerank, BGE reranker): use a cross-encoder to score (query, chunk) relevance — more accurate than embedding similarity alone. Re-rank and select top-5 from top-50. Compression (LLMLingua, RECOMP): use an LLM to extract only the most relevant sentences from retrieved chunks — reducing token count by 2–5× before insertion.)
- What is the 'lost in the middle' problem for RAG and how do you mitigate it? (Answer: LLMs perform better when relevant context appears at the beginning or end of the prompt rather than the middle (Liu et al. 2023). For RAG with 10 retrieved chunks, the most relevant chunk should be first or last, not middle. Mitigation: (1) Re-rank by relevance and position most relevant chunk first. (2) Reverse order (most relevant last, just before the query). (3) Reduce number of retrieved chunks (fewer chunks = less middle). (4) Use models trained specifically for long-context RAG.)
- What is the difference between dense retrieval, sparse retrieval, and hybrid retrieval in RAG systems? (Answer: Sparse (BM25/TF-IDF): keyword matching, handles exact terms well, interpretable, fast. Fails on semantic synonyms. Dense (bi-encoder): embed query and documents, retrieve by cosine similarity. Handles semantic similarity but may miss exact matches. Hybrid (Reciprocal Rank Fusion): combine sparse and dense retrieval ranking lists. Example: BM25 rank + FAISS rank → RRF combined rank. Best of both: handles exact terms (BM25) AND semantic similarity (dense). Weaviate, Qdrant, OpenSearch all support hybrid search natively.)
On LumiChats
LumiChats Study Mode is built on a production RAG pipeline. Documents are chunked, embedded with text-embedding-3-large, and stored in pgvector. Every answer in Study Mode is retrieved from your specific document — cited by page number, never hallucinated from training data.
Try it free