Glossary/RAG (Retrieval-Augmented Generation)

Definition

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances a language model's responses by first retrieving relevant information from a knowledge base, then generating a response grounded in that retrieved content. RAG solves LLMs' knowledge cutoff and hallucination problems by giving models access to specific, up-to-date, or proprietary information at inference time.

The problem RAG solves

Standard LLMs answer from their parametric memory (weights trained on a fixed corpus with a cutoff date). They can't know about events after training, have no access to your private documents, and may confuse or misremember specific details.

Analogy

RAG is like allowing a student to reference their textbook during an exam, rather than relying purely on memory. The model still does the reasoning — but it's grounded in retrieved evidence, not confabulation.

  • Knowledge cutoff: Base LLMs don't know events after their training cutoff.
  • Private data: Your company's internal documents, your personal PDFs — none of this is in the training data.
  • Hallucination: Without source grounding, models generate plausible-sounding but incorrect specific facts.
  • Attribution: RAG lets the model cite exact sources (page numbers, documents).

The full RAG pipeline

RAG has two phases: offline indexing (one-time) and online query (per request):

Complete RAG pipeline — indexing phase

from openai import OpenAI
import numpy as np
from typing import List, Dict

client = OpenAI()

# ══════════════════════════════════════════════════════════
#  PHASE 1: INDEXING (runs once when you upload a document)
# ══════════════════════════════════════════════════════════

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> List[str]:
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    step = chunk_size - overlap

    for i in range(0, len(words), step):
        chunk = " ".join(words[i : i + chunk_size])
        if chunk:
            chunks.append(chunk)

    return chunks


def embed_texts(texts: List[str]) -> np.ndarray:
    """Embed a list of texts using OpenAI's embedding model."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    embeddings = [r.embedding for r in response.data]
    return np.array(embeddings)


# Your document (in practice, extracted from PDF)
document = """
Photosynthesis is the process by which plants convert sunlight into glucose.
It occurs in two stages: the light-dependent reactions in the thylakoids,
and the Calvin cycle in the stroma of the chloroplast.
The overall equation is: 6CO2 + 6H2O + light → C6H12O6 + 6O2.
...
"""

# Split into chunks and embed
chunks = chunk_text(document)
chunk_embeddings = embed_texts(chunks)     # shape: (n_chunks, 1536)

# Normalize for fast cosine similarity
norms = np.linalg.norm(chunk_embeddings, axis=1, keepdims=True)
chunk_embeddings_norm = chunk_embeddings / norms

# Store in "vector database" (in-memory here; use pgvector/Pinecone in production)
index = {"chunks": chunks, "embeddings": chunk_embeddings_norm}

print(f"Indexed {len(chunks)} chunks")

Complete RAG pipeline — query phase

# ══════════════════════════════════════════════════════════
#  PHASE 2: QUERY (runs on every user question)
# ══════════════════════════════════════════════════════════

def retrieve(query: str, index: Dict, top_k: int = 3) -> List[str]:
    """Find the most relevant chunks for a query."""
    query_emb = embed_texts([query])[0]
    query_emb = query_emb / np.linalg.norm(query_emb)     # normalize

    # Cosine similarity = dot product with normalized vectors
    scores = index["embeddings"] @ query_emb               # (n_chunks,)
    top_indices = np.argsort(scores)[::-1][:top_k]

    return [index["chunks"][i] for i in top_indices]


def rag_answer(question: str, index: Dict) -> str:
    """Retrieve relevant context, then generate a grounded answer."""

    # Step 1: Retrieve
    context_chunks = retrieve(question, index, top_k=3)
    context = "\n\n".join(f"[Chunk {i+1}]: {c}" for i, c in enumerate(context_chunks))

    # Step 2: Generate (grounded answer)
    system_prompt = """You are a precise AI assistant.
Answer ONLY from the provided context below.
If the answer is not in the context, say "I don't have that information in the provided documents."
Always cite which chunk your answer comes from."""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0    # deterministic for factual Q&A
    )

    return response.choices[0].message.content


answer = rag_answer("What is the equation for photosynthesis?", index)
print(answer)
# "According to Chunk 1, the equation for photosynthesis is:
#  6CO2 + 6H2O + light → C6H12O6 + 6O2"

Naive RAG vs Advanced RAG

Basic RAG (retrieve → generate) breaks down in real-world scenarios. Advanced RAG techniques address common failure modes:

ProblemTechniqueHow it helps
Query ambiguousQuery rewriting / HyDERewrite query or generate a hypothetical answer to embed, then retrieve
Multi-hop questionsIterative retrievalRetrieve → generate sub-answer → use sub-answer to retrieve more context
Irrelevant chunks retrievedRe-rankingUse a cross-encoder to re-rank top-k retrieved chunks by true relevance
Keyword terms missedHybrid searchCombine dense vector search + sparse BM25 keyword search
Large documentsHierarchical indexingIndex summaries + full chunks; search summaries first for efficiency

Why RAG dramatically reduces hallucination

The key instruction that eliminates most hallucination in RAG systems:

System prompt that grounds the model in retrieved evidence

SYSTEM: You are a precise study assistant. Answer questions based ONLY on 
the provided context passages below. 

Rules:
1. If the answer is directly stated in the context, cite it with [Chunk N].  
2. If the answer is implied but not directly stated, say "Based on the context..."
3. If the answer is NOT in the context, respond: "This isn't covered in the 
   provided material. Try asking a more specific question or uploading a 
   document that covers this topic."
4. NEVER use your general knowledge to fill gaps — only use the context.

This is critical for academic use: fabricating information could harm students.

RAG limitations

RAG is not foolproof. If the relevant chunk wasn't retrieved (retrieval failure), the model has no source to ground its answer and may hallucinate. This is why chunk size, overlap, and top-k tuning matter.

Practice questions

  1. What is the difference between naive RAG, advanced RAG, and modular RAG? (Answer: Naive RAG: query → embed query → retrieve top-k chunks → concatenate with prompt → generate. Simple but fragile. Advanced RAG: adds pre-retrieval (query rewriting, HyDE) and post-retrieval (re-ranking, compression) steps. Modular RAG: loosely coupled pipeline with interchangeable components — different retrievers, rankers, generators, memory modules can be swapped per use case. Modular RAG is the current production standard: enables A/B testing components, graceful degradation, and specialized modules for different query types.)
  2. What is HyDE (Hypothetical Document Embeddings) and when should you use it? (Answer: HyDE: instead of embedding the user query directly, prompt an LLM to generate a hypothetical answer to the query. Embed the hypothetical answer and use it as the search query. Rationale: the answer embedding is closer in embedding space to actual answer documents than the query embedding (questions and answers have different linguistic patterns). Use when: queries are short and ambiguous (the LLM expansion adds context), domain vocabulary differs between queries and documents, or retrieval quality is poor with direct query embedding.)
  3. What is the context window stuffing problem in RAG and how do re-ranking and compression address it? (Answer: Naive top-k retrieval may return: (1) Redundant chunks covering the same information. (2) Low-relevance chunks that merely contain query keywords. (3) More content than the context window can hold. Re-ranking (Cohere Rerank, BGE reranker): use a cross-encoder to score (query, chunk) relevance — more accurate than embedding similarity alone. Re-rank and select top-5 from top-50. Compression (LLMLingua, RECOMP): use an LLM to extract only the most relevant sentences from retrieved chunks — reducing token count by 2–5× before insertion.)
  4. What is the 'lost in the middle' problem for RAG and how do you mitigate it? (Answer: LLMs perform better when relevant context appears at the beginning or end of the prompt rather than the middle (Liu et al. 2023). For RAG with 10 retrieved chunks, the most relevant chunk should be first or last, not middle. Mitigation: (1) Re-rank by relevance and position most relevant chunk first. (2) Reverse order (most relevant last, just before the query). (3) Reduce number of retrieved chunks (fewer chunks = less middle). (4) Use models trained specifically for long-context RAG.)
  5. What is the difference between dense retrieval, sparse retrieval, and hybrid retrieval in RAG systems? (Answer: Sparse (BM25/TF-IDF): keyword matching, handles exact terms well, interpretable, fast. Fails on semantic synonyms. Dense (bi-encoder): embed query and documents, retrieve by cosine similarity. Handles semantic similarity but may miss exact matches. Hybrid (Reciprocal Rank Fusion): combine sparse and dense retrieval ranking lists. Example: BM25 rank + FAISS rank → RRF combined rank. Best of both: handles exact terms (BM25) AND semantic similarity (dense). Weaviate, Qdrant, OpenSearch all support hybrid search natively.)

RAG vs fine-tuning: the definitive decision framework

The single most asked question in LLM deployment: should I use RAG or fine-tune my model? The answer depends on what kind of knowledge your application needs.

DimensionRAGFine-tuning
Knowledge typeFactual, document-grounded, frequently updatedStyle, format, behavior, domain reasoning patterns
Update frequencyUpdate at any time — add docs to the vector DBRequires retraining run — expensive to update frequently
Cost to set upLow–medium ($50–$500 for most applications)Medium–high ($200–$5,000 depending on model and dataset size)
Hallucination riskLow when retrieval works correctlySame as base model — fine-tuning does not reduce hallucination
Source attributionNative — cite exact chunks/pagesImpossible — knowledge baked into weights
Private document Q&A✅ Perfect fit❌ Wrong tool — you'd need to retrain for every doc update
Change model tone/persona⚠️ Possible via system prompt; less stable✅ Perfect fit — persists across all conversations
Domain-specific reasoning⚠️ Works if docs are good quality✅ Better for tasks requiring learned reasoning patterns
Recent events / news✅ Just add to the knowledge base❌ Stale after training cutoff

The right answer for most teams

Start with RAG. Fine-tuning is almost never the right first step for document-grounded applications. RAG is faster to iterate, cheaper to update, and naturally auditable. Fine-tune only when you have a clear, stable behavioral requirement (consistent output format, specific domain reasoning style) that system prompting alone cannot reliably achieve after exhaustive iteration. When in doubt: RAG first, fine-tune later if needed.

ScenarioRecommendationRationale
Internal knowledge base Q&A (Confluence, Notion, Docs)RAGDocuments change; attribution needed; straightforward retrieval
Customer support bot that quotes policyRAGPolicy updates frequently; accuracy requires sourcing
Medical/legal document Q&ARAG + human reviewHallucination is unacceptable; source citation mandatory
Code completion in a specific styleFine-tune (LoRA)Style is behavioral, not factual; RAG doesn't help with style
SQL-to-English report generationFine-tune (LoRA)Structured output format; consistent reasoning pattern
Multilingual customer supportRAG + fine-tuneFactual answers via RAG; tone/format consistency via fine-tuning
Real-time news summarizationRAG (with live search)Knowledge changes hourly; fine-tuning can't keep up

Production RAG: vector database comparison

DatabaseBest forHostingHybrid searchFree tierNotes
pgvector (PostgreSQL)Existing Postgres users; <10M vectorsSelf-hosted / Supabase / Neon✅ with pg_trgmYes (Supabase)Simplest if you already use Postgres — no separate DB to manage
PineconeManaged, scalable, fast; teams who want zero infraManaged cloudYes (1 index)Best developer experience; highest cost at scale
QdrantHigh performance; filtering + vectors combinedSelf-hosted / cloud✅ (BM42)Yes (cloud)Best performance/cost for self-hosted; HNSW + payload filtering
WeaviateMulti-modal; built-in vectorization; schema flexibilitySelf-hosted / cloudYes (cloud)Native GraphQL; supports image + text vectors in same collection
ChromaLocal dev and prototypingEmbedded (local) / cloud⚠️ basicYes (embedded)Easiest to get started; not recommended for production at scale
FAISS (Meta)Maximum performance; total controlSelf-hosted only❌ (pair with BM25)Open-sourceIndustry standard for custom implementations; no managed offering

On LumiChats

LumiChats Study Mode is built on a production RAG pipeline. Documents are chunked, embedded with text-embedding-3-large, and stored in pgvector. Every answer in Study Mode is retrieved from your specific document — cited by page number, never hallucinated from training data.

Try it free

✦ Under $1 / day

Practice what you just learned

Quiz Hub + Study Mode lock in every concept. 40+ AI models, Agent Mode, page-locked answers — all for less than a dollar a day.

Start Free — Under $1/day

Related Terms

5 terms