What is practice questions?

RAG (Retrieval-Augmented Generation): Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/rag

RAG (Retrieval-Augmented Generation)

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances a language model's responses by first retrieving relevant information from a knowledge base, then generating a response grounded in that retrieved content. RAG solves LLMs' knowledge cutoff and hallucination problems by giving models access to specific, up-to-date, or proprietary information at inference time.

How AI answers questions from your own documents.

Category: AI Fundamentals

The problem RAG solves

Standard LLMs answer from their parametric memory (weights trained on a fixed corpus with a cutoff date). They can't know about events after training, have no access to your private documents, and may confuse or misremember specific details.

Analogy: RAG is like allowing a student to reference their textbook during an exam, rather than relying purely on memory. The model still does the reasoning — but it's grounded in retrieved evidence, not confabulation.

Knowledge cutoff: Base LLMs don't know events after their training cutoff.
Private data: Your company's internal documents, your personal PDFs — none of this is in the training data.
Hallucination: Without source grounding, models generate plausible-sounding but incorrect specific facts.
Attribution: RAG lets the model cite exact sources (page numbers, documents).

The full RAG pipeline

RAG has two phases: offline indexing (one-time) and online query (per request):

from openai import OpenAI
import numpy as np
from typing import List, Dict

client = OpenAI()

# ══════════════════════════════════════════════════════════
#  PHASE 1: INDEXING (runs once when you upload a document)
# ══════════════════════════════════════════════════════════

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> List[str]:
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    step = chunk_size - overlap

    for i in range(0, len(words), step):
        chunk = " ".join(words[i : i + chunk_size])
        if chunk:
            chunks.append(chunk)

    return chunks


def embed_texts(texts: List[str]) -> np.ndarray:
    """Embed a list of texts using OpenAI's embedding model."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    embeddings = [r.embedding for r in response.data]
    return np.array(embeddings)


# Your document (in practice, extracted from PDF)
document = """
Photosynthesis is the process by which plants convert sunlight into glucose.
It occurs in two stages: the light-dependent reactions in the thylakoids,
and the Calvin cycle in the stroma of the chloroplast.
The overall equation is: 6CO2 + 6H2O + light → C6H12O6 + 6O2.
...
"""

# Split into chunks and embed
chunks = chunk_text(document)
chunk_embeddings = embed_texts(chunks)     # shape: (n_chunks, 1536)

# Normalize for fast cosine similarity
norms = np.linalg.norm(chunk_embeddings, axis=1, keepdims=True)
chunk_embeddings_norm = chunk_embeddings / norms

# Store in "vector database" (in-memory here; use pgvector/Pinecone in production)
index = {"chunks": chunks, "embeddings": chunk_embeddings_norm}

print(f"Indexed {len(chunks)} chunks")

# ══════════════════════════════════════════════════════════
#  PHASE 2: QUERY (runs on every user question)
# ══════════════════════════════════════════════════════════

def retrieve(query: str, index: Dict, top_k: int = 3) -> List[str]:
    """Find the most relevant chunks for a query."""
    query_emb = embed_texts([query])[0]
    query_emb = query_emb / np.linalg.norm(query_emb)     # normalize

    # Cosine similarity = dot product with normalized vectors
    scores = index["embeddings"] @ query_emb               # (n_chunks,)
    top_indices = np.argsort(scores)[::-1][:top_k]

    return [index["chunks"][i] for i in top_indices]


def rag_answer(question: str, index: Dict) -> str:
    """Retrieve relevant context, then generate a grounded answer."""

    # Step 1: Retrieve
    context_chunks = retrieve(question, index, top_k=3)
    context = "\n\n".join(f"[Chunk {i+1}]: {c}" for i, c in enumerate(context_chunks))

    # Step 2: Generate (grounded answer)
    system_prompt = """You are a precise AI assistant.
Answer ONLY from the provided context below.
If the answer is not in the context, say "I don't have that information in the provided documents."
Always cite which chunk your answer comes from."""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0    # deterministic for factual Q&A
    )

    return response.choices[0].message.content


answer = rag_answer("What is the equation for photosynthesis?", index)
print(answer)
# "According to Chunk 1, the equation for photosynthesis is:
#  6CO2 + 6H2O + light → C6H12O6 + 6O2"

Naive RAG vs Advanced RAG

Basic RAG (retrieve → generate) breaks down in real-world scenarios. Advanced RAG techniques address common failure modes:

Problem	Technique	How it helps
Query ambiguous	Query rewriting / HyDE	Rewrite query or generate a hypothetical answer to embed, then retrieve
Multi-hop questions	Iterative retrieval	Retrieve → generate sub-answer → use sub-answer to retrieve more context
Irrelevant chunks retrieved	Re-ranking	Use a cross-encoder to re-rank top-k retrieved chunks by true relevance
Keyword terms missed	Hybrid search	Combine dense vector search + sparse BM25 keyword search
Large documents	Hierarchical indexing	Index summaries + full chunks; search summaries first for efficiency

Why RAG dramatically reduces hallucination

The key instruction that eliminates most hallucination in RAG systems:

SYSTEM: You are a precise study assistant. Answer questions based ONLY on 
the provided context passages below. 

Rules:
1. If the answer is directly stated in the context, cite it with [Chunk N].  
2. If the answer is implied but not directly stated, say "Based on the context..."
3. If the answer is NOT in the context, respond: "This isn't covered in the 
   provided material. Try asking a more specific question or uploading a 
   document that covers this topic."
4. NEVER use your general knowledge to fill gaps — only use the context.

This is critical for academic use: fabricating information could harm students.

RAG limitations: RAG is not foolproof. If the relevant chunk wasn't retrieved (retrieval failure), the model has no source to ground its answer and may hallucinate. This is why chunk size, overlap, and top-k tuning matter.

Practice questions

What is the difference between naive RAG, advanced RAG, and modular RAG? (Answer: Naive RAG: query → embed query → retrieve top-k chunks → concatenate with prompt → generate. Simple but fragile. Advanced RAG: adds pre-retrieval (query rewriting, HyDE) and post-retrieval (re-ranking, compression) steps. Modular RAG: loosely coupled pipeline with interchangeable components — different retrievers, rankers, generators, memory modules can be swapped per use case. Modular RAG is the current production standard: enables A/B testing components, graceful degradation, and specialized modules for different query types.)
What is HyDE (Hypothetical Document Embeddings) and when should you use it? (Answer: HyDE: instead of embedding the user query directly, prompt an LLM to generate a hypothetical answer to the query. Embed the hypothetical answer and use it as the search query. Rationale: the answer embedding is closer in embedding space to actual answer documents than the query embedding (questions and answers have different linguistic patterns). Use when: queries are short and ambiguous (the LLM expansion adds context), domain vocabulary differs between queries and documents, or retrieval quality is poor with direct query embedding.)
What is the context window stuffing problem in RAG and how do re-ranking and compression address it? (Answer: Naive top-k retrieval may return: (1) Redundant chunks covering the same information. (2) Low-relevance chunks that merely contain query keywords. (3) More content than the context window can hold. Re-ranking (Cohere Rerank, BGE reranker): use a cross-encoder to score (query, chunk) relevance — more accurate than embedding similarity alone. Re-rank and select top-5 from top-50. Compression (LLMLingua, RECOMP): use an LLM to extract only the most relevant sentences from retrieved chunks — reducing token count by 2–5× before insertion.)
What is the 'lost in the middle' problem for RAG and how do you mitigate it? (Answer: LLMs perform better when relevant context appears at the beginning or end of the prompt rather than the middle (Liu et al. 2023). For RAG with 10 retrieved chunks, the most relevant chunk should be first or last, not middle. Mitigation: (1) Re-rank by relevance and position most relevant chunk first. (2) Reverse order (most relevant last, just before the query). (3) Reduce number of retrieved chunks (fewer chunks = less middle). (4) Use models trained specifically for long-context RAG.)
What is the difference between dense retrieval, sparse retrieval, and hybrid retrieval in RAG systems? (Answer: Sparse (BM25/TF-IDF): keyword matching, handles exact terms well, interpretable, fast. Fails on semantic synonyms. Dense (bi-encoder): embed query and documents, retrieve by cosine similarity. Handles semantic similarity but may miss exact matches. Hybrid (Reciprocal Rank Fusion): combine sparse and dense retrieval ranking lists. Example: BM25 rank + FAISS rank → RRF combined rank. Best of both: handles exact terms (BM25) AND semantic similarity (dense). Weaviate, Qdrant, OpenSearch all support hybrid search natively.)

RAG vs fine-tuning: the definitive decision framework

The single most asked question in LLM deployment: should I use RAG or fine-tune my model? The answer depends on what kind of knowledge your application needs.

Dimension	RAG	Fine-tuning
Knowledge type	Factual, document-grounded, frequently updated	Style, format, behavior, domain reasoning patterns
Update frequency	Update at any time — add docs to the vector DB	Requires retraining run — expensive to update frequently
Cost to set up	Low–medium ($50–$500 for most applications)	Medium–high ($200–$5,000 depending on model and dataset size)
Hallucination risk	Low when retrieval works correctly	Same as base model — fine-tuning does not reduce hallucination
Source attribution	Native — cite exact chunks/pages	Impossible — knowledge baked into weights
Private document Q&A	✅ Perfect fit	❌ Wrong tool — you'd need to retrain for every doc update
Change model tone/persona	⚠️ Possible via system prompt; less stable	✅ Perfect fit — persists across all conversations
Domain-specific reasoning	⚠️ Works if docs are good quality	✅ Better for tasks requiring learned reasoning patterns
Recent events / news	✅ Just add to the knowledge base	❌ Stale after training cutoff

The right answer for most teams: Start with RAG. Fine-tuning is almost never the right first step for document-grounded applications. RAG is faster to iterate, cheaper to update, and naturally auditable. Fine-tune only when you have a clear, stable behavioral requirement (consistent output format, specific domain reasoning style) that system prompting alone cannot reliably achieve after exhaustive iteration. When in doubt: RAG first, fine-tune later if needed.

Scenario	Recommendation	Rationale
Internal knowledge base Q&A (Confluence, Notion, Docs)	RAG	Documents change; attribution needed; straightforward retrieval
Customer support bot that quotes policy	RAG	Policy updates frequently; accuracy requires sourcing
Medical/legal document Q&A	RAG + human review	Hallucination is unacceptable; source citation mandatory
Code completion in a specific style	Fine-tune (LoRA)	Style is behavioral, not factual; RAG doesn't help with style
SQL-to-English report generation	Fine-tune (LoRA)	Structured output format; consistent reasoning pattern
Multilingual customer support	RAG + fine-tune	Factual answers via RAG; tone/format consistency via fine-tuning
Real-time news summarization	RAG (with live search)	Knowledge changes hourly; fine-tuning can't keep up

GraphRAG and knowledge graphs: beyond flat vector search

Standard RAG retrieves isolated text chunks — it cannot traverse relationships between entities across documents. GraphRAG (Microsoft Research, 2024) addresses this by building a knowledge graph from the document corpus, enabling multi-hop reasoning.

Approach	How it works	Strengths	Weaknesses	Best for
Naive RAG	Embed chunks → cosine similarity retrieval	Simple, fast, cheap	Misses cross-document relationships; struggles with multi-hop	Single-document Q&A, simple factual retrieval
GraphRAG	Extract entities + relationships → build knowledge graph → retrieve via graph traversal + community summaries	Handles multi-hop queries; global document understanding	Expensive indexing; slower queries; complex setup	Corpus-wide analysis, relationship queries, thematic synthesis
Hybrid RAG	Dense vector search + sparse BM25 keyword search; fuse scores (RRF)	Best coverage of exact terms + semantic match	Slightly more complex pipeline	Production default — best general-purpose retrieval
Agentic RAG	LLM decides what to retrieve, when to retrieve, how many hops	Handles complex multi-step questions; self-correcting	Higher latency; more API calls; can loop	Research assistants, complex reasoning tasks

# pip install graphrag
# GraphRAG builds a knowledge graph from your documents — enabling 
# "what are the main themes across all documents?" style queries

import os
from graphrag.index import run_pipeline
from graphrag.query import GraphRagConfig, run_local_search, run_global_search

# Initialize GraphRAG workspace
os.makedirs("./graphrag_workspace/input", exist_ok=True)

# Put your documents in ./graphrag_workspace/input/
# Then initialize and run the indexing pipeline:
# $ graphrag init --root ./graphrag_workspace
# $ graphrag index --root ./graphrag_workspace
# (Indexing builds entities, relationships, communities, and summaries)

# --- Local search: specific entity questions ---
local_answer = run_local_search(
    root_dir="./graphrag_workspace",
    query="What did the research find about transformer attention mechanisms?",
    # Returns entity-focused, attribution-rich answer
)

# --- Global search: corpus-wide synthesis ---
global_answer = run_global_search(
    root_dir="./graphrag_workspace",
    query="What are the three most important themes across all research papers?",
    # Synthesizes community summaries — impossible with naive RAG
)

print("LOCAL:", local_answer.response[:300])
print("GLOBAL:", global_answer.response[:300])
# GraphRAG global search: particularly powerful for research synthesis,
# legal corpus analysis, and large enterprise knowledge bases

GraphRAG cost warning: GraphRAG indexing is expensive: it uses LLM calls to extract entities and relationships from every chunk. A 100-document corpus might cost $5–$50 to index with GPT-4o. This makes it impractical for frequently updated knowledge bases. Best fit: large, stable corpora (research archives, regulatory documents, legal case databases) where the indexing cost is amortized over thousands of queries. For simple document Q&A: standard RAG is better in almost every dimension.

Production RAG: vector database comparison

Database	Best for	Hosting	Hybrid search	Free tier	Notes
pgvector (PostgreSQL)	Existing Postgres users; <10M vectors	Self-hosted / Supabase / Neon	✅ with pg_trgm	Yes (Supabase)	Simplest if you already use Postgres — no separate DB to manage
Pinecone	Managed, scalable, fast; teams who want zero infra	Managed cloud	✅	Yes (1 index)	Best developer experience; highest cost at scale
Qdrant	High performance; filtering + vectors combined	Self-hosted / cloud	✅ (BM42)	Yes (cloud)	Best performance/cost for self-hosted; HNSW + payload filtering
Weaviate	Multi-modal; built-in vectorization; schema flexibility	Self-hosted / cloud	✅	Yes (cloud)	Native GraphQL; supports image + text vectors in same collection
Chroma	Local dev and prototyping	Embedded (local) / cloud	⚠️ basic	Yes (embedded)	Easiest to get started; not recommended for production at scale
FAISS (Meta)	Maximum performance; total control	Self-hosted only	❌ (pair with BM25)	Open-source	Industry standard for custom implementations; no managed offering

LumiChats Study Mode is built on a production RAG pipeline. Documents are chunked, embedded with text-embedding-3-large, and stored in pgvector. Every answer in Study Mode is retrieved from your specific document — cited by page number, never hallucinated from training data.

Definition

The problem RAG solves

Analogy

RAG is like allowing a student to reference their textbook during an exam, rather than relying purely on memory. The model still does the reasoning — but it's grounded in retrieved evidence, not confabulation.

Knowledge cutoff: Base LLMs don't know events after their training cutoff.
Private data: Your company's internal documents, your personal PDFs — none of this is in the training data.
Hallucination: Without source grounding, models generate plausible-sounding but incorrect specific facts.
Attribution: RAG lets the model cite exact sources (page numbers, documents).

The full RAG pipeline

RAG has two phases: offline indexing (one-time) and online query (per request):

Complete RAG pipeline — indexing phase

from openai import OpenAI
import numpy as np
from typing import List, Dict

client = OpenAI()

# ══════════════════════════════════════════════════════════
#  PHASE 1: INDEXING (runs once when you upload a document)
# ══════════════════════════════════════════════════════════

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> List[str]:
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    step = chunk_size - overlap

    for i in range(0, len(words), step):
        chunk = " ".join(words[i : i + chunk_size])
        if chunk:
            chunks.append(chunk)

    return chunks


def embed_texts(texts: List[str]) -> np.ndarray:
    """Embed a list of texts using OpenAI's embedding model."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    embeddings = [r.embedding for r in response.data]
    return np.array(embeddings)


# Your document (in practice, extracted from PDF)
document = """
Photosynthesis is the process by which plants convert sunlight into glucose.
It occurs in two stages: the light-dependent reactions in the thylakoids,
and the Calvin cycle in the stroma of the chloroplast.
The overall equation is: 6CO2 + 6H2O + light → C6H12O6 + 6O2.
...
"""

# Split into chunks and embed
chunks = chunk_text(document)
chunk_embeddings = embed_texts(chunks)     # shape: (n_chunks, 1536)

# Normalize for fast cosine similarity
norms = np.linalg.norm(chunk_embeddings, axis=1, keepdims=True)
chunk_embeddings_norm = chunk_embeddings / norms

# Store in "vector database" (in-memory here; use pgvector/Pinecone in production)
index = {"chunks": chunks, "embeddings": chunk_embeddings_norm}

print(f"Indexed {len(chunks)} chunks")

Complete RAG pipeline — query phase

# ══════════════════════════════════════════════════════════
#  PHASE 2: QUERY (runs on every user question)
# ══════════════════════════════════════════════════════════

def retrieve(query: str, index: Dict, top_k: int = 3) -> List[str]:
    """Find the most relevant chunks for a query."""
    query_emb = embed_texts([query])[0]
    query_emb = query_emb / np.linalg.norm(query_emb)     # normalize

    # Cosine similarity = dot product with normalized vectors
    scores = index["embeddings"] @ query_emb               # (n_chunks,)
    top_indices = np.argsort(scores)[::-1][:top_k]

    return [index["chunks"][i] for i in top_indices]


def rag_answer(question: str, index: Dict) -> str:
    """Retrieve relevant context, then generate a grounded answer."""

    # Step 1: Retrieve
    context_chunks = retrieve(question, index, top_k=3)
    context = "\n\n".join(f"[Chunk {i+1}]: {c}" for i, c in enumerate(context_chunks))

    # Step 2: Generate (grounded answer)
    system_prompt = """You are a precise AI assistant.
Answer ONLY from the provided context below.
If the answer is not in the context, say "I don't have that information in the provided documents."
Always cite which chunk your answer comes from."""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0    # deterministic for factual Q&A
    )

    return response.choices[0].message.content


answer = rag_answer("What is the equation for photosynthesis?", index)
print(answer)
# "According to Chunk 1, the equation for photosynthesis is:
#  6CO2 + 6H2O + light → C6H12O6 + 6O2"

Naive RAG vs Advanced RAG

Basic RAG (retrieve → generate) breaks down in real-world scenarios. Advanced RAG techniques address common failure modes:

Problem	Technique	How it helps
Query ambiguous	Query rewriting / HyDE	Rewrite query or generate a hypothetical answer to embed, then retrieve
Multi-hop questions	Iterative retrieval	Retrieve → generate sub-answer → use sub-answer to retrieve more context
Irrelevant chunks retrieved	Re-ranking	Use a cross-encoder to re-rank top-k retrieved chunks by true relevance
Keyword terms missed	Hybrid search	Combine dense vector search + sparse BM25 keyword search
Large documents	Hierarchical indexing	Index summaries + full chunks; search summaries first for efficiency

Why RAG dramatically reduces hallucination

The key instruction that eliminates most hallucination in RAG systems:

System prompt that grounds the model in retrieved evidence

SYSTEM: You are a precise study assistant. Answer questions based ONLY on 
the provided context passages below. 

Rules:
1. If the answer is directly stated in the context, cite it with [Chunk N].  
2. If the answer is implied but not directly stated, say "Based on the context..."
3. If the answer is NOT in the context, respond: "This isn't covered in the 
   provided material. Try asking a more specific question or uploading a 
   document that covers this topic."
4. NEVER use your general knowledge to fill gaps — only use the context.

This is critical for academic use: fabricating information could harm students.

RAG limitations

RAG is not foolproof. If the relevant chunk wasn't retrieved (retrieval failure), the model has no source to ground its answer and may hallucinate. This is why chunk size, overlap, and top-k tuning matter.

Practice questions

What is the difference between naive RAG, advanced RAG, and modular RAG? (Answer: Naive RAG: query → embed query → retrieve top-k chunks → concatenate with prompt → generate. Simple but fragile. Advanced RAG: adds pre-retrieval (query rewriting, HyDE) and post-retrieval (re-ranking, compression) steps. Modular RAG: loosely coupled pipeline with interchangeable components — different retrievers, rankers, generators, memory modules can be swapped per use case. Modular RAG is the current production standard: enables A/B testing components, graceful degradation, and specialized modules for different query types.)
What is HyDE (Hypothetical Document Embeddings) and when should you use it? (Answer: HyDE: instead of embedding the user query directly, prompt an LLM to generate a hypothetical answer to the query. Embed the hypothetical answer and use it as the search query. Rationale: the answer embedding is closer in embedding space to actual answer documents than the query embedding (questions and answers have different linguistic patterns). Use when: queries are short and ambiguous (the LLM expansion adds context), domain vocabulary differs between queries and documents, or retrieval quality is poor with direct query embedding.)
What is the context window stuffing problem in RAG and how do re-ranking and compression address it? (Answer: Naive top-k retrieval may return: (1) Redundant chunks covering the same information. (2) Low-relevance chunks that merely contain query keywords. (3) More content than the context window can hold. Re-ranking (Cohere Rerank, BGE reranker): use a cross-encoder to score (query, chunk) relevance — more accurate than embedding similarity alone. Re-rank and select top-5 from top-50. Compression (LLMLingua, RECOMP): use an LLM to extract only the most relevant sentences from retrieved chunks — reducing token count by 2–5× before insertion.)
What is the 'lost in the middle' problem for RAG and how do you mitigate it? (Answer: LLMs perform better when relevant context appears at the beginning or end of the prompt rather than the middle (Liu et al. 2023). For RAG with 10 retrieved chunks, the most relevant chunk should be first or last, not middle. Mitigation: (1) Re-rank by relevance and position most relevant chunk first. (2) Reverse order (most relevant last, just before the query). (3) Reduce number of retrieved chunks (fewer chunks = less middle). (4) Use models trained specifically for long-context RAG.)
What is the difference between dense retrieval, sparse retrieval, and hybrid retrieval in RAG systems? (Answer: Sparse (BM25/TF-IDF): keyword matching, handles exact terms well, interpretable, fast. Fails on semantic synonyms. Dense (bi-encoder): embed query and documents, retrieve by cosine similarity. Handles semantic similarity but may miss exact matches. Hybrid (Reciprocal Rank Fusion): combine sparse and dense retrieval ranking lists. Example: BM25 rank + FAISS rank → RRF combined rank. Best of both: handles exact terms (BM25) AND semantic similarity (dense). Weaviate, Qdrant, OpenSearch all support hybrid search natively.)

RAG vs fine-tuning: the definitive decision framework

The single most asked question in LLM deployment: should I use RAG or fine-tune my model? The answer depends on what kind of knowledge your application needs.

Dimension	RAG	Fine-tuning
Knowledge type	Factual, document-grounded, frequently updated	Style, format, behavior, domain reasoning patterns
Update frequency	Update at any time — add docs to the vector DB	Requires retraining run — expensive to update frequently
Cost to set up	Low–medium ($50–$500 for most applications)	Medium–high ($200–$5,000 depending on model and dataset size)
Hallucination risk	Low when retrieval works correctly	Same as base model — fine-tuning does not reduce hallucination
Source attribution	Native — cite exact chunks/pages	Impossible — knowledge baked into weights
Private document Q&A	✅ Perfect fit	❌ Wrong tool — you'd need to retrain for every doc update
Change model tone/persona	⚠️ Possible via system prompt; less stable	✅ Perfect fit — persists across all conversations
Domain-specific reasoning	⚠️ Works if docs are good quality	✅ Better for tasks requiring learned reasoning patterns
Recent events / news	✅ Just add to the knowledge base	❌ Stale after training cutoff

The right answer for most teams

Start with RAG. Fine-tuning is almost never the right first step for document-grounded applications. RAG is faster to iterate, cheaper to update, and naturally auditable. Fine-tune only when you have a clear, stable behavioral requirement (consistent output format, specific domain reasoning style) that system prompting alone cannot reliably achieve after exhaustive iteration. When in doubt: RAG first, fine-tune later if needed.

Scenario	Recommendation	Rationale
Internal knowledge base Q&A (Confluence, Notion, Docs)	RAG	Documents change; attribution needed; straightforward retrieval
Customer support bot that quotes policy	RAG	Policy updates frequently; accuracy requires sourcing
Medical/legal document Q&A	RAG + human review	Hallucination is unacceptable; source citation mandatory
Code completion in a specific style	Fine-tune (LoRA)	Style is behavioral, not factual; RAG doesn't help with style
SQL-to-English report generation	Fine-tune (LoRA)	Structured output format; consistent reasoning pattern
Multilingual customer support	RAG + fine-tune	Factual answers via RAG; tone/format consistency via fine-tuning
Real-time news summarization	RAG (with live search)	Knowledge changes hourly; fine-tuning can't keep up

GraphRAG and knowledge graphs: beyond flat vector search

Approach	How it works	Strengths	Weaknesses	Best for
Naive RAG	Embed chunks → cosine similarity retrieval	Simple, fast, cheap	Misses cross-document relationships; struggles with multi-hop	Single-document Q&A, simple factual retrieval
GraphRAG	Extract entities + relationships → build knowledge graph → retrieve via graph traversal + community summaries	Handles multi-hop queries; global document understanding	Expensive indexing; slower queries; complex setup	Corpus-wide analysis, relationship queries, thematic synthesis
Hybrid RAG	Dense vector search + sparse BM25 keyword search; fuse scores (RRF)	Best coverage of exact terms + semantic match	Slightly more complex pipeline	Production default — best general-purpose retrieval
Agentic RAG	LLM decides what to retrieve, when to retrieve, how many hops	Handles complex multi-step questions; self-correcting	Higher latency; more API calls; can loop	Research assistants, complex reasoning tasks

GraphRAG with Microsoft's open-source library — indexing a document corpus

# pip install graphrag
# GraphRAG builds a knowledge graph from your documents — enabling 
# "what are the main themes across all documents?" style queries

import os
from graphrag.index import run_pipeline
from graphrag.query import GraphRagConfig, run_local_search, run_global_search

# Initialize GraphRAG workspace
os.makedirs("./graphrag_workspace/input", exist_ok=True)

# Put your documents in ./graphrag_workspace/input/
# Then initialize and run the indexing pipeline:
# $ graphrag init --root ./graphrag_workspace
# $ graphrag index --root ./graphrag_workspace
# (Indexing builds entities, relationships, communities, and summaries)

# --- Local search: specific entity questions ---
local_answer = run_local_search(
    root_dir="./graphrag_workspace",
    query="What did the research find about transformer attention mechanisms?",
    # Returns entity-focused, attribution-rich answer
)

# --- Global search: corpus-wide synthesis ---
global_answer = run_global_search(
    root_dir="./graphrag_workspace",
    query="What are the three most important themes across all research papers?",
    # Synthesizes community summaries — impossible with naive RAG
)

print("LOCAL:", local_answer.response[:300])
print("GLOBAL:", global_answer.response[:300])
# GraphRAG global search: particularly powerful for research synthesis,
# legal corpus analysis, and large enterprise knowledge bases

GraphRAG cost warning

GraphRAG indexing is expensive: it uses LLM calls to extract entities and relationships from every chunk. A 100-document corpus might cost $5–$50 to index with GPT-4o. This makes it impractical for frequently updated knowledge bases. Best fit: large, stable corpora (research archives, regulatory documents, legal case databases) where the indexing cost is amortized over thousands of queries. For simple document Q&A: standard RAG is better in almost every dimension.

Production RAG: vector database comparison

Database	Best for	Hosting	Hybrid search	Free tier	Notes
pgvector (PostgreSQL)	Existing Postgres users; <10M vectors	Self-hosted / Supabase / Neon	✅ with pg_trgm	Yes (Supabase)	Simplest if you already use Postgres — no separate DB to manage
Pinecone	Managed, scalable, fast; teams who want zero infra	Managed cloud	✅	Yes (1 index)	Best developer experience; highest cost at scale
Qdrant	High performance; filtering + vectors combined	Self-hosted / cloud	✅ (BM42)	Yes (cloud)	Best performance/cost for self-hosted; HNSW + payload filtering
Weaviate	Multi-modal; built-in vectorization; schema flexibility	Self-hosted / cloud	✅	Yes (cloud)	Native GraphQL; supports image + text vectors in same collection
Chroma	Local dev and prototyping	Embedded (local) / cloud	⚠️ basic	Yes (embedded)	Easiest to get started; not recommended for production at scale
FAISS (Meta)	Maximum performance; total control	Self-hosted only	❌ (pair with BM25)	Open-source	Industry standard for custom implementations; no managed offering

On LumiChats

Try it free

RAG (Retrieval-Augmented Generation)

The problem RAG solves

The full RAG pipeline

Naive RAG vs Advanced RAG

Why RAG dramatically reduces hallucination

Practice questions

RAG vs fine-tuning: the definitive decision framework

GraphRAG and knowledge graphs: beyond flat vector search

Production RAG: vector database comparison

RAG (Retrieval-Augmented Generation)

The problem RAG solves

The full RAG pipeline

Naive RAG vs Advanced RAG

Why RAG dramatically reduces hallucination

Practice questions

RAG vs fine-tuning: the definitive decision framework

GraphRAG and knowledge graphs: beyond flat vector search

Production RAG: vector database comparison

Practice what you just learned

Related Terms