Question Answering (QA) is an NLP task where a system reads a context passage and produces a direct answer to a natural language question. Types: <strong>Extractive QA</strong> (span extraction from context — the answer is a substring of the passage), <strong>Generative QA</strong> (generates free-form answers), and <strong>Open-Domain QA</strong> (no given context — must retrieve relevant documents first, then answer). SQuAD (Stanford Question Answering Dataset) is the benchmark that drove modern QA research. RAG (Retrieval-Augmented Generation) is the modern production architecture.
Real-life analogy: The open-book vs closed-book exam
Extractive QA is like an open-book exam where you must find and quote the exact sentence from the textbook that answers the question. Generative QA is like explaining the answer in your own words. Open-domain QA is like a closed-book exam — you must recall (or retrieve) relevant knowledge first, then reason about it. LLMs like GPT-4 do a mix: they have knowledge memorised in weights, but RAG gives them an open book.
Extractive QA — span prediction with BERT
Extractive QA models predict two token positions in the context: the start and end of the answer span. BERT-based models fine-tuned on SQuAD achieve near-human F1 scores by leveraging bidirectional context.
Extractive QA with Hugging Face
from transformers import pipeline
# BERT fine-tuned on SQuAD 2.0
qa = pipeline("question-answering",
model="deepset/roberta-base-squad2")
context = """
The transformer architecture was introduced in the paper "Attention Is All
You Need" by Vaswani et al. in 2017. It replaced recurrent neural networks
with a self-attention mechanism, enabling parallelisation and better
modelling of long-range dependencies. The encoder processes the input
sequence while the decoder generates the output sequence.
"""
questions = [
"Who introduced the transformer architecture?",
"What did transformers replace?",
"What year was the transformer introduced?",
"What does the encoder do?",
]
for q in questions:
result = qa(question=q, context=context)
print(f"Q: {q}")
print(f"A: {result['answer']} (score: {result['score']:.2%})")
print()
# Output:
# Q: Who introduced the transformer architecture?
# A: Vaswani et al. (score: 89.23%)
# Q: What year was the transformer introduced?
# A: 2017 (score: 96.41%)Open-Domain QA and RAG
Open-domain QA requires retrieving relevant passages before answering — the system does not have a given context. The retrieval-augmented generation (RAG) pipeline:
- Query encoding: Convert the question to a dense vector using a bi-encoder (e.g., DPR — Dense Passage Retrieval).
- Retrieval: Search a vector database (FAISS, Pinecone, Chroma) for the top-k most similar document chunks using approximate nearest-neighbour search.
- Reading / Generation: Pass the retrieved chunks + question to a reader model (BERT for extractive, GPT/BART for generative) to produce the final answer.
| QA type | Context given? | Retrieval needed? | Answer type | Model |
|---|---|---|---|---|
| Extractive | Yes | No | Span from context | BERT-SQuAD, RoBERTa |
| Abstractive | Yes | No | Free-form generated | T5, BART, GPT-4 |
| Open-Domain (RAG) | No (retrieved) | Yes | Free-form generated | DPR + GPT-4, Llama |
| Closed-Book | No | No (LLM memory) | Free-form (may hallucinate) | GPT-4, Claude, Gemini |
SQuAD and SQuAD 2.0
SQuAD (Stanford QA Dataset) has 100k+ Q&A pairs from Wikipedia. SQuAD 2.0 added 50k unanswerable questions (the answer is not in the passage) — models must also learn to say "I don't know" instead of always extracting a span. This tests reading comprehension more rigorously. EM (Exact Match) and F1 over answer tokens are the standard metrics.
Practice questions
- What are the two output tokens that an extractive QA model predicts? (Answer: Start token index and end token index of the answer span within the context passage.)
- Why does RAG reduce hallucination compared to closed-book LLM QA? (Answer: RAG grounds the answer in retrieved documents — the model is conditioned on actual retrieved text, not solely on memorised training weights that may be outdated or incorrect.)
- What does EM (Exact Match) measure in QA evaluation? (Answer: The percentage of predictions that exactly match the ground truth answer string after normalisation (lowercase, remove punctuation). Strict metric — partial credit is given by token-level F1.)
- DPR (Dense Passage Retrieval) uses a bi-encoder. What are the two encoders? (Answer: A question encoder and a passage encoder. Both trained so that relevant question-passage pairs have high dot-product similarity in embedding space.)
- What makes SQuAD 2.0 harder than SQuAD 1.1? (Answer: SQuAD 2.0 includes unanswerable questions. Models must detect when no answer exists in the context instead of always extracting a span — requires reasoning about absence of evidence.)
On LumiChats
LumiChats uses a RAG pipeline for document QA: paste a PDF or document, and the system retrieves the most relevant chunks and generates a grounded answer with citations. This is extractive + generative QA in production.
Try it free