Glossary/Question Answering Systems in NLP
Natural Language Processing

Question Answering Systems in NLP

Building systems that read a passage and answer questions about it.


Definition

Question Answering (QA) is an NLP task where a system reads a context passage and produces a direct answer to a natural language question. Types: <strong>Extractive QA</strong> (span extraction from context — the answer is a substring of the passage), <strong>Generative QA</strong> (generates free-form answers), and <strong>Open-Domain QA</strong> (no given context — must retrieve relevant documents first, then answer). SQuAD (Stanford Question Answering Dataset) is the benchmark that drove modern QA research. RAG (Retrieval-Augmented Generation) is the modern production architecture.

Real-life analogy: The open-book vs closed-book exam

Extractive QA is like an open-book exam where you must find and quote the exact sentence from the textbook that answers the question. Generative QA is like explaining the answer in your own words. Open-domain QA is like a closed-book exam — you must recall (or retrieve) relevant knowledge first, then reason about it. LLMs like GPT-4 do a mix: they have knowledge memorised in weights, but RAG gives them an open book.

Extractive QA — span prediction with BERT

Extractive QA models predict two token positions in the context: the start and end of the answer span. BERT-based models fine-tuned on SQuAD achieve near-human F1 scores by leveraging bidirectional context.

Extractive QA with Hugging Face

from transformers import pipeline

# BERT fine-tuned on SQuAD 2.0
qa = pipeline("question-answering",
    model="deepset/roberta-base-squad2")

context = """
The transformer architecture was introduced in the paper "Attention Is All
You Need" by Vaswani et al. in 2017. It replaced recurrent neural networks
with a self-attention mechanism, enabling parallelisation and better
modelling of long-range dependencies. The encoder processes the input
sequence while the decoder generates the output sequence.
"""

questions = [
    "Who introduced the transformer architecture?",
    "What did transformers replace?",
    "What year was the transformer introduced?",
    "What does the encoder do?",
]

for q in questions:
    result = qa(question=q, context=context)
    print(f"Q: {q}")
    print(f"A: {result['answer']} (score: {result['score']:.2%})")
    print()

# Output:
# Q: Who introduced the transformer architecture?
# A: Vaswani et al. (score: 89.23%)
# Q: What year was the transformer introduced?
# A: 2017 (score: 96.41%)

Open-Domain QA and RAG

Open-domain QA requires retrieving relevant passages before answering — the system does not have a given context. The retrieval-augmented generation (RAG) pipeline:

  1. Query encoding: Convert the question to a dense vector using a bi-encoder (e.g., DPR — Dense Passage Retrieval).
  2. Retrieval: Search a vector database (FAISS, Pinecone, Chroma) for the top-k most similar document chunks using approximate nearest-neighbour search.
  3. Reading / Generation: Pass the retrieved chunks + question to a reader model (BERT for extractive, GPT/BART for generative) to produce the final answer.
QA typeContext given?Retrieval needed?Answer typeModel
ExtractiveYesNoSpan from contextBERT-SQuAD, RoBERTa
AbstractiveYesNoFree-form generatedT5, BART, GPT-4
Open-Domain (RAG)No (retrieved)YesFree-form generatedDPR + GPT-4, Llama
Closed-BookNoNo (LLM memory)Free-form (may hallucinate)GPT-4, Claude, Gemini

SQuAD and SQuAD 2.0

SQuAD (Stanford QA Dataset) has 100k+ Q&A pairs from Wikipedia. SQuAD 2.0 added 50k unanswerable questions (the answer is not in the passage) — models must also learn to say "I don't know" instead of always extracting a span. This tests reading comprehension more rigorously. EM (Exact Match) and F1 over answer tokens are the standard metrics.

Practice questions

  1. What are the two output tokens that an extractive QA model predicts? (Answer: Start token index and end token index of the answer span within the context passage.)
  2. Why does RAG reduce hallucination compared to closed-book LLM QA? (Answer: RAG grounds the answer in retrieved documents — the model is conditioned on actual retrieved text, not solely on memorised training weights that may be outdated or incorrect.)
  3. What does EM (Exact Match) measure in QA evaluation? (Answer: The percentage of predictions that exactly match the ground truth answer string after normalisation (lowercase, remove punctuation). Strict metric — partial credit is given by token-level F1.)
  4. DPR (Dense Passage Retrieval) uses a bi-encoder. What are the two encoders? (Answer: A question encoder and a passage encoder. Both trained so that relevant question-passage pairs have high dot-product similarity in embedding space.)
  5. What makes SQuAD 2.0 harder than SQuAD 1.1? (Answer: SQuAD 2.0 includes unanswerable questions. Models must detect when no answer exists in the context instead of always extracting a span — requires reasoning about absence of evidence.)

On LumiChats

LumiChats uses a RAG pipeline for document QA: paste a PDF or document, and the system retrieves the most relevant chunks and generates a grounded answer with citations. This is extractive + generative QA in production.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms