NLP pretraining trains a large model on massive unlabelled text (language modelling) to learn general language representations. Fine-tuning adapts the pretrained model to a specific task with labelled data — much faster and with far less data than training from scratch. This transfer learning paradigm — pretrain on billions of tokens, fine-tune on thousands — is the foundation of modern NLP. BERT, GPT, T5, and all modern LLMs follow this paradigm. Efficient fine-tuning methods (LoRA, Prefix Tuning, Adapters) update only a fraction of parameters, making large model adaptation practical.
Real-life analogy: The expert generalist
Pretraining is like a person spending 10 years reading everything — science, literature, history, law, medicine. They become a generalist expert with broad world knowledge and language mastery. Fine-tuning is like that person spending 2 weeks studying specifically for a medical board exam. They do not relearn language — they apply their existing knowledge to the specific domain. Without the 10-year foundation, 2 weeks would not be enough. Without the 2-week specialisation, broad knowledge alone is not focused enough to pass.
The pretraining → fine-tuning pipeline
Fine-tuning BERT for text classification
from transformers import (BertForSequenceClassification, BertTokenizer,
Trainer, TrainingArguments, AutoModelForSequenceClassification)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
# ── STEP 1: Load pretrained BERT ──
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(
model_name,
num_labels=2 # Add classification head: 2 classes (positive/negative)
)
# BERT-base has 110M parameters
# Fine-tuning adds: 768 → 2 linear layer (1,538 new params)
# Total trainable: 110M (all parameters updated with small learning rate)
# ── STEP 2: Load and tokenise dataset ──
dataset = load_dataset('sst2') # Stanford Sentiment Treebank (67k train, 872 val)
def tokenise_batch(batch):
return tokenizer(batch['sentence'], truncation=True,
padding='max_length', max_length=128)
dataset = dataset.map(tokenise_batch, batched=True)
# ── STEP 3: Training configuration ──
training_args = TrainingArguments(
output_dir='./bert_sst2',
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
learning_rate=2e-5, # CRITICAL: Small LR for fine-tuning (1e-5 to 5e-5)
weight_decay=0.01,
warmup_ratio=0.1, # 10% warmup steps
evaluation_strategy='epoch',
save_strategy='epoch',
load_best_model_at_end=True,
metric_for_best_model='eval_accuracy',
report_to='none',
)
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {'accuracy': accuracy_score(labels, preds),
'f1': f1_score(labels, preds)}
trainer = Trainer(model=model, args=training_args,
train_dataset=dataset['train'],
eval_dataset=dataset['validation'],
compute_metrics=compute_metrics)
# Fine-tune: 3 epochs on 67k examples takes ~20 min on GPU
# Result: ~92-93% accuracy (pretrained BERT from scratch would need much more data)
print("Fine-tuning BERT on SST-2...")
# trainer.train() # Uncomment to actually train
# ── STEP 4: Inference ──
from transformers import pipeline
clf = pipeline('text-classification', model='./bert_sst2') # After training
result = clf("This movie was absolutely fantastic!")
print(f"Prediction: {result}")Parameter-efficient fine-tuning (PEFT)
Full fine-tuning updates all 7B-175B parameters — computationally expensive and produces a full model copy per task. PEFT methods update only a tiny fraction of parameters.
LoRA fine-tuning — adapting a 7B model with 0.1% of parameters
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load base model (7B parameters)
model_name = "facebook/opt-1.3b" # Smaller for demo
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(model_name)
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank: smaller = fewer params, less capacity
lora_alpha=32, # Scaling factor: alpha/r controls effective LR
target_modules=["q_proj", "v_proj"], # Which layers to add LoRA to
lora_dropout=0.1,
bias="none",
)
# Wrap model with LoRA — adds trainable low-rank matrices to attention layers
peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 1,179,648 || all params: 1,315,979,264 || trainable%: 0.0896%
# Only 0.09% of parameters are updated — all others frozen!
# LoRA math: W' = W + BA where B∈R^(d×r), A∈R^(r×k), rank r << min(d,k)
# Original W: d×k matrix (frozen)
# LoRA adds: B×A decomposition with rank r=16 (much smaller)
# During training: only A and B are updated
# During inference: merge back: W' = W + (alpha/r)×BA
# Comparison of PEFT methods
print("
PEFT method comparison:")
methods = {
'Full fine-tuning': ('100% params updated', 'Best quality', 'Highest cost, full model copy per task'),
'LoRA (r=16)': ('~0.1-1% params', 'Near-full quality', 'Fast, cheap, merge back to original'),
'Prefix Tuning': ('~0.1% params (prefix)', 'Good for generation', 'Adds soft prompts to each layer'),
'Adapter': ('~1-3% params', 'Good quality', 'Adds bottleneck layers between existing'),
'Prompt Tuning': ('<0.01% params', 'Sufficient', 'Only trains input prompt embeddings'),
}
for method, (params, quality, notes) in methods.items():
print(f" {method:<20}: {params:<25} | {quality:<25} | {notes}")| Approach | Labelled data needed | Training time | Performance vs full FT |
|---|---|---|---|
| Full fine-tuning | 1k-100k examples | Hours-days on GPU | 100% (baseline) |
| LoRA fine-tuning | 100-10k examples | 1-4 hours on GPU | 95-99% |
| Adapter tuning | 100-10k examples | 1-4 hours on GPU | 93-98% |
| Few-shot prompting (0 FT) | 0-10 examples in prompt | 0 (inference only) | 70-90% on easy tasks |
| Zero-shot prompting | 0 examples | 0 (inference only) | 60-85% task-dependent |
Practice questions
- Why use a small learning rate (2e-5) for fine-tuning BERT instead of a larger rate? (Answer: BERT's pretrained weights encode general language knowledge learned over billions of tokens. A large LR would catastrophically forget this knowledge (catastrophic forgetting). Small LR makes small adjustments to specialise the model — preserving general knowledge while adapting to the task.)
- LoRA adds matrices B and A to weight matrix W with rank r=16. How many parameters does this add for a 768×768 attention matrix? (Answer: A: r×k = 16×768 = 12,288. B: d×r = 768×16 = 12,288. Total LoRA params: 24,576 vs original 768×768 = 589,824. Reduction: ~24x fewer parameters for this layer while preserving most capacity.)
- What is catastrophic forgetting and how does fine-tuning handle it? (Answer: When a neural network is trained on task B, it "forgets" task A because weights are overwritten. Fine-tuning manages this with: (1) Small learning rate. (2) Short training (few epochs). (3) Regularisation towards original weights. PEFT methods inherently prevent it by freezing most original weights.)
- You have 500 labelled examples for medical entity extraction. Should you fine-tune BERT or train a BiLSTM-CRF from scratch? (Answer: Fine-tune BERT — with 500 examples, a BiLSTM from scratch would massively overfit (millions of params, few examples). BERT starts with rich language representations from pretraining; fine-tuning with 500 examples adapts the existing knowledge. Standard practice: fine-tuning with 100-1000 examples typically outperforms training from scratch with 10x more data.)
- What is the difference between task-specific fine-tuning and instruction fine-tuning? (Answer: Task-specific: fine-tune on labelled (input, output) examples for ONE task (e.g., sentiment classification). Produces a narrow specialist model. Instruction fine-tuning (FLAN, InstructGPT): fine-tune on hundreds of tasks described in natural language instructions — produces a general-purpose assistant that can follow arbitrary new instructions. ChatGPT and Claude use instruction fine-tuning + RLHF.)
On LumiChats
LumiChats is built on a pretrained language model fine-tuned with RLHF on human feedback. Understanding pretraining vs fine-tuning explains why LumiChats knows general world knowledge (from pretraining on internet text) but also follows instructions and is helpful (from instruction fine-tuning and RLHF).
Try it free