Glossary/Pretraining, Fine-Tuning & Transfer Learning in NLP
Natural Language Processing

Pretraining, Fine-Tuning & Transfer Learning in NLP

How general language knowledge becomes task-specific expertise.


Definition

NLP pretraining trains a large model on massive unlabelled text (language modelling) to learn general language representations. Fine-tuning adapts the pretrained model to a specific task with labelled data — much faster and with far less data than training from scratch. This transfer learning paradigm — pretrain on billions of tokens, fine-tune on thousands — is the foundation of modern NLP. BERT, GPT, T5, and all modern LLMs follow this paradigm. Efficient fine-tuning methods (LoRA, Prefix Tuning, Adapters) update only a fraction of parameters, making large model adaptation practical.

Real-life analogy: The expert generalist

Pretraining is like a person spending 10 years reading everything — science, literature, history, law, medicine. They become a generalist expert with broad world knowledge and language mastery. Fine-tuning is like that person spending 2 weeks studying specifically for a medical board exam. They do not relearn language — they apply their existing knowledge to the specific domain. Without the 10-year foundation, 2 weeks would not be enough. Without the 2-week specialisation, broad knowledge alone is not focused enough to pass.

The pretraining → fine-tuning pipeline

Fine-tuning BERT for text classification

from transformers import (BertForSequenceClassification, BertTokenizer,
                            Trainer, TrainingArguments, AutoModelForSequenceClassification)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ── STEP 1: Load pretrained BERT ──
model_name = 'bert-base-uncased'
tokenizer  = BertTokenizer.from_pretrained(model_name)
model      = BertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2   # Add classification head: 2 classes (positive/negative)
)

# BERT-base has 110M parameters
# Fine-tuning adds: 768 → 2 linear layer (1,538 new params)
# Total trainable: 110M (all parameters updated with small learning rate)

# ── STEP 2: Load and tokenise dataset ──
dataset = load_dataset('sst2')   # Stanford Sentiment Treebank (67k train, 872 val)

def tokenise_batch(batch):
    return tokenizer(batch['sentence'], truncation=True,
                     padding='max_length', max_length=128)
dataset = dataset.map(tokenise_batch, batched=True)

# ── STEP 3: Training configuration ──
training_args = TrainingArguments(
    output_dir='./bert_sst2',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,       # CRITICAL: Small LR for fine-tuning (1e-5 to 5e-5)
    weight_decay=0.01,
    warmup_ratio=0.1,          # 10% warmup steps
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='eval_accuracy',
    report_to='none',
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {'accuracy': accuracy_score(labels, preds),
            'f1': f1_score(labels, preds)}

trainer = Trainer(model=model, args=training_args,
                  train_dataset=dataset['train'],
                  eval_dataset=dataset['validation'],
                  compute_metrics=compute_metrics)

# Fine-tune: 3 epochs on 67k examples takes ~20 min on GPU
# Result: ~92-93% accuracy (pretrained BERT from scratch would need much more data)
print("Fine-tuning BERT on SST-2...")
# trainer.train()  # Uncomment to actually train

# ── STEP 4: Inference ──
from transformers import pipeline
clf = pipeline('text-classification', model='./bert_sst2')  # After training
result = clf("This movie was absolutely fantastic!")
print(f"Prediction: {result}")

Parameter-efficient fine-tuning (PEFT)

Full fine-tuning updates all 7B-175B parameters — computationally expensive and produces a full model copy per task. PEFT methods update only a tiny fraction of parameters.

LoRA fine-tuning — adapting a 7B model with 0.1% of parameters

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model (7B parameters)
model_name = "facebook/opt-1.3b"  # Smaller for demo
tokenizer  = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(model_name)

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,              # Rank: smaller = fewer params, less capacity
    lora_alpha=32,     # Scaling factor: alpha/r controls effective LR
    target_modules=["q_proj", "v_proj"],   # Which layers to add LoRA to
    lora_dropout=0.1,
    bias="none",
)

# Wrap model with LoRA — adds trainable low-rank matrices to attention layers
peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 1,179,648 || all params: 1,315,979,264 || trainable%: 0.0896%
# Only 0.09% of parameters are updated — all others frozen!

# LoRA math: W' = W + BA where B∈R^(d×r), A∈R^(r×k), rank r << min(d,k)
# Original W: d×k matrix (frozen)
# LoRA adds: B×A decomposition with rank r=16 (much smaller)
# During training: only A and B are updated
# During inference: merge back: W' = W + (alpha/r)×BA

# Comparison of PEFT methods
print("
PEFT method comparison:")
methods = {
    'Full fine-tuning':    ('100% params updated', 'Best quality', 'Highest cost, full model copy per task'),
    'LoRA (r=16)':         ('~0.1-1% params',      'Near-full quality', 'Fast, cheap, merge back to original'),
    'Prefix Tuning':       ('~0.1% params (prefix)', 'Good for generation', 'Adds soft prompts to each layer'),
    'Adapter':             ('~1-3% params',          'Good quality',        'Adds bottleneck layers between existing'),
    'Prompt Tuning':       ('<0.01% params',          'Sufficient',          'Only trains input prompt embeddings'),
}
for method, (params, quality, notes) in methods.items():
    print(f"  {method:<20}: {params:<25} | {quality:<25} | {notes}")
ApproachLabelled data neededTraining timePerformance vs full FT
Full fine-tuning1k-100k examplesHours-days on GPU100% (baseline)
LoRA fine-tuning100-10k examples1-4 hours on GPU95-99%
Adapter tuning100-10k examples1-4 hours on GPU93-98%
Few-shot prompting (0 FT)0-10 examples in prompt0 (inference only)70-90% on easy tasks
Zero-shot prompting0 examples0 (inference only)60-85% task-dependent

Practice questions

  1. Why use a small learning rate (2e-5) for fine-tuning BERT instead of a larger rate? (Answer: BERT's pretrained weights encode general language knowledge learned over billions of tokens. A large LR would catastrophically forget this knowledge (catastrophic forgetting). Small LR makes small adjustments to specialise the model — preserving general knowledge while adapting to the task.)
  2. LoRA adds matrices B and A to weight matrix W with rank r=16. How many parameters does this add for a 768×768 attention matrix? (Answer: A: r×k = 16×768 = 12,288. B: d×r = 768×16 = 12,288. Total LoRA params: 24,576 vs original 768×768 = 589,824. Reduction: ~24x fewer parameters for this layer while preserving most capacity.)
  3. What is catastrophic forgetting and how does fine-tuning handle it? (Answer: When a neural network is trained on task B, it "forgets" task A because weights are overwritten. Fine-tuning manages this with: (1) Small learning rate. (2) Short training (few epochs). (3) Regularisation towards original weights. PEFT methods inherently prevent it by freezing most original weights.)
  4. You have 500 labelled examples for medical entity extraction. Should you fine-tune BERT or train a BiLSTM-CRF from scratch? (Answer: Fine-tune BERT — with 500 examples, a BiLSTM from scratch would massively overfit (millions of params, few examples). BERT starts with rich language representations from pretraining; fine-tuning with 500 examples adapts the existing knowledge. Standard practice: fine-tuning with 100-1000 examples typically outperforms training from scratch with 10x more data.)
  5. What is the difference between task-specific fine-tuning and instruction fine-tuning? (Answer: Task-specific: fine-tune on labelled (input, output) examples for ONE task (e.g., sentiment classification). Produces a narrow specialist model. Instruction fine-tuning (FLAN, InstructGPT): fine-tune on hundreds of tasks described in natural language instructions — produces a general-purpose assistant that can follow arbitrary new instructions. ChatGPT and Claude use instruction fine-tuning + RLHF.)

On LumiChats

LumiChats is built on a pretrained language model fine-tuned with RLHF on human feedback. Understanding pretraining vs fine-tuning explains why LumiChats knows general world knowledge (from pretraining on internet text) but also follows instructions and is helpful (from instruction fine-tuning and RLHF).

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms