What is the pretraining → fine-tuning pipeline?

Pretraining, Fine-Tuning & Transfer Learning in NLP: The pretraining → fine-tuning pipeline. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/nlp-pretraining-finetuning

What is practice questions?

Pretraining, Fine-Tuning & Transfer Learning in NLP: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/nlp-pretraining-finetuning

Pretraining, Fine-Tuning & Transfer Learning in NLP | LoRA

Pretraining, Fine-Tuning & Transfer Learning in NLP

NLP pretraining trains a large model on massive unlabeled text (language modeling) to learn general language representations. Fine-tuning adapts the pretrained model to a specific task with labeled data — much faster and with far less data than training from scratch. This transfer learning paradigm — pretrain on billions of tokens, fine-tune on thousands — is the foundation of modern NLP. BERT, GPT, T5, and all modern LLMs follow this paradigm. Efficient fine-tuning methods (LoRA, Prefix Tuning, Adapters) update only a fraction of parameters, making large model adaptation practical.

How general language knowledge becomes task-specific expertise.

Category: Natural Language Processing

Real-life analogy: The expert generalist

Pretraining is like a person spending 10 years reading everything — science, literature, history, law, medicine. They become a generalist expert with broad world knowledge and language mastery. Fine-tuning is like that person spending 2 weeks studying specifically for a medical board exam. They do not relearn language — they apply their existing knowledge to the specific domain. Without the 10-year foundation, 2 weeks would not be enough. Without the 2-week specialization, broad knowledge alone is not focused enough to pass.

The pretraining → fine-tuning pipeline

from transformers import (BertForSequenceClassification, BertTokenizer,
                            Trainer, TrainingArguments, AutoModelForSequenceClassification)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ── STEP 1: Load pretrained BERT ──
model_name = 'bert-base-uncased'
tokenizer  = BertTokenizer.from_pretrained(model_name)
model      = BertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2   # Add classification head: 2 classes (positive/negative)
)

# BERT-base has 110M parameters
# Fine-tuning adds: 768 → 2 linear layer (1,538 new params)
# Total trainable: 110M (all parameters updated with small learning rate)

# ── STEP 2: Load and tokenise dataset ──
dataset = load_dataset('sst2')   # Stanford Sentiment Treebank (67k train, 872 val)

def tokenise_batch(batch):
    return tokenizer(batch['sentence'], truncation=True,
                     padding='max_length', max_length=128)
dataset = dataset.map(tokenise_batch, batched=True)

# ── STEP 3: Training configuration ──
training_args = TrainingArguments(
    output_dir='./bert_sst2',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,       # CRITICAL: Small LR for fine-tuning (1e-5 to 5e-5)
    weight_decay=0.01,
    warmup_ratio=0.1,          # 10% warmup steps
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='eval_accuracy',
    report_to='none',
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {'accuracy': accuracy_score(labels, preds),
            'f1': f1_score(labels, preds)}

trainer = Trainer(model=model, args=training_args,
                  train_dataset=dataset['train'],
                  eval_dataset=dataset['validation'],
                  compute_metrics=compute_metrics)

# Fine-tune: 3 epochs on 67k examples takes ~20 min on GPU
# Result: ~92-93% accuracy (pretrained BERT from scratch would need much more data)
print("Fine-tuning BERT on SST-2...")
# trainer.train()  # Uncomment to actually train

# ── STEP 4: Inference ──
from transformers import pipeline
clf = pipeline('text-classification', model='./bert_sst2')  # After training
result = clf("This movie was absolutely fantastic!")
print(f"Prediction: {result}")

Parameter-efficient fine-tuning (PEFT)

Full fine-tuning updates all 7B-175B parameters — computationally expensive and produces a full model copy per task. PEFT methods update only a tiny fraction of parameters.

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model (7B parameters)
model_name = "facebook/opt-1.3b"  # Smaller for demo
tokenizer  = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(model_name)

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,              # Rank: smaller = fewer params, less capacity
    lora_alpha=32,     # Scaling factor: alpha/r controls effective LR
    target_modules=["q_proj", "v_proj"],   # Which layers to add LoRA to
    lora_dropout=0.1,
    bias="none",
)

# Wrap model with LoRA — adds trainable low-rank matrices to attention layers
peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 1,179,648 || all params: 1,315,979,264 || trainable%: 0.0896%
# Only 0.09% of parameters are updated — all others frozen!

# LoRA math: W' = W + BA where B∈R^(d×r), A∈R^(r×k), rank r << min(d,k)
# Original W: d×k matrix (frozen)
# LoRA adds: B×A decomposition with rank r=16 (much smaller)
# During training: only A and B are updated
# During inference: merge back: W' = W + (alpha/r)×BA

# Comparison of PEFT methods
print("
PEFT method comparison:")
methods = {
    'Full fine-tuning':    ('100% params updated', 'Best quality', 'Highest cost, full model copy per task'),
    'LoRA (r=16)':         ('~0.1-1% params',      'Near-full quality', 'Fast, cheap, merge back to original'),
    'Prefix Tuning':       ('~0.1% params (prefix)', 'Good for generation', 'Adds soft prompts to each layer'),
    'Adapter':             ('~1-3% params',          'Good quality',        'Adds bottleneck layers between existing'),
    'Prompt Tuning':       ('<0.01% params',          'Sufficient',          'Only trains input prompt embeddings'),
}
for method, (params, quality, notes) in methods.items():
    print(f"  {method:<20}: {params:<25} | {quality:<25} | {notes}")

Approach	Labeled data needed	Training time	Performance vs full FT
Full fine-tuning	1k-100k examples	Hours-days on GPU	100% (baseline)
LoRA fine-tuning	100-10k examples	1-4 hours on GPU	95-99%
Adapter tuning	100-10k examples	1-4 hours on GPU	93-98%
Few-shot prompting (0 FT)	0-10 examples in prompt	0 (inference only)	70-90% on easy tasks
Zero-shot prompting	0 examples	0 (inference only)	60-85% task-dependent

Practice questions

Why use a small learning rate (2e-5) for fine-tuning BERT instead of a larger rate? (Answer: BERT's pretrained weights encode general language knowledge learned over billions of tokens. A large LR would catastrophically forget this knowledge (catastrophic forgetting). Small LR makes small adjustments to specialize the model — preserving general knowledge while adapting to the task.)
LoRA adds matrices B and A to weight matrix W with rank r=16. How many parameters does this add for a 768×768 attention matrix? (Answer: A: r×k = 16×768 = 12,288. B: d×r = 768×16 = 12,288. Total LoRA params: 24,576 vs original 768×768 = 589,824. Reduction: ~24x fewer parameters for this layer while preserving most capacity.)
What is catastrophic forgetting and how does fine-tuning handle it? (Answer: When a neural network is trained on task B, it "forgets" task A because weights are overwritten. Fine-tuning manages this with: (1) Small learning rate. (2) Short training (few epochs). (3) Regularization toward original weights. PEFT methods inherently prevent it by freezing most original weights.)
You have 500 labeled examples for medical entity extraction. Should you fine-tune BERT or train a BiLSTM-CRF from scratch? (Answer: Fine-tune BERT — with 500 examples, a BiLSTM from scratch would massively overfit (millions of params, few examples). BERT starts with rich language representations from pretraining; fine-tuning with 500 examples adapts the existing knowledge. Standard practice: fine-tuning with 100-1000 examples typically outperforms training from scratch with 10x more data.)
What is the difference between task-specific fine-tuning and instruction fine-tuning? (Answer: Task-specific: fine-tune on labeled (input, output) examples for ONE task (e.g., sentiment classification). Produces a narrow specialist model. Instruction fine-tuning (FLAN, InstructGPT): fine-tune on hundreds of tasks described in natural language instructions — produces a general-purpose assistant that can follow arbitrary new instructions. ChatGPT and Claude use instruction fine-tuning + RLHF.)

LumiChats is built on a pretrained language model fine-tuned with RLHF on human feedback. Understanding pretraining vs fine-tuning explains why LumiChats knows general world knowledge (from pretraining on internet text) but also follows instructions and is helpful (from instruction fine-tuning and RLHF).

Definition

Real-life analogy: The expert generalist

The pretraining → fine-tuning pipeline

Fine-tuning BERT for text classification

from transformers import (BertForSequenceClassification, BertTokenizer,
                            Trainer, TrainingArguments, AutoModelForSequenceClassification)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ── STEP 1: Load pretrained BERT ──
model_name = 'bert-base-uncased'
tokenizer  = BertTokenizer.from_pretrained(model_name)
model      = BertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2   # Add classification head: 2 classes (positive/negative)
)

# BERT-base has 110M parameters
# Fine-tuning adds: 768 → 2 linear layer (1,538 new params)
# Total trainable: 110M (all parameters updated with small learning rate)

# ── STEP 2: Load and tokenise dataset ──
dataset = load_dataset('sst2')   # Stanford Sentiment Treebank (67k train, 872 val)

def tokenise_batch(batch):
    return tokenizer(batch['sentence'], truncation=True,
                     padding='max_length', max_length=128)
dataset = dataset.map(tokenise_batch, batched=True)

# ── STEP 3: Training configuration ──
training_args = TrainingArguments(
    output_dir='./bert_sst2',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,       # CRITICAL: Small LR for fine-tuning (1e-5 to 5e-5)
    weight_decay=0.01,
    warmup_ratio=0.1,          # 10% warmup steps
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='eval_accuracy',
    report_to='none',
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {'accuracy': accuracy_score(labels, preds),
            'f1': f1_score(labels, preds)}

trainer = Trainer(model=model, args=training_args,
                  train_dataset=dataset['train'],
                  eval_dataset=dataset['validation'],
                  compute_metrics=compute_metrics)

# Fine-tune: 3 epochs on 67k examples takes ~20 min on GPU
# Result: ~92-93% accuracy (pretrained BERT from scratch would need much more data)
print("Fine-tuning BERT on SST-2...")
# trainer.train()  # Uncomment to actually train

# ── STEP 4: Inference ──
from transformers import pipeline
clf = pipeline('text-classification', model='./bert_sst2')  # After training
result = clf("This movie was absolutely fantastic!")
print(f"Prediction: {result}")

Parameter-efficient fine-tuning (PEFT)

Full fine-tuning updates all 7B-175B parameters — computationally expensive and produces a full model copy per task. PEFT methods update only a tiny fraction of parameters.

LoRA fine-tuning — adapting a 7B model with 0.1% of parameters

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model (7B parameters)
model_name = "facebook/opt-1.3b"  # Smaller for demo
tokenizer  = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(model_name)

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,              # Rank: smaller = fewer params, less capacity
    lora_alpha=32,     # Scaling factor: alpha/r controls effective LR
    target_modules=["q_proj", "v_proj"],   # Which layers to add LoRA to
    lora_dropout=0.1,
    bias="none",
)

# Wrap model with LoRA — adds trainable low-rank matrices to attention layers
peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 1,179,648 || all params: 1,315,979,264 || trainable%: 0.0896%
# Only 0.09% of parameters are updated — all others frozen!

# LoRA math: W' = W + BA where B∈R^(d×r), A∈R^(r×k), rank r << min(d,k)
# Original W: d×k matrix (frozen)
# LoRA adds: B×A decomposition with rank r=16 (much smaller)
# During training: only A and B are updated
# During inference: merge back: W' = W + (alpha/r)×BA

# Comparison of PEFT methods
print("
PEFT method comparison:")
methods = {
    'Full fine-tuning':    ('100% params updated', 'Best quality', 'Highest cost, full model copy per task'),
    'LoRA (r=16)':         ('~0.1-1% params',      'Near-full quality', 'Fast, cheap, merge back to original'),
    'Prefix Tuning':       ('~0.1% params (prefix)', 'Good for generation', 'Adds soft prompts to each layer'),
    'Adapter':             ('~1-3% params',          'Good quality',        'Adds bottleneck layers between existing'),
    'Prompt Tuning':       ('<0.01% params',          'Sufficient',          'Only trains input prompt embeddings'),
}
for method, (params, quality, notes) in methods.items():
    print(f"  {method:<20}: {params:<25} | {quality:<25} | {notes}")

Approach	Labeled data needed	Training time	Performance vs full FT
Full fine-tuning	1k-100k examples	Hours-days on GPU	100% (baseline)
LoRA fine-tuning	100-10k examples	1-4 hours on GPU	95-99%
Adapter tuning	100-10k examples	1-4 hours on GPU	93-98%
Few-shot prompting (0 FT)	0-10 examples in prompt	0 (inference only)	70-90% on easy tasks
Zero-shot prompting	0 examples	0 (inference only)	60-85% task-dependent

Practice questions

Why use a small learning rate (2e-5) for fine-tuning BERT instead of a larger rate? (Answer: BERT's pretrained weights encode general language knowledge learned over billions of tokens. A large LR would catastrophically forget this knowledge (catastrophic forgetting). Small LR makes small adjustments to specialize the model — preserving general knowledge while adapting to the task.)
LoRA adds matrices B and A to weight matrix W with rank r=16. How many parameters does this add for a 768×768 attention matrix? (Answer: A: r×k = 16×768 = 12,288. B: d×r = 768×16 = 12,288. Total LoRA params: 24,576 vs original 768×768 = 589,824. Reduction: ~24x fewer parameters for this layer while preserving most capacity.)
What is catastrophic forgetting and how does fine-tuning handle it? (Answer: When a neural network is trained on task B, it "forgets" task A because weights are overwritten. Fine-tuning manages this with: (1) Small learning rate. (2) Short training (few epochs). (3) Regularization toward original weights. PEFT methods inherently prevent it by freezing most original weights.)
You have 500 labeled examples for medical entity extraction. Should you fine-tune BERT or train a BiLSTM-CRF from scratch? (Answer: Fine-tune BERT — with 500 examples, a BiLSTM from scratch would massively overfit (millions of params, few examples). BERT starts with rich language representations from pretraining; fine-tuning with 500 examples adapts the existing knowledge. Standard practice: fine-tuning with 100-1000 examples typically outperforms training from scratch with 10x more data.)
What is the difference between task-specific fine-tuning and instruction fine-tuning? (Answer: Task-specific: fine-tune on labeled (input, output) examples for ONE task (e.g., sentiment classification). Produces a narrow specialist model. Instruction fine-tuning (FLAN, InstructGPT): fine-tune on hundreds of tasks described in natural language instructions — produces a general-purpose assistant that can follow arbitrary new instructions. ChatGPT and Claude use instruction fine-tuning + RLHF.)

On LumiChats

Try it free

Pretraining, Fine-Tuning & Transfer Learning in NLP

Real-life analogy: The expert generalist

The pretraining → fine-tuning pipeline

Parameter-efficient fine-tuning (PEFT)

Practice questions

Pretraining, Fine-Tuning & Transfer Learning in NLP

Real-life analogy: The expert generalist

The pretraining → fine-tuning pipeline

Parameter-efficient fine-tuning (PEFT)

Practice questions

Practice what you just learned

Related Terms