Glossary/Transfer Learning
Machine Learning

Transfer Learning

Reusing what a model already knows.


Definition

Transfer learning is the technique of taking a model trained on one task or dataset and reusing it — or adapting it — for a different but related task. Instead of training from scratch, you leverage representations learned on large datasets (often with abundant data) and transfer that knowledge to problems with limited data. Transfer learning is the foundation of modern NLP and computer vision.

Why transfer learning works

Neural networks learn features in a hierarchy — lower layers capture generic, universally useful patterns; upper layers capture task-specific abstractions. This hierarchy means lower layers almost always transfer:

Model typeEarly layers learnMiddle layers learnLate layers learn
CNN (vision)Gabor-like edges, colour blobsTextures, corners, curvesObject parts, semantic concepts (face, wheel)
LLM (text)Token co-occurrence, positional patternsSyntax, grammar, POS structureSemantics, world knowledge, reasoning
Speech modelMel-frequency features, phoneme boundariesPhonemes, prosodyWord identity, speaker style

The "frozen layers" principle

A ResNet-50 pretrained on ImageNet's 1.2M photos has learned visual features useful for medical imaging, satellite imagery, and artwork — domains with completely different content. Because these early features are universal, freezing them and only training a new head is often enough for high performance on small datasets.

Feature extraction vs fine-tuning

StrategyWhat trainsWhen to useRisk
Feature extraction (frozen)New head onlyVery small dataset (<1K examples), source ≈ target domainUnderfit if domains differ significantly
Partial fine-tuningLast 1–2 blocks + headMedium dataset, moderate domain shiftMild catastrophic forgetting
Full fine-tuningAll layers with small LRLarge dataset (10K+), significant domain shiftCatastrophic forgetting without regularization
LoRA fine-tuningLow-rank adapters only (0.1–1% params)LLMs where GPU memory is a constraintSlightly lower ceiling than full FT

Choosing the right strategy based on dataset size

from transformers import AutoModelForSequenceClassification
import torch.nn as nn

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=5)

# ── Strategy A: Feature extraction (< 500 labelled examples) ──────────────
for param in model.bert.parameters():
    param.requires_grad = False   # freeze ALL BERT weights
# Only model.classifier trains — fast, no GPU needed for small models

# ── Strategy B: Partial fine-tuning (500–5K examples) ─────────────────────
for param in model.bert.parameters():
    param.requires_grad = False
for param in model.bert.encoder.layer[-2:].parameters():
    param.requires_grad = True    # unfreeze last 2 transformer blocks
# Classifier + last 2 blocks train

# ── Strategy C: Full fine-tuning (5K+ examples) ───────────────────────────
for param in model.parameters():
    param.requires_grad = True
# Use discriminative learning rates: lower LR for early layers
optimizer = torch.optim.AdamW([
    {"params": model.bert.embeddings.parameters(), "lr": 1e-5},
    {"params": model.bert.encoder.layer[:6].parameters(), "lr": 2e-5},
    {"params": model.bert.encoder.layer[6:].parameters(), "lr": 3e-5},
    {"params": model.classifier.parameters(), "lr": 5e-5},
])

Pretrain-then-fine-tune in NLP

Transfer learning transformed NLP in 2018 and remains the dominant paradigm. The timeline shows how each step built on the last:

YearModelContributionImpact
2017ELMoContext-dependent word embeddings from bidirectional LSTMFirst strong contextual representations — replaced GloVe
2018ULMFiTLM pretraining + discriminative fine-tuning + gradual unfreezingProved pretrain→fine-tune works for NLP tasks
2018BERTMasked LM + bidirectional Transformer pretrainingSOTA on 11 NLP benchmarks with tiny fine-tuning data
2020GPT-3175B params, few-shot in-context learning without weight updatesShowed scale alone produces task generalization
2022+ChatGPT / LLaMARLHF alignment on top of pretrained foundation modelsFine-tuned assistants outperform task-specific models

Domain adaptation

When source (pretraining) and target domains differ significantly, standard fine-tuning leaves performance on the table. Domain adaptation bridges the gap:

TechniqueWhat it doesData neededBest for
Domain-adaptive pretraining (DAPT)Continue MLM/LM pretraining on unlabelled domain text before fine-tuningLarge unlabelled domain corpusBiomedical, legal, code — domains with distinct vocabulary
Task-adaptive pretraining (TAPT)Continue pretraining on unlabelled task data specificallyUnlabelled examples of your taskWhen task data is plentiful but labels are scarce
LoRA domain adaptersTrain low-rank adapters on domain textAny size domain corpusLLMs where full pretraining is too expensive
Mixture of domain dataInclude domain data in final fine-tuning mixDomain + general dataPrevents forgetting general capabilities while adapting

Domain pretraining evidence

PubMedBERT (pretrained entirely on 14M biomedical abstracts, never web text) outperforms BioBERT (general BERT fine-tuned on biomedical text) on 6 of 7 biomedical NLP benchmarks — proving from-scratch domain pretraining beats adaptation. But for most practical cases, DAPT (continue pretraining on domain text) gets 80% of the benefit at 5% of the compute.

Zero-shot and few-shot transfer

The most powerful manifestation of transfer learning is performing entirely new tasks without any task-specific training at all:

Transfer typeTraining neededHow it worksPerformance vs fine-tuning
Zero-shotNone — inference onlyModel infers task from description in prompt70–85% of fine-tuned for frontier models
Few-shot (ICL)None — inference only2–10 examples in context window at inference time80–90% of fine-tuned — scales with model size
Few-shot fine-tuning10–100 labelled examples, brief fine-tuneWeight updates from tiny labelled set90–95% of full fine-tuning quality
Full fine-tuning1K–1M labelled examplesStandard gradient descent on task data100% baseline

When zero-shot is enough

For frontier models (GPT-4, Claude 3.5, Gemini 1.5 Pro), zero-shot is competitive for: classification, summarization, extraction, translation, and code generation. Fine-tuning still wins for: highly specialized domains (medical, legal), consistent output format requirements, latency-sensitive applications (smaller fine-tuned model can beat large zero-shot model on speed+cost), and tasks requiring model to know proprietary information.

Practice questions

  1. What is the feature extraction approach vs fine-tuning approach in transfer learning? (Answer: Feature extraction: freeze the pretrained model entirely, use it only as a feature extractor (extract embeddings), train a small classifier head on top. Fast, no GPU needed for backprop through large model, prevents forgetting. Best when: small dataset, similar domain to pretraining. Fine-tuning: continue training some or all layers of the pretrained model on the new task. Slower but higher accuracy. Best when: large enough dataset to avoid overfitting, different domain from pretraining. Gradual unfreezing (start with head, then unfreeze layers from top) is a common compromise.)
  2. What is negative transfer and when does it occur? (Answer: Negative transfer: pretraining on source domain hurts performance on target domain compared to training from scratch. Occurs when source and target domains are fundamentally incompatible — the features learned on source data are actively misleading for the target. Example: pretraining on natural images then fine-tuning on X-ray pathology — image features (textures, colours) from natural images don't correspond to radiological features. Mitigation: use pretrained models only as initialisation, aggressive fine-tuning, or train from scratch if domains are truly incompatible.)
  3. What is domain adaptation and how does it differ from fine-tuning? (Answer: Fine-tuning: you have labelled data in the target domain — standard supervised training. Domain adaptation: target domain has little or no labels, but you have unlabelled target domain data. Techniques: adversarial domain adaptation (train a domain discriminator that can't distinguish source from target features — forces domain-invariant representations), self-training on target data (pseudo-labels from confident predictions), CORAL (align feature covariances across domains). Relevant when: collecting target domain labels is expensive (medical imaging, legal text).)
  4. How does CLIP enable zero-shot classification without any fine-tuning? (Answer: CLIP trains image and text encoders jointly on 400M image-text pairs so image embeddings and text embeddings are in the same space. Zero-shot classification: for each class, create a text prompt ('a photo of a dog') and compute its embedding. For a query image, compute its embedding and find the text prompt with highest cosine similarity. No task-specific training needed — class names ARE the classifier. The shared embedding space is the transfer: text descriptions of novel classes immediately work for image classification.)
  5. What is the transformer's contribution to making transfer learning universal across modalities? (Answer: Transformers with self-attention operate on sequences of tokens regardless of modality — the same architecture processes text tokens, image patches, audio frames, or video chunks. This universality enables: training one model on multiple modalities (Gemini, GPT-4o), pretraining on one modality and fine-tuning on another (CLIP image encoder fine-tuned for medical images), and transfer across domains without architecture changes. Pre-transformer, CNNs could not be directly transferred to sequence data; RNNs could not handle images. Transformers made cross-modal transfer natural.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms