What is feature extraction vs fine-tuning?

Transfer Learning: Feature extraction vs fine-tuning. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/transfer-learning

What is practice questions?

Transfer Learning: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/transfer-learning

Transfer Learning

Transfer learning is the technique of taking a model trained on one task or dataset and reusing it — or adapting it — for a different but related task. Instead of training from scratch, you leverage representations learned on large datasets (often with abundant data) and transfer that knowledge to problems with limited data. Transfer learning is the foundation of modern NLP and computer vision.

Reusing what a model already knows.

Category: Machine Learning

Why transfer learning works

Neural networks learn features in a hierarchy — lower layers capture generic, universally useful patterns; upper layers capture task-specific abstractions. This hierarchy means lower layers almost always transfer:

Model type	Early layers learn	Middle layers learn	Late layers learn
CNN (vision)	Gabor-like edges, color blobs	Textures, corners, curves	Object parts, semantic concepts (face, wheel)
LLM (text)	Token co-occurrence, positional patterns	Syntax, grammar, POS structure	Semantics, world knowledge, reasoning
Speech model	Mel-frequency features, phoneme boundaries	Phonemes, prosody	Word identity, speaker style

The "frozen layers" principle: A ResNet-50 pretrained on ImageNet's 1.2M photos has learned visual features useful for medical imaging, satellite imagery, and artwork — domains with completely different content. Because these early features are universal, freezing them and only training a new head is often enough for high performance on small datasets.

Feature extraction vs fine-tuning

Strategy	What trains	When to use	Risk
Feature extraction (frozen)	New head only	Very small dataset (<1K examples), source ≈ target domain	Underfit if domains differ significantly
Partial fine-tuning	Last 1–2 blocks + head	Medium dataset, moderate domain shift	Mild catastrophic forgetting
Full fine-tuning	All layers with small LR	Large dataset (10K+), significant domain shift	Catastrophic forgetting without regularization
LoRA fine-tuning	Low-rank adapters only (0.1–1% params)	LLMs where GPU memory is a constraint	Slightly lower ceiling than full FT

from transformers import AutoModelForSequenceClassification
import torch.nn as nn

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=5)

# ── Strategy A: Feature extraction (< 500 labeled examples) ──────────────
for param in model.bert.parameters():
    param.requires_grad = False   # freeze ALL BERT weights
# Only model.classifier trains — fast, no GPU needed for small models

# ── Strategy B: Partial fine-tuning (500–5K examples) ─────────────────────
for param in model.bert.parameters():
    param.requires_grad = False
for param in model.bert.encoder.layer[-2:].parameters():
    param.requires_grad = True    # unfreeze last 2 transformer blocks
# Classifier + last 2 blocks train

# ── Strategy C: Full fine-tuning (5K+ examples) ───────────────────────────
for param in model.parameters():
    param.requires_grad = True
# Use discriminative learning rates: lower LR for early layers
optimizer = torch.optim.AdamW([
    {"params": model.bert.embeddings.parameters(), "lr": 1e-5},
    {"params": model.bert.encoder.layer[:6].parameters(), "lr": 2e-5},
    {"params": model.bert.encoder.layer[6:].parameters(), "lr": 3e-5},
    {"params": model.classifier.parameters(), "lr": 5e-5},
])

Pretrain-then-fine-tune in NLP

Transfer learning transformed NLP in 2018 and remains the dominant paradigm. The timeline shows how each step built on the last:

Year	Model	Contribution	Impact
2017	ELMo	Context-dependent word embeddings from bidirectional LSTM	First strong contextual representations — replaced GloVe
2018	ULMFiT	LM pretraining + discriminative fine-tuning + gradual unfreezing	Proved pretrain→fine-tune works for NLP tasks
2018	BERT	Masked LM + bidirectional Transformer pretraining	SOTA on 11 NLP benchmarks with tiny fine-tuning data
2020	GPT-3	175B params, few-shot in-context learning without weight updates	Showed scale alone produces task generalization
2022+	ChatGPT / LLaMA	RLHF alignment on top of pretrained foundation models	Fine-tuned assistants outperform task-specific models

Domain adaptation

When source (pretraining) and target domains differ significantly, standard fine-tuning leaves performance on the table. Domain adaptation bridges the gap:

Technique	What it does	Data needed	Best for
Domain-adaptive pretraining (DAPT)	Continue MLM/LM pretraining on unlabeled domain text before fine-tuning	Large unlabeled domain corpus	Biomedical, legal, code — domains with distinct vocabulary
Task-adaptive pretraining (TAPT)	Continue pretraining on unlabeled task data specifically	Unlabeled examples of your task	When task data is plentiful but labels are scarce
LoRA domain adapters	Train low-rank adapters on domain text	Any size domain corpus	LLMs where full pretraining is too expensive
Mixture of domain data	Include domain data in final fine-tuning mix	Domain + general data	Prevents forgetting general capabilities while adapting

Domain pretraining evidence: PubMedBERT (pretrained entirely on 14M biomedical abstracts, never web text) outperforms BioBERT (general BERT fine-tuned on biomedical text) on 6 of 7 biomedical NLP benchmarks — proving from-scratch domain pretraining beats adaptation. But for most practical cases, DAPT (continue pretraining on domain text) gets 80% of the benefit at 5% of the compute.

Zero-shot and few-shot transfer

The most powerful manifestation of transfer learning is performing entirely new tasks without any task-specific training at all:

Transfer type	Training needed	How it works	Performance vs fine-tuning
Zero-shot	None — inference only	Model infers task from description in prompt	70–85% of fine-tuned for frontier models
Few-shot (ICL)	None — inference only	2–10 examples in context window at inference time	80–90% of fine-tuned — scales with model size
Few-shot fine-tuning	10–100 labeled examples, brief fine-tune	Weight updates from tiny labeled set	90–95% of full fine-tuning quality
Full fine-tuning	1K–1M labeled examples	Standard gradient descent on task data	100% baseline

When zero-shot is enough: For frontier models (GPT-4, Claude 3.5, Gemini 1.5 Pro), zero-shot is competitive for: classification, summarization, extraction, translation, and code generation. Fine-tuning still wins for: highly specialized domains (medical, legal), consistent output format requirements, latency-sensitive applications (smaller fine-tuned model can beat large zero-shot model on speed+cost), and tasks requiring model to know proprietary information.

Practice questions

What is the feature extraction approach vs fine-tuning approach in transfer learning? (Answer: Feature extraction: freeze the pretrained model entirely, use it only as a feature extractor (extract embeddings), train a small classifier head on top. Fast, no GPU needed for backprop through large model, prevents forgetting. Best when: small dataset, similar domain to pretraining. Fine-tuning: continue training some or all layers of the pretrained model on the new task. Slower but higher accuracy. Best when: large enough dataset to avoid overfitting, different domain from pretraining. Gradual unfreezing (start with head, then unfreeze layers from top) is a common compromise.)
What is negative transfer and when does it occur? (Answer: Negative transfer: pretraining on source domain hurts performance on target domain compared to training from scratch. Occurs when source and target domains are fundamentally incompatible — the features learned on source data are actively misleading for the target. Example: pretraining on natural images then fine-tuning on X-ray pathology — image features (textures, colors) from natural images don't correspond to radiological features. Mitigation: use pretrained models only as initialization, aggressive fine-tuning, or train from scratch if domains are truly incompatible.)
What is domain adaptation and how does it differ from fine-tuning? (Answer: Fine-tuning: you have labeled data in the target domain — standard supervised training. Domain adaptation: target domain has little or no labels, but you have unlabeled target domain data. Techniques: adversarial domain adaptation (train a domain discriminator that can't distinguish source from target features — forces domain-invariant representations), self-training on target data (pseudo-labels from confident predictions), CORAL (align feature covariances across domains). Relevant when: collecting target domain labels is expensive (medical imaging, legal text).)
How does CLIP enable zero-shot classification without any fine-tuning? (Answer: CLIP trains image and text encoders jointly on 400M image-text pairs so image embeddings and text embeddings are in the same space. Zero-shot classification: for each class, create a text prompt ('a photo of a dog') and compute its embedding. For a query image, compute its embedding and find the text prompt with highest cosine similarity. No task-specific training needed — class names ARE the classifier. The shared embedding space is the transfer: text descriptions of novel classes immediately work for image classification.)
What is the transformer's contribution to making transfer learning universal across modalities? (Answer: Transformers with self-attention operate on sequences of tokens regardless of modality — the same architecture processes text tokens, image patches, audio frames, or video chunks. This universality enables: training one model on multiple modalities (Gemini, GPT-4o), pretraining on one modality and fine-tuning on another (CLIP image encoder fine-tuned for medical images), and transfer across domains without architecture changes. Pre-transformer, CNNs could not be directly transferred to sequence data; RNNs could not handle images. Transformers made cross-modal transfer natural.)

Definition

Why transfer learning works

Model type	Early layers learn	Middle layers learn	Late layers learn
CNN (vision)	Gabor-like edges, color blobs	Textures, corners, curves	Object parts, semantic concepts (face, wheel)
LLM (text)	Token co-occurrence, positional patterns	Syntax, grammar, POS structure	Semantics, world knowledge, reasoning
Speech model	Mel-frequency features, phoneme boundaries	Phonemes, prosody	Word identity, speaker style

The "frozen layers" principle

A ResNet-50 pretrained on ImageNet's 1.2M photos has learned visual features useful for medical imaging, satellite imagery, and artwork — domains with completely different content. Because these early features are universal, freezing them and only training a new head is often enough for high performance on small datasets.

Feature extraction vs fine-tuning

Strategy	What trains	When to use	Risk
Feature extraction (frozen)	New head only	Very small dataset (<1K examples), source ≈ target domain	Underfit if domains differ significantly
Partial fine-tuning	Last 1–2 blocks + head	Medium dataset, moderate domain shift	Mild catastrophic forgetting
Full fine-tuning	All layers with small LR	Large dataset (10K+), significant domain shift	Catastrophic forgetting without regularization
LoRA fine-tuning	Low-rank adapters only (0.1–1% params)	LLMs where GPU memory is a constraint	Slightly lower ceiling than full FT

Choosing the right strategy based on dataset size

from transformers import AutoModelForSequenceClassification
import torch.nn as nn

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=5)

# ── Strategy A: Feature extraction (< 500 labeled examples) ──────────────
for param in model.bert.parameters():
    param.requires_grad = False   # freeze ALL BERT weights
# Only model.classifier trains — fast, no GPU needed for small models

# ── Strategy B: Partial fine-tuning (500–5K examples) ─────────────────────
for param in model.bert.parameters():
    param.requires_grad = False
for param in model.bert.encoder.layer[-2:].parameters():
    param.requires_grad = True    # unfreeze last 2 transformer blocks
# Classifier + last 2 blocks train

# ── Strategy C: Full fine-tuning (5K+ examples) ───────────────────────────
for param in model.parameters():
    param.requires_grad = True
# Use discriminative learning rates: lower LR for early layers
optimizer = torch.optim.AdamW([
    {"params": model.bert.embeddings.parameters(), "lr": 1e-5},
    {"params": model.bert.encoder.layer[:6].parameters(), "lr": 2e-5},
    {"params": model.bert.encoder.layer[6:].parameters(), "lr": 3e-5},
    {"params": model.classifier.parameters(), "lr": 5e-5},
])

Pretrain-then-fine-tune in NLP

Transfer learning transformed NLP in 2018 and remains the dominant paradigm. The timeline shows how each step built on the last:

Year	Model	Contribution	Impact
2017	ELMo	Context-dependent word embeddings from bidirectional LSTM	First strong contextual representations — replaced GloVe
2018	ULMFiT	LM pretraining + discriminative fine-tuning + gradual unfreezing	Proved pretrain→fine-tune works for NLP tasks
2018	BERT	Masked LM + bidirectional Transformer pretraining	SOTA on 11 NLP benchmarks with tiny fine-tuning data
2020	GPT-3	175B params, few-shot in-context learning without weight updates	Showed scale alone produces task generalization
2022+	ChatGPT / LLaMA	RLHF alignment on top of pretrained foundation models	Fine-tuned assistants outperform task-specific models

Domain adaptation

When source (pretraining) and target domains differ significantly, standard fine-tuning leaves performance on the table. Domain adaptation bridges the gap:

Technique	What it does	Data needed	Best for
Domain-adaptive pretraining (DAPT)	Continue MLM/LM pretraining on unlabeled domain text before fine-tuning	Large unlabeled domain corpus	Biomedical, legal, code — domains with distinct vocabulary
Task-adaptive pretraining (TAPT)	Continue pretraining on unlabeled task data specifically	Unlabeled examples of your task	When task data is plentiful but labels are scarce
LoRA domain adapters	Train low-rank adapters on domain text	Any size domain corpus	LLMs where full pretraining is too expensive
Mixture of domain data	Include domain data in final fine-tuning mix	Domain + general data	Prevents forgetting general capabilities while adapting

Domain pretraining evidence

PubMedBERT (pretrained entirely on 14M biomedical abstracts, never web text) outperforms BioBERT (general BERT fine-tuned on biomedical text) on 6 of 7 biomedical NLP benchmarks — proving from-scratch domain pretraining beats adaptation. But for most practical cases, DAPT (continue pretraining on domain text) gets 80% of the benefit at 5% of the compute.

Zero-shot and few-shot transfer

The most powerful manifestation of transfer learning is performing entirely new tasks without any task-specific training at all:

Transfer type	Training needed	How it works	Performance vs fine-tuning
Zero-shot	None — inference only	Model infers task from description in prompt	70–85% of fine-tuned for frontier models
Few-shot (ICL)	None — inference only	2–10 examples in context window at inference time	80–90% of fine-tuned — scales with model size
Few-shot fine-tuning	10–100 labeled examples, brief fine-tune	Weight updates from tiny labeled set	90–95% of full fine-tuning quality
Full fine-tuning	1K–1M labeled examples	Standard gradient descent on task data	100% baseline

When zero-shot is enough

For frontier models (GPT-4, Claude 3.5, Gemini 1.5 Pro), zero-shot is competitive for: classification, summarization, extraction, translation, and code generation. Fine-tuning still wins for: highly specialized domains (medical, legal), consistent output format requirements, latency-sensitive applications (smaller fine-tuned model can beat large zero-shot model on speed+cost), and tasks requiring model to know proprietary information.

Practice questions

What is the feature extraction approach vs fine-tuning approach in transfer learning? (Answer: Feature extraction: freeze the pretrained model entirely, use it only as a feature extractor (extract embeddings), train a small classifier head on top. Fast, no GPU needed for backprop through large model, prevents forgetting. Best when: small dataset, similar domain to pretraining. Fine-tuning: continue training some or all layers of the pretrained model on the new task. Slower but higher accuracy. Best when: large enough dataset to avoid overfitting, different domain from pretraining. Gradual unfreezing (start with head, then unfreeze layers from top) is a common compromise.)
What is negative transfer and when does it occur? (Answer: Negative transfer: pretraining on source domain hurts performance on target domain compared to training from scratch. Occurs when source and target domains are fundamentally incompatible — the features learned on source data are actively misleading for the target. Example: pretraining on natural images then fine-tuning on X-ray pathology — image features (textures, colors) from natural images don't correspond to radiological features. Mitigation: use pretrained models only as initialization, aggressive fine-tuning, or train from scratch if domains are truly incompatible.)
What is domain adaptation and how does it differ from fine-tuning? (Answer: Fine-tuning: you have labeled data in the target domain — standard supervised training. Domain adaptation: target domain has little or no labels, but you have unlabeled target domain data. Techniques: adversarial domain adaptation (train a domain discriminator that can't distinguish source from target features — forces domain-invariant representations), self-training on target data (pseudo-labels from confident predictions), CORAL (align feature covariances across domains). Relevant when: collecting target domain labels is expensive (medical imaging, legal text).)
How does CLIP enable zero-shot classification without any fine-tuning? (Answer: CLIP trains image and text encoders jointly on 400M image-text pairs so image embeddings and text embeddings are in the same space. Zero-shot classification: for each class, create a text prompt ('a photo of a dog') and compute its embedding. For a query image, compute its embedding and find the text prompt with highest cosine similarity. No task-specific training needed — class names ARE the classifier. The shared embedding space is the transfer: text descriptions of novel classes immediately work for image classification.)
What is the transformer's contribution to making transfer learning universal across modalities? (Answer: Transformers with self-attention operate on sequences of tokens regardless of modality — the same architecture processes text tokens, image patches, audio frames, or video chunks. This universality enables: training one model on multiple modalities (Gemini, GPT-4o), pretraining on one modality and fine-tuning on another (CLIP image encoder fine-tuned for medical images), and transfer across domains without architecture changes. Pre-transformer, CNNs could not be directly transferred to sequence data; RNNs could not handle images. Transformers made cross-modal transfer natural.)

Transfer Learning

Why transfer learning works

Feature extraction vs fine-tuning

Pretrain-then-fine-tune in NLP

Domain adaptation

Zero-shot and few-shot transfer

Practice questions

Transfer Learning

Why transfer learning works

Feature extraction vs fine-tuning

Pretrain-then-fine-tune in NLP

Domain adaptation

Zero-shot and few-shot transfer

Practice questions

Practice what you just learned

Related Terms