Parameter-Efficient Fine-Tuning (PEFT) is a family of techniques for adapting large pretrained models to specific tasks or domains by updating only a small subset of the model's parameters — typically 0.1%–1% of total weights — rather than the full model. This dramatically reduces GPU memory requirements and training time while achieving performance comparable to full fine-tuning. LoRA, QLoRA, Prefix Tuning, Prompt Tuning, IA³, and Adapters are all PEFT methods. PEFT has made fine-tuning frontier-scale models accessible on consumer hardware.
The PEFT method family
| Method | What it updates | Parameters updated | Best for |
|---|---|---|---|
| LoRA (Low-Rank Adaptation) | Adds low-rank decomposition matrices alongside attention weights | ~0.1–1% of total | Most fine-tuning tasks — the current default PEFT method |
| QLoRA (Quantised LoRA) | LoRA on 4-bit quantised base model | ~0.1% (LoRA adapters only) | Fine-tuning 70B models on consumer GPUs — Llama 3 70B on single A100 |
| Prefix Tuning | Prepends trainable "soft tokens" to each transformer layer | <0.1% | Few-shot tasks; style and tone adaptation |
| Prompt Tuning | Prepends trainable tokens only to input layer | <0.01% | Task switching with single model; very parameter-efficient |
| IA³ (Infused Adapter by Inhibiting and Amplifying) | Element-wise scaling of activations | ~0.01% | Few-shot; extreme parameter efficiency |
| Adapters | Small feed-forward networks inserted between transformer layers | ~1–5% | Multi-task learning; modular task adapters |
QLoRA fine-tuning: train Llama 3 8B on a single GPU with 24GB VRAM
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# 4-bit quantisation config — loads 8B model in ~5GB instead of 16GB
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — best quality at 4-bit
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # nested quantisation for extra memory savings
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)
# LoRA configuration — only these adapter matrices will be trained
lora_config = LoraConfig(
r=16, # rank of LoRA decomposition — higher = more parameters
lora_alpha=32, # scaling factor (lora_alpha/r = effective learning rate scaling)
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # which attention matrices to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: 6,815,744 || all params: 8,037,224,448 || trainable%: 0.0848When PEFT is the right choice
- Domain adaptation: Fine-tuning a general model on your domain's vocabulary, writing style, and knowledge patterns (medical records, legal documents, Indian customer service dialogue).
- Style and persona: Teaching a model to consistently write in a specific tone, length, or format without prompting for it every time.
- Task specialisation: Optimising a model for a specific structured output task (NER, classification, SQL generation) where prompt engineering alone gives variable results.
- Resource-constrained deployment: When you need a custom model but cannot afford full fine-tuning compute — QLoRA makes 7B–13B model fine-tuning possible on a single consumer GPU.
- Not the right choice: If your task is already handled well by the base model with good prompting; if your dataset is smaller than ~500–1000 examples (high overfitting risk); if you need to update the model's knowledge cutoff (fine-tuning doesn't reliably add new factual knowledge — use RAG instead).
Practice questions
- LoRA uses rank decomposition W = W₀ + BA. For a 4096×4096 weight matrix with rank r=16, how many parameters does LoRA add? (Answer: A: r×k = 16×4096 = 65,536. B: d×r = 4096×16 = 65,536. Total LoRA addition: 131,072 parameters vs the original 4096×4096 = 16,777,216. LoRA adds only 0.78% of the original layer's parameters. For a full 7B model, LoRA with r=16 on all attention and MLP matrices typically adds ~40M trainable parameters from 7B total — 0.57%.)
- What is the difference between LoRA, QLoRA, and LoRA+? (Answer: LoRA: adds low-rank matrices to frozen BF16/FP16 weights. QLoRA (Dettmers 2023): base model loaded in 4-bit NF4 quantisation; LoRA adapters trained in BF16. Same quality as LoRA but 2× less VRAM — enables 65B fine-tuning on a single A100-80GB. LoRA+: uses different learning rates for A (small LR) and B (large LR) matrices — A matrix determines the subspace, B scales it. Empirically improves LoRA convergence on many tasks.)
- When should you use Adapter layers instead of LoRA for PEFT? (Answer: Adapters: insert small bottleneck MLP layers (down-project → activation → up-project) at specific positions in the transformer. More expressive per-parameter than LoRA for tasks requiring new feature dimensions not present in the base model. LoRA preferred for: maintaining inference speed (adapters add forward pass overhead unless merged). Adapters preferred for: continual learning (stack multiple adapters), multi-task learning (swap adapters per task at inference), when task requires genuinely new capabilities.)
- What is catastrophic forgetting in PEFT and does LoRA prevent it? (Answer: Catastrophic forgetting: training on new data overwrites knowledge encoded in existing weights. Full fine-tuning is highly susceptible — weights from pretraining are overwritten. LoRA significantly reduces forgetting because most weights are frozen — only the small A,B matrices are updated. The frozen weights preserve the base model's language capabilities. Residual forgetting: the LoRA matrices can shift model behaviour away from original capabilities on unrelated tasks, but the effect is much smaller than full fine-tuning.)
- How does Prompt Tuning differ from LoRA and when would you prefer it? (Answer: Prompt tuning: prepend learnable continuous embeddings (soft prompts) to the input — only these ~100 tokens' worth of parameters are trained. Extremely parameter-efficient (0.01% of model). Works well for large models (T5-XXL, GPT-3-scale) where the model is already very capable. Fails for smaller models (<1B) that lack the capacity to be steered by soft prompts alone. LoRA preferred for tasks requiring large behavioural changes; Prompt Tuning preferred for tasks solvable with careful prompting, when minimum storage cost is critical.)