Glossary/PEFT (Parameter-Efficient Fine-Tuning)
Model Training & Optimization

PEFT (Parameter-Efficient Fine-Tuning)

Adapting large models for specific tasks by updating only a tiny fraction of parameters.


Definition

Parameter-Efficient Fine-Tuning (PEFT) is a family of techniques for adapting large pretrained models to specific tasks or domains by updating only a small subset of the model's parameters — typically 0.1%–1% of total weights — rather than the full model. This dramatically reduces GPU memory requirements and training time while achieving performance comparable to full fine-tuning. LoRA, QLoRA, Prefix Tuning, Prompt Tuning, IA³, and Adapters are all PEFT methods. PEFT has made fine-tuning frontier-scale models accessible on consumer hardware.

The PEFT method family

MethodWhat it updatesParameters updatedBest for
LoRA (Low-Rank Adaptation)Adds low-rank decomposition matrices alongside attention weights~0.1–1% of totalMost fine-tuning tasks — the current default PEFT method
QLoRA (Quantised LoRA)LoRA on 4-bit quantised base model~0.1% (LoRA adapters only)Fine-tuning 70B models on consumer GPUs — Llama 3 70B on single A100
Prefix TuningPrepends trainable "soft tokens" to each transformer layer<0.1%Few-shot tasks; style and tone adaptation
Prompt TuningPrepends trainable tokens only to input layer<0.01%Task switching with single model; very parameter-efficient
IA³ (Infused Adapter by Inhibiting and Amplifying)Element-wise scaling of activations~0.01%Few-shot; extreme parameter efficiency
AdaptersSmall feed-forward networks inserted between transformer layers~1–5%Multi-task learning; modular task adapters

QLoRA fine-tuning: train Llama 3 8B on a single GPU with 24GB VRAM

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantisation config — loads 8B model in ~5GB instead of 16GB
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 — best quality at 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # nested quantisation for extra memory savings
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration — only these adapter matrices will be trained
lora_config = LoraConfig(
    r=16,                      # rank of LoRA decomposition — higher = more parameters
    lora_alpha=32,             # scaling factor (lora_alpha/r = effective learning rate scaling)
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # which attention matrices to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: 6,815,744 || all params: 8,037,224,448 || trainable%: 0.0848

When PEFT is the right choice

  • Domain adaptation: Fine-tuning a general model on your domain's vocabulary, writing style, and knowledge patterns (medical records, legal documents, Indian customer service dialogue).
  • Style and persona: Teaching a model to consistently write in a specific tone, length, or format without prompting for it every time.
  • Task specialisation: Optimising a model for a specific structured output task (NER, classification, SQL generation) where prompt engineering alone gives variable results.
  • Resource-constrained deployment: When you need a custom model but cannot afford full fine-tuning compute — QLoRA makes 7B–13B model fine-tuning possible on a single consumer GPU.
  • Not the right choice: If your task is already handled well by the base model with good prompting; if your dataset is smaller than ~500–1000 examples (high overfitting risk); if you need to update the model's knowledge cutoff (fine-tuning doesn't reliably add new factual knowledge — use RAG instead).

Practice questions

  1. LoRA uses rank decomposition W = W₀ + BA. For a 4096×4096 weight matrix with rank r=16, how many parameters does LoRA add? (Answer: A: r×k = 16×4096 = 65,536. B: d×r = 4096×16 = 65,536. Total LoRA addition: 131,072 parameters vs the original 4096×4096 = 16,777,216. LoRA adds only 0.78% of the original layer's parameters. For a full 7B model, LoRA with r=16 on all attention and MLP matrices typically adds ~40M trainable parameters from 7B total — 0.57%.)
  2. What is the difference between LoRA, QLoRA, and LoRA+? (Answer: LoRA: adds low-rank matrices to frozen BF16/FP16 weights. QLoRA (Dettmers 2023): base model loaded in 4-bit NF4 quantisation; LoRA adapters trained in BF16. Same quality as LoRA but 2× less VRAM — enables 65B fine-tuning on a single A100-80GB. LoRA+: uses different learning rates for A (small LR) and B (large LR) matrices — A matrix determines the subspace, B scales it. Empirically improves LoRA convergence on many tasks.)
  3. When should you use Adapter layers instead of LoRA for PEFT? (Answer: Adapters: insert small bottleneck MLP layers (down-project → activation → up-project) at specific positions in the transformer. More expressive per-parameter than LoRA for tasks requiring new feature dimensions not present in the base model. LoRA preferred for: maintaining inference speed (adapters add forward pass overhead unless merged). Adapters preferred for: continual learning (stack multiple adapters), multi-task learning (swap adapters per task at inference), when task requires genuinely new capabilities.)
  4. What is catastrophic forgetting in PEFT and does LoRA prevent it? (Answer: Catastrophic forgetting: training on new data overwrites knowledge encoded in existing weights. Full fine-tuning is highly susceptible — weights from pretraining are overwritten. LoRA significantly reduces forgetting because most weights are frozen — only the small A,B matrices are updated. The frozen weights preserve the base model's language capabilities. Residual forgetting: the LoRA matrices can shift model behaviour away from original capabilities on unrelated tasks, but the effect is much smaller than full fine-tuning.)
  5. How does Prompt Tuning differ from LoRA and when would you prefer it? (Answer: Prompt tuning: prepend learnable continuous embeddings (soft prompts) to the input — only these ~100 tokens' worth of parameters are trained. Extremely parameter-efficient (0.01% of model). Works well for large models (T5-XXL, GPT-3-scale) where the model is already very capable. Fails for smaller models (<1B) that lack the capacity to be steered by soft prompts alone. LoRA preferred for tasks requiring large behavioural changes; Prompt Tuning preferred for tasks solvable with careful prompting, when minimum storage cost is critical.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms