What is the PEFT method family?

PEFT (Parameter-Efficient Fine-Tuning): The PEFT method family. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/peft

What is practice questions?

PEFT (Parameter-Efficient Fine-Tuning): Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/peft

PEFT (Parameter-Efficient Fine-Tuning)

Parameter-Efficient Fine-Tuning (PEFT) is a family of techniques for adapting large pretrained models to specific tasks or domains by updating only a small subset of the model's parameters — typically 0.1%–1% of total weights — rather than the full model. This dramatically reduces GPU memory requirements and training time while achieving performance comparable to full fine-tuning. LoRA, QLoRA, Prefix Tuning, Prompt Tuning, IA³, and Adapters are all PEFT methods. PEFT has made fine-tuning frontier-scale models accessible on consumer hardware.

Adapting large models for specific tasks by updating only a tiny fraction of parameters.

Category: Model Training & Optimization

The PEFT method family

Method	What it updates	Parameters updated	Best for
LoRA (Low-Rank Adaptation)	Adds low-rank decomposition matrices alongside attention weights	~0.1–1% of total	Most fine-tuning tasks — the current default PEFT method
QLoRA (Quantised LoRA)	LoRA on 4-bit quantised base model	~0.1% (LoRA adapters only)	Fine-tuning 70B models on consumer GPUs — Llama 3 70B on single A100
Prefix Tuning	Prepends trainable "soft tokens" to each transformer layer	<0.1%	Few-shot tasks; style and tone adaptation
Prompt Tuning	Prepends trainable tokens only to input layer	<0.01%	Task switching with single model; very parameter-efficient
IA³ (Infused Adapter by Inhibiting and Amplifying)	Element-wise scaling of activations	~0.01%	Few-shot; extreme parameter efficiency
Adapters	Small feed-forward networks inserted between transformer layers	~1–5%	Multi-task learning; modular task adapters

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization config — loads 8B model in ~5GB instead of 16GB
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 — best quality at 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # nested quantization for extra memory savings
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration — only these adapter matrices will be trained
lora_config = LoraConfig(
    r=16,                      # rank of LoRA decomposition — higher = more parameters
    lora_alpha=32,             # scaling factor (lora_alpha/r = effective learning rate scaling)
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # which attention matrices to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: 6,815,744 || all params: 8,037,224,448 || trainable%: 0.0848

When PEFT is the right choice

Domain adaptation: Fine-tuning a general model on your domain's vocabulary, writing style, and knowledge patterns (medical records, legal documents, Indian customer service dialogue).
Style and persona: Teaching a model to consistently write in a specific tone, length, or format without prompting for it every time.
Task specialization: Optimising a model for a specific structured output task (NER, classification, SQL generation) where prompt engineering alone gives variable results.
Resource-constrained deployment: When you need a custom model but cannot afford full fine-tuning compute — QLoRA makes 7B–13B model fine-tuning possible on a single consumer GPU.
Not the right choice: If your task is already handled well by the base model with good prompting; if your dataset is smaller than ~500–1000 examples (high overfitting risk); if you need to update the model's knowledge cutoff (fine-tuning doesn't reliably add new factual knowledge — use RAG instead).

Practice questions

LoRA uses rank decomposition W = W₀ + BA. For a 4096×4096 weight matrix with rank r=16, how many parameters does LoRA add? (Answer: A: r×k = 16×4096 = 65,536. B: d×r = 4096×16 = 65,536. Total LoRA addition: 131,072 parameters vs the original 4096×4096 = 16,777,216. LoRA adds only 0.78% of the original layer's parameters. For a full 7B model, LoRA with r=16 on all attention and MLP matrices typically adds ~40M trainable parameters from 7B total — 0.57%.)
What is the difference between LoRA, QLoRA, and LoRA+? (Answer: LoRA: adds low-rank matrices to frozen BF16/FP16 weights. QLoRA (Dettmers 2023): base model loaded in 4-bit NF4 quantization; LoRA adapters trained in BF16. Same quality as LoRA but 2× less VRAM — enables 65B fine-tuning on a single A100-80GB. LoRA+: uses different learning rates for A (small LR) and B (large LR) matrices — A matrix determines the subspace, B scales it. Empirically improves LoRA convergence on many tasks.)
When should you use Adapter layers instead of LoRA for PEFT? (Answer: Adapters: insert small bottleneck MLP layers (down-project → activation → up-project) at specific positions in the transformer. More expressive per-parameter than LoRA for tasks requiring new feature dimensions not present in the base model. LoRA preferred for: maintaining inference speed (adapters add forward pass overhead unless merged). Adapters preferred for: continual learning (stack multiple adapters), multi-task learning (swap adapters per task at inference), when task requires genuinely new capabilities.)
What is catastrophic forgetting in PEFT and does LoRA prevent it? (Answer: Catastrophic forgetting: training on new data overwrites knowledge encoded in existing weights. Full fine-tuning is highly susceptible — weights from pretraining are overwritten. LoRA significantly reduces forgetting because most weights are frozen — only the small A,B matrices are updated. The frozen weights preserve the base model's language capabilities. Residual forgetting: the LoRA matrices can shift model behavior away from original capabilities on unrelated tasks, but the effect is much smaller than full fine-tuning.)
How does Prompt Tuning differ from LoRA and when would you prefer it? (Answer: Prompt tuning: prepend learnable continuous embeddings (soft prompts) to the input — only these ~100 tokens' worth of parameters are trained. Extremely parameter-efficient (0.01% of model). Works well for large models (T5-XXL, GPT-3-scale) where the model is already very capable. Fails for smaller models (<1B) that lack the capacity to be steered by soft prompts alone. LoRA preferred for tasks requiring large behavioral changes; Prompt Tuning preferred for tasks solvable with careful prompting, when minimum storage cost is critical.)

Method

What it updates

Parameters updated

Best for

LoRA (Low-Rank Adaptation)

Adds low-rank decomposition matrices alongside attention weights

~0.1–1% of total

Most fine-tuning tasks — the current default PEFT method

QLoRA (Quantised LoRA)

LoRA on 4-bit quantised base model

~0.1% (LoRA adapters only)

Fine-tuning 70B models on consumer GPUs — Llama 3 70B on single A100

Prefix Tuning

Prepends trainable "soft tokens" to each transformer layer

<0.1%

Few-shot tasks; style and tone adaptation

Prompt Tuning

Prepends trainable tokens only to input layer

<0.01%

Task switching with single model; very parameter-efficient

IA³ (Infused Adapter by Inhibiting and Amplifying)

Element-wise scaling of activations

~0.01%

Few-shot; extreme parameter efficiency

Adapters

Small feed-forward networks inserted between transformer layers

~1–5%

Multi-task learning; modular task adapters

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training import torch # 4-bit quantization config — loads 8B model in ~5GB instead of 16GB bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 — best quality at 4-bit bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, # nested quantization for extra memory savings ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3-8B", quantization_config=bnb_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B") # Prepare model for k-bit training model = prepare_model_for_kbit_training(model) # LoRA configuration — only these adapter matrices will be trained lora_config = LoraConfig( r=16, # rank of LoRA decomposition — higher = more parameters lora_alpha=32, # scaling factor (lora_alpha/r = effective learning rate scaling) target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # which attention matrices to adapt lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # → trainable params: 6,815,744 || all params: 8,037,224,448 || trainable%: 0.0848

PEFT (Parameter-Efficient Fine-Tuning)

The PEFT method family

When PEFT is the right choice

Practice questions

PEFT (Parameter-Efficient Fine-Tuning)

The PEFT method family

When PEFT is the right choice

Practice questions

Practice what you just learned

Related Terms