LoRA (Low-Rank Adaptation, Hu et al., 2021) is a parameter-efficient fine-tuning technique that adds small trainable low-rank matrices to frozen pretrained model layers. Instead of updating all billions of parameters, LoRA updates only 0.01-1% of parameters — enabling high-quality fine-tuning at a fraction of the compute and memory cost. QLoRA extends this with 4-bit quantization.
The LoRA mechanism
For each weight matrix W in the model, LoRA inserts two small trainable matrices A (d × r) and B (r × k) where r ≪ min(d, k). Only A and B are trained; W stays frozen. The effective weight becomes:
B is initialised to zero so the adapter output is zero at the start of training — the model begins as the original pretrained model. A is randomly initialised. After training, BA can be merged into W with zero inference overhead.
LoRA with PEFT library — fine-tuning LLaMA 3 8B
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
lora_config = LoraConfig(
r=8, # rank — higher = more capacity, more params
lora_alpha=16, # scaling: effective update = (alpha/r) * BA = 2x
target_modules=[ # apply LoRA to these projection matrices
"q_proj", "k_proj", "v_proj", "o_proj", # attention
"gate_proj", "up_proj", "down_proj", # feedforward (MLP)
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.52
# After training, merge adapter into base weights (zero inference overhead)
merged_model = model.merge_and_unload()Why low-rank works: the weight change hypothesis
The core intuition: the weight changes needed for task adaptation occupy a low-dimensional subspace of the full parameter space. Two pieces of evidence:
| Evidence | Finding | Implication |
|---|---|---|
| Intrinsic dimensionality (Aghajanyan 2020) | Many NLP tasks can be learned by optimising only ~200 parameters in a transformed space | Task adaptation is inherently low-dimensional |
| LoRA paper (Hu 2021) | ΔW matrices during full fine-tuning have very low stable rank (singular value spectrum dominated by top-r values) | LoRA directly captures the low-rank structure — not much information is lost |
| Scaling rank r | Quality plateaus at r=8–16 for most tasks; larger r rarely helps | The intrinsic rank of task adaptation is typically ≤ 16 |
Practical implication
LoRA with r=8 on a 7B model trains ~40M parameters (0.5%) but achieves 95–98% of full fine-tuning quality. The adapter file is ~150MB — trivial to distribute, store, and switch between. One base model can host dozens of task-specific LoRA adapters, switched at runtime.
LoRA hyperparameters and best practices
| Hyperparameter | Typical values | Effect | Guidance |
|---|---|---|---|
| Rank r | 4, 8, 16, 64 | Capacity of adapter — higher rank = more parameters | Start with r=8; increase to 16–64 only for complex tasks |
| Alpha (α) | Same as r, or 2×r | Scaling: update magnitude = (α/r) × BA | α=r gives scale=1×; α=2r gives 2× — often better in practice |
| Target modules | All attention + FFN projections | Which matrices get adapters | Apply to all linear layers for best quality |
| Dropout | 0.0–0.1 | Regularisation on adapter matrices | Use 0.05 for small datasets; 0.0 for large datasets |
| Learning rate | 1e-4 to 5e-4 | Step size for adapter updates only | Higher LR than full FT is fine — adapters start from zero |
rsLoRA for high ranks
Standard LoRA scales by α/r — as r increases, the update magnitude decreases (instability at high r). rsLoRA (rank-stabilised LoRA) scales by α/√r instead, enabling stable training at r=128 or higher. Useful when a task genuinely needs higher capacity.
QLoRA: fine-tuning 70B models on one GPU
QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantization, making it possible to fine-tune enormous models on a single consumer GPU:
| Innovation | What it does | Memory saving |
|---|---|---|
| NF4 (Normal Float 4-bit) | Data type optimised for normally-distributed weights — preserves more precision at distribution centre | 4× smaller than FP16 weights |
| Double quantization | Quantise the quantisation constants themselves | Extra ~0.4 bits/parameter saved |
| Paged optimisers | Move optimiser states (Adam moments) to CPU RAM during peak GPU usage | Prevents OOM on memory spikes |
QLoRA: fine-tune LLaMA 3 70B on a single 48GB A100
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch
# 4-bit NF4 quantisation config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normal Float 4-bit
bnb_4bit_compute_dtype=torch.bfloat16, # compute in BF16
bnb_4bit_use_double_quant=True, # double quantisation
)
# Load 70B model quantised to 4-bit (~35GB instead of 140GB)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B",
quantization_config=bnb_config,
device_map="auto",
)
# Add LoRA adapters in BF16 on top of frozen 4-bit base
lora_config = LoraConfig(r=16, lora_alpha=32,
target_modules=["q_proj","v_proj","k_proj","o_proj"],
lora_dropout=0.05, task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
# Peak GPU memory: ~40GB — fits on a single 48GB A100 or 4× 12GB consumer GPUsLoRA vs full fine-tuning vs alternatives
| Method | Params updated | GPU for 7B | Quality vs full FT | Inference overhead |
|---|---|---|---|---|
| Full fine-tuning | 100% | ~80GB (FP16 + Adam) | 100% | None |
| LoRA (r=8) | ~0.5% | ~16GB | 95–98% | None (merge after training) |
| QLoRA (4-bit + LoRA) | ~0.5% | ~6GB | 92–96% | None (merge after training) |
| Adapter layers | ~1–3% | ~17GB | 93–96% | ⚠️ Small latency — can't be merged |
| Prefix tuning | <0.1% | ~14GB | 80–88% | None but wastes context tokens |
| Prompt tuning | <0.01% | ~14GB | 70–85% | None but only works at large scale |
| DoRA (Weight-Decomposed LoRA) | ~0.5% | ~16GB | 97–99% | None (merge after training) |
DoRA in 2025
DoRA (Liu et al., 2024) decomposes weights into magnitude and direction components, applying LoRA only to the direction. This consistently outperforms LoRA by 1–3% on most tasks with identical parameter count. DoRA is now the recommended default in most PEFT use cases where slightly better quality is worth the marginal implementation complexity.
Practice questions
- What is the mathematical justification for LoRA's low-rank assumption? (Answer: The hypothesis: weight updates ΔW = W_finetuned - W_pretrained have low intrinsic rank. Empirical evidence (Aghajanyan et al. 2020): fine-tuned models can be represented in low-dimensional intrinsic subspaces. The pretrained model already captures most of the relevant structure; task-specific adaptation requires a low-rank modification. LoRA decomposes ΔW = BA where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, r << min(d,k). Hypothesis confirmed: r=4–16 achieves near full-rank fine-tuning quality on most NLP tasks.)
- How does LoRA merge adapters at inference time for zero overhead? (Answer: LoRA adds ΔW = BA to the frozen W₀. At inference: output = (W₀ + BA)x = W₀x + BAx. Pre-merging: must compute both W₀x and BAx, then add — two matrix multiplications. Post-merging: compute W_merged = W₀ + BA (one-time operation), then output = W_merged × x — single matrix multiplication, identical speed to original model. Merging is a pre-deployment step: load LoRA weights, add to base weights, save. The merged model is indistinguishable from full fine-tuning at serving time.)
- What is the effect of LoRA rank r on model capacity and training stability? (Answer: Low r (1–4): very few trainable parameters (~100K for 7B model). Fast, memory-efficient, less risk of overfitting. May underfit complex tasks. Medium r (8–32): standard range for most fine-tuning. Balances capacity and efficiency. High r (64–256): approaches full fine-tuning capacity. More parameters but still much cheaper than full FT. Stability: very high r can cause instability if learning rate is not reduced proportionally. Practical guideline: start with r=16, tune if underfitting/overfitting observed.)
- What is the difference between applying LoRA to attention weights only vs all linear layers? (Answer: Attention-only LoRA (q_proj, v_proj, k_proj, o_proj): targets the self-attention mechanism — where most task-specific information integration happens. Fewer parameters, faster. Original LoRA paper used q, v projections only. All-linear LoRA (attention + MLP + embeddings): more capacity to adapt. Usually better accuracy on complex tasks requiring deep factual changes. Memory cost: ~4× more LoRA parameters. For instruction following and style: attention-only sufficient. For domain knowledge adaptation (medical, legal): all-linear LoRA recommended.)
- What is QLoRA and how does it enable 65B model fine-tuning on a single GPU? (Answer: QLoRA (Dettmers et al. 2023): (1) Load base model in 4-bit NF4 (Normal Float 4) quantisation — 65B model: ~35GB instead of ~130GB. (2) Add LoRA adapters in BF16. (3) Train only LoRA adapters (base model frozen and quantised). (4) During forward/backward pass: dequantise base weights to BF16 on the fly for computation, then discard. Memory: 35GB (quantised base) + 4GB (LoRA + optimizer states) ≈ 39GB — fits on a single 40GB A100. Quantisation adds <1% performance loss vs FP16 LoRA on most tasks.)