Fine-tuning is the process of taking a pretrained model and continuing to train it on a smaller, task-specific dataset. It adjusts the model's parameters to improve performance on a specific task, domain, or style — building on the general knowledge already learned during pretraining rather than starting from scratch.
Pre-training vs fine-tuning
| Stage | Data | Cost | Goal | Who does it |
|---|---|---|---|---|
| Pretraining | Trillions of tokens of internet text, code, books | $1M–$100M+ in compute | Learn general world knowledge + language | AI labs (OpenAI, Meta, Anthropic, Google) |
| Instruction fine-tuning (SFT) | Thousands–millions of (instruction, response) pairs | $10–$10,000 on cloud GPUs | Teach model to follow instructions helpfully | Labs + companies building on top of base models |
| Alignment fine-tuning (RLHF/DPO) | Human or AI preference pairs | $1,000–$100,000 | Make model safe, helpful, honest | Primarily AI labs |
| Domain fine-tuning | Domain-specific documents + Q&A | $50–$5,000 | Specialize model for a vertical (medical, legal, code) | Companies, researchers, developers |
The LIMA insight
The LIMA paper (2023) demonstrated that just 1,000 carefully curated, high-quality instruction examples produced a model competitive with models trained on 52,000 pairs. Quality matters far more than quantity in SFT data — a finding that reshaped how fine-tuning datasets are built.
Instruction fine-tuning (SFT)
Supervised Fine-Tuning on instruction-following data transforms a base LLM (which just predicts the next token) into an assistant that follows instructions. The data format is simple (instruction, response) pairs:
SFT data format and training with HuggingFace TRL
from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
# SFT training data: (instruction, response) pairs
data = [
{
"messages": [
{"role": "user", "content": "Summarize this text in 2 sentences: [article text]"},
{"role": "assistant", "content": "The article discusses..."}
]
},
# ... thousands more high-quality examples
]
dataset = Dataset.from_list(data)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=SFTConfig(
output_dir="./llama-sft",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch = 16
learning_rate=2e-5,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
bf16=True, # bfloat16 — faster, same quality as fp16
logging_steps=10,
save_strategy="epoch",
),
)
trainer.train()What makes good SFT data
(1) Diversity — cover the full range of tasks the model should handle. (2) Quality over quantity — one human-expert response beats 100 AI-generated ones. (3) Correct format — responses should model ideal assistant behaviour (helpful, clear, appropriately concise). (4) No contamination — test set benchmarks must not appear in training data.
Catastrophic forgetting
Fine-tuning improves task performance but can erase general capabilities — the model 'forgets' what it knew during pretraining. This happens because gradient descent for the new task increases loss on the original data distribution:
| Mitigation | How it works | Cost | Effectiveness |
|---|---|---|---|
| LoRA / PEFT | Only update small adapter — 99% of weights frozen | Low | ⭐⭐⭐⭐ — best default choice |
| Replay | Mix original pretraining data into fine-tuning batches | Medium (need original data) | ⭐⭐⭐⭐ |
| EWC (Elastic Weight Consolidation) | Penalize changes to weights important for old tasks (via Fisher info) | Medium | ⭐⭐⭐ |
| Low learning rate | Fine-tune with LR 10–100× smaller than pretraining | None | ⭐⭐ |
| Short fine-tuning | Stop early before forgetting accumulates | None | ⭐⭐ |
The alignment tax
RLHF and SFT can reduce raw capability on knowledge benchmarks (MMLU, HumanEval) even as they improve helpfulness and safety. This is the "alignment tax" — models become more pleasant to talk to but may perform worse on raw capability evals. Careful data curation and LoRA help minimise it.
Full fine-tuning vs PEFT
| Method | Params updated | GPU memory (7B model) | Relative quality | Use case |
|---|---|---|---|---|
| Full fine-tuning | 100% (7B) | ~80GB (FP16 + optimizer states) | 100% | When you have A100/H100 cluster and large dataset |
| LoRA (r=8) | ~0.1% (7M) | ~16GB | 95–98% | Standard choice — great quality/cost ratio |
| QLoRA (4-bit + LoRA) | ~0.1% | ~6GB | 92–95% | Consumer GPU or limited VRAM — democratizes fine-tuning |
| Prefix tuning | ~0.1% (soft tokens) | ~16GB | 85–90% | Rarely used — underperforms LoRA |
| Prompt tuning | <0.01% (prompt tokens) | ~14GB | 80–85% | Only competitive at very large model scale (>10B) |
| Adapter layers | ~0.5–3% | ~17GB | 93–96% | Works well but adds inference latency (can't be merged) |
Domain-specific fine-tuning: real examples
| Domain | Model | Training data | Key result |
|---|---|---|---|
| Code | DeepSeek-Coder, Codestral | The Stack (2.8TB code), GitHub | Outperform GPT-3.5 on HumanEval despite being smaller |
| Medicine | Med-PaLM 2 (Google) | Medical texts, USMLE Q&A, clinical notes | Expert-level performance on USMLE (passing score in all categories) |
| Law | Harvey AI (GPT-4 based) | Legal documents, case law, contracts | Used by Am Law 100 law firms for contract review |
| Finance | BloombergGPT | 363B tokens of financial text + general web | Outperforms general LLMs on financial NLP benchmarks |
| Math | DeepSeek-Math, Mammoth | MATH dataset + synthetic chain-of-thought | SOTA on MATH benchmark — rivals proprietary models |
| Science | Galactica (Meta) | 48M scientific papers, references, code | Excels at scientific question answering and formula generation |
The key insight
Targeted fine-tuning on high-quality domain data often outperforms much larger general models on domain-specific tasks. A 7B model fine-tuned on 50K medical Q&A examples can outperform GPT-4 on medical licensing exams — at 1/100th the inference cost.
Practice questions
- What is the difference between full fine-tuning, LoRA, and prompt tuning in terms of memory requirements? (Answer: Full fine-tuning: update ALL weights — needs: model weights (14GB for 7B BF16) + gradients (14GB) + Adam optimizer states (28GB for FP32) ≈ 56GB for 7B. LoRA (r=16): freeze all weights, add small A,B matrices — needs: frozen weights (14GB) + LoRA gradients (~200MB) + LoRA optimizer (~400MB) ≈ 15GB for 7B. Prompt tuning: add soft prompt tokens, train only their embeddings — needs: model weights (inference memory) + ~1MB for prompt parameters. LoRA is the practical sweet spot for single-GPU fine-tuning.)
- What is the learning rate recommendation for LoRA fine-tuning vs full fine-tuning? (Answer: Full fine-tuning: 1e-5 to 5e-5. Small LR needed because all weights (pretrained knowledge) are updated — too large destroys pretraining. LoRA: 1e-4 to 3e-4. Larger LR is appropriate because only the small A,B matrices (which start near zero) are updated — the frozen pretrained weights remain unchanged. LoRA matrices need to learn from scratch, requiring higher LR to converge in reasonable training time.)
- What is catastrophic forgetting and how does fine-tuning mitigate it? (Answer: Catastrophic forgetting: training on new data overwrites gradients from old data, causing the model to lose previously learned capabilities. Full fine-tuning on a narrow dataset (e.g., SQL generation only) can severely degrade general language capabilities. Mitigations: (1) Small learning rate: minimal updates to pretrained knowledge. (2) LoRA: frozen pretrained weights cannot forget. (3) Regularisation toward original weights (EWC). (4) Replay: include some general training examples in the fine-tuning mix. (5) Short training duration (1–3 epochs typically).)
- What is task-specific fine-tuning vs instruction fine-tuning, and which produces a more flexible model? (Answer: Task-specific: fine-tune on (input, output) pairs for ONE task (SQL generation, sentiment classification). Expert in that task but cannot generalise to new instructions. Instruction fine-tuning: train on thousands of diverse tasks expressed as natural language instructions. The model learns to follow novel instructions even for tasks not seen during training. Instruction-tuned models are more flexible deployments (single model handles many tasks) but may underperform narrow specialists on their specific domain.)
- Why do fine-tuned models sometimes perform worse than base models on the target task? (Answer: Common causes: (1) Too few fine-tuning examples (<100): overfitting to the small dataset. (2) Too many epochs: continued training on limited data causes memorisation. (3) LR too high: destructive updates to pretrained knowledge. (4) Training data format mismatch: the model expects a specific prompt format that doesn't match the fine-tuning examples. (5) Evaluation on wrong distribution: the fine-tuned model excels on the training distribution but test examples are formatted differently. Always validate with a held-out test set.)
On LumiChats
LumiChats' open-source models (LumiCoder-7B, LumiStudy-3B, LumiReason-13B) are all fine-tuned from base open-source models using LoRA on task-specific datasets.
Try it free