Fine-tuning is the process of taking a pretrained model and continuing to train it on a smaller, task-specific dataset. It adjusts the model's parameters to improve performance on a specific task, domain, or style — building on the general knowledge already learned during pretraining rather than starting from scratch.
Pre-training vs fine-tuning
| Stage | Data | Cost | Goal | Who does it |
|---|---|---|---|---|
| Pretraining | Trillions of tokens of internet text, code, books | $1M–$100M+ in compute | Learn general world knowledge + language | AI labs (OpenAI, Meta, Anthropic, Google) |
| Instruction fine-tuning (SFT) | Thousands–millions of (instruction, response) pairs | $10–$10,000 on cloud GPUs | Teach model to follow instructions helpfully | Labs + companies building on top of base models |
| Alignment fine-tuning (RLHF/DPO) | Human or AI preference pairs | $1,000–$100,000 | Make model safe, helpful, honest | Primarily AI labs |
| Domain fine-tuning | Domain-specific documents + Q&A | $50–$5,000 | Specialize model for a vertical (medical, legal, code) | Companies, researchers, developers |
The LIMA insight
The LIMA paper (2023) demonstrated that just 1,000 carefully curated, high-quality instruction examples produced a model competitive with models trained on 52,000 pairs. Quality matters far more than quantity in SFT data — a finding that reshaped how fine-tuning datasets are built.
Instruction fine-tuning (SFT)
Supervised Fine-Tuning on instruction-following data transforms a base LLM (which just predicts the next token) into an assistant that follows instructions. The data format is simple (instruction, response) pairs:
SFT data format and training with HuggingFace TRL
from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
# SFT training data: (instruction, response) pairs
data = [
{
"messages": [
{"role": "user", "content": "Summarize this text in 2 sentences: [article text]"},
{"role": "assistant", "content": "The article discusses..."}
]
},
# ... thousands more high-quality examples
]
dataset = Dataset.from_list(data)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=SFTConfig(
output_dir="./llama-sft",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch = 16
learning_rate=2e-5,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
bf16=True, # bfloat16 — faster, same quality as fp16
logging_steps=10,
save_strategy="epoch",
),
)
trainer.train()What makes good SFT data
(1) Diversity — cover the full range of tasks the model should handle. (2) Quality over quantity — one human-expert response beats 100 AI-generated ones. (3) Correct format — responses should model ideal assistant behavior (helpful, clear, appropriately concise). (4) No contamination — test set benchmarks must not appear in training data.
Catastrophic forgetting
Fine-tuning improves task performance but can erase general capabilities — the model 'forgets' what it knew during pretraining. This happens because gradient descent for the new task increases loss on the original data distribution:
| Mitigation | How it works | Cost | Effectiveness |
|---|---|---|---|
| LoRA / PEFT | Only update small adapter — 99% of weights frozen | Low | ⭐⭐⭐⭐ — best default choice |
| Replay | Mix original pretraining data into fine-tuning batches | Medium (need original data) | ⭐⭐⭐⭐ |
| EWC (Elastic Weight Consolidation) | Penalize changes to weights important for old tasks (via Fisher info) | Medium | ⭐⭐⭐ |
| Low learning rate | Fine-tune with LR 10–100× smaller than pretraining | None | ⭐⭐ |
| Short fine-tuning | Stop early before forgetting accumulates | None | ⭐⭐ |
The alignment tax
RLHF and SFT can reduce raw capability on knowledge benchmarks (MMLU, HumanEval) even as they improve helpfulness and safety. This is the "alignment tax" — models become more pleasant to talk to but may perform worse on raw capability evals. Careful data curation and LoRA help minimize it.
Full fine-tuning vs PEFT
| Method | Params updated | GPU memory (7B model) | Relative quality | Use case |
|---|---|---|---|---|
| Full fine-tuning | 100% (7B) | ~80GB (FP16 + optimizer states) | 100% | When you have A100/H100 cluster and large dataset |
| LoRA (r=8) | ~0.1% (7M) | ~16GB | 95–98% | Standard choice — great quality/cost ratio |
| QLoRA (4-bit + LoRA) | ~0.1% | ~6GB | 92–95% | Consumer GPU or limited VRAM — democratizes fine-tuning |
| Prefix tuning | ~0.1% (soft tokens) | ~16GB | 85–90% | Rarely used — underperforms LoRA |
| Prompt tuning | <0.01% (prompt tokens) | ~14GB | 80–85% | Only competitive at very large model scale (>10B) |
| Adapter layers | ~0.5–3% | ~17GB | 93–96% | Works well but adds inference latency (can't be merged) |
Domain-specific fine-tuning: real examples
| Domain | Model | Training data | Key result |
|---|---|---|---|
| Code | DeepSeek-Coder, Codestral | The Stack (2.8TB code), GitHub | Outperform GPT-3.5 on HumanEval despite being smaller |
| Medicine | Med-PaLM 2 (Google) | Medical texts, USMLE Q&A, clinical notes | Expert-level performance on USMLE (passing score in all categories) |
| Law | Harvey AI (GPT-4 based) | Legal documents, case law, contracts | Used by Am Law 100 law firms for contract review |
| Finance | BloombergGPT | 363B tokens of financial text + general web | Outperforms general LLMs on financial NLP benchmarks |
| Math | DeepSeek-Math, Mammoth | MATH dataset + synthetic chain-of-thought | SOTA on MATH benchmark — rivals proprietary models |
| Science | Galactica (Meta) | 48M scientific papers, references, code | Excels at scientific question answering and formula generation |
The key insight
Targeted fine-tuning on high-quality domain data often outperforms much larger general models on domain-specific tasks. A 7B model fine-tuned on 50K medical Q&A examples can outperform GPT-4 on medical licensing exams — at 1/100th the inference cost.
Practice questions
- What is the difference between full fine-tuning, LoRA, and prompt tuning in terms of memory requirements? (Answer: Full fine-tuning: update ALL weights — needs: model weights (14GB for 7B BF16) + gradients (14GB) + Adam optimizer states (28GB for FP32) ≈ 56GB for 7B. LoRA (r=16): freeze all weights, add small A,B matrices — needs: frozen weights (14GB) + LoRA gradients (~200MB) + LoRA optimizer (~400MB) ≈ 15GB for 7B. Prompt tuning: add soft prompt tokens, train only their embeddings — needs: model weights (inference memory) + ~1MB for prompt parameters. LoRA is the practical sweet spot for single-GPU fine-tuning.)
- What is the learning rate recommendation for LoRA fine-tuning vs full fine-tuning? (Answer: Full fine-tuning: 1e-5 to 5e-5. Small LR needed because all weights (pretrained knowledge) are updated — too large destroys pretraining. LoRA: 1e-4 to 3e-4. Larger LR is appropriate because only the small A,B matrices (which start near zero) are updated — the frozen pretrained weights remain unchanged. LoRA matrices need to learn from scratch, requiring higher LR to converge in reasonable training time.)
- What is catastrophic forgetting and how does fine-tuning mitigate it? (Answer: Catastrophic forgetting: training on new data overwrites gradients from old data, causing the model to lose previously learned capabilities. Full fine-tuning on a narrow dataset (e.g., SQL generation only) can severely degrade general language capabilities. Mitigations: (1) Small learning rate: minimal updates to pretrained knowledge. (2) LoRA: frozen pretrained weights cannot forget. (3) Regularization toward original weights (EWC). (4) Replay: include some general training examples in the fine-tuning mix. (5) Short training duration (1–3 epochs typically).)
- What is task-specific fine-tuning vs instruction fine-tuning, and which produces a more flexible model? (Answer: Task-specific: fine-tune on (input, output) pairs for ONE task (SQL generation, sentiment classification). Expert in that task but cannot generalize to new instructions. Instruction fine-tuning: train on thousands of diverse tasks expressed as natural language instructions. The model learns to follow novel instructions even for tasks not seen during training. Instruction-tuned models are more flexible deployments (single model handles many tasks) but may underperform narrow specialists on their specific domain.)
- Why do fine-tuned models sometimes perform worse than base models on the target task? (Answer: Common causes: (1) Too few fine-tuning examples (<100): overfitting to the small dataset. (2) Too many epochs: continued training on limited data causes memorization. (3) LR too high: destructive updates to pretrained knowledge. (4) Training data format mismatch: the model expects a specific prompt format that doesn't match the fine-tuning examples. (5) Evaluation on wrong distribution: the fine-tuned model excels on the training distribution but test examples are formatted differently. Always validate with a held-out test set.)
Practical fine-tuning in 2026: Unsloth and the fast path
The biggest shift in fine-tuning tooling since 2023: Unsloth — an open-source library that makes QLoRA training 2–5× faster with 60–80% less memory, by hand-optimizing CUDA kernels. A 7B model that took 8 hours on a single A100 now takes under 2 hours.
Fine-tuning Llama 3.1 8B with Unsloth — the 2026 standard for fast QLoRA
# pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# pip install --no-deps trl peft accelerate bitsandbytes
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch
# ── 1. Load model (4-bit QLoRA via Unsloth) ──────────────────────────────────
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=2048,
dtype=None, # auto-detect: bfloat16 on Ampere, float16 on older
load_in_4bit=True, # QLoRA — ~5GB VRAM vs ~16GB for FP16
)
# ── 2. Add LoRA adapters via Unsloth (optimized PEFT) ────────────────────────
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0, # Unsloth: 0 dropout is optimized for speed
bias="none",
use_gradient_checkpointing="unsloth", # 30% more context length
random_state=42,
)
# ── 3. Format dataset in ChatML / Alpaca format ───────────────────────────────
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
alpaca_prompt = """Below is an instruction. Write a response.
### Instruction: {}
### Response: {}"""
def format_prompts(examples):
texts = []
for instr, output in zip(examples["instruction"], examples["output"]):
texts.append(alpaca_prompt.format(instr, output) + tokenizer.eos_token)
return {"text": texts}
dataset = dataset.map(format_prompts, batched=True)
# ── 4. Train ──────────────────────────────────────────────────────────────────
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=SFTConfig(
output_dir="outputs",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60, # full run: set num_train_epochs=3
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
optim="adamw_8bit", # Unsloth: 8-bit Adam — less memory, same quality
lr_scheduler_type="linear",
),
)
trainer.train()
# ── 5. Save and merge ─────────────────────────────────────────────────────────
model.save_pretrained("lora_model") # saves LoRA weights only (~300MB)
model.save_pretrained_merged("merged_model", tokenizer,
save_method="merged_16bit") # merge + save full modelUnsloth benchmarks vs standard HuggingFace QLoRA
Unsloth's speed gains: Llama 3.1 8B QLoRA — 2.2× faster training, 71% less VRAM. Llama 3.1 70B QLoRA — 1.8× faster, 63% less VRAM. This makes training a 70B model feasible on a single H100 80GB. No accuracy loss — identical final model quality because Unsloth only optimizes the CUDA kernels, not the math. Google Colab free tier (T4 GPU) can fine-tune a 7B model in under 3 hours with Unsloth.
| Platform | GPU | Max model (QLoRA) | Cost/hour | Best for |
|---|---|---|---|---|
| Google Colab Free | T4 16GB | 7B models, short training | Free (~40hrs/month) | Learning, prototyping, small datasets |
| Google Colab Pro | A100 40GB | 13B–30B models | $0.45/hr | Serious experiments; most LoRA work |
| RunPod / Vast.ai | H100 80GB | 70B models (QLoRA) | $2–4/hr | Cost-effective production fine-tuning |
| AWS EC2 p4d.24xlarge | 8× A100 40GB | Full fine-tune 70B | $32/hr | Enterprise; multi-GPU distributed training |
| Lambda Labs | H100 80GB | 70B (QLoRA) / 13B (full) | $2.99/hr | Best price/performance for solo researchers |
| Hugging Face AutoTrain | Managed | Up to 70B | $0.60–3/hr | No-code fine-tuning; simplest setup |
OpenAI fine-tuning API: fine-tuning GPT-4o mini in 2026
For teams without ML infrastructure, OpenAI's fine-tuning API lets you fine-tune GPT-4o mini (and GPT-4o) directly — no GPU management, no training code. Best for: enforcing consistent output formats, teaching domain-specific terminology, or replicating a specific response style at scale.
OpenAI fine-tuning API — the no-infrastructure path to a custom GPT-4o mini
from openai import OpenAI
import json
client = OpenAI()
# ── Step 1: Prepare JSONL training data ─────────────────────────────────────
# Each line: one complete conversation with the ideal response
training_examples = [
{
"messages": [
{"role": "system", "content": "You are a concise SQL expert. Always return SQL in a fenced code block. Never explain unless asked."},
{"role": "user", "content": "Get all users who signed up last month"},
{"role": "assistant", "content": "```sql
SELECT * FROM users
WHERE created_at >= date_trunc('month', now() - interval '1 month')
AND created_at < date_trunc('month', now());
```"}
]
},
# ... 50+ more high-quality examples (minimum 10, recommended 50–200)
]
with open("training_data.jsonl", "w") as f:
for ex in training_examples:
f.write(json.dumps(ex) + "
")
# ── Step 2: Upload training file ──────────────────────────────────────────────
upload = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
# ── Step 3: Create fine-tuning job ────────────────────────────────────────────
job = client.fine_tuning.jobs.create(
training_file=upload.id,
model="gpt-4o-mini-2024-07-18", # cheapest; also supports gpt-4o
hyperparameters={
"n_epochs": 3, # 3 epochs default; increase for <50 examples
}
)
print(f"Job ID: {job.id}") # monitor at platform.openai.com/finetune
# ── Step 4: Use the fine-tuned model ─────────────────────────────────────────
# (after job status = "succeeded", typically 15–60 minutes)
response = client.chat.completions.create(
model=job.fine_tuned_model, # e.g. "ft:gpt-4o-mini:org:name:abc123"
messages=[{"role": "user", "content": "Count users by country"}]
)
print(response.choices[0].message.content)
# Cost: ~$3 per 1M training tokens.
# A dataset of 100 examples × 500 tokens each × 3 epochs ≈ 150K tokens ≈ $0.45 total.When OpenAI fine-tuning beats open-source LoRA
Use OpenAI fine-tuning when: (1) you have no ML engineering resources, (2) you need the reliability of GPT-4o quality without model serving infrastructure, (3) your dataset is small (<500 examples) and the task is format/style enforcement rather than deep domain knowledge. Use open-source LoRA (Llama 3, Mistral, Qwen) when: you need cost efficiency at high inference volume (self-hosted is much cheaper per call), data privacy requirements prevent sending training data to OpenAI, or you need a model you can run locally.
On LumiChats
LumiChats' open-source models (LumiCoder-7B, LumiStudy-3B, LumiReason-13B) are all fine-tuned from base open-source models using LoRA on task-specific datasets.
Try it free