Glossary/Fine-tuning

Definition

Fine-tuning is the process of taking a pretrained model and continuing to train it on a smaller, task-specific dataset. It adjusts the model's parameters to improve performance on a specific task, domain, or style — building on the general knowledge already learned during pretraining rather than starting from scratch.

Pre-training vs fine-tuning

StageDataCostGoalWho does it
PretrainingTrillions of tokens of internet text, code, books$1M–$100M+ in computeLearn general world knowledge + languageAI labs (OpenAI, Meta, Anthropic, Google)
Instruction fine-tuning (SFT)Thousands–millions of (instruction, response) pairs$10–$10,000 on cloud GPUsTeach model to follow instructions helpfullyLabs + companies building on top of base models
Alignment fine-tuning (RLHF/DPO)Human or AI preference pairs$1,000–$100,000Make model safe, helpful, honestPrimarily AI labs
Domain fine-tuningDomain-specific documents + Q&A$50–$5,000Specialize model for a vertical (medical, legal, code)Companies, researchers, developers

The LIMA insight

The LIMA paper (2023) demonstrated that just 1,000 carefully curated, high-quality instruction examples produced a model competitive with models trained on 52,000 pairs. Quality matters far more than quantity in SFT data — a finding that reshaped how fine-tuning datasets are built.

Instruction fine-tuning (SFT)

Supervised Fine-Tuning on instruction-following data transforms a base LLM (which just predicts the next token) into an assistant that follows instructions. The data format is simple (instruction, response) pairs:

SFT data format and training with HuggingFace TRL

from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# SFT training data: (instruction, response) pairs
data = [
    {
        "messages": [
            {"role": "user", "content": "Summarize this text in 2 sentences: [article text]"},
            {"role": "assistant", "content": "The article discusses..."}
        ]
    },
    # ... thousands more high-quality examples
]
dataset = Dataset.from_list(data)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="./llama-sft",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,   # effective batch = 16
        learning_rate=2e-5,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        bf16=True,                        # bfloat16 — faster, same quality as fp16
        logging_steps=10,
        save_strategy="epoch",
    ),
)
trainer.train()

What makes good SFT data

(1) Diversity — cover the full range of tasks the model should handle. (2) Quality over quantity — one human-expert response beats 100 AI-generated ones. (3) Correct format — responses should model ideal assistant behavior (helpful, clear, appropriately concise). (4) No contamination — test set benchmarks must not appear in training data.

Catastrophic forgetting

Fine-tuning improves task performance but can erase general capabilities — the model 'forgets' what it knew during pretraining. This happens because gradient descent for the new task increases loss on the original data distribution:

MitigationHow it worksCostEffectiveness
LoRA / PEFTOnly update small adapter — 99% of weights frozenLow⭐⭐⭐⭐ — best default choice
ReplayMix original pretraining data into fine-tuning batchesMedium (need original data)⭐⭐⭐⭐
EWC (Elastic Weight Consolidation)Penalize changes to weights important for old tasks (via Fisher info)Medium⭐⭐⭐
Low learning rateFine-tune with LR 10–100× smaller than pretrainingNone⭐⭐
Short fine-tuningStop early before forgetting accumulatesNone⭐⭐

The alignment tax

RLHF and SFT can reduce raw capability on knowledge benchmarks (MMLU, HumanEval) even as they improve helpfulness and safety. This is the "alignment tax" — models become more pleasant to talk to but may perform worse on raw capability evals. Careful data curation and LoRA help minimize it.

Full fine-tuning vs PEFT

MethodParams updatedGPU memory (7B model)Relative qualityUse case
Full fine-tuning100% (7B)~80GB (FP16 + optimizer states)100%When you have A100/H100 cluster and large dataset
LoRA (r=8)~0.1% (7M)~16GB95–98%Standard choice — great quality/cost ratio
QLoRA (4-bit + LoRA)~0.1%~6GB92–95%Consumer GPU or limited VRAM — democratizes fine-tuning
Prefix tuning~0.1% (soft tokens)~16GB85–90%Rarely used — underperforms LoRA
Prompt tuning<0.01% (prompt tokens)~14GB80–85%Only competitive at very large model scale (>10B)
Adapter layers~0.5–3%~17GB93–96%Works well but adds inference latency (can't be merged)

Domain-specific fine-tuning: real examples

DomainModelTraining dataKey result
CodeDeepSeek-Coder, CodestralThe Stack (2.8TB code), GitHubOutperform GPT-3.5 on HumanEval despite being smaller
MedicineMed-PaLM 2 (Google)Medical texts, USMLE Q&A, clinical notesExpert-level performance on USMLE (passing score in all categories)
LawHarvey AI (GPT-4 based)Legal documents, case law, contractsUsed by Am Law 100 law firms for contract review
FinanceBloombergGPT363B tokens of financial text + general webOutperforms general LLMs on financial NLP benchmarks
MathDeepSeek-Math, MammothMATH dataset + synthetic chain-of-thoughtSOTA on MATH benchmark — rivals proprietary models
ScienceGalactica (Meta)48M scientific papers, references, codeExcels at scientific question answering and formula generation

The key insight

Targeted fine-tuning on high-quality domain data often outperforms much larger general models on domain-specific tasks. A 7B model fine-tuned on 50K medical Q&A examples can outperform GPT-4 on medical licensing exams — at 1/100th the inference cost.

Practice questions

  1. What is the difference between full fine-tuning, LoRA, and prompt tuning in terms of memory requirements? (Answer: Full fine-tuning: update ALL weights — needs: model weights (14GB for 7B BF16) + gradients (14GB) + Adam optimizer states (28GB for FP32) ≈ 56GB for 7B. LoRA (r=16): freeze all weights, add small A,B matrices — needs: frozen weights (14GB) + LoRA gradients (~200MB) + LoRA optimizer (~400MB) ≈ 15GB for 7B. Prompt tuning: add soft prompt tokens, train only their embeddings — needs: model weights (inference memory) + ~1MB for prompt parameters. LoRA is the practical sweet spot for single-GPU fine-tuning.)
  2. What is the learning rate recommendation for LoRA fine-tuning vs full fine-tuning? (Answer: Full fine-tuning: 1e-5 to 5e-5. Small LR needed because all weights (pretrained knowledge) are updated — too large destroys pretraining. LoRA: 1e-4 to 3e-4. Larger LR is appropriate because only the small A,B matrices (which start near zero) are updated — the frozen pretrained weights remain unchanged. LoRA matrices need to learn from scratch, requiring higher LR to converge in reasonable training time.)
  3. What is catastrophic forgetting and how does fine-tuning mitigate it? (Answer: Catastrophic forgetting: training on new data overwrites gradients from old data, causing the model to lose previously learned capabilities. Full fine-tuning on a narrow dataset (e.g., SQL generation only) can severely degrade general language capabilities. Mitigations: (1) Small learning rate: minimal updates to pretrained knowledge. (2) LoRA: frozen pretrained weights cannot forget. (3) Regularization toward original weights (EWC). (4) Replay: include some general training examples in the fine-tuning mix. (5) Short training duration (1–3 epochs typically).)
  4. What is task-specific fine-tuning vs instruction fine-tuning, and which produces a more flexible model? (Answer: Task-specific: fine-tune on (input, output) pairs for ONE task (SQL generation, sentiment classification). Expert in that task but cannot generalize to new instructions. Instruction fine-tuning: train on thousands of diverse tasks expressed as natural language instructions. The model learns to follow novel instructions even for tasks not seen during training. Instruction-tuned models are more flexible deployments (single model handles many tasks) but may underperform narrow specialists on their specific domain.)
  5. Why do fine-tuned models sometimes perform worse than base models on the target task? (Answer: Common causes: (1) Too few fine-tuning examples (<100): overfitting to the small dataset. (2) Too many epochs: continued training on limited data causes memorization. (3) LR too high: destructive updates to pretrained knowledge. (4) Training data format mismatch: the model expects a specific prompt format that doesn't match the fine-tuning examples. (5) Evaluation on wrong distribution: the fine-tuned model excels on the training distribution but test examples are formatted differently. Always validate with a held-out test set.)

Practical fine-tuning in 2026: Unsloth and the fast path

The biggest shift in fine-tuning tooling since 2023: Unsloth — an open-source library that makes QLoRA training 2–5× faster with 60–80% less memory, by hand-optimizing CUDA kernels. A 7B model that took 8 hours on a single A100 now takes under 2 hours.

Fine-tuning Llama 3.1 8B with Unsloth — the 2026 standard for fast QLoRA

# pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# pip install --no-deps trl peft accelerate bitsandbytes

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

# ── 1. Load model (4-bit QLoRA via Unsloth) ──────────────────────────────────
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    dtype=None,           # auto-detect: bfloat16 on Ampere, float16 on older
    load_in_4bit=True,    # QLoRA — ~5GB VRAM vs ~16GB for FP16
)

# ── 2. Add LoRA adapters via Unsloth (optimized PEFT) ────────────────────────
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                 # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,       # Unsloth: 0 dropout is optimized for speed
    bias="none",
    use_gradient_checkpointing="unsloth",   # 30% more context length
    random_state=42,
)

# ── 3. Format dataset in ChatML / Alpaca format ───────────────────────────────
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

alpaca_prompt = """Below is an instruction. Write a response.
### Instruction: {}
### Response: {}"""

def format_prompts(examples):
    texts = []
    for instr, output in zip(examples["instruction"], examples["output"]):
        texts.append(alpaca_prompt.format(instr, output) + tokenizer.eos_token)
    return {"text": texts}

dataset = dataset.map(format_prompts, batched=True)

# ── 4. Train ──────────────────────────────────────────────────────────────────
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=SFTConfig(
        output_dir="outputs",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,          # full run: set num_train_epochs=3
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",    # Unsloth: 8-bit Adam — less memory, same quality
        lr_scheduler_type="linear",
    ),
)

trainer.train()

# ── 5. Save and merge ─────────────────────────────────────────────────────────
model.save_pretrained("lora_model")         # saves LoRA weights only (~300MB)
model.save_pretrained_merged("merged_model", tokenizer,
                             save_method="merged_16bit")  # merge + save full model

Unsloth benchmarks vs standard HuggingFace QLoRA

Unsloth's speed gains: Llama 3.1 8B QLoRA — 2.2× faster training, 71% less VRAM. Llama 3.1 70B QLoRA — 1.8× faster, 63% less VRAM. This makes training a 70B model feasible on a single H100 80GB. No accuracy loss — identical final model quality because Unsloth only optimizes the CUDA kernels, not the math. Google Colab free tier (T4 GPU) can fine-tune a 7B model in under 3 hours with Unsloth.

PlatformGPUMax model (QLoRA)Cost/hourBest for
Google Colab FreeT4 16GB7B models, short trainingFree (~40hrs/month)Learning, prototyping, small datasets
Google Colab ProA100 40GB13B–30B models$0.45/hrSerious experiments; most LoRA work
RunPod / Vast.aiH100 80GB70B models (QLoRA)$2–4/hrCost-effective production fine-tuning
AWS EC2 p4d.24xlarge8× A100 40GBFull fine-tune 70B$32/hrEnterprise; multi-GPU distributed training
Lambda LabsH100 80GB70B (QLoRA) / 13B (full)$2.99/hrBest price/performance for solo researchers
Hugging Face AutoTrainManagedUp to 70B$0.60–3/hrNo-code fine-tuning; simplest setup

OpenAI fine-tuning API: fine-tuning GPT-4o mini in 2026

For teams without ML infrastructure, OpenAI's fine-tuning API lets you fine-tune GPT-4o mini (and GPT-4o) directly — no GPU management, no training code. Best for: enforcing consistent output formats, teaching domain-specific terminology, or replicating a specific response style at scale.

OpenAI fine-tuning API — the no-infrastructure path to a custom GPT-4o mini

from openai import OpenAI
import json

client = OpenAI()

# ── Step 1: Prepare JSONL training data ─────────────────────────────────────
# Each line: one complete conversation with the ideal response
training_examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a concise SQL expert. Always return SQL in a fenced code block. Never explain unless asked."},
            {"role": "user", "content": "Get all users who signed up last month"},
            {"role": "assistant", "content": "```sql
SELECT * FROM users
WHERE created_at >= date_trunc('month', now() - interval '1 month')
  AND created_at < date_trunc('month', now());
```"}
        ]
    },
    # ... 50+ more high-quality examples (minimum 10, recommended 50–200)
]

with open("training_data.jsonl", "w") as f:
    for ex in training_examples:
        f.write(json.dumps(ex) + "
")

# ── Step 2: Upload training file ──────────────────────────────────────────────
upload = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

# ── Step 3: Create fine-tuning job ────────────────────────────────────────────
job = client.fine_tuning.jobs.create(
    training_file=upload.id,
    model="gpt-4o-mini-2024-07-18",    # cheapest; also supports gpt-4o
    hyperparameters={
        "n_epochs": 3,                  # 3 epochs default; increase for <50 examples
    }
)

print(f"Job ID: {job.id}")  # monitor at platform.openai.com/finetune

# ── Step 4: Use the fine-tuned model ─────────────────────────────────────────
# (after job status = "succeeded", typically 15–60 minutes)
response = client.chat.completions.create(
    model=job.fine_tuned_model,   # e.g. "ft:gpt-4o-mini:org:name:abc123"
    messages=[{"role": "user", "content": "Count users by country"}]
)
print(response.choices[0].message.content)

# Cost: ~$3 per 1M training tokens.  
# A dataset of 100 examples × 500 tokens each × 3 epochs ≈ 150K tokens ≈ $0.45 total.

When OpenAI fine-tuning beats open-source LoRA

Use OpenAI fine-tuning when: (1) you have no ML engineering resources, (2) you need the reliability of GPT-4o quality without model serving infrastructure, (3) your dataset is small (<500 examples) and the task is format/style enforcement rather than deep domain knowledge. Use open-source LoRA (Llama 3, Mistral, Qwen) when: you need cost efficiency at high inference volume (self-hosted is much cheaper per call), data privacy requirements prevent sending training data to OpenAI, or you need a model you can run locally.

On LumiChats

LumiChats' open-source models (LumiCoder-7B, LumiStudy-3B, LumiReason-13B) are all fine-tuned from base open-source models using LoRA on task-specific datasets.

Try it free

✦ Under $1 / day

Practice what you just learned

Quiz Hub + Study Mode lock in every concept. 40+ AI models, Agent Mode, page-locked answers — all for less than a dollar a day.

Start Free — Under $1/day

Related Terms

5 terms