What is pre-training vs fine-tuning?

Fine-tuning: Pre-training vs fine-tuning. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/fine-tuning

What is full fine-tuning vs PEFT?

Fine-tuning: Full fine-tuning vs PEFT. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/fine-tuning

What is practice questions?

Fine-tuning: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/fine-tuning

Fine-tuning

Fine-tuning is the process of taking a pretrained model and continuing to train it on a smaller, task-specific dataset. It adjusts the model's parameters to improve performance on a specific task, domain, or style — building on the general knowledge already learned during pretraining rather than starting from scratch.

How AI models are specialized for specific tasks.

Category: Model Training & Optimization

Pre-training vs fine-tuning

Stage	Data	Cost	Goal	Who does it
Pretraining	Trillions of tokens of internet text, code, books	$1M–$100M+ in compute	Learn general world knowledge + language	AI labs (OpenAI, Meta, Anthropic, Google)
Instruction fine-tuning (SFT)	Thousands–millions of (instruction, response) pairs	$10–$10,000 on cloud GPUs	Teach model to follow instructions helpfully	Labs + companies building on top of base models
Alignment fine-tuning (RLHF/DPO)	Human or AI preference pairs	$1,000–$100,000	Make model safe, helpful, honest	Primarily AI labs
Domain fine-tuning	Domain-specific documents + Q&A	$50–$5,000	Specialize model for a vertical (medical, legal, code)	Companies, researchers, developers

The LIMA insight: The LIMA paper (2023) demonstrated that just 1,000 carefully curated, high-quality instruction examples produced a model competitive with models trained on 52,000 pairs. Quality matters far more than quantity in SFT data — a finding that reshaped how fine-tuning datasets are built.

Instruction fine-tuning (SFT)

Supervised Fine-Tuning on instruction-following data transforms a base LLM (which just predicts the next token) into an assistant that follows instructions. The data format is simple (instruction, response) pairs:

from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# SFT training data: (instruction, response) pairs
data = [
    {
        "messages": [
            {"role": "user", "content": "Summarize this text in 2 sentences: [article text]"},
            {"role": "assistant", "content": "The article discusses..."}
        ]
    },
    # ... thousands more high-quality examples
]
dataset = Dataset.from_list(data)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="./llama-sft",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,   # effective batch = 16
        learning_rate=2e-5,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        bf16=True,                        # bfloat16 — faster, same quality as fp16
        logging_steps=10,
        save_strategy="epoch",
    ),
)
trainer.train()

What makes good SFT data: (1) Diversity — cover the full range of tasks the model should handle. (2) Quality over quantity — one human-expert response beats 100 AI-generated ones. (3) Correct format — responses should model ideal assistant behavior (helpful, clear, appropriately concise). (4) No contamination — test set benchmarks must not appear in training data.

Catastrophic forgetting

Fine-tuning improves task performance but can erase general capabilities — the model 'forgets' what it knew during pretraining. This happens because gradient descent for the new task increases loss on the original data distribution:

Mitigation	How it works	Cost	Effectiveness
LoRA / PEFT	Only update small adapter — 99% of weights frozen	Low	⭐⭐⭐⭐ — best default choice
Replay	Mix original pretraining data into fine-tuning batches	Medium (need original data)	⭐⭐⭐⭐
EWC (Elastic Weight Consolidation)	Penalize changes to weights important for old tasks (via Fisher info)	Medium	⭐⭐⭐
Low learning rate	Fine-tune with LR 10–100× smaller than pretraining	None	⭐⭐
Short fine-tuning	Stop early before forgetting accumulates	None	⭐⭐

The alignment tax: RLHF and SFT can reduce raw capability on knowledge benchmarks (MMLU, HumanEval) even as they improve helpfulness and safety. This is the "alignment tax" — models become more pleasant to talk to but may perform worse on raw capability evals. Careful data curation and LoRA help minimize it.

Full fine-tuning vs PEFT

Method	Params updated	GPU memory (7B model)	Relative quality	Use case
Full fine-tuning	100% (7B)	~80GB (FP16 + optimizer states)	100%	When you have A100/H100 cluster and large dataset
LoRA (r=8)	~0.1% (7M)	~16GB	95–98%	Standard choice — great quality/cost ratio
QLoRA (4-bit + LoRA)	~0.1%	~6GB	92–95%	Consumer GPU or limited VRAM — democratizes fine-tuning
Prefix tuning	~0.1% (soft tokens)	~16GB	85–90%	Rarely used — underperforms LoRA
Prompt tuning	<0.01% (prompt tokens)	~14GB	80–85%	Only competitive at very large model scale (>10B)
Adapter layers	~0.5–3%	~17GB	93–96%	Works well but adds inference latency (can't be merged)

Domain-specific fine-tuning: real examples

Domain	Model	Training data	Key result
Code	DeepSeek-Coder, Codestral	The Stack (2.8TB code), GitHub	Outperform GPT-3.5 on HumanEval despite being smaller
Medicine	Med-PaLM 2 (Google)	Medical texts, USMLE Q&A, clinical notes	Expert-level performance on USMLE (passing score in all categories)
Law	Harvey AI (GPT-4 based)	Legal documents, case law, contracts	Used by Am Law 100 law firms for contract review
Finance	BloombergGPT	363B tokens of financial text + general web	Outperforms general LLMs on financial NLP benchmarks
Math	DeepSeek-Math, Mammoth	MATH dataset + synthetic chain-of-thought	SOTA on MATH benchmark — rivals proprietary models
Science	Galactica (Meta)	48M scientific papers, references, code	Excels at scientific question answering and formula generation

The key insight: Targeted fine-tuning on high-quality domain data often outperforms much larger general models on domain-specific tasks. A 7B model fine-tuned on 50K medical Q&A examples can outperform GPT-4 on medical licensing exams — at 1/100th the inference cost.

Practice questions

What is the difference between full fine-tuning, LoRA, and prompt tuning in terms of memory requirements? (Answer: Full fine-tuning: update ALL weights — needs: model weights (14GB for 7B BF16) + gradients (14GB) + Adam optimizer states (28GB for FP32) ≈ 56GB for 7B. LoRA (r=16): freeze all weights, add small A,B matrices — needs: frozen weights (14GB) + LoRA gradients (~200MB) + LoRA optimizer (~400MB) ≈ 15GB for 7B. Prompt tuning: add soft prompt tokens, train only their embeddings — needs: model weights (inference memory) + ~1MB for prompt parameters. LoRA is the practical sweet spot for single-GPU fine-tuning.)
What is the learning rate recommendation for LoRA fine-tuning vs full fine-tuning? (Answer: Full fine-tuning: 1e-5 to 5e-5. Small LR needed because all weights (pretrained knowledge) are updated — too large destroys pretraining. LoRA: 1e-4 to 3e-4. Larger LR is appropriate because only the small A,B matrices (which start near zero) are updated — the frozen pretrained weights remain unchanged. LoRA matrices need to learn from scratch, requiring higher LR to converge in reasonable training time.)
What is catastrophic forgetting and how does fine-tuning mitigate it? (Answer: Catastrophic forgetting: training on new data overwrites gradients from old data, causing the model to lose previously learned capabilities. Full fine-tuning on a narrow dataset (e.g., SQL generation only) can severely degrade general language capabilities. Mitigations: (1) Small learning rate: minimal updates to pretrained knowledge. (2) LoRA: frozen pretrained weights cannot forget. (3) Regularization toward original weights (EWC). (4) Replay: include some general training examples in the fine-tuning mix. (5) Short training duration (1–3 epochs typically).)
What is task-specific fine-tuning vs instruction fine-tuning, and which produces a more flexible model? (Answer: Task-specific: fine-tune on (input, output) pairs for ONE task (SQL generation, sentiment classification). Expert in that task but cannot generalize to new instructions. Instruction fine-tuning: train on thousands of diverse tasks expressed as natural language instructions. The model learns to follow novel instructions even for tasks not seen during training. Instruction-tuned models are more flexible deployments (single model handles many tasks) but may underperform narrow specialists on their specific domain.)
Why do fine-tuned models sometimes perform worse than base models on the target task? (Answer: Common causes: (1) Too few fine-tuning examples (<100): overfitting to the small dataset. (2) Too many epochs: continued training on limited data causes memorization. (3) LR too high: destructive updates to pretrained knowledge. (4) Training data format mismatch: the model expects a specific prompt format that doesn't match the fine-tuning examples. (5) Evaluation on wrong distribution: the fine-tuned model excels on the training distribution but test examples are formatted differently. Always validate with a held-out test set.)

Practical fine-tuning in 2026: Unsloth and the fast path

The biggest shift in fine-tuning tooling since 2023: Unsloth — an open-source library that makes QLoRA training 2–5× faster with 60–80% less memory, by hand-optimizing CUDA kernels. A 7B model that took 8 hours on a single A100 now takes under 2 hours.

# pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# pip install --no-deps trl peft accelerate bitsandbytes

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

# ── 1. Load model (4-bit QLoRA via Unsloth) ──────────────────────────────────
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    dtype=None,           # auto-detect: bfloat16 on Ampere, float16 on older
    load_in_4bit=True,    # QLoRA — ~5GB VRAM vs ~16GB for FP16
)

# ── 2. Add LoRA adapters via Unsloth (optimized PEFT) ────────────────────────
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                 # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,       # Unsloth: 0 dropout is optimized for speed
    bias="none",
    use_gradient_checkpointing="unsloth",   # 30% more context length
    random_state=42,
)

# ── 3. Format dataset in ChatML / Alpaca format ───────────────────────────────
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

alpaca_prompt = """Below is an instruction. Write a response.
### Instruction: {}
### Response: {}"""

def format_prompts(examples):
    texts = []
    for instr, output in zip(examples["instruction"], examples["output"]):
        texts.append(alpaca_prompt.format(instr, output) + tokenizer.eos_token)
    return {"text": texts}

dataset = dataset.map(format_prompts, batched=True)

# ── 4. Train ──────────────────────────────────────────────────────────────────
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=SFTConfig(
        output_dir="outputs",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,          # full run: set num_train_epochs=3
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",    # Unsloth: 8-bit Adam — less memory, same quality
        lr_scheduler_type="linear",
    ),
)

trainer.train()

# ── 5. Save and merge ─────────────────────────────────────────────────────────
model.save_pretrained("lora_model")         # saves LoRA weights only (~300MB)
model.save_pretrained_merged("merged_model", tokenizer,
                             save_method="merged_16bit")  # merge + save full model

Unsloth benchmarks vs standard HuggingFace QLoRA: Unsloth's speed gains: Llama 3.1 8B QLoRA — 2.2× faster training, 71% less VRAM. Llama 3.1 70B QLoRA — 1.8× faster, 63% less VRAM. This makes training a 70B model feasible on a single H100 80GB. No accuracy loss — identical final model quality because Unsloth only optimizes the CUDA kernels, not the math. Google Colab free tier (T4 GPU) can fine-tune a 7B model in under 3 hours with Unsloth.

Platform	GPU	Max model (QLoRA)	Cost/hour	Best for
Google Colab Free	T4 16GB	7B models, short training	Free (~40hrs/month)	Learning, prototyping, small datasets
Google Colab Pro	A100 40GB	13B–30B models	$0.45/hr	Serious experiments; most LoRA work
RunPod / Vast.ai	H100 80GB	70B models (QLoRA)	$2–4/hr	Cost-effective production fine-tuning
AWS EC2 p4d.24xlarge	8× A100 40GB	Full fine-tune 70B	$32/hr	Enterprise; multi-GPU distributed training
Lambda Labs	H100 80GB	70B (QLoRA) / 13B (full)	$2.99/hr	Best price/performance for solo researchers
Hugging Face AutoTrain	Managed	Up to 70B	$0.60–3/hr	No-code fine-tuning; simplest setup

OpenAI fine-tuning API: fine-tuning GPT-4o mini in 2026

For teams without ML infrastructure, OpenAI's fine-tuning API lets you fine-tune GPT-4o mini (and GPT-4o) directly — no GPU management, no training code. Best for: enforcing consistent output formats, teaching domain-specific terminology, or replicating a specific response style at scale.

from openai import OpenAI
import json

client = OpenAI()

# ── Step 1: Prepare JSONL training data ─────────────────────────────────────
# Each line: one complete conversation with the ideal response
training_examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a concise SQL expert. Always return SQL in a fenced code block. Never explain unless asked."},
            {"role": "user", "content": "Get all users who signed up last month"},
            {"role": "assistant", "content": "```sql
SELECT * FROM users
WHERE created_at >= date_trunc('month', now() - interval '1 month')
  AND created_at < date_trunc('month', now());
```"}
        ]
    },
    # ... 50+ more high-quality examples (minimum 10, recommended 50–200)
]

with open("training_data.jsonl", "w") as f:
    for ex in training_examples:
        f.write(json.dumps(ex) + "
")

# ── Step 2: Upload training file ──────────────────────────────────────────────
upload = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

# ── Step 3: Create fine-tuning job ────────────────────────────────────────────
job = client.fine_tuning.jobs.create(
    training_file=upload.id,
    model="gpt-4o-mini-2024-07-18",    # cheapest; also supports gpt-4o
    hyperparameters={
        "n_epochs": 3,                  # 3 epochs default; increase for <50 examples
    }
)

print(f"Job ID: {job.id}")  # monitor at platform.openai.com/finetune

# ── Step 4: Use the fine-tuned model ─────────────────────────────────────────
# (after job status = "succeeded", typically 15–60 minutes)
response = client.chat.completions.create(
    model=job.fine_tuned_model,   # e.g. "ft:gpt-4o-mini:org:name:abc123"
    messages=[{"role": "user", "content": "Count users by country"}]
)
print(response.choices[0].message.content)

# Cost: ~$3 per 1M training tokens.  
# A dataset of 100 examples × 500 tokens each × 3 epochs ≈ 150K tokens ≈ $0.45 total.

When OpenAI fine-tuning beats open-source LoRA: Use OpenAI fine-tuning when: (1) you have no ML engineering resources, (2) you need the reliability of GPT-4o quality without model serving infrastructure, (3) your dataset is small (<500 examples) and the task is format/style enforcement rather than deep domain knowledge. Use open-source LoRA (Llama 3, Mistral, Qwen) when: you need cost efficiency at high inference volume (self-hosted is much cheaper per call), data privacy requirements prevent sending training data to OpenAI, or you need a model you can run locally.

LumiChats' open-source models (LumiCoder-7B, LumiStudy-3B, LumiReason-13B) are all fine-tuned from base open-source models using LoRA on task-specific datasets.

Definition

Pre-training vs fine-tuning

Stage	Data	Cost	Goal	Who does it
Pretraining	Trillions of tokens of internet text, code, books	$1M–$100M+ in compute	Learn general world knowledge + language	AI labs (OpenAI, Meta, Anthropic, Google)
Instruction fine-tuning (SFT)	Thousands–millions of (instruction, response) pairs	$10–$10,000 on cloud GPUs	Teach model to follow instructions helpfully	Labs + companies building on top of base models
Alignment fine-tuning (RLHF/DPO)	Human or AI preference pairs	$1,000–$100,000	Make model safe, helpful, honest	Primarily AI labs
Domain fine-tuning	Domain-specific documents + Q&A	$50–$5,000	Specialize model for a vertical (medical, legal, code)	Companies, researchers, developers

The LIMA insight

The LIMA paper (2023) demonstrated that just 1,000 carefully curated, high-quality instruction examples produced a model competitive with models trained on 52,000 pairs. Quality matters far more than quantity in SFT data — a finding that reshaped how fine-tuning datasets are built.

Instruction fine-tuning (SFT)

SFT data format and training with HuggingFace TRL

from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# SFT training data: (instruction, response) pairs
data = [
    {
        "messages": [
            {"role": "user", "content": "Summarize this text in 2 sentences: [article text]"},
            {"role": "assistant", "content": "The article discusses..."}
        ]
    },
    # ... thousands more high-quality examples
]
dataset = Dataset.from_list(data)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="./llama-sft",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,   # effective batch = 16
        learning_rate=2e-5,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        bf16=True,                        # bfloat16 — faster, same quality as fp16
        logging_steps=10,
        save_strategy="epoch",
    ),
)
trainer.train()

What makes good SFT data

(1) Diversity — cover the full range of tasks the model should handle. (2) Quality over quantity — one human-expert response beats 100 AI-generated ones. (3) Correct format — responses should model ideal assistant behavior (helpful, clear, appropriately concise). (4) No contamination — test set benchmarks must not appear in training data.

Catastrophic forgetting

Mitigation	How it works	Cost	Effectiveness
LoRA / PEFT	Only update small adapter — 99% of weights frozen	Low	⭐⭐⭐⭐ — best default choice
Replay	Mix original pretraining data into fine-tuning batches	Medium (need original data)	⭐⭐⭐⭐
EWC (Elastic Weight Consolidation)	Penalize changes to weights important for old tasks (via Fisher info)	Medium	⭐⭐⭐
Low learning rate	Fine-tune with LR 10–100× smaller than pretraining	None	⭐⭐
Short fine-tuning	Stop early before forgetting accumulates	None	⭐⭐

The alignment tax

RLHF and SFT can reduce raw capability on knowledge benchmarks (MMLU, HumanEval) even as they improve helpfulness and safety. This is the "alignment tax" — models become more pleasant to talk to but may perform worse on raw capability evals. Careful data curation and LoRA help minimize it.

Full fine-tuning vs PEFT

Method	Params updated	GPU memory (7B model)	Relative quality	Use case
Full fine-tuning	100% (7B)	~80GB (FP16 + optimizer states)	100%	When you have A100/H100 cluster and large dataset
LoRA (r=8)	~0.1% (7M)	~16GB	95–98%	Standard choice — great quality/cost ratio
QLoRA (4-bit + LoRA)	~0.1%	~6GB	92–95%	Consumer GPU or limited VRAM — democratizes fine-tuning
Prefix tuning	~0.1% (soft tokens)	~16GB	85–90%	Rarely used — underperforms LoRA
Prompt tuning	<0.01% (prompt tokens)	~14GB	80–85%	Only competitive at very large model scale (>10B)
Adapter layers	~0.5–3%	~17GB	93–96%	Works well but adds inference latency (can't be merged)

Domain-specific fine-tuning: real examples

Domain	Model	Training data	Key result
Code	DeepSeek-Coder, Codestral	The Stack (2.8TB code), GitHub	Outperform GPT-3.5 on HumanEval despite being smaller
Medicine	Med-PaLM 2 (Google)	Medical texts, USMLE Q&A, clinical notes	Expert-level performance on USMLE (passing score in all categories)
Law	Harvey AI (GPT-4 based)	Legal documents, case law, contracts	Used by Am Law 100 law firms for contract review
Finance	BloombergGPT	363B tokens of financial text + general web	Outperforms general LLMs on financial NLP benchmarks
Math	DeepSeek-Math, Mammoth	MATH dataset + synthetic chain-of-thought	SOTA on MATH benchmark — rivals proprietary models
Science	Galactica (Meta)	48M scientific papers, references, code	Excels at scientific question answering and formula generation

The key insight

Targeted fine-tuning on high-quality domain data often outperforms much larger general models on domain-specific tasks. A 7B model fine-tuned on 50K medical Q&A examples can outperform GPT-4 on medical licensing exams — at 1/100th the inference cost.

Practice questions

What is the difference between full fine-tuning, LoRA, and prompt tuning in terms of memory requirements? (Answer: Full fine-tuning: update ALL weights — needs: model weights (14GB for 7B BF16) + gradients (14GB) + Adam optimizer states (28GB for FP32) ≈ 56GB for 7B. LoRA (r=16): freeze all weights, add small A,B matrices — needs: frozen weights (14GB) + LoRA gradients (~200MB) + LoRA optimizer (~400MB) ≈ 15GB for 7B. Prompt tuning: add soft prompt tokens, train only their embeddings — needs: model weights (inference memory) + ~1MB for prompt parameters. LoRA is the practical sweet spot for single-GPU fine-tuning.)
What is the learning rate recommendation for LoRA fine-tuning vs full fine-tuning? (Answer: Full fine-tuning: 1e-5 to 5e-5. Small LR needed because all weights (pretrained knowledge) are updated — too large destroys pretraining. LoRA: 1e-4 to 3e-4. Larger LR is appropriate because only the small A,B matrices (which start near zero) are updated — the frozen pretrained weights remain unchanged. LoRA matrices need to learn from scratch, requiring higher LR to converge in reasonable training time.)
What is catastrophic forgetting and how does fine-tuning mitigate it? (Answer: Catastrophic forgetting: training on new data overwrites gradients from old data, causing the model to lose previously learned capabilities. Full fine-tuning on a narrow dataset (e.g., SQL generation only) can severely degrade general language capabilities. Mitigations: (1) Small learning rate: minimal updates to pretrained knowledge. (2) LoRA: frozen pretrained weights cannot forget. (3) Regularization toward original weights (EWC). (4) Replay: include some general training examples in the fine-tuning mix. (5) Short training duration (1–3 epochs typically).)
What is task-specific fine-tuning vs instruction fine-tuning, and which produces a more flexible model? (Answer: Task-specific: fine-tune on (input, output) pairs for ONE task (SQL generation, sentiment classification). Expert in that task but cannot generalize to new instructions. Instruction fine-tuning: train on thousands of diverse tasks expressed as natural language instructions. The model learns to follow novel instructions even for tasks not seen during training. Instruction-tuned models are more flexible deployments (single model handles many tasks) but may underperform narrow specialists on their specific domain.)
Why do fine-tuned models sometimes perform worse than base models on the target task? (Answer: Common causes: (1) Too few fine-tuning examples (<100): overfitting to the small dataset. (2) Too many epochs: continued training on limited data causes memorization. (3) LR too high: destructive updates to pretrained knowledge. (4) Training data format mismatch: the model expects a specific prompt format that doesn't match the fine-tuning examples. (5) Evaluation on wrong distribution: the fine-tuned model excels on the training distribution but test examples are formatted differently. Always validate with a held-out test set.)

Practical fine-tuning in 2026: Unsloth and the fast path

Fine-tuning Llama 3.1 8B with Unsloth — the 2026 standard for fast QLoRA

# pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# pip install --no-deps trl peft accelerate bitsandbytes

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

# ── 1. Load model (4-bit QLoRA via Unsloth) ──────────────────────────────────
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    dtype=None,           # auto-detect: bfloat16 on Ampere, float16 on older
    load_in_4bit=True,    # QLoRA — ~5GB VRAM vs ~16GB for FP16
)

# ── 2. Add LoRA adapters via Unsloth (optimized PEFT) ────────────────────────
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                 # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,       # Unsloth: 0 dropout is optimized for speed
    bias="none",
    use_gradient_checkpointing="unsloth",   # 30% more context length
    random_state=42,
)

# ── 3. Format dataset in ChatML / Alpaca format ───────────────────────────────
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

alpaca_prompt = """Below is an instruction. Write a response.
### Instruction: {}
### Response: {}"""

def format_prompts(examples):
    texts = []
    for instr, output in zip(examples["instruction"], examples["output"]):
        texts.append(alpaca_prompt.format(instr, output) + tokenizer.eos_token)
    return {"text": texts}

dataset = dataset.map(format_prompts, batched=True)

# ── 4. Train ──────────────────────────────────────────────────────────────────
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=SFTConfig(
        output_dir="outputs",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,          # full run: set num_train_epochs=3
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",    # Unsloth: 8-bit Adam — less memory, same quality
        lr_scheduler_type="linear",
    ),
)

trainer.train()

# ── 5. Save and merge ─────────────────────────────────────────────────────────
model.save_pretrained("lora_model")         # saves LoRA weights only (~300MB)
model.save_pretrained_merged("merged_model", tokenizer,
                             save_method="merged_16bit")  # merge + save full model

Unsloth benchmarks vs standard HuggingFace QLoRA

Unsloth's speed gains: Llama 3.1 8B QLoRA — 2.2× faster training, 71% less VRAM. Llama 3.1 70B QLoRA — 1.8× faster, 63% less VRAM. This makes training a 70B model feasible on a single H100 80GB. No accuracy loss — identical final model quality because Unsloth only optimizes the CUDA kernels, not the math. Google Colab free tier (T4 GPU) can fine-tune a 7B model in under 3 hours with Unsloth.

Platform	GPU	Max model (QLoRA)	Cost/hour	Best for
Google Colab Free	T4 16GB	7B models, short training	Free (~40hrs/month)	Learning, prototyping, small datasets
Google Colab Pro	A100 40GB	13B–30B models	$0.45/hr	Serious experiments; most LoRA work
RunPod / Vast.ai	H100 80GB	70B models (QLoRA)	$2–4/hr	Cost-effective production fine-tuning
AWS EC2 p4d.24xlarge	8× A100 40GB	Full fine-tune 70B	$32/hr	Enterprise; multi-GPU distributed training
Lambda Labs	H100 80GB	70B (QLoRA) / 13B (full)	$2.99/hr	Best price/performance for solo researchers
Hugging Face AutoTrain	Managed	Up to 70B	$0.60–3/hr	No-code fine-tuning; simplest setup

OpenAI fine-tuning API: fine-tuning GPT-4o mini in 2026

OpenAI fine-tuning API — the no-infrastructure path to a custom GPT-4o mini

from openai import OpenAI
import json

client = OpenAI()

# ── Step 1: Prepare JSONL training data ─────────────────────────────────────
# Each line: one complete conversation with the ideal response
training_examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a concise SQL expert. Always return SQL in a fenced code block. Never explain unless asked."},
            {"role": "user", "content": "Get all users who signed up last month"},
            {"role": "assistant", "content": "```sql
SELECT * FROM users
WHERE created_at >= date_trunc('month', now() - interval '1 month')
  AND created_at < date_trunc('month', now());
```"}
        ]
    },
    # ... 50+ more high-quality examples (minimum 10, recommended 50–200)
]

with open("training_data.jsonl", "w") as f:
    for ex in training_examples:
        f.write(json.dumps(ex) + "
")

# ── Step 2: Upload training file ──────────────────────────────────────────────
upload = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

# ── Step 3: Create fine-tuning job ────────────────────────────────────────────
job = client.fine_tuning.jobs.create(
    training_file=upload.id,
    model="gpt-4o-mini-2024-07-18",    # cheapest; also supports gpt-4o
    hyperparameters={
        "n_epochs": 3,                  # 3 epochs default; increase for <50 examples
    }
)

print(f"Job ID: {job.id}")  # monitor at platform.openai.com/finetune

# ── Step 4: Use the fine-tuned model ─────────────────────────────────────────
# (after job status = "succeeded", typically 15–60 minutes)
response = client.chat.completions.create(
    model=job.fine_tuned_model,   # e.g. "ft:gpt-4o-mini:org:name:abc123"
    messages=[{"role": "user", "content": "Count users by country"}]
)
print(response.choices[0].message.content)

# Cost: ~$3 per 1M training tokens.  
# A dataset of 100 examples × 500 tokens each × 3 epochs ≈ 150K tokens ≈ $0.45 total.

When OpenAI fine-tuning beats open-source LoRA

Use OpenAI fine-tuning when: (1) you have no ML engineering resources, (2) you need the reliability of GPT-4o quality without model serving infrastructure, (3) your dataset is small (<500 examples) and the task is format/style enforcement rather than deep domain knowledge. Use open-source LoRA (Llama 3, Mistral, Qwen) when: you need cost efficiency at high inference volume (self-hosted is much cheaper per call), data privacy requirements prevent sending training data to OpenAI, or you need a model you can run locally.

On LumiChats

LumiChats' open-source models (LumiCoder-7B, LumiStudy-3B, LumiReason-13B) are all fine-tuned from base open-source models using LoRA on task-specific datasets.

Try it free

Fine-tuning

Pre-training vs fine-tuning

Instruction fine-tuning (SFT)

Catastrophic forgetting

Full fine-tuning vs PEFT

Domain-specific fine-tuning: real examples

Practice questions

Practical fine-tuning in 2026: Unsloth and the fast path

OpenAI fine-tuning API: fine-tuning GPT-4o mini in 2026

Fine-tuning

Pre-training vs fine-tuning

Instruction fine-tuning (SFT)

Catastrophic forgetting

Full fine-tuning vs PEFT

Domain-specific fine-tuning: real examples

Practice questions

Practical fine-tuning in 2026: Unsloth and the fast path

OpenAI fine-tuning API: fine-tuning GPT-4o mini in 2026

Practice what you just learned

Related Terms