What is the complete fine-tuning stack in 2025?

Practical LLM Fine-Tuning — Unsloth, LoRA & the Modern Stack: The complete fine-tuning stack in 2025. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/unsloth-practical-finetuning

What is practice questions?

Practical LLM Fine-Tuning — Unsloth, LoRA & the Modern Stack: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/unsloth-practical-finetuning

Practical LLM Fine-Tuning — Unsloth, LoRA & the Modern Stack

Modern LLM fine-tuning uses a streamlined stack: Unsloth (2-4× faster training, 60% less VRAM via custom CUDA kernels), LoRA/QLoRA (train 0.1% of parameters), and HuggingFace TRL (GRPO, PPO, SFT trainers). A 7B model that previously required 4× A100s (80GB each) can now be fine-tuned on a single 16GB consumer GPU. This democratization means anyone can customise a state-of-the-art model for their specific domain — legal documents, medical Q&A, custom personas, or specialized code generation.

Fine-tuning a 7B LLM on a free GPU in under an hour — the complete practical guide.

Category: Model Training & Optimization

The complete fine-tuning stack in 2025

# ═══════════════════════════════════════════════════════════
# Complete SFT (Supervised Fine-Tuning) workflow with Unsloth
# Based on patterns from the Granite/Qwen/Llama notebooks
# Runs on free Google Colab T4 (16GB VRAM)
# ═══════════════════════════════════════════════════════════

# Step 0: Install
# pip install unsloth trl transformers datasets bitsandbytes

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset, Dataset
import torch

# ── Step 1: Load model with automatic optimizations ──
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = "unsloth/Llama-3.2-1B-Instruct",  # or Qwen3, Granite4 etc.
    max_seq_length = 2048,        # Max tokens per example
    dtype          = None,        # Auto-detect: BF16 on Ampere+, FP16 on older
    load_in_4bit   = True,        # 4-bit quantization: 7B uses ~4GB VRAM instead of 14GB
)

# ── Step 2: Add LoRA adapters ──
model = FastLanguageModel.get_peft_model(
    model,
    r=16,               # LoRA rank. 8-64 common. Higher = more params = more capacity
    lora_alpha=16,      # Scaling factor = lora_alpha/r. Often set equal to r.
    lora_dropout=0.0,   # 0 works well for LoRA (unlike vanilla dropout)
    bias="none",        # Recommended: no bias in LoRA layers
    target_modules=[    # Which attention/MLP matrices to add LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",   # Attention matrices
        "gate_proj", "up_proj", "down_proj",       # MLP matrices
    ],
    use_gradient_checkpointing="unsloth",  # Saves VRAM at cost of slight speed
)

# Show trainable parameter count
model.print_trainable_parameters()
# trainable params: 27,262,976 / 1,235,814,400 = 2.21% (for 1B model)

# ── Step 3: Prepare dataset with chat template ──
# Use standard chat format that matches the model's instruction template
train_data = [
    {
        "messages": [
            {"role": "system",    "content": "You are a helpful SQL expert."},
            {"role": "user",      "content": "How do I get the top 10 customers by revenue?"},
            {"role": "assistant", "content": "Use ORDER BY with LIMIT:

SELECT customer_id, SUM(amount) AS revenue
FROM orders
GROUP BY customer_id
ORDER BY revenue DESC
LIMIT 10;"},
        ]
    },
    # Add thousands more examples...
]
dataset = Dataset.from_list(train_data)

# Apply chat template to format as model expects
def format_chat(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": text}

dataset = dataset.map(format_chat)

# ── Step 4: Configure and run SFT trainer ──
sft_config = SFTConfig(
    output_dir             = "./sft_output",
    dataset_text_field     = "text",
    max_seq_length         = 2048,
    num_train_epochs       = 3,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,   # Effective batch = 2 × 4 = 8
    warmup_steps           = 10,
    learning_rate          = 2e-4,     # Higher LR for LoRA vs full FT
    weight_decay           = 0.01,
    lr_scheduler_type      = "cosine",
    bf16                   = True,     # BF16 compute (auto on Ampere+)
    fp16                   = False,
    logging_steps          = 10,
    save_steps             = 100,
    report_to              = "none",   # "tensorboard" or "wandb" for tracking
)

trainer = SFTTrainer(
    model         = model,
    tokenizer     = tokenizer,
    train_dataset = dataset,
    args          = sft_config,
)

# ── Step 5: Train ──
trainer.train()

# ── Step 6: Save and use the fine-tuned model ──
# Save LoRA adapters only (tiny: ~50MB for 7B model)
model.save_pretrained("./lora_adapters")
tokenizer.save_pretrained("./lora_adapters")

# For GGUF/Ollama deployment (quantise for local inference)
model.save_pretrained_gguf("./gguf_model", tokenizer,
    quantization_method="q4_k_m")   # 4-bit quantization for CPU inference

# ── Step 7: Inference with the fine-tuned model ──
FastLanguageModel.for_inference(model)   # Enable 2× faster inference mode
inputs = tokenizer([
    tokenizer.apply_chat_template([
        {"role": "user", "content": "How do I join two tables in SQL?"}
    ], tokenize=False, add_generation_prompt=True)
], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Vision fine-tuning with Unsloth (from Qwen/PaddleOCR notebooks)

# Based on Qwen3_5__2B__Vision.ipynb and Qwen3_5__4B__Vision.ipynb notebooks
# Fine-tuning a vision-language model (VLM) on image+text tasks

from unsloth import FastVisionModel   # Vision-specific fast loading

model, tokenizer = FastVisionModel.from_pretrained(
    model_name   = "unsloth/Qwen2-VL-2B-Instruct",  # 2B vision-language model
    max_seq_length = 2048,
    dtype          = None,
    load_in_4bit   = True,
)

# Enable only text weights for faster fine-tuning (keep vision encoder frozen)
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers = True,    # Train vision encoder too?
    finetune_language_layers = True,  # Train language decoder
    finetune_attention_modules = True,
    finetune_mlp_modules = True,
    r=16, lora_alpha=16,
)

# Vision training dataset format
vision_data = [
    {
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": "path/to/invoice.jpg"},
                    {"type": "text",  "text": "Extract all line items and totals from this invoice."}
                ]
            },
            {
                "role": "assistant",
                "content": "Line Items:
1. Product A: $50.00
2. Service B: $120.00
Total: $170.00"
            }
        ]
    }
]
# Use cases: document OCR, chart understanding, medical imaging, receipt extraction

Model	Params	VRAM (4-bit)	Task	Notebook
Llama-3.2-1B-Instruct-FP8	1B	~2GB	Reasoning (GRPO)	Llama_FP8_GRPO.ipynb
Qwen2-VL-2B-Instruct	2B	~2GB	Vision+Text	Qwen3_5__2B__Vision.ipynb
Qwen2-VL-4B-Instruct	4B	~4GB	Vision+Text	Qwen3_5__4B__Vision.ipynb
Granite-4.0-2B	2B	~2GB	Code+Reasoning	Granite4_0.ipynb

Practice questions

A 7B model with load_in_4bit=True uses how much VRAM? (Answer: ~4-5GB. Standard BF16 = 2 bytes × 7B = 14GB. 4-bit = 0.5 bytes × 7B = 3.5GB + overhead ≈ 4-5GB. LoRA adapters add ~200MB. Total fits in a 6-8GB consumer GPU (RTX 3060, 3070).)
What does use_gradient_checkpointing="unsloth" do? (Answer: Instead of storing all intermediate activations in VRAM during the forward pass (needed for backprop), gradient checkpointing recomputes them during the backward pass. Trades compute for memory: ~30-40% more computation but ~60% less VRAM. "unsloth" mode is Unsloth's optimized implementation that saves more memory with less compute overhead.)
Why is learning_rate=2e-4 for LoRA fine-tuning higher than 2e-5 for full fine-tuning? (Answer: LoRA only updates 0.1-2% of parameters. The small LoRA matrices (A and B) start at zero and need a larger learning rate to learn meaningful representations quickly. Full fine-tuning updates all parameters from a good starting point, requiring small LR to avoid catastrophic forgetting.)
What is the difference between saving LoRA adapters vs saving the full merged model? (Answer: LoRA adapters: ~50-200MB (just the small A and B matrices). Load: requires base model + adapter. Merged model: full model with adapters mathematically merged back into W. Load: just one model file. Use adapters for: flexibility (swap adapters), storage efficiency. Use merged for: simple deployment, sharing.)
save_pretrained_gguf with quantization_method="q4_k_m" — what does this produce? (Answer: GGUF format with Q4_K_M quantization (~4.5 bits per weight on average). Compatible with llama.cpp and Ollama for local CPU/GPU inference. A 7B model becomes ~4-5GB. Q4_K_M uses "K-quant" which preserves more precision for important weights. Good balance of size vs quality for local deployment.)

The fine-tuning pattern described here — Unsloth + LoRA + TRL — is used by thousands of researchers and developers to create custom versions of LLMs. LumiChats Study Mode and domain-specific features are built on the same fine-tuning paradigm applied at production scale.

# ═══════════════════════════════════════════════════════════ # Complete SFT (Supervised Fine-Tuning) workflow with Unsloth # Based on patterns from the Granite/Qwen/Llama notebooks # Runs on free Google Colab T4 (16GB VRAM) # ═══════════════════════════════════════════════════════════ # Step 0: Install # pip install unsloth trl transformers datasets bitsandbytes from unsloth import FastLanguageModel from trl import SFTTrainer, SFTConfig from datasets import load_dataset, Dataset import torch # ── Step 1: Load model with automatic optimizations ── model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Llama-3.2-1B-Instruct", # or Qwen3, Granite4 etc. max_seq_length = 2048, # Max tokens per example dtype = None, # Auto-detect: BF16 on Ampere+, FP16 on older load_in_4bit = True, # 4-bit quantization: 7B uses ~4GB VRAM instead of 14GB ) # ── Step 2: Add LoRA adapters ── model = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank. 8-64 common. Higher = more params = more capacity lora_alpha=16, # Scaling factor = lora_alpha/r. Often set equal to r. lora_dropout=0.0, # 0 works well for LoRA (unlike vanilla dropout) bias="none", # Recommended: no bias in LoRA layers target_modules=[ # Which attention/MLP matrices to add LoRA to "q_proj", "k_proj", "v_proj", "o_proj", # Attention matrices "gate_proj", "up_proj", "down_proj", # MLP matrices ], use_gradient_checkpointing="unsloth", # Saves VRAM at cost of slight speed ) # Show trainable parameter count model.print_trainable_parameters() # trainable params: 27,262,976 / 1,235,814,400 = 2.21% (for 1B model) # ── Step 3: Prepare dataset with chat template ── # Use standard chat format that matches the model's instruction template train_data = [ { "messages": [ {"role": "system", "content": "You are a helpful SQL expert."}, {"role": "user", "content": "How do I get the top 10 customers by revenue?"}, {"role": "assistant", "content": "Use ORDER BY with LIMIT: SELECT customer_id, SUM(amount) AS revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 10;"}, ] }, # Add thousands more examples... ] dataset = Dataset.from_list(train_data) # Apply chat template to format as model expects def format_chat(example): text = tokenizer.apply_chat_template( example["messages"], tokenize=False, add_generation_prompt=False ) return {"text": text} dataset = dataset.map(format_chat) # ── Step 4: Configure and run SFT trainer ── sft_config = SFTConfig( output_dir = "./sft_output", dataset_text_field = "text", max_seq_length = 2048, num_train_epochs = 3, per_device_train_batch_size = 2, gradient_accumulation_steps = 4, # Effective batch = 2 × 4 = 8 warmup_steps = 10, learning_rate = 2e-4, # Higher LR for LoRA vs full FT weight_decay = 0.01, lr_scheduler_type = "cosine", bf16 = True, # BF16 compute (auto on Ampere+) fp16 = False, logging_steps = 10, save_steps = 100, report_to = "none", # "tensorboard" or "wandb" for tracking ) trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, args = sft_config, ) # ── Step 5: Train ── trainer.train() # ── Step 6: Save and use the fine-tuned model ── # Save LoRA adapters only (tiny: ~50MB for 7B model) model.save_pretrained("./lora_adapters") tokenizer.save_pretrained("./lora_adapters") # For GGUF/Ollama deployment (quantise for local inference) model.save_pretrained_gguf("./gguf_model", tokenizer, quantization_method="q4_k_m") # 4-bit quantization for CPU inference # ── Step 7: Inference with the fine-tuned model ── FastLanguageModel.for_inference(model) # Enable 2× faster inference mode inputs = tokenizer([ tokenizer.apply_chat_template([ {"role": "user", "content": "How do I join two tables in SQL?"} ], tokenize=False, add_generation_prompt=True) ], return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Based on Qwen3_5__2B__Vision.ipynb and Qwen3_5__4B__Vision.ipynb notebooks # Fine-tuning a vision-language model (VLM) on image+text tasks from unsloth import FastVisionModel # Vision-specific fast loading model, tokenizer = FastVisionModel.from_pretrained( model_name = "unsloth/Qwen2-VL-2B-Instruct", # 2B vision-language model max_seq_length = 2048, dtype = None, load_in_4bit = True, ) # Enable only text weights for faster fine-tuning (keep vision encoder frozen) model = FastVisionModel.get_peft_model( model, finetune_vision_layers = True, # Train vision encoder too? finetune_language_layers = True, # Train language decoder finetune_attention_modules = True, finetune_mlp_modules = True, r=16, lora_alpha=16, ) # Vision training dataset format vision_data = [ { "messages": [ { "role": "user", "content": [ {"type": "image", "image": "path/to/invoice.jpg"}, {"type": "text", "text": "Extract all line items and totals from this invoice."} ] }, { "role": "assistant", "content": "Line Items: 1. Product A: $50.00 2. Service B: $120.00 Total: $170.00" } ] } ] # Use cases: document OCR, chart understanding, medical imaging, receipt extraction

Model

Params

VRAM (4-bit)

Task

Notebook

Llama-3.2-1B-Instruct-FP8

~2GB

Reasoning (GRPO)

Llama_FP8_GRPO.ipynb

Qwen2-VL-2B-Instruct

~2GB

Vision+Text

Qwen3_5__2B__Vision.ipynb

Qwen2-VL-4B-Instruct

~4GB

Vision+Text

Qwen3_5__4B__Vision.ipynb

Granite-4.0-2B

~2GB

Code+Reasoning

Granite4_0.ipynb

Practical LLM Fine-Tuning — Unsloth, LoRA & the Modern Stack

The complete fine-tuning stack in 2025

Vision fine-tuning with Unsloth (from Qwen/PaddleOCR notebooks)

Practice questions

Practical LLM Fine-Tuning — Unsloth, LoRA & the Modern Stack

The complete fine-tuning stack in 2025

Vision fine-tuning with Unsloth (from Qwen/PaddleOCR notebooks)

Practice questions

Practice what you just learned

Related Terms