Glossary/Knowledge Distillation
Model Training & Optimization

Knowledge Distillation

Teaching a small model to think like a large one — at a fraction of the cost.


Definition

Knowledge distillation is a model compression technique where a small 'student' model is trained to mimic the outputs, intermediate representations, or reasoning patterns of a much larger 'teacher' model. The result is a compact model that captures most of the teacher's capability while being dramatically cheaper to run. In 2026, distillation is the primary technique behind the small language model revolution — Phi-3, Llama 3.2, and DeepSeek's most efficient models are all heavily distilled from larger teachers.

The core idea: soft labels vs. hard labels

Standard supervised training gives a model hard labels: 'this is a cat, label=1.' Distillation uses soft labels — the full probability distribution the teacher assigns to every possible output. These soft distributions contain far more information than a single label: they reveal the teacher's uncertainty, secondary predictions, and the relationships between classes.

Hinton et al. (2015) distillation loss: a weighted sum of (1) standard cross-entropy with the ground truth, and (2) KL divergence between the teacher and student output distributions at temperature T. Temperature T > 1 softens the distributions, amplifying information in the non-maximum logits. α balances the two objectives.

Training signalInformation contentExample
Hard label (standard training)Binary — right or wrong"cat" = 1, everything else = 0
Teacher soft label (distillation)Rich — reveals relationships"cat" = 0.92, "lynx" = 0.05, "dog" = 0.02, "car" = 0.001
Intermediate featuresRicher — match internal representations layer by layerStudent layer 4 output ≈ Teacher layer 12 output (feature distillation)
Reasoning tracesRichest — match step-by-step thinkingStudent generates same chain-of-thought steps as teacher (chain-of-thought distillation)

Types of distillation used in modern LLMs

TypeWhat is matchedHow it worksUsed in
Response distillation (black-box)Final outputs onlyGenerate teacher outputs; train student on those outputs as supervised dataDeepSeek-R1 distilled models; most SLMs fine-tuned on GPT-4 outputs
Logit distillation (white-box)Full output probability distributionsRequires access to teacher logits (not just text); uses KL divergence lossInternal lab distillation pipelines; not possible with closed APIs
Feature distillationIntermediate hidden statesAdd auxiliary loss: student layer i output ≈ teacher layer j outputTinyBERT; DistilBERT; efficient vision models
Chain-of-thought distillationReasoning traces / thinking stepsFine-tune student on teacher's step-by-step reasoning, not just final answersKey to DeepSeek-R1 distillation; creates small reasoning models
Speculative decodingFunctional: student drafts tokens, teacher verifiesNot classic distillation but uses a small student as a draft model; teacher corrects in batchesGPT-4 inference optimization; Llama production serving

Response distillation pipeline — the most practical form: fine-tune a small model on teacher outputs

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
import anthropic
import json

# ── Step 1: Generate teacher outputs ─────────────────────────────────────
client = anthropic.Anthropic()

def get_teacher_response(prompt: str) -> str:
    """Get a high-quality response from Claude Sonnet (the teacher)."""
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    return msg.content[0].text

# Your domain-specific prompts (e.g. customer support, medical Q&A, legal)
student_prompts = [
    "What are the side effects of ibuprofen?",
    "Explain the difference between a debit card and a credit card.",
    # ... thousands more domain prompts
]

distillation_data = []
for prompt in student_prompts:
    response = get_teacher_response(prompt)
    distillation_data.append({
        "text": f"<|user|>\n{prompt}\n<|assistant|>\n{response}"
    })

# ── Step 2: Fine-tune a small student model on teacher outputs ────────────
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

dataset = Dataset.from_list(distillation_data)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    args=TrainingArguments(
        output_dir="./phi3-distilled-domain",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-5,
        # Use LoRA via peft_config for efficiency — see LoRA article
    ),
)
trainer.train()
# Result: A 3.8B model with domain expertise from a 100B+ teacher.

Distillation vs. other compression techniques

TechniqueHow it worksCompression ratioQuality lossBest for
DistillationTrain new smaller model to mimic the teacher10–100× parameter reductionLow — the student is specifically trained to be accurateWhen you can afford the training compute; best overall quality-per-size
QuantizationReduce parameter bit-width (FP32 → INT8 or INT4)4–8× memory reduction; same architectureMinimal with careful calibrationDeployment on existing models; no retraining needed; fastest to apply
PruningRemove individual weights or entire layers below a threshold2–10× parameter reductionModerate — requires fine-tuning after pruning to recover qualityStructured pruning of specific attention heads or FFN layers
Architecture search (NAS)Automatically find the most efficient architecture for a targetVaries widelyLow — model is designed to be efficient from scratchLarge-scale production; resource-intensive to run

In practice: combine them

The Phi-3 Mini was distilled from a large teacher model AND then quantized to INT4 for edge deployment. This stack — distillation for quality compression, then quantization for memory compression — is the standard recipe for deploying capable models on consumer hardware in 2026.

Practice questions

  1. What is 'dark knowledge' and why does it improve student model training? (Answer: Dark knowledge: the probability distribution a teacher model assigns to all classes (soft labels). Example: for an image of a cat, the teacher might output [cat: 0.7, tiger: 0.2, dog: 0.1]. The non-cat probabilities encode the teacher's knowledge about similarity relationships — tiger is more cat-like than dog. Training on hard labels [cat: 1, tiger: 0, dog: 0] loses this similarity structure. Soft labels carry much more information per example, enabling the student to learn better representations with less data.)
  2. What is temperature scaling in knowledge distillation and why is it important? (Answer: Temperature T in soft targets: p_i = exp(z_i/T) / Σ exp(z_j/T). High T flattens the distribution — makes the soft labels even softer, exposing more dark knowledge (tiny probabilities become more visible). Low T sharpens — approaches hard labels. Standard distillation uses T=3–5 for soft targets, T=1 for hard targets. The final loss combines soft loss (at T) + hard loss (at T=1): L = α × T² × KL(teacher_soft, student_soft) + (1-α) × CE(student_logits, hard_labels). T² rescales the soft gradient to match hard gradient magnitude.)
  3. What is the difference between offline distillation, online distillation, and self-distillation? (Answer: Offline: train teacher fully first, then train student on teacher outputs — classic approach. Teacher is fixed throughout student training. Online (mutual learning): teacher and student train simultaneously, sharing knowledge with each other. No pretrained teacher needed — multiple students teach each other. Self-distillation: a model distils knowledge to itself — deeper layers teach shallower layers, or later training epochs teach earlier epochs. Born-Again Networks: retrain same architecture using soft labels from a trained copy, consistently outperforming the original.)
  4. What is feature-based distillation vs response-based distillation? (Answer: Response-based: match only final outputs (logits/probabilities). Simplest but loses intermediate representation information. Feature-based (FitNets, CRD): match intermediate layer activations — student's hidden states should match teacher's hidden states at corresponding depths. Requires projection layers if student and teacher have different hidden dimensions. Feature-based distillation transfers more structural knowledge but is more complex to implement. Combined approaches (match both outputs and features) typically perform best.)
  5. Why is knowledge distillation particularly effective for BERT compression? (Answer: BERT is heavily over-parameterised for most downstream tasks — much of its capacity is not needed. DistilBERT retains 97% of BERT's GLUE performance with 40% fewer parameters and 60% faster inference by distilling from BERT. BERT's multi-head attention naturally produces soft, information-rich targets. Task-specific distillation (after fine-tuning) is more effective than general distillation. Further: TinyBERT distils both intermediate representations and attention matrices — achieving 97% performance with 7.5× compression.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms