Diffusion models are a class of generative models that learn to generate data (images, audio, video) by learning to reverse a gradual noise-adding process. The model is trained to iteratively denoise random noise into structured data, guided by a text prompt or other conditioning. Diffusion models now produce the highest-quality AI-generated images and power Stable Diffusion, DALL-E 3, Midjourney, and Sora.
The forward and reverse process
Diffusion models are built on a two-stage framework. The forward process gradually destroys a real image by adding Gaussian noise over T steps (typically T=1000). After T steps, the image is pure noise — all structure gone. This process is not learned, just defined mathematically. The reverse process trains a neural network to undo this noise step-by-step:
Forward process: each step adds a small Gaussian noise scaled by β_t (the noise schedule). β_t increases from ~0.0001 to ~0.02 over T steps. By step T, the signal-to-noise ratio → 0.
The reparameterization trick
A key insight: you can sample x_t directly from x_0 without running T sequential steps. x_t = √ᾱ_t · x_0 + √(1−ᾱ_t) · ε, where ε ~ N(0,I) and ᾱ_t = ∏β_i. This means training can sample any noise level in a single step — crucial for efficient training.
DDPM training objective
DDPM (Ho et al., 2020) simplified the diffusion training objective to a mean-squared error on predicted noise:
The model ε_θ takes a noisy image x_t and timestep t, predicts the noise ε that was added. MSE against the true noise. Remarkably simple — yet produces state-of-the-art image quality.
DDPM training loop skeleton (simplified)
import torch
import torch.nn.functional as F
def ddpm_train_step(model, x_0, noise_schedule):
"""One training step for a diffusion model."""
batch_size = x_0.shape[0]
device = x_0.device
# 1. Sample random timestep for each image in the batch
t = torch.randint(0, noise_schedule.T, (batch_size,), device=device)
# 2. Sample noise
eps = torch.randn_like(x_0)
# 3. Create noisy image x_t via closed-form reparameterization
alpha_bar_t = noise_schedule.alpha_bar[t].view(-1, 1, 1, 1)
x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * eps
# 4. Predict the noise (U-Net or DiT backbone)
eps_pred = model(x_t, t)
# 5. Simple MSE loss on predicted vs actual noise
loss = F.mse_loss(eps_pred, eps)
return lossLatent diffusion: Stable Diffusion's architecture
Running diffusion in pixel space is expensive — a 512×512 image has 786K values. Latent Diffusion Models (LDM, Rombach et al., 2022) run the diffusion process in a compressed latent space using a pretrained VAE:
| Stage | Component | What it does |
|---|---|---|
| Encode | VAE Encoder | 512×512×3 image → 64×64×4 latent (48× compression) |
| Diffuse | U-Net / DiT (denoiser) | Adds/removes noise in 64×64×4 space — 48× cheaper than pixel diffusion |
| Condition | CLIP text encoder + cross-attention | Text prompt → 77 token embeddings injected via cross-attention |
| Decode | VAE Decoder | 64×64×4 denoised latent → 512×512×3 image |
Why latent diffusion dominates
The 48× smaller diffusion space makes training and inference radically cheaper. Stable Diffusion 1.5 (860M params) can run on consumer GPUs in seconds. The VAE quality determines the hard ceiling on detail recovery — SD3 and FLUX improved the VAE from 4 to 16 channels, dramatically improving fine details.
Classifier-free guidance (CFG)
CFG (Ho & Salimans, 2021) is the technique that makes text-to-image models actually follow prompts. The model is trained both with and without conditioning (random prompt dropout). At inference, the conditioned and unconditioned predictions are combined:
c = text condition, ∅ = null condition, s = guidance scale (typically 7–14). Higher s → stronger prompt adherence but less diversity and occasional artifacts ("oversaturation"). Lower s → more creative but may ignore the prompt.
Guidance scale tuning
Guidance scale 7–9: good balance of prompt adherence and natural diversity — best for most use cases. Scale 12–15: maximum prompt fidelity, useful for precise character/object specifications. Scale 1–3: near-unconditional sampling, interesting for artistic exploration. Values > 20 typically cause oversaturated, artifact-heavy outputs.
Sampling speed: from 1000 steps to 4
| Sampler | Steps needed | Key idea | Quality |
|---|---|---|---|
| DDPM | 1000 | Original Markovian reverse process | High — but very slow |
| DDIM | 20–50 | Non-Markovian — deterministic sampling allows skipping steps | High, deterministic (same seed = same image) |
| DPM-Solver++ | 10–20 | High-order ODE solvers for diffusion SDEs | Very high — standard in SD WebUI |
| LCM (Latent Consistency) | 4–8 | Distill multi-step into few-step model | Good — some quality loss vs 20-step |
| Flow Matching (FLUX, SD3) | 4–8 | Straight-line trajectories via optimal transport | State-of-the-art — used in FLUX.1 and SD3 |
What to use in 2025
For highest quality: FLUX.1-dev or Stable Diffusion 3.5 with Flow Matching (8 steps). For speed + quality: DPM-Solver++ with 20 steps on SDXL. For real-time (< 1 second): LCM-LoRA or Turbo distillation variants. Sora and other video models use Diffusion Transformers (DiT) instead of U-Nets — the same principles apply.
Practice questions
- What is the difference between DDPM (Denoising Diffusion Probabilistic Models) and DDIM (Denoising Diffusion Implicit Models)? (Answer: DDPM: adds Gaussian noise over T=1000 steps; generation reverses this by predicting and removing noise at each step. Requires all T steps → slow (1000 forward passes). DDIM: reformulates the denoising as a deterministic ODE (no stochastic sampling). Enables skipping steps — generate in 20–50 steps instead of 1000 with minimal quality loss. Same noise model, different sampling strategy. DDIM is the standard for fast sampling in Stable Diffusion (--steps 20 uses DDIM scheduler).)
- What is the role of the text encoder in text-to-image diffusion models? (Answer: The text encoder (CLIP, T5, or similar) converts the text prompt into a sequence of embeddings. These embeddings are injected into the denoiser (U-Net or DiT) via cross-attention — each image patch attends to the text embeddings to guide denoising. The text encoder determines which semantic concepts the model can represent. Stable Diffusion uses CLIP; DALL-E 3/Imagen uses T5. Text encoder quality strongly determines prompt adherence — this is why jailbreaking SD often attacks the text encoding step.)
- What is the forward process in DDPM and why must each step be small? (Answer: Forward process: q(x_t|x_{t-1}) = N(√(1-β_t) x_{t-1}, β_t I). Each step adds a small amount of Gaussian noise (β_t is small, e.g., 0.0001 to 0.02). After T=1000 steps, x_T ≈ N(0, I) — pure Gaussian noise. Steps must be small so the reverse step (denoising) is also approximately Gaussian — making it learnable. If β_t is large (few big steps), the reverse distribution is not Gaussian and the learned denoiser cannot model it.)
- What is classifier-free guidance (CFG) scale and why does higher CFG improve prompt adherence but reduce diversity? (Answer: CFG output = uncond_pred + scale × (cond_pred - uncond_pred). Scale=1: pure conditional generation. Scale=7.5 (typical): conditional and unconditional predictions are extrapolated toward the conditional — stronger prompt adherence. High scale (15+): over-saturated, oversharped images, less natural variation. The CFG formula amplifies the difference between conditional and unconditional predictions — high scale means any deviation from the prompt is strongly penalised, reducing diversity. Images look 'more like the prompt' but less like natural photographs.)
- What is ControlNet and how does it add spatial control to diffusion models? (Answer: ControlNet (Zhang et al. 2023): adds a trainable copy of the U-Net encoder that processes a control signal (edge map, depth map, pose skeleton, segmentation map). The control encoder's features are added to the main U-Net's features at corresponding resolutions. The original U-Net weights are frozen; only the control encoder is trained. This preserves the original model's generation quality while adding spatial conditioning. Multiple ControlNets can be combined: e.g., pose control + depth control simultaneously for precise scene composition.)