Stable Diffusion is an open-source latent diffusion model for text-to-image generation, originally released by Stability AI in August 2022. It operates in a compressed latent space (rather than pixel space) for computational efficiency, processes noise through a U-Net or DiT denoiser conditioned on CLIP text embeddings, and decodes results through a variational autoencoder. As of 2026, Stable Diffusion XL, SD3-Medium, and the Flux architecture (from former Stability AI researchers) are the dominant open-source image generation options — all freely downloadable and runnable locally.
How latent diffusion works
Stable Diffusion's core insight is to perform the expensive diffusion process in latent space — a compressed representation 8× smaller than pixel space — rather than directly on image pixels. A Variational Autoencoder (VAE) compresses 512×512 pixel images to 64×64 latent representations. The diffusion process adds and removes noise in this small latent space, making each denoising step ~64× cheaper than pixel-space diffusion while preserving perceptual quality.
The VAE bottleneck: encoder ℰ compresses image x to latent z; decoder 𝒟 reconstructs x from z. Diffusion runs entirely on z.
| Component | Role | Architecture |
|---|---|---|
| VAE Encoder | Compress 512×512 image → 64×64 latent | CNN-based encoder with KL regularisation |
| CLIP Text Encoder | Convert prompt text → conditioning embedding | ViT-L/14 (SD1.x) or OpenCLIP ViT-H (SDXL) |
| U-Net / DiT Denoiser | Iteratively denoise latent conditioned on text | ResNet U-Net (SD1/2) or DiT (SD3/Flux) |
| VAE Decoder | Convert denoised latent → final pixel image | CNN-based decoder |
| DDPM Scheduler | Define forward/reverse noise schedule | DDPM, DDIM, DPM-Solver variants |
Run Stable Diffusion XL locally with diffusers — complete working example
from diffusers import StableDiffusionXLPipeline
import torch
# Load SDXL — downloads ~7GB on first run
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16",
)
pipe = pipe.to("cuda") # requires NVIDIA GPU with 8GB+ VRAM
# Optionally enable memory savings
pipe.enable_xformers_memory_efficient_attention()
image = pipe(
prompt="A photorealistic portrait of a student studying with AI, warm library lighting, shallow depth of field",
negative_prompt="blurry, distorted, low quality, watermark",
height=1024,
width=1024,
num_inference_steps=30, # more steps = higher quality, slower
guidance_scale=7.5, # how closely to follow the prompt (7–12 typical)
generator=torch.Generator("cuda").manual_seed(42), # reproducible seed
).images[0]
image.save("output.png")The Flux architecture — why it matters in 2026
Flux, released by Black Forest Labs in mid-2024 (founded by former Stability AI researchers), replaced the U-Net backbone with a Diffusion Transformer (DiT) and redesigned the text conditioning pipeline. Flux 1.1 Pro achieves Midjourney-competitive photorealism and significantly better prompt adherence than SDXL. Its Schnell variant is MIT-licensed for commercial use. For developers and researchers, Flux represents the current state of open-source image generation quality.
| Model | License | Quality | Speed | VRAM requirement |
|---|---|---|---|---|
| SD 1.5 | CreativeML Open RAIL-M | Legacy — 2022 standard | Fast | 4GB minimum |
| SDXL 1.0 | CreativeML Open RAIL-M+ | Good — 2023 standard | Moderate | 8GB minimum |
| SD3-Medium | Stability AI non-commercial | Very good — DiT architecture | Moderate | 10GB minimum |
| Flux.1 Schnell | Apache 2.0 (commercial OK) | Excellent — Midjourney-competitive | Fast (4 steps) | 12GB minimum |
| Flux.1 Dev | Flux non-commercial research | Excellent | Moderate (25 steps) | 16GB minimum |
Why open-source matters for image AI
Running Stable Diffusion or Flux locally means: no per-image cost, no content moderation restrictions, complete privacy (your prompts never leave your hardware), and the ability to fine-tune on your own images. For researchers, artists, and developers building commercial products, local open-source generation is often the correct technical and economic choice over cloud APIs.
Practice questions
- What is the advantage of latent diffusion over pixel-space diffusion for image generation? (Answer: Latent diffusion operates in a compressed latent space (e.g., 64×64 latents for a 512×512 image via the VAE encoder — 64× fewer pixels to process). Diffusion steps run at this compressed resolution, dramatically reducing compute. The VAE decoder only runs once to expand to pixel space. Pixel-space diffusion (e.g., DALL-E 2's prior) runs diffusion at full resolution — 64× more compute per step. Stable Diffusion's efficiency comes almost entirely from this latent compression trick.)
- What is classifier-free guidance (CFG) in Stable Diffusion and what does the CFG scale control? (Answer: CFG trains the model both with and without text conditioning. At inference, the denoised output is interpolated: output = uncond_output + scale × (cond_output - uncond_output). High CFG scale (7–15): strong adherence to prompt, more saturated/artistic results, less diversity. Low CFG scale (1–3): more random/creative, weaker prompt following. CFG scale = 7.5 is the typical default. Scale > 20 often produces overexposed, artifacted images.)
- What is the role of the negative prompt in Stable Diffusion and how does it work mathematically? (Answer: Negative prompt provides the 'unconditional' guidance direction in CFG. Instead of guiding away from pure noise, the model guides away from the negative prompt embedding. Output = neg_prompt_embed + scale × (pos_prompt_embed - neg_prompt_embed). Common negatives: 'blurry, low quality, watermark, deformed hands' steer the denoising away from those distributions. Technically the negative prompt is the second text embedding in the CFG formula.)
- What architectural change does Flux (2024) make over Stable Diffusion XL? (Answer: Flux replaces the U-Net denoiser with a Diffusion Transformer (DiT) using Multi-Modal Diffusion Transformer (MMDIT) blocks that perform self-attention jointly over text tokens and image patch tokens. This tight integration of text and image context (vs SD's cross-attention from separate streams) gives Flux dramatically better text rendering and spatial accuracy. Flux also uses flow matching instead of DDPM denoising — simpler, faster training.)
- Why do Stable Diffusion models sometimes generate images with 6 fingers or anatomically wrong hands? (Answer: Hands are statistically rare and complex in training data — they appear in many orientations and are often partially occluded, blurry, or cropped in photos. The model lacks strong priors about hand anatomy compared to faces. Additionally, pixel-level fine details are hardest to learn from denoising. Current SDXL and Flux models have improved hand generation significantly through curated training data, but complete reliability still requires ControlNet or post-processing for professional use.)
On LumiChats
LumiChats gives you access to DALL-E 3 and Gemini image generation through the platform — use Claude Sonnet 4.6 to craft and refine your image prompts, then generate directly inside LumiChats without switching tools.
Try it free