What is practice questions?

Stable Diffusion: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/stable-diffusion

Stable Diffusion Explained: How Latent Diffusion & Flux

Stable Diffusion

Stable Diffusion is an open-source latent diffusion model for text-to-image generation, originally released by Stability AI in August 2022. It operates in a compressed latent space (rather than pixel space) for computational efficiency, processes noise through a U-Net or DiT denoiser conditioned on CLIP text embeddings, and decodes results through a variational autoencoder. As of 2026, Stable Diffusion XL, SD3-Medium, and the Flux architecture (from former Stability AI researchers) are the dominant open-source image generation options — all freely downloadable and runnable locally.

The open-source image generator that runs on your own computer.

Category: Generative AI

How latent diffusion works

Stable Diffusion's core insight is to perform the expensive diffusion process in latent space — a compressed representation 8× smaller than pixel space — rather than directly on image pixels. A Variational Autoencoder (VAE) compresses 512×512 pixel images to 64×64 latent representations. The diffusion process adds and removes noise in this small latent space, making each denoising step ~64× cheaper than pixel-space diffusion while preserving perceptual quality.

z = \mathcal{E}(x), \quad x \approx \mathcal{D}(z)

Component	Role	Architecture
VAE Encoder	Compress 512×512 image → 64×64 latent	CNN-based encoder with KL regularization
CLIP Text Encoder	Convert prompt text → conditioning embedding	ViT-L/14 (SD1.x) or OpenCLIP ViT-H (SDXL)
U-Net / DiT Denoiser	Iteratively denoise latent conditioned on text	ResNet U-Net (SD1/2) or DiT (SD3/Flux)
VAE Decoder	Convert denoised latent → final pixel image	CNN-based decoder
DDPM Scheduler	Define forward/reverse noise schedule	DDPM, DDIM, DPM-Solver variants

from diffusers import StableDiffusionXLPipeline
import torch

# Load SDXL — downloads ~7GB on first run
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
pipe = pipe.to("cuda")  # requires NVIDIA GPU with 8GB+ VRAM

# Optionally enable memory savings
pipe.enable_xformers_memory_efficient_attention()

image = pipe(
    prompt="A photorealistic portrait of a student studying with AI, warm library lighting, shallow depth of field",
    negative_prompt="blurry, distorted, low quality, watermark",
    height=1024,
    width=1024,
    num_inference_steps=30,   # more steps = higher quality, slower
    guidance_scale=7.5,       # how closely to follow the prompt (7–12 typical)
    generator=torch.Generator("cuda").manual_seed(42),  # reproducible seed
).images[0]

image.save("output.png")

The Flux architecture — why it matters in 2026

Flux, released by Black Forest Labs in mid-2024 (founded by former Stability AI researchers), replaced the U-Net backbone with a Diffusion Transformer (DiT) and redesigned the text conditioning pipeline. Flux 1.1 Pro achieves Midjourney-competitive photorealism and significantly better prompt adherence than SDXL. Its Schnell variant is MIT-licensed for commercial use. For developers and researchers, Flux represents the current state of open-source image generation quality.

Model	License	Quality	Speed	VRAM requirement
SD 1.5	CreativeML Open RAIL-M	Legacy — 2022 standard	Fast	4GB minimum
SDXL 1.0	CreativeML Open RAIL-M+	Good — 2023 standard	Moderate	8GB minimum
SD3-Medium	Stability AI non-commercial	Very good — DiT architecture	Moderate	10GB minimum
Flux.1 Schnell	Apache 2.0 (commercial OK)	Excellent — Midjourney-competitive	Fast (4 steps)	12GB minimum
Flux.1 Dev	Flux non-commercial research	Excellent	Moderate (25 steps)	16GB minimum

Why open-source matters for image AI: Running Stable Diffusion or Flux locally means: no per-image cost, no content moderation restrictions, complete privacy (your prompts never leave your hardware), and the ability to fine-tune on your own images. For researchers, artists, and developers building commercial products, local open-source generation is often the correct technical and economic choice over cloud APIs.

Practice questions

What is the advantage of latent diffusion over pixel-space diffusion for image generation? (Answer: Latent diffusion operates in a compressed latent space (e.g., 64×64 latents for a 512×512 image via the VAE encoder — 64× fewer pixels to process). Diffusion steps run at this compressed resolution, dramatically reducing compute. The VAE decoder only runs once to expand to pixel space. Pixel-space diffusion (e.g., DALL-E 2's prior) runs diffusion at full resolution — 64× more compute per step. Stable Diffusion's efficiency comes almost entirely from this latent compression trick.)
What is classifier-free guidance (CFG) in Stable Diffusion and what does the CFG scale control? (Answer: CFG trains the model both with and without text conditioning. At inference, the denoised output is interpolated: output = uncond_output + scale × (cond_output - uncond_output). High CFG scale (7–15): strong adherence to prompt, more saturated/artistic results, less diversity. Low CFG scale (1–3): more random/creative, weaker prompt following. CFG scale = 7.5 is the typical default. Scale > 20 often produces overexposed, artifacted images.)
What is the role of the negative prompt in Stable Diffusion and how does it work mathematically? (Answer: Negative prompt provides the 'unconditional' guidance direction in CFG. Instead of guiding away from pure noise, the model guides away from the negative prompt embedding. Output = neg_prompt_embed + scale × (pos_prompt_embed - neg_prompt_embed). Common negatives: 'blurry, low quality, watermark, deformed hands' steer the denoising away from those distributions. Technically the negative prompt is the second text embedding in the CFG formula.)
What architectural change does Flux (2024) make over Stable Diffusion XL? (Answer: Flux replaces the U-Net denoiser with a Diffusion Transformer (DiT) using Multi-Modal Diffusion Transformer (MMDIT) blocks that perform self-attention jointly over text tokens and image patch tokens. This tight integration of text and image context (vs SD's cross-attention from separate streams) gives Flux dramatically better text rendering and spatial accuracy. Flux also uses flow matching instead of DDPM denoising — simpler, faster training.)
Why do Stable Diffusion models sometimes generate images with 6 fingers or anatomically wrong hands? (Answer: Hands are statistically rare and complex in training data — they appear in many orientations and are often partially occluded, blurry, or cropped in photos. The model lacks strong priors about hand anatomy compared to faces. Additionally, pixel-level fine details are hardest to learn from denoising. Current SDXL and Flux models have improved hand generation significantly through curated training data, but complete reliability still requires ControlNet or post-processing for professional use.)

LumiChats gives you access to DALL-E 3 and Gemini image generation through the platform — use Claude Sonnet 4.6 to craft and refine your image prompts, then generate directly inside LumiChats without switching tools.

Definition

How latent diffusion works

The VAE bottleneck: encoder ℰ compresses image x to latent z; decoder 𝒟 reconstructs x from z. Diffusion runs entirely on z.

Component	Role	Architecture
VAE Encoder	Compress 512×512 image → 64×64 latent	CNN-based encoder with KL regularization
CLIP Text Encoder	Convert prompt text → conditioning embedding	ViT-L/14 (SD1.x) or OpenCLIP ViT-H (SDXL)
U-Net / DiT Denoiser	Iteratively denoise latent conditioned on text	ResNet U-Net (SD1/2) or DiT (SD3/Flux)
VAE Decoder	Convert denoised latent → final pixel image	CNN-based decoder
DDPM Scheduler	Define forward/reverse noise schedule	DDPM, DDIM, DPM-Solver variants

Run Stable Diffusion XL locally with diffusers — complete working example

from diffusers import StableDiffusionXLPipeline
import torch

# Load SDXL — downloads ~7GB on first run
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
pipe = pipe.to("cuda")  # requires NVIDIA GPU with 8GB+ VRAM

# Optionally enable memory savings
pipe.enable_xformers_memory_efficient_attention()

image = pipe(
    prompt="A photorealistic portrait of a student studying with AI, warm library lighting, shallow depth of field",
    negative_prompt="blurry, distorted, low quality, watermark",
    height=1024,
    width=1024,
    num_inference_steps=30,   # more steps = higher quality, slower
    guidance_scale=7.5,       # how closely to follow the prompt (7–12 typical)
    generator=torch.Generator("cuda").manual_seed(42),  # reproducible seed
).images[0]

image.save("output.png")

The Flux architecture — why it matters in 2026

Model	License	Quality	Speed	VRAM requirement
SD 1.5	CreativeML Open RAIL-M	Legacy — 2022 standard	Fast	4GB minimum
SDXL 1.0	CreativeML Open RAIL-M+	Good — 2023 standard	Moderate	8GB minimum
SD3-Medium	Stability AI non-commercial	Very good — DiT architecture	Moderate	10GB minimum
Flux.1 Schnell	Apache 2.0 (commercial OK)	Excellent — Midjourney-competitive	Fast (4 steps)	12GB minimum
Flux.1 Dev	Flux non-commercial research	Excellent	Moderate (25 steps)	16GB minimum

Why open-source matters for image AI

Running Stable Diffusion or Flux locally means: no per-image cost, no content moderation restrictions, complete privacy (your prompts never leave your hardware), and the ability to fine-tune on your own images. For researchers, artists, and developers building commercial products, local open-source generation is often the correct technical and economic choice over cloud APIs.

Practice questions

What is the advantage of latent diffusion over pixel-space diffusion for image generation? (Answer: Latent diffusion operates in a compressed latent space (e.g., 64×64 latents for a 512×512 image via the VAE encoder — 64× fewer pixels to process). Diffusion steps run at this compressed resolution, dramatically reducing compute. The VAE decoder only runs once to expand to pixel space. Pixel-space diffusion (e.g., DALL-E 2's prior) runs diffusion at full resolution — 64× more compute per step. Stable Diffusion's efficiency comes almost entirely from this latent compression trick.)
What is classifier-free guidance (CFG) in Stable Diffusion and what does the CFG scale control? (Answer: CFG trains the model both with and without text conditioning. At inference, the denoised output is interpolated: output = uncond_output + scale × (cond_output - uncond_output). High CFG scale (7–15): strong adherence to prompt, more saturated/artistic results, less diversity. Low CFG scale (1–3): more random/creative, weaker prompt following. CFG scale = 7.5 is the typical default. Scale > 20 often produces overexposed, artifacted images.)
What is the role of the negative prompt in Stable Diffusion and how does it work mathematically? (Answer: Negative prompt provides the 'unconditional' guidance direction in CFG. Instead of guiding away from pure noise, the model guides away from the negative prompt embedding. Output = neg_prompt_embed + scale × (pos_prompt_embed - neg_prompt_embed). Common negatives: 'blurry, low quality, watermark, deformed hands' steer the denoising away from those distributions. Technically the negative prompt is the second text embedding in the CFG formula.)
What architectural change does Flux (2024) make over Stable Diffusion XL? (Answer: Flux replaces the U-Net denoiser with a Diffusion Transformer (DiT) using Multi-Modal Diffusion Transformer (MMDIT) blocks that perform self-attention jointly over text tokens and image patch tokens. This tight integration of text and image context (vs SD's cross-attention from separate streams) gives Flux dramatically better text rendering and spatial accuracy. Flux also uses flow matching instead of DDPM denoising — simpler, faster training.)
Why do Stable Diffusion models sometimes generate images with 6 fingers or anatomically wrong hands? (Answer: Hands are statistically rare and complex in training data — they appear in many orientations and are often partially occluded, blurry, or cropped in photos. The model lacks strong priors about hand anatomy compared to faces. Additionally, pixel-level fine details are hardest to learn from denoising. Current SDXL and Flux models have improved hand generation significantly through curated training data, but complete reliability still requires ControlNet or post-processing for professional use.)

On LumiChats

Try it free

Stable Diffusion

How latent diffusion works

The Flux architecture — why it matters in 2026

Practice questions

Stable Diffusion

How latent diffusion works

The Flux architecture — why it matters in 2026

Practice questions

Practice what you just learned

Related Terms