A Generative Adversarial Network (GAN) is a generative model consisting of two neural networks trained in opposition: a Generator that creates synthetic data samples, and a Discriminator that distinguishes real from generated data. Through this adversarial game, the Generator learns to produce increasingly realistic outputs. GANs produced the first photorealistic AI-generated faces and drove the early generative AI revolution.
The adversarial training game
A GAN pits two networks against each other. The Generator G maps random noise z to fake data. The Discriminator D tries to tell real from fake. They play a minimax game:
The original GAN objective (Goodfellow et al., 2014). D maximizes its ability to detect fakes; G minimizes D's success. At Nash equilibrium, G produces samples indistinguishable from real data and D outputs 0.5 everywhere.
Nash equilibrium in practice
The theoretical optimum is never actually reached — GAN training is notoriously unstable because D and G need to improve together at the right rate. If D becomes too strong too fast, gradients to G vanish (it can't learn). If G improves too fast, D can't keep up. Careful architecture design and loss function choice mitigate this.
GAN loss functions and training stability
The original GAN loss suffers from vanishing gradients when the discriminator is too confident. Wasserstein GAN (WGAN) replaced it with the Earth Mover distance:
WGAN critic (not constrained to [0,1]) measures the Wasserstein-1 distance between real and generated distributions. Requires the critic to be 1-Lipschitz — enforced via weight clipping (WGAN) or gradient penalty (WGAN-GP).
| Variant | Key idea | Solves | Widely used |
|---|---|---|---|
| Vanilla GAN | Binary cross-entropy | Baseline | ❌ Unstable |
| WGAN | Earth Mover distance + weight clipping | Vanishing gradients | ⚠️ Clipping harms quality |
| WGAN-GP | Gradient penalty instead of clipping | Stable, meaningful loss metric | ✅ Standard baseline |
| StyleGAN 2/3 | R1 regularization + path length regularization | High-quality face synthesis | ✅ SOTA for faces |
| BigGAN | Large batch + class conditioning | High-res diverse image generation | ✅ ImageNet generation |
Mode collapse and training instability
Mode collapse — the most common GAN failure — occurs when the Generator learns to produce only a narrow subset of the real distribution (e.g., only one face expression), because that's enough to fool the Discriminator:
Detecting mode collapse: monitor generator output diversity
import torch
import torch.nn.functional as F
def check_mode_collapse(generator, latent_dim=128, n_samples=1000, threshold=0.85):
"""
If generated samples have very high pairwise similarity → mode collapse.
Real diverse data should have low average cosine similarity.
"""
with torch.no_grad():
z = torch.randn(n_samples, latent_dim)
fake = generator(z) # (n_samples, C, H, W)
# Flatten and normalize
flat = fake.view(n_samples, -1)
flat = F.normalize(flat, dim=1)
# Sample 200 pairs for efficiency
idx = torch.randint(0, n_samples, (200, 2))
sims = (flat[idx[:, 0]] * flat[idx[:, 1]]).sum(dim=1)
avg_sim = sims.mean().item()
print(f"Average cosine similarity: {avg_sim:.3f}")
if avg_sim > threshold:
print("⚠️ Possible mode collapse detected!")
else:
print("✅ Generator output looks diverse")
return avg_simWhy mode collapse happens
The Generator finds a "local minimum" — a small set of convincing fakes that the Discriminator can't yet reject. Once D adapts, G might jump to another mode rather than spreading across all modes. Minibatch discrimination (showing D multiple samples at once so it can detect lack of diversity) and spectral normalization are the most reliable mitigations.
GAN applications
| Application | Architecture | Example |
|---|---|---|
| Photorealistic face synthesis | StyleGAN 3 | thispersondoesnotexist.com — 1024×1024 faces |
| Image-to-image translation | pix2pix (paired), CycleGAN (unpaired) | Sketch→photo, day→night, horse→zebra |
| Super-resolution | ESRGAN, Real-ESRGAN | 4×upscale with realistic textures |
| Medical image synthesis | DCGAN, StyleGAN | Generate rare pathology training data |
| Video prediction | VideoGAN, DVD-GAN | Short video sequence generation |
| Drug molecule generation | MolGAN, Graph GAN | Generate novel molecular structures with target properties |
| Data augmentation | Conditional GAN | Synthetic training data for underrepresented classes |
GANs vs Diffusion Models
| Dimension | GANs | Diffusion Models |
|---|---|---|
| Training stability | ❌ Notoriously unstable, mode collapse | ✅ Stable — standard supervised denoising loss |
| Sample diversity | ❌ Mode collapse risk | ✅ Excellent diversity |
| Sampling speed | ✅ Single forward pass (~milliseconds) | ❌ 20–1000 denoising steps |
| Text conditioning | ⚠️ Difficult — requires careful architecture | ✅ Natural via cross-attention (DALL-E 3, SD3) |
| Image quality (2025) | ✅ StyleGAN3 still top for faces | ✅ Diffusion dominates general image gen |
| Video generation | ⚠️ Limited progress | ✅ Sora, Kling, Gen-3 — all diffusion-based |
| Best use today | Real-time generation, face synthesis, low-latency | Text-to-image, editing, video, highest quality |
GANs are not dead
The adversarial training paradigm lives on in: (1) Adversarial examples — testing model robustness. (2) Adversarial training for robustness — training classifiers on adversarial examples. (3) Discriminator components in hybrid models. (4) Real-time edge applications where single-step inference is required. GAN-based face generators still produce more photorealistic identity-preserving results than diffusion for certain use cases.
Practice questions
- What is mode collapse in GAN training and how does it manifest? (Answer: Mode collapse: the generator learns to produce a small subset of the possible outputs that fool the discriminator — ignoring other modes of the real distribution. Example: a face GAN only generates blonde females even though training data has diverse faces. The generator found a local optimum: one type of face consistently fools the discriminator. The discriminator then over-fits to this mode, but the generator doesn't need to diversify. Mitigation: minibatch discrimination (encourage diverse outputs per batch), Wasserstein loss, spectral normalisation.)
- What is the Wasserstein distance (used in WGAN) and why is it more stable than JS divergence for GAN training? (Answer: Earth Mover's distance / Wasserstein-1: the minimum cost of transforming one distribution into another (minimum transport plan). Advantages over JS divergence: (1) Provides meaningful gradients even when distributions do not overlap — when generator is far from real data, JS divergence = constant log(2) but Wasserstein is proportional to distance. (2) Correlates better with sample quality — lower Wasserstein distance = better generated samples. WGAN with gradient penalty (WGAN-GP) is more stable to train than original GAN.)
- What is the discriminator's role during inference with a trained GAN? (Answer: The discriminator is discarded during inference. Only the generator is used: sample z ~ N(0,I), compute G(z) to generate a new sample. The discriminator served only as a training signal — an adversary that forced the generator to improve. At convergence (if achieved), the generator outputs samples indistinguishable from real data. The discriminator has no role in production image generation systems like StyleGAN or BigGAN.)
- How does StyleGAN control specific features (hair colour, age, facial expression) in generated faces? (Answer: StyleGAN uses Adaptive Instance Normalisation (AdaIN): a mapping network converts the latent z to a style vector w. At each resolution level, w modulates (via affine transform) the feature map mean and variance — directly controlling style at that level. Different levels control different aspects: coarse levels (4×4–8×8): pose, shape, face structure. Middle levels (16×16–32×32): facial features, hair style. Fine levels (64×64–1024×1024): colour, texture, fine details. Mixing styles from two latent codes produces faces with combined characteristics.)
- What is a conditional GAN (cGAN) and how does it enable class-conditioned generation? (Answer: cGAN: add class label conditioning to both generator and discriminator. Generator: G(z, c) where c is the class label (one-hot or embedding) concatenated to z or injected via FiLM conditioning. Discriminator: D(x, c) evaluates whether real/fake AND whether x matches class c. Training: generator must fool discriminator for the correct class — cannot generate a cat image and claim it's a dog. Enables controllable generation: 'generate class 42' or 'generate a cat'. BigGAN, class-conditional ImageNet generation uses large-scale cGAN.)