Adversarial attacks are carefully crafted inputs designed to fool machine learning models into making incorrect predictions. In computer vision, pixel-level perturbations invisible to humans cause misclassification (stop sign recognised as speed limit). In NLP, adding or substituting words causes sentiment classifiers to flip predictions. Adversarial examples are a fundamental vulnerability of gradient-based models, not a fixable bug. Adversarial robustness — training models to resist such attacks — is a core AI safety challenge with implications for autonomous vehicles, medical imaging, security systems, and any safety-critical AI deployment.
Real-life analogy: The optical illusion
Optical illusions exploit weaknesses in human visual processing — adding specific patterns makes humans misperceive shapes, colours, and sizes. Adversarial examples are optical illusions for AI systems. A stop sign with yellow stickers at specific positions causes a self-driving car to classify it as a 45mph speed limit sign. The stickers are meaningless to a human but maximally confusing to the neural network. This is not a rare edge case — adversarial examples exist for virtually every neural network ever trained.
Types of adversarial attacks
FGSM adversarial attack and adversarial training defence
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models, transforms
from PIL import Image
import numpy as np
# ── Fast Gradient Sign Method (FGSM) — the simplest adversarial attack ──
# Goodfellow et al. 2014: "Explaining and Harnessing Adversarial Examples"
def fgsm_attack(model, loss_fn, image, label, epsilon=0.03):
"""
Generate an adversarial example by taking one step in the
direction that MAXIMISES the loss.
image: (1, C, H, W) tensor, requires_grad=True
epsilon: perturbation budget (max pixel change, in [0,1] scale)
"""
image.requires_grad = True
output = model(image)
loss = loss_fn(output, label)
model.zero_grad()
loss.backward()
# Perturb in the direction that increases loss
# sign(gradient) = +1 where gradient is positive, -1 where negative
perturbation = epsilon * image.grad.sign()
adversarial = image + perturbation
adversarial = torch.clamp(adversarial, 0, 1) # Keep pixels in valid range
return adversarial.detach()
# ── Projected Gradient Descent (PGD) — stronger multi-step attack ──
def pgd_attack(model, loss_fn, image, label,
epsilon=0.03, alpha=0.01, n_steps=40):
"""
Multi-step FGSM with random start — much stronger than FGSM.
Used for adversarial training (Madry et al. 2018)
"""
# Random start within epsilon ball
delta = torch.empty_like(image).uniform_(-epsilon, epsilon)
adversarial = (image + delta).clamp(0, 1).detach().requires_grad_(True)
for step in range(n_steps):
output = model(adversarial)
loss = loss_fn(output, label)
model.zero_grad()
loss.backward()
# Gradient ascent step
adversarial = adversarial + alpha * adversarial.grad.sign()
# Project back to epsilon-ball around original image
delta = torch.clamp(adversarial - image, -epsilon, epsilon)
adversarial = torch.clamp(image + delta, 0, 1).detach().requires_grad_(True)
return adversarial.detach()
# ── Adversarial training: train on adversarial examples to improve robustness ──
def adversarial_training_step(model, optimizer, loss_fn, X, y, epsilon=0.03):
"""
Madry et al. adversarial training: for each batch, generate adversarial examples
and train on them INSTEAD OF clean examples.
"""
model.eval() # Generate adversarial examples without dropout
X_adv = pgd_attack(model, loss_fn, X.clone(), y, epsilon=epsilon, n_steps=10)
model.train()
optimizer.zero_grad()
loss = loss_fn(model(X_adv), y) # Train on adversarial examples
loss.backward()
optimizer.step()
return loss.item()
# ── NLP adversarial attacks ──
# TextFooler: substitute words with synonyms to flip sentiment
def textfooler_attack(text: str, model, tokenizer) -> str:
"""Simplified word substitution attack."""
from transformers import pipeline
clf = pipeline("sentiment-analysis", model=model)
original_pred = clf(text)[0]['label']
words = text.split()
# Try substituting each word with synonyms
synonyms = {
'good': ['decent', 'adequate', 'satisfactory'],
'great': ['acceptable', 'reasonable', 'moderate'],
'amazing': ['okay', 'fine', 'average'],
'bad': ['challenging', 'complex', 'difficult'],
'terrible': ['suboptimal', 'imperfect', 'flawed'],
}
for i, word in enumerate(words):
if word.lower() in synonyms:
for synonym in synonyms[word.lower()]:
candidate = words.copy()
candidate[i] = synonym
candidate_text = ' '.join(candidate)
new_pred = clf(candidate_text)[0]['label']
if new_pred != original_pred:
print(f"Attack succeeded: '{text}' → '{candidate_text}'")
return candidate_text
return text # Attack failedDefences and robustness certifications
| Defence | How it works | Against white-box? | Against black-box? | Accuracy cost |
|---|---|---|---|---|
| Adversarial training (PGD) | Train on adversarial examples | Partially | Yes | Moderate (-5% clean accuracy) |
| Input preprocessing | Blur, denoise, or smooth inputs | Weak | Partially | Low |
| Certified robustness | Provable guarantees within epsilon-ball | Yes (formal proof) | Yes | High (-15% accuracy) |
| Randomised smoothing | Add Gaussian noise, majority vote | Yes (certifiable) | Yes | Moderate |
| Feature squeezing | Reduce colour depth or smooth inputs | Weak | Partially | Very low |
Why adversarial robustness is hard to solve
No-free-lunch theorem for adversarial robustness: improving robustness always reduces accuracy on clean inputs (accuracy-robustness trade-off). The fundamental reason: adversarial vulnerability is a consequence of the high-dimensional geometry of neural network decision boundaries — in high dimensions, most correctly classified points are close to a decision boundary. This geometric property makes perfect robustness provably impossible without sacrificing accuracy.
Practice questions
- FGSM perturbs an image by ε=0.03 in the gradient sign direction. Why is the perturbation imperceptible to humans? (Answer: Image pixels are in [0,1]. A change of 0.03 (3% of pixel range) is below the human visual threshold for detecting changes in uniform regions. Yet the perturbation is maximally damaging for the model because it is aligned with the loss gradient — the direction the model is most sensitive to.)
- White-box vs black-box adversarial attack — what is the difference? (Answer: White-box: attacker has full access to model weights and gradients — can compute the exact gradient to maximise loss. Much stronger. Black-box: attacker can only query the model (input → output) — no access to internals. Must estimate gradients via queries or use transferability (adversarial examples from a surrogate model often transfer to the target).)
- Why is adversarial training the most effective defence? (Answer: It directly exposes the model to adversarial examples during training — the model learns to classify them correctly. Standard training never sees adversarial inputs, so the model has no defence against them. Adversarial training is the only defence that has survived rigorous evaluation against adaptive attackers.)
- Stop signs with adversarial stickers have been demonstrated to fool self-driving cars. What are the real-world safety implications? (Answer: Physical adversarial attacks are a genuine safety risk for autonomous systems. Malicious actors could place stickers on stop signs causing vehicles to ignore them. Medical imaging adversarial examples could cause misdiagnosis. Security cameras could be fooled by adversarial patterns on clothing. These are deployment-time attacks with catastrophic potential in safety-critical systems.)
- What does certified robustness mean? (Answer: A formal mathematical guarantee that the model's prediction does not change for any input within an ε-ball around a given point. Cannot be fooled by any adversarial attack within that radius. Achieved via randomised smoothing (majority vote over noisy copies) or interval bound propagation. Provides stronger guarantees than empirical defences but at significant accuracy cost.)
On LumiChats
Adversarial attacks on text models (prompt injection, jailbreaking) are the NLP equivalent of image adversarial attacks — carefully crafted inputs that override intended model behaviour. LumiChats uses input filtering, output monitoring, and adversarial training to resist these attacks in production.
Try it free