What is types of adversarial attacks?

Adversarial Attacks — Robustness & Input Manipulation: Types of adversarial attacks. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/adversarial-attacks-ai

What is defences and robustness certifications?

Adversarial Attacks — Robustness & Input Manipulation: Defences and robustness certifications. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/adversarial-attacks-ai

What is practice questions?

Adversarial Attacks — Robustness & Input Manipulation: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/adversarial-attacks-ai

Adversarial Attacks — Robustness & Input Manipulation

Adversarial attacks are carefully crafted inputs designed to fool machine learning models into making incorrect predictions. In computer vision, pixel-level perturbations invisible to humans cause misclassification (stop sign recognized as speed limit). In NLP, adding or substituting words causes sentiment classifiers to flip predictions. Adversarial examples are a fundamental vulnerability of gradient-based models, not a fixable bug. Adversarial robustness — training models to resist such attacks — is a core AI safety challenge with implications for autonomous vehicles, medical imaging, security systems, and any safety-critical AI deployment.

How tiny, invisible perturbations can fool AI systems — and why robustness matters.

Category: AI Safety & Ethics

Real-life analogy: The optical illusion

Optical illusions exploit weaknesses in human visual processing — adding specific patterns makes humans misperceive shapes, colors, and sizes. Adversarial examples are optical illusions for AI systems. A stop sign with yellow stickers at specific positions causes a self-driving car to classify it as a 45mph speed limit sign. The stickers are meaningless to a human but maximally confusing to the neural network. This is not a rare edge case — adversarial examples exist for virtually every neural network ever trained.

Types of adversarial attacks

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models, transforms
from PIL import Image
import numpy as np

# ── Fast Gradient Sign Method (FGSM) — the simplest adversarial attack ──
# Goodfellow et al. 2014: "Explaining and Harnessing Adversarial Examples"
def fgsm_attack(model, loss_fn, image, label, epsilon=0.03):
    """
    Generate an adversarial example by taking one step in the
    direction that MAXIMISES the loss.
    image:   (1, C, H, W) tensor, requires_grad=True
    epsilon: perturbation budget (max pixel change, in [0,1] scale)
    """
    image.requires_grad = True
    output = model(image)
    loss   = loss_fn(output, label)
    model.zero_grad()
    loss.backward()

    # Perturb in the direction that increases loss
    # sign(gradient) = +1 where gradient is positive, -1 where negative
    perturbation = epsilon * image.grad.sign()
    adversarial  = image + perturbation
    adversarial  = torch.clamp(adversarial, 0, 1)   # Keep pixels in valid range
    return adversarial.detach()

# ── Projected Gradient Descent (PGD) — stronger multi-step attack ──
def pgd_attack(model, loss_fn, image, label,
               epsilon=0.03, alpha=0.01, n_steps=40):
    """
    Multi-step FGSM with random start — much stronger than FGSM.
    Used for adversarial training (Madry et al. 2018)
    """
    # Random start within epsilon ball
    delta = torch.empty_like(image).uniform_(-epsilon, epsilon)
    adversarial = (image + delta).clamp(0, 1).detach().requires_grad_(True)

    for step in range(n_steps):
        output = model(adversarial)
        loss   = loss_fn(output, label)
        model.zero_grad()
        loss.backward()

        # Gradient ascent step
        adversarial = adversarial + alpha * adversarial.grad.sign()
        # Project back to epsilon-ball around original image
        delta = torch.clamp(adversarial - image, -epsilon, epsilon)
        adversarial = torch.clamp(image + delta, 0, 1).detach().requires_grad_(True)

    return adversarial.detach()

# ── Adversarial training: train on adversarial examples to improve robustness ──
def adversarial_training_step(model, optimizer, loss_fn, X, y, epsilon=0.03):
    """
    Madry et al. adversarial training: for each batch, generate adversarial examples
    and train on them INSTEAD OF clean examples.
    """
    model.eval()   # Generate adversarial examples without dropout
    X_adv = pgd_attack(model, loss_fn, X.clone(), y, epsilon=epsilon, n_steps=10)

    model.train()
    optimizer.zero_grad()
    loss = loss_fn(model(X_adv), y)   # Train on adversarial examples
    loss.backward()
    optimizer.step()
    return loss.item()

# ── NLP adversarial attacks ──
# TextFooler: substitute words with synonyms to flip sentiment
def textfooler_attack(text: str, model, tokenizer) -> str:
    """Simplified word substitution attack."""
    from transformers import pipeline
    clf = pipeline("sentiment-analysis", model=model)

    original_pred = clf(text)[0]['label']
    words = text.split()

    # Try substituting each word with synonyms
    synonyms = {
        'good': ['decent', 'adequate', 'satisfactory'],
        'great': ['acceptable', 'reasonable', 'moderate'],
        'amazing': ['okay', 'fine', 'average'],
        'bad': ['challenging', 'complex', 'difficult'],
        'terrible': ['suboptimal', 'imperfect', 'flawed'],
    }

    for i, word in enumerate(words):
        if word.lower() in synonyms:
            for synonym in synonyms[word.lower()]:
                candidate = words.copy()
                candidate[i] = synonym
                candidate_text = ' '.join(candidate)
                new_pred = clf(candidate_text)[0]['label']
                if new_pred != original_pred:
                    print(f"Attack succeeded: '{text}' → '{candidate_text}'")
                    return candidate_text
    return text   # Attack failed

Defences and robustness certifications

Defence	How it works	Against white-box?	Against black-box?	Accuracy cost
Adversarial training (PGD)	Train on adversarial examples	Partially	Yes	Moderate (-5% clean accuracy)
Input preprocessing	Blur, denoise, or smooth inputs	Weak	Partially	Low
Certified robustness	Provable guarantees within epsilon-ball	Yes (formal proof)	Yes	High (-15% accuracy)
Randomised smoothing	Add Gaussian noise, majority vote	Yes (certifiable)	Yes	Moderate
Feature squeezing	Reduce color depth or smooth inputs	Weak	Partially	Very low

Why adversarial robustness is hard to solve: No-free-lunch theorem for adversarial robustness: improving robustness always reduces accuracy on clean inputs (accuracy-robustness trade-off). The fundamental reason: adversarial vulnerability is a consequence of the high-dimensional geometry of neural network decision boundaries — in high dimensions, most correctly classified points are close to a decision boundary. This geometric property makes perfect robustness provably impossible without sacrificing accuracy.

Practice questions

FGSM perturbs an image by ε=0.03 in the gradient sign direction. Why is the perturbation imperceptible to humans? (Answer: Image pixels are in [0,1]. A change of 0.03 (3% of pixel range) is below the human visual threshold for detecting changes in uniform regions. Yet the perturbation is maximally damaging for the model because it is aligned with the loss gradient — the direction the model is most sensitive to.)
White-box vs black-box adversarial attack — what is the difference? (Answer: White-box: attacker has full access to model weights and gradients — can compute the exact gradient to maximize loss. Much stronger. Black-box: attacker can only query the model (input → output) — no access to internals. Must estimate gradients via queries or use transferability (adversarial examples from a surrogate model often transfer to the target).)
Why is adversarial training the most effective defence? (Answer: It directly exposes the model to adversarial examples during training — the model learns to classify them correctly. Standard training never sees adversarial inputs, so the model has no defence against them. Adversarial training is the only defence that has survived rigorous evaluation against adaptive attackers.)
Stop signs with adversarial stickers have been demonstrated to fool self-driving cars. What are the real-world safety implications? (Answer: Physical adversarial attacks are a genuine safety risk for autonomous systems. Malicious actors could place stickers on stop signs causing vehicles to ignore them. Medical imaging adversarial examples could cause misdiagnosis. Security cameras could be fooled by adversarial patterns on clothing. These are deployment-time attacks with catastrophic potential in safety-critical systems.)
What does certified robustness mean? (Answer: A formal mathematical guarantee that the model's prediction does not change for any input within an ε-ball around a given point. Cannot be fooled by any adversarial attack within that radius. Achieved via randomized smoothing (majority vote over noisy copies) or interval bound propagation. Provides stronger guarantees than empirical defences but at significant accuracy cost.)

Adversarial attacks on text models (prompt injection, jailbreaking) are the NLP equivalent of image adversarial attacks — carefully crafted inputs that override intended model behavior. LumiChats uses input filtering, output monitoring, and adversarial training to resist these attacks in production.

import torch import torch.nn as nn import torch.nn.functional as F from torchvision import models, transforms from PIL import Image import numpy as np # ── Fast Gradient Sign Method (FGSM) — the simplest adversarial attack ── # Goodfellow et al. 2014: "Explaining and Harnessing Adversarial Examples" def fgsm_attack(model, loss_fn, image, label, epsilon=0.03): """ Generate an adversarial example by taking one step in the direction that MAXIMISES the loss. image: (1, C, H, W) tensor, requires_grad=True epsilon: perturbation budget (max pixel change, in [0,1] scale) """ image.requires_grad = True output = model(image) loss = loss_fn(output, label) model.zero_grad() loss.backward() # Perturb in the direction that increases loss # sign(gradient) = +1 where gradient is positive, -1 where negative perturbation = epsilon * image.grad.sign() adversarial = image + perturbation adversarial = torch.clamp(adversarial, 0, 1) # Keep pixels in valid range return adversarial.detach() # ── Projected Gradient Descent (PGD) — stronger multi-step attack ── def pgd_attack(model, loss_fn, image, label, epsilon=0.03, alpha=0.01, n_steps=40): """ Multi-step FGSM with random start — much stronger than FGSM. Used for adversarial training (Madry et al. 2018) """ # Random start within epsilon ball delta = torch.empty_like(image).uniform_(-epsilon, epsilon) adversarial = (image + delta).clamp(0, 1).detach().requires_grad_(True) for step in range(n_steps): output = model(adversarial) loss = loss_fn(output, label) model.zero_grad() loss.backward() # Gradient ascent step adversarial = adversarial + alpha * adversarial.grad.sign() # Project back to epsilon-ball around original image delta = torch.clamp(adversarial - image, -epsilon, epsilon) adversarial = torch.clamp(image + delta, 0, 1).detach().requires_grad_(True) return adversarial.detach() # ── Adversarial training: train on adversarial examples to improve robustness ── def adversarial_training_step(model, optimizer, loss_fn, X, y, epsilon=0.03): """ Madry et al. adversarial training: for each batch, generate adversarial examples and train on them INSTEAD OF clean examples. """ model.eval() # Generate adversarial examples without dropout X_adv = pgd_attack(model, loss_fn, X.clone(), y, epsilon=epsilon, n_steps=10) model.train() optimizer.zero_grad() loss = loss_fn(model(X_adv), y) # Train on adversarial examples loss.backward() optimizer.step() return loss.item() # ── NLP adversarial attacks ── # TextFooler: substitute words with synonyms to flip sentiment def textfooler_attack(text: str, model, tokenizer) -> str: """Simplified word substitution attack.""" from transformers import pipeline clf = pipeline("sentiment-analysis", model=model) original_pred = clf(text)[0]['label'] words = text.split() # Try substituting each word with synonyms synonyms = { 'good': ['decent', 'adequate', 'satisfactory'], 'great': ['acceptable', 'reasonable', 'moderate'], 'amazing': ['okay', 'fine', 'average'], 'bad': ['challenging', 'complex', 'difficult'], 'terrible': ['suboptimal', 'imperfect', 'flawed'], } for i, word in enumerate(words): if word.lower() in synonyms: for synonym in synonyms[word.lower()]: candidate = words.copy() candidate[i] = synonym candidate_text = ' '.join(candidate) new_pred = clf(candidate_text)[0]['label'] if new_pred != original_pred: print(f"Attack succeeded: '{text}' → '{candidate_text}'") return candidate_text return text # Attack failed

Defence

How it works

Against white-box?

Against black-box?

Accuracy cost

Adversarial training (PGD)

Train on adversarial examples

Partially

Yes

Moderate (-5% clean accuracy)

Input preprocessing

Blur, denoise, or smooth inputs

Weak

Partially

Low

Certified robustness

Provable guarantees within epsilon-ball

Yes (formal proof)

Yes

High (-15% accuracy)

Randomised smoothing

Add Gaussian noise, majority vote

Yes (certifiable)

Yes

Moderate

Feature squeezing

Reduce color depth or smooth inputs

Weak

Partially

Very low

Adversarial Attacks — Robustness & Input Manipulation

Real-life analogy: The optical illusion

Types of adversarial attacks

Defences and robustness certifications

Practice questions

Adversarial Attacks — Robustness & Input Manipulation

Real-life analogy: The optical illusion

Types of adversarial attacks

Defences and robustness certifications

Practice questions

Practice what you just learned

Related Terms