Glossary/AI Red Teaming
AI Safety & Ethics

AI Red Teaming

Deliberately attacking AI systems to find safety failures before deployment.


Definition

AI red teaming is the practice of systematically probing AI models for harmful behaviours, safety failures, and policy violations using adversarial testing — mirroring the cybersecurity practice of red teaming computer systems. Red teams attempt to elicit harmful outputs, test the consistency of safety training, find jailbreaks, and identify failure modes that automated testing misses. Both AI labs (internal red teams) and external researchers conduct red teaming, which has become a standard pre-deployment requirement for frontier AI systems.

What AI red teams actually test

Risk categoryWhat testers probe forExample attack
Harmful content generationProducing violence, hate speech, CSAM, self-harm instructionsEscalating role-play scenarios, hypothetical framings
Dangerous capability elicitationCBRN weapons, bioweapon synthesis, critical infrastructure attacksScientific framing, academic paper personas, multi-step decomposition
Privacy violationsExtracting training data, inferring PII, repeating memorised contentPrefix injection to elicit memorised training examples
Deception and manipulationSycophancy, false confidence, social manipulation tacticsAsking model to persuade, mislead, or deceive in specified ways
Agentic safety failuresHarmful autonomous actions, prompt injection exploitation, scope creepMalicious tool results, adversarial environment states
Model-specific jailbreaksCharacter-specific bypass patterns that survived safety trainingModel-targeted adversarial prompts, many-shot examples

Red teaming methodology

  1. Scope definition: Define what capabilities and risk categories are in scope. A red team for a medical AI product focuses differently from one for a general-purpose chatbot.
  2. Attack surface mapping: Identify all inputs the model can receive — direct prompts, system prompts, tool results, retrieved documents, multimodal inputs — and test each as a potential injection vector.
  3. Manual adversarial prompting: Human red teamers craft targeted prompts based on their understanding of the model's safety training and likely failure modes. Human creativity is essential — automated testing systematically misses novel attack patterns.
  4. Automated fuzzing: Tools like Garak (open-source LLM vulnerability scanner) and commercial red-teaming platforms automatically generate thousands of adversarial prompts and record failure rates.
  5. Structured scoring: Rate each finding by severity (how harmful is the output), consistency (does the attack succeed reliably), and novelty (is this a known or new technique). Prioritise fixes by this risk matrix.
  6. Responsible disclosure: Document all findings in a structured red team report. Share with the AI lab before public disclosure. Follow the same responsible disclosure norms as vulnerability research in traditional cybersecurity.

Garak — open-source LLM vulnerability scanner for automated red teaming

# Install: pip install garak
# Garak tests LLM APIs against hundreds of known attack categories automatically

# Run from command line:
# python -m garak --model_type openai --model_name gpt-4o-mini --probes all

# Or use programmatically:
import garak.attempt
from garak.generators.openai import OpenAIGenerator

generator = OpenAIGenerator(name="gpt-4o-mini")

# Test for prompt injection vulnerabilities
from garak.probes.promptinject import HijackHateHumans

probe = HijackHateHumans()
results = probe.probe(generator)

# Each result is an Attempt object with:
# - prompt: what was sent
# - outputs: what the model said
# - passed: whether the safety check passed
for attempt in results:
    if not attempt.passed:
        print(f"FAILED: {attempt.prompt[:100]}...")
        print(f"OUTPUT: {attempt.outputs[0][:200]}...")

Government and regulatory requirements for red teaming

  • US Executive Order 14110 (2023): Required developers of frontier AI models (above defined compute thresholds) to share red team results with the US government before deployment. Reinforced by the Frontier AI Safety Commitments signed by major labs.
  • EU AI Act (2024): Requires 'high-risk' AI systems to undergo conformity assessments including adversarial testing. 'General-purpose AI models with systemic risk' (above 10²⁵ FLOP training compute) must conduct adversarial testing and report to the AI Office.
  • UK AI Safety Institute: Conducted pre-deployment evaluations of Claude 3, GPT-4o, and Gemini 1.5 Pro in 2024, focusing on CBRN capabilities and cyber offensive capabilities. Published findings inform regulatory guidance.
  • Voluntary commitments: Anthropic, OpenAI, Google DeepMind, Microsoft, Meta, and others have committed to pre-deployment third-party safety evaluations and sharing red team findings with governments before releasing frontier models.

Practice questions

  1. What is the difference between white-box and black-box red teaming for LLMs? (Answer: White-box: red teamers have access to model weights, architecture, and training details — can compute gradient-based adversarial inputs, analyse internal representations, and test specific safety mechanisms. Black-box: only API access — must probe through prompt manipulation, observing outputs. Most deployed model red teaming is black-box (models are proprietary). White-box enables much more systematic vulnerability analysis but requires model access.)
  2. What categories of harm do AI red teams typically test for? (Answer: Weapons of mass destruction assistance (bio, chem, nuclear, radiological); CSAM generation; violent extremism facilitation; cyberweapon creation; personal data extraction; identity theft facilitation; privacy violations; social engineering tools; disinformation generation at scale; self-harm facilitation; financial fraud assistance. The UK AISI red-teaming framework adds: persuasion capabilities (manipulation), and uplift (does the model provide meaningful additional capability beyond a search engine?).)
  3. What is the difference between offensive jailbreaking (finding safety bypasses) and constructive red teaming? (Answer: Offensive jailbreaking: adversarially find ways to extract harmful outputs, often to embarrass the company or for personal use. Goal: demonstrate vulnerability. Constructive red teaming: systematically probe for vulnerabilities to improve safety before deployment. Goal: enumerate and fix vulnerabilities. Constructive red teams write detailed reports about successful attacks, their reliability, and recommended mitigations — not just demonstrate that an attack works.)
  4. The UK AI Safety Institute required frontier model red teaming before release. What gap does this fill? (Answer: Internal red teaming has conflicts of interest (you may not want to find problems that delay release) and limited attack surface coverage (your team has the same blind spots). Third-party government red teaming by UKIASI provides independent assessment, broader attack surface coverage from diverse team backgrounds, authoritative public reporting, and accountability. It mirrors financial auditing — internal controls plus independent external verification.)
  5. What is a red team 'uplift evaluation' and why is it important for frontier AI governance? (Answer: Uplift evaluation measures whether a model provides meaningful capability enhancement (uplift) to someone trying to cause harm. Not just 'can it answer harmful questions' but 'does it provide information that substantially increases a malicious actor's capability beyond what Google provides?' For bioweapons: does the model provide synthesis routes that meaningfully advance a bad actor beyond public literature? High-uplift responses are a hard safety red line. Evaluating uplift requires domain experts (biosecurity, cybersecurity, nuclear experts) on the red team.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms