AI red teaming is the practice of systematically probing AI models for harmful behaviours, safety failures, and policy violations using adversarial testing — mirroring the cybersecurity practice of red teaming computer systems. Red teams attempt to elicit harmful outputs, test the consistency of safety training, find jailbreaks, and identify failure modes that automated testing misses. Both AI labs (internal red teams) and external researchers conduct red teaming, which has become a standard pre-deployment requirement for frontier AI systems.
What AI red teams actually test
| Risk category | What testers probe for | Example attack |
|---|---|---|
| Harmful content generation | Producing violence, hate speech, CSAM, self-harm instructions | Escalating role-play scenarios, hypothetical framings |
| Dangerous capability elicitation | CBRN weapons, bioweapon synthesis, critical infrastructure attacks | Scientific framing, academic paper personas, multi-step decomposition |
| Privacy violations | Extracting training data, inferring PII, repeating memorised content | Prefix injection to elicit memorised training examples |
| Deception and manipulation | Sycophancy, false confidence, social manipulation tactics | Asking model to persuade, mislead, or deceive in specified ways |
| Agentic safety failures | Harmful autonomous actions, prompt injection exploitation, scope creep | Malicious tool results, adversarial environment states |
| Model-specific jailbreaks | Character-specific bypass patterns that survived safety training | Model-targeted adversarial prompts, many-shot examples |
Red teaming methodology
- Scope definition: Define what capabilities and risk categories are in scope. A red team for a medical AI product focuses differently from one for a general-purpose chatbot.
- Attack surface mapping: Identify all inputs the model can receive — direct prompts, system prompts, tool results, retrieved documents, multimodal inputs — and test each as a potential injection vector.
- Manual adversarial prompting: Human red teamers craft targeted prompts based on their understanding of the model's safety training and likely failure modes. Human creativity is essential — automated testing systematically misses novel attack patterns.
- Automated fuzzing: Tools like Garak (open-source LLM vulnerability scanner) and commercial red-teaming platforms automatically generate thousands of adversarial prompts and record failure rates.
- Structured scoring: Rate each finding by severity (how harmful is the output), consistency (does the attack succeed reliably), and novelty (is this a known or new technique). Prioritise fixes by this risk matrix.
- Responsible disclosure: Document all findings in a structured red team report. Share with the AI lab before public disclosure. Follow the same responsible disclosure norms as vulnerability research in traditional cybersecurity.
Garak — open-source LLM vulnerability scanner for automated red teaming
# Install: pip install garak
# Garak tests LLM APIs against hundreds of known attack categories automatically
# Run from command line:
# python -m garak --model_type openai --model_name gpt-4o-mini --probes all
# Or use programmatically:
import garak.attempt
from garak.generators.openai import OpenAIGenerator
generator = OpenAIGenerator(name="gpt-4o-mini")
# Test for prompt injection vulnerabilities
from garak.probes.promptinject import HijackHateHumans
probe = HijackHateHumans()
results = probe.probe(generator)
# Each result is an Attempt object with:
# - prompt: what was sent
# - outputs: what the model said
# - passed: whether the safety check passed
for attempt in results:
if not attempt.passed:
print(f"FAILED: {attempt.prompt[:100]}...")
print(f"OUTPUT: {attempt.outputs[0][:200]}...")Government and regulatory requirements for red teaming
- US Executive Order 14110 (2023): Required developers of frontier AI models (above defined compute thresholds) to share red team results with the US government before deployment. Reinforced by the Frontier AI Safety Commitments signed by major labs.
- EU AI Act (2024): Requires 'high-risk' AI systems to undergo conformity assessments including adversarial testing. 'General-purpose AI models with systemic risk' (above 10²⁵ FLOP training compute) must conduct adversarial testing and report to the AI Office.
- UK AI Safety Institute: Conducted pre-deployment evaluations of Claude 3, GPT-4o, and Gemini 1.5 Pro in 2024, focusing on CBRN capabilities and cyber offensive capabilities. Published findings inform regulatory guidance.
- Voluntary commitments: Anthropic, OpenAI, Google DeepMind, Microsoft, Meta, and others have committed to pre-deployment third-party safety evaluations and sharing red team findings with governments before releasing frontier models.
Practice questions
- What is the difference between white-box and black-box red teaming for LLMs? (Answer: White-box: red teamers have access to model weights, architecture, and training details — can compute gradient-based adversarial inputs, analyse internal representations, and test specific safety mechanisms. Black-box: only API access — must probe through prompt manipulation, observing outputs. Most deployed model red teaming is black-box (models are proprietary). White-box enables much more systematic vulnerability analysis but requires model access.)
- What categories of harm do AI red teams typically test for? (Answer: Weapons of mass destruction assistance (bio, chem, nuclear, radiological); CSAM generation; violent extremism facilitation; cyberweapon creation; personal data extraction; identity theft facilitation; privacy violations; social engineering tools; disinformation generation at scale; self-harm facilitation; financial fraud assistance. The UK AISI red-teaming framework adds: persuasion capabilities (manipulation), and uplift (does the model provide meaningful additional capability beyond a search engine?).)
- What is the difference between offensive jailbreaking (finding safety bypasses) and constructive red teaming? (Answer: Offensive jailbreaking: adversarially find ways to extract harmful outputs, often to embarrass the company or for personal use. Goal: demonstrate vulnerability. Constructive red teaming: systematically probe for vulnerabilities to improve safety before deployment. Goal: enumerate and fix vulnerabilities. Constructive red teams write detailed reports about successful attacks, their reliability, and recommended mitigations — not just demonstrate that an attack works.)
- The UK AI Safety Institute required frontier model red teaming before release. What gap does this fill? (Answer: Internal red teaming has conflicts of interest (you may not want to find problems that delay release) and limited attack surface coverage (your team has the same blind spots). Third-party government red teaming by UKIASI provides independent assessment, broader attack surface coverage from diverse team backgrounds, authoritative public reporting, and accountability. It mirrors financial auditing — internal controls plus independent external verification.)
- What is a red team 'uplift evaluation' and why is it important for frontier AI governance? (Answer: Uplift evaluation measures whether a model provides meaningful capability enhancement (uplift) to someone trying to cause harm. Not just 'can it answer harmful questions' but 'does it provide information that substantially increases a malicious actor's capability beyond what Google provides?' For bioweapons: does the model provide synthesis routes that meaningfully advance a bad actor beyond public literature? High-uplift responses are a hard safety red line. Evaluating uplift requires domain experts (biosecurity, cybersecurity, nuclear experts) on the red team.)