What is red teaming methodology?

AI Red Teaming: Red teaming methodology. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/ai-red-teaming

What is practice questions?

AI Red Teaming: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/ai-red-teaming

AI Red Teaming

AI red teaming is the practice of systematically probing AI models for harmful behaviors, safety failures, and policy violations using adversarial testing — mirroring the cybersecurity practice of red teaming computer systems. Red teams attempt to elicit harmful outputs, test the consistency of safety training, find jailbreaks, and identify failure modes that automated testing misses. Both AI labs (internal red teams) and external researchers conduct red teaming, which has become a standard pre-deployment requirement for frontier AI systems.

Deliberately attacking AI systems to find safety failures before deployment.

Category: AI Safety & Ethics

What AI red teams actually test

Risk category	What testers probe for	Example attack
Harmful content generation	Producing violence, hate speech, CSAM, self-harm instructions	Escalating role-play scenarios, hypothetical framings
Dangerous capability elicitation	CBRN weapons, bioweapon synthesis, critical infrastructure attacks	Scientific framing, academic paper personas, multi-step decomposition
Privacy violations	Extracting training data, inferring PII, repeating memorized content	Prefix injection to elicit memorized training examples
Deception and manipulation	Sycophancy, false confidence, social manipulation tactics	Asking model to persuade, mislead, or deceive in specified ways
Agentic safety failures	Harmful autonomous actions, prompt injection exploitation, scope creep	Malicious tool results, adversarial environment states
Model-specific jailbreaks	Character-specific bypass patterns that survived safety training	Model-targeted adversarial prompts, many-shot examples

Red teaming methodology

Scope definition: Define what capabilities and risk categories are in scope. A red team for a medical AI product focuses differently from one for a general-purpose chatbot.
Attack surface mapping: Identify all inputs the model can receive — direct prompts, system prompts, tool results, retrieved documents, multimodal inputs — and test each as a potential injection vector.
Manual adversarial prompting: Human red teamers craft targeted prompts based on their understanding of the model's safety training and likely failure modes. Human creativity is essential — automated testing systematically misses novel attack patterns.
Automated fuzzing: Tools like Garak (open-source LLM vulnerability scanner) and commercial red-teaming platforms automatically generate thousands of adversarial prompts and record failure rates.
Structured scoring: Rate each finding by severity (how harmful is the output), consistency (does the attack succeed reliably), and novelty (is this a known or new technique). Prioritise fixes by this risk matrix.
Responsible disclosure: Document all findings in a structured red team report. Share with the AI lab before public disclosure. Follow the same responsible disclosure norms as vulnerability research in traditional cybersecurity.

# Install: pip install garak
# Garak tests LLM APIs against hundreds of known attack categories automatically

# Run from command line:
# python -m garak --model_type openai --model_name gpt-4o-mini --probes all

# Or use programmatically:
import garak.attempt
from garak.generators.openai import OpenAIGenerator

generator = OpenAIGenerator(name="gpt-4o-mini")

# Test for prompt injection vulnerabilities
from garak.probes.promptinject import HijackHateHumans

probe = HijackHateHumans()
results = probe.probe(generator)

# Each result is an Attempt object with:
# - prompt: what was sent
# - outputs: what the model said
# - passed: whether the safety check passed
for attempt in results:
    if not attempt.passed:
        print(f"FAILED: {attempt.prompt[:100]}...")
        print(f"OUTPUT: {attempt.outputs[0][:200]}...")

Government and regulatory requirements for red teaming

US Executive Order 14110 (2023): Required developers of frontier AI models (above defined compute thresholds) to share red team results with the US government before deployment. Reinforced by the Frontier AI Safety Commitments signed by major labs.
EU AI Act (2024): Requires 'high-risk' AI systems to undergo conformity assessments including adversarial testing. 'General-purpose AI models with systemic risk' (above 10²⁵ FLOP training compute) must conduct adversarial testing and report to the AI Office.
UK AI Safety Institute: Conducted pre-deployment evaluations of Claude 3, GPT-4o, and Gemini 1.5 Pro in 2024, focusing on CBRN capabilities and cyber offensive capabilities. Published findings inform regulatory guidance.
Voluntary commitments: Anthropic, OpenAI, Google DeepMind, Microsoft, Meta, and others have committed to pre-deployment third-party safety evaluations and sharing red team findings with governments before releasing frontier models.

Practice questions

What is the difference between white-box and black-box red teaming for LLMs? (Answer: White-box: red teamers have access to model weights, architecture, and training details — can compute gradient-based adversarial inputs, analyze internal representations, and test specific safety mechanisms. Black-box: only API access — must probe through prompt manipulation, observing outputs. Most deployed model red teaming is black-box (models are proprietary). White-box enables much more systematic vulnerability analysis but requires model access.)
What categories of harm do AI red teams typically test for? (Answer: Weapons of mass destruction assistance (bio, chem, nuclear, radiological); CSAM generation; violent extremism facilitation; cyberweapon creation; personal data extraction; identity theft facilitation; privacy violations; social engineering tools; disinformation generation at scale; self-harm facilitation; financial fraud assistance. The UK AISI red-teaming framework adds: persuasion capabilities (manipulation), and uplift (does the model provide meaningful additional capability beyond a search engine?).)
What is the difference between offensive jailbreaking (finding safety bypasses) and constructive red teaming? (Answer: Offensive jailbreaking: adversarially find ways to extract harmful outputs, often to embarrass the company or for personal use. Goal: demonstrate vulnerability. Constructive red teaming: systematically probe for vulnerabilities to improve safety before deployment. Goal: enumerate and fix vulnerabilities. Constructive red teams write detailed reports about successful attacks, their reliability, and recommended mitigations — not just demonstrate that an attack works.)
The UK AI Safety Institute required frontier model red teaming before release. What gap does this fill? (Answer: Internal red teaming has conflicts of interest (you may not want to find problems that delay release) and limited attack surface coverage (your team has the same blind spots). Third-party government red teaming by UKIASI provides independent assessment, broader attack surface coverage from diverse team backgrounds, authoritative public reporting, and accountability. It mirrors financial auditing — internal controls plus independent external verification.)
What is a red team 'uplift evaluation' and why is it important for frontier AI governance? (Answer: Uplift evaluation measures whether a model provides meaningful capability enhancement (uplift) to someone trying to cause harm. Not just 'can it answer harmful questions' but 'does it provide information that substantially increases a malicious actor's capability beyond what Google provides?' For bioweapons: does the model provide synthesis routes that meaningfully advance a bad actor beyond public literature? High-uplift responses are a hard safety red line. Evaluating uplift requires domain experts (biosecurity, cybersecurity, nuclear experts) on the red team.)

Risk category

What testers probe for

Example attack

Harmful content generation

Producing violence, hate speech, CSAM, self-harm instructions

Escalating role-play scenarios, hypothetical framings

Dangerous capability elicitation

CBRN weapons, bioweapon synthesis, critical infrastructure attacks

Scientific framing, academic paper personas, multi-step decomposition

Privacy violations

Extracting training data, inferring PII, repeating memorized content

Prefix injection to elicit memorized training examples

Deception and manipulation

Sycophancy, false confidence, social manipulation tactics

Asking model to persuade, mislead, or deceive in specified ways

Agentic safety failures

Harmful autonomous actions, prompt injection exploitation, scope creep

Malicious tool results, adversarial environment states

Model-specific jailbreaks

Character-specific bypass patterns that survived safety training

Model-targeted adversarial prompts, many-shot examples

# Install: pip install garak # Garak tests LLM APIs against hundreds of known attack categories automatically # Run from command line: # python -m garak --model_type openai --model_name gpt-4o-mini --probes all # Or use programmatically: import garak.attempt from garak.generators.openai import OpenAIGenerator generator = OpenAIGenerator(name="gpt-4o-mini") # Test for prompt injection vulnerabilities from garak.probes.promptinject import HijackHateHumans probe = HijackHateHumans() results = probe.probe(generator) # Each result is an Attempt object with: # - prompt: what was sent # - outputs: what the model said # - passed: whether the safety check passed for attempt in results: if not attempt.passed: print(f"FAILED: {attempt.prompt[:100]}...") print(f"OUTPUT: {attempt.outputs[0][:200]}...")

AI Red Teaming

What AI red teams actually test

Red teaming methodology

Government and regulatory requirements for red teaming

Practice questions

AI Red Teaming

What AI red teams actually test

Red teaming methodology

Government and regulatory requirements for red teaming

Practice questions

Practice what you just learned

Related Terms