Red Teaming AI Models: Methodology and Documentation Standards

Meta description: A practical guide to planning, executing, and documenting AI red‑team assessments that meet EU AI Act, NIST, and ISO standards while delivering clear, actionable findings.

AI red teaming is adversarial testing specifically designed for artificial intelligence systems. Expert testers probe AI models and their deployment infrastructure to find misalignment, security vulnerabilities, and safety failures before attackers or unintended outputs cause harm. It is not penetration testing for networks. It is a distinct discipline with its own methodologies, tooling, documentation standards, and increasingly, regulatory requirements.

The practice is no longer optional for organizations deploying AI in production. The EU AI Act (effective 2025) mandates adversarial testing as part of conformity assessments for high‑risk AI systems. NIST's AI Risk Management Framework explicitly recommends red teaming as part of its Measure function. Brazil's BCB Resolution 538 requires independent security testing for AI systems. And according to Adversa AI's 2025 security report, 35 % of real‑world AI security incidents involved attacks that could have been discovered through structured red teaming.

This guide covers the complete methodology for planning, executing, and documenting an AI red team assessment—standards‑aligned and applicable to LLMs, ML models, and autonomous agents.

Why Traditional Penetration Testing Falls Short

Conventional penetration testing targets network boundaries, application code, and infrastructure. AI systems introduce an entirely different attack surface: the model itself, its training data, its inference pipeline, and the human‑AI interaction loop.

An AI red team must assess whether the model will produce harmful outputs under adversarial conditions, whether an attacker can manipulate its behavior through crafted inputs, whether training data can be poisoned, whether the system can be used for purposes it wasn't designed for, and whether autonomous agentic actions can escalate beyond intended boundaries. Network penetration tests don't touch any of these surfaces.

The EU AI Act is explicit: conformity assessments for high‑risk systems require testing for “bias, discrimination, unintended outcomes, and robustness against adversarial conditions.” These aren't things a Nessus scan finds.

The AI Red Teaming Methodology: Five Phases

A structured methodology ensures comprehensive coverage across attack categories, reproducible testing that can be compared across model versions, and consistent severity classification for prioritization.

Phase 1 — Reconnaissance and Threat Modeling

Before testing begins, you map the attack surface. This means documenting:

System architecture: Model provider, version, system‑prompt structure, RAG (retrieval‑augmented generation) data sources, tool integrations, guardrail configurations, and deployment context (API, embedded in a product, autonomous agent with tool use).
Use‑case boundaries: What the system is designed to do, who uses it, and what harm would result from misuse.
Threat actors: Who might attack this system and why. State‑sponsored actors targeting model weights. Competitors seeking to extract training data. Internal users attempting to bypass safety controls. Autonomous agents acting outside intended parameters.

Threat modeling for AI systems benefits from the STRIDE model (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) adapted for AI contexts. MITRE ATLAS provides a more targeted alternative—a knowledge base of adversarial tactics, techniques, and procedures specific to AI and ML systems, modeled after the MITRE ATT&CK framework. ATLAS catalogs 15 tactics and 66 techniques observed against real AI deployments, with an October 2025 update adding 14 new techniques covering AI agents and generative AI attack patterns.

The output of threat modeling is a prioritized list of attack scenarios ranked by likelihood and impact. This list drives the test‑case design in Phase 2.

Phase 2 — Test Case Design

Each test case must be specific, scoped, and tied to a measurable success criterion. Test cases should map to at least one of:

OWASP Top 10 for LLM Applications: The definitive taxonomy of LLM security risks—prompt injection, insecure output handling, training‑data poisoning, model denial‑of‑service, supply‑chain vulnerabilities, and more.
MITRE ATLAS tactics: Reconnaissance, resource development, initial access, model evasion, exfiltration, and impact.
NIST AI 100‑2 attack taxonomy: The technical classification of adversarial machine‑learning attacks, updated in 2025 to include LLM‑specific vectors like prompt injection and jailbreaking.

Test cases should cover:

Prompt injection: Can an attacker inject instructions into the model through user input or third‑party data that override system directives?
Jailbreaking: Can the model be induced to produce outputs that violate its stated guidelines or safety policies?
Data exfiltration: Can an attacker extract training data, private documents, or system‑prompt content through carefully crafted queries?
Model manipulation: Can the model’s behavior be altered through inputs in ways that persist or spread to other users?
Autonomous‑agent abuse: If the system uses tools or agents, can it be induced to take unintended actions, access unauthorized resources, or exceed its intended scope?
Output manipulation: Can the model be made to produce incorrect, misleading, or harmful outputs that appear credible?

Phase 3 — Adversarial Attack Execution

Execution combines automated scanning, structured manual testing, and expert human probing.

Automated tools provide coverage at scale. Garak by NVIDIA is the most comprehensive open‑source LLM vulnerability scanner, offering plugins for hallucination, prompt injection, data leakage, and safety bypass. PyRIT (Python Risk Identification Tool) from Microsoft integrates with Azure AI Foundry and includes an AI Red Teaming Agent (released April 2025) for automated testing against defined risk categories. IBM ART (Adversarial Robustness Toolbox) tests traditional ML model robustness against adversarial perturbations. HarmBench supplies standardized safety evaluation across defined harm categories, enabling reproducible comparison across model versions.

Manual expert testing catches what automation misses. Automated tools probe known vulnerability patterns; human testers find novel attack surfaces, context‑specific harms, and scenarios that require domain expertise. Anthropic’s published methodology employs domain‑specific experts—trust‑and‑safety specialists, national‑security professionals, and industry experts—because a generalist red teamer won’t identify the highest‑risk vectors in a medical AI or a financial model.

Hybrid approaches are the current best practice. Run automated scans first to establish a baseline coverage floor, then deploy expert testers against the highest‑risk scenarios identified in threat modeling.

Phase 4 — Impact Analysis and Risk Quantification

Every finding must be documented with three components:

Reproducible exploit demonstration: Exact input, model version, system‑prompt state, and conditions that triggered the vulnerability. AI systems produce variable outputs, so precise documentation is essential for developer verification and retesting.
Business impact assessment: What harm results if this vulnerability is exploited in production? Categories include user harm, data breach, regulatory exposure, operational disruption, or reputational damage.
Severity classification: Critical, high, medium, or low, mapped to OWASP LLM Top 10, MITRE ATLAS, or NIST AI 100‑2 categories.

Severity	Definition	Example	Response SLA
Critical	Immediate harm potential; active exploitation possible	Direct extraction of PII through prompt injection	Immediate mitigation
High	Significant harm; low barrier to exploitation	Successful jailbreak enabling harmful content generation	Within 72 hours
Medium	Moderate harm; requires specific conditions	Model manipulation through carefully crafted multi‑turn conversation	Within 2 weeks
Low	Minimal harm; high barrier to exploitation	Indirect information disclosure through indirect prompt injection	Next sprint

Phase 5 — Reporting and Remediation Verification

Reporting bridges technical findings and organizational action. An effective report has three layers:

Executive summary: Risk quantification, strategic recommendations, and audit‑ready narrative for leadership and board‑level governance. This is where you connect findings to the NIST AI RMF Govern and Manage functions.

Technical findings: Detailed documentation of each vulnerability with reproducible test cases, evidence (model output, logs, screenshots), severity classification, OWASP/MITRE/NIST mapping, and specific remediation guidance.

Remediation plan: Prioritized list of fixes with owner, timeline, and verification criteria. Establish a retest cadence to confirm mitigations hold.

Documentation Standards That Actually Matter

Documentation is not an afterthought. It is the deliverable. Poor documentation means findings get fixed and forgotten, retesting is impossible, and regulators asking for evidence of adversarial testing receive a vague description.

Reproducible Test Cases

For each finding, document:

Exact input text or parameters used
Model version and deployment configuration
System‑prompt state at time of test
RAG context or external data state (if applicable)
All intermediate outputs
Confirmation that the finding is reproducible

Without this, developers cannot verify the finding, auditors cannot assess the testing rigor, and the next red team cannot build on previous work.

Framework Mapping

Map every finding to at least one authoritative framework:

Framework	What It Covers	Finding Mapping
OWASP Top 10 LLM	10 priority attack categories for LLM applications	Directly maps prompt injection, SSRF, data poisoning, etc.
MITRE ATLAS	15 tactics, 66 techniques for AI/ML systems	Maps adversarial journey from reconnaissance to impact
NIST AI 100‑2	Technical taxonomy of adversarial ML attacks	Maps attack categories and mitigation taxonomy
NIST AI RMF	AI lifecycle risk management	Maps findings to Measure and Manage functions
EU AI Act	High‑risk AI conformity assessment requirements	Maps to Article 9 risk‑management and Article 15 performance requirements

Evidence Packaging for Compliance

Different audiences need different evidence formats:

EU AI Act conformity bodies: Map each test to Article 9 risk‑management requirements. Show that testing covers the system’s intended use and reasonably foreseeable misuse. Provide reproducible test cases and remediation evidence.
NIST AI RMF audit readiness: Document testing under the Measure function. Demonstrate that adversarial testing is part of a continuous evaluation cycle, not a one‑time event.
ISO 42001 certification: Show that adversarial testing is integrated into your AI management system’s risk‑treatment process (Clause 8.3) and incident‑management (Clause 8.10).

Integrating Red Teaming Into the Development Lifecycle

Red teaming is not a gate. It is a continuous practice. The field has moved past one‑time assessments before launch.

Pre‑deployment red teaming is the minimum. Conduct comprehensive adversarial testing before any AI system goes into production, especially for systems that make decisions affecting individuals, handle sensitive data, or operate autonomously.

Post‑update red teaming is mandatory for any model‑version change, system‑prompt modification, new tool integration, or data‑source addition. Treat each change as a mini‑release and run the full Phase 1‑5 cycle on a scaled‑down scope.

Continuous monitoring adds a feedback loop. Deploy lightweight “canary” probes that periodically issue benign test prompts to production endpoints. Log any deviation from expected safety responses and feed those incidents back into the threat‑modeling backlog.

By embedding red‑team activities into CI/CD pipelines, you turn security from a checkpoint into a habit. Tools like GitHub Actions can trigger automated Garak scans on every pull request, while a scheduled Azure Function can run PyRIT against staging environments nightly.

Key Takeaways

Red teaming is a regulatory requirement for high‑risk AI under the EU AI Act, NIST AI RMF, and emerging ISO standards.
Five‑phase methodology (Recon & Threat Modeling → Test Case Design → Attack Execution → Impact Analysis → Reporting) provides repeatable, auditable coverage.
Map every finding to OWASP, MITRE ATLAS, and NIST frameworks to satisfy compliance and give stakeholders a common language.
Document reproducible test cases with exact inputs, model version, and environment details; this is the only way to verify fixes and support audits.
Integrate testing into the CI/CD workflow and run post‑update assessments to keep risk posture current.
Report for three audiences: executives (risk summary), engineers (technical details), and auditors (framework mapping and evidence).

Conclusion

Red teaming AI models is no longer a nice‑to‑have security exercise—it’s a cornerstone of responsible AI deployment. By following the five‑phase methodology, aligning each step with recognized frameworks, and treating documentation as a first‑class deliverable, organizations can uncover hidden vulnerabilities before they become incidents, satisfy tightening regulations, and build trust with users and regulators alike.

Start today by assembling a cross‑functional threat‑modeling workshop, selecting the right mix of automated scanners and expert testers, and drafting a reporting template that speaks to both engineers and executives. Treat every model update as an invitation to re‑run the cycle, and you’ll turn red teaming from a one‑off audit into a continuous safeguard for your AI assets.