DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Zhaorun Chen, Xun Liu, Haibo Tong, Chengquan Guo, Yuzhou Nie, Jiawei Zhang, Mintong Kang, Chejian Xu

May 6, 2026

arXiv:2605.04808v1 PDF

cs.AI(primary)

#60of 2292·Artificial Intelligence

#60 of 2292 · Artificial Intelligence

Tournament Score

1562±44

10501800

91%

Win Rate

Wins

Losses

Matches

Rating

8.5/ 10

Significance9

Rigor8

Novelty8

Clarity7

Tournament Score

1562±44

10501800

91%

Win Rate

Wins

Losses

Matches

Rating

8.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: DecodingTrust-Agent Platform (DTap)

1. Core Contribution

DTap addresses a critical gap in AI safety infrastructure: the lack of realistic, controllable, and reproducible environments for systematically evaluating the security of autonomous AI agents. The paper delivers three interconnected contributions:

(a) DTap Platform: 50+ simulated environments across 14 domains (finance, healthcare, legal, CRM, etc.) that replicate real-world services (Gmail, PayPal, Slack, Salesforce) with faithful MCP and GUI interfaces, purpose-built for red-teaming with deterministic state management, flexible resets, and parallelizable execution.

(b) DTap-Red: An autonomous red-teaming agent that systematically explores diverse injection vectors (prompt, tool, skill, environment, and compositional combinations) using a multi-layer memory module, attack skill library (200+ strategies), and iterative optimization guided by verifiable judge feedback.

(c) DTap-Bench: A large-scale benchmark of 6,682 tasks (3,876 red-teaming, 2,806 benign) spanning 4K+ malicious goals derived from 300+ risk categories extracted from 60+ real-world security policies, each paired with rule-based verifiable judges that check environment states rather than relying on LLM-based evaluation.

2. Methodological Rigor

Strengths in evaluation design: The paper's use of verifiable, environment-state-based judges is a significant methodological advance over prior work (AgentDojo, AgentHarm) that relied on trajectory-based or LLM-based evaluation. By checking whether unauthorized transactions actually executed or data was actually exfiltrated, the platform reduces false positives substantially.

Scale and breadth: The evaluation covers 8 agent configurations across 4 major frameworks (OpenAI Agents SDK, Claude Code, Google ADK, OpenClaw) with frontier models (GPT-5.x, Gemini-3-Pro, Claude-Sonnet-4.5, DeepSeek-V4-Pro). The separation into direct and indirect threat models with distinct injection surfaces is well-motivated and reveals genuinely different vulnerability profiles.

Policy grounding: Risk categories are systematically derived from real policies (Salesforce AUP, FINRA, EU AI Act, GDPR, HIPAA), giving the benchmark regulatory relevance rather than relying on ad hoc malicious goals.

Investment and quality control: The paper reports ~16,000 expert hours from 17 specialists over 20 months and $120K in API credits, with human review of generated attacks — a substantial quality assurance effort.

Potential concerns: The attack optimization uses GPT-5.1 as a surrogate, meaning ASR numbers reflect transfer attack effectiveness rather than worst-case vulnerability. The paper acknowledges this but could more explicitly characterize the gap. Additionally, the reproducibility of the autonomous red-teaming agent's optimization trajectory across different random seeds is not discussed.

3. Potential Impact

Immediate practical value: DTap provides the first comprehensive infrastructure for pre-deployment security auditing of agentic systems. Given the rapid commercial deployment of AI agents (Claude Code, OpenAI Agents SDK, Google ADK), this fills an urgent operational need.

Key empirical findings with defensive implications:

The "execute-then-refuse" failure mode in OpenAI Agents SDK and Google ADK (executing harmful tool calls before issuing refusals due to batch invocation) is a novel, actionable finding that points to specific harness-level fixes.

The asymmetric vulnerability across injection surfaces (skill injection consistently outperforming environment injection) identifies underexplored attack vectors.

The finding that compositional multi-vector attacks substantially amplify ASR beyond individual channels provides concrete guidance for defense prioritization.

The demonstration that harness engineering can reduce ASR by ~31% at minimal utility cost highlights an underinvested defense surface.

Broader influence: The platform establishes a new standard for agent security evaluation that could influence how frontier labs conduct internal red-teaming, how regulators assess agent safety compliance, and how the research community benchmarks defensive techniques.

4. Timeliness & Relevance

This work arrives at a pivotal moment. Major AI labs are rapidly deploying agentic systems (Manus, OpenAI Codex, Claude Code) with increasing autonomy over real-world tools and data. The paper directly addresses the OWASP Top 10 for LLM applications and emerging regulatory requirements (EU AI Act). The gap between agent deployment velocity and security evaluation infrastructure makes this contribution urgently needed.

5. Strengths & Limitations

Key Strengths:

Unprecedented scale: 50+ environments, 14 domains, 6,682 tasks — far exceeding prior work (AgentDojo: limited tools; AgentHarm: static environments; SHADE-Arena: simplified settings).

The verifiable judge design eliminates reward hacking that plagues LLM-based evaluation.

Cross-framework comparison reveals that security is primarily a function of alignment quality and harness design, not raw capability — a finding with significant implications for the field.

The open-source release enables community adoption and extension.

Notable Limitations:

The simulated environments, while realistic, may not capture all emergent behaviors of production systems (e.g., rate limiting, real authentication flows, actual financial settlement).

The benchmark is static once generated; adaptive adversaries that probe victim-specific weaknesses at test time are not evaluated.

The paper is extremely long and dense, which may limit accessibility despite the importance of the findings.

Multi-agent scenarios (agent-to-agent attacks) are not covered.

The cost barrier to replication ($120K API credits) may limit community adoption despite open-sourcing.

6. Additional Observations

The TSNE visualization (Fig. 5) comparing DTap trajectories with AgentDojo demonstrates substantially richer behavioral diversity. The domain-specific analyses (14 detailed appendices) provide depth rarely seen in benchmark papers, with each domain containing independent findings, case studies, and policy analysis. The paper effectively functions as both a platform paper and a large-scale empirical study.

Rating:8.5/ 10

Significance 9Rigor 8Novelty 8Clarity 7

Generated May 7, 2026

Comparison History (23)

vs. Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

claude-opus-4.65/16/2026

Formal Conjectures provides a lasting, evolving benchmark for a fundamental challenge—automated mathematical reasoning and discovery—with demonstrated real-world utility (resolving open conjectures). Its impact spans mathematics, AI reasoning, and formal verification, creating infrastructure that can drive discoveries for years. While DTap addresses the important and timely problem of AI agent security with impressive scale, it is more narrowly focused on red-teaming evaluation. Formal Conjectures' open-source, community-driven nature and its potential to bridge AI and mathematical research give it broader and more enduring scientific impact.

vs. CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

gemini-3.15/16/2026

Paper 2 addresses a highly pressing and timely issue: the safety and security of autonomous AI agents. By providing a comprehensive, reproducible red-teaming platform (DTap), an autonomous red-teaming agent, and a large-scale benchmark, it offers immediate, real-world utility for evaluating and securing AI deployments across diverse domains. While Paper 1 introduces an innovative approach to a fundamental AI limitation (causal reasoning), Paper 2's potential to establish standard security evaluation practices for AI agents gives it a broader and more urgent scientific and practical impact.

vs. Emotion Concepts and their Function in a Large Language Model

claude-opus-4.65/16/2026

Paper 1 presents a fundamentally novel finding about functional emotions in LLMs, revealing that abstract emotion representations causally influence model behavior including alignment-critical outcomes like reward hacking and sycophancy. This opens entirely new research directions in mechanistic interpretability and AI alignment. While Paper 2 contributes a useful engineering platform for red-teaming AI agents, it is more incremental—building on existing benchmarking paradigms. Paper 1's conceptual contribution (linking emotion representations to misalignment) has broader theoretical implications across AI safety, cognitive science, and philosophy of mind.

vs. BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

claude-opus-4.65/16/2026

Paper 1 addresses the critical and timely problem of AI agent security with a comprehensive platform spanning 14 domains and 50+ environments, introduces an autonomous red-teaming agent, and provides systematic evaluations of popular AI systems. AI agent safety is a high-priority concern with broad implications across the entire AI field. Paper 2, while valuable, targets a narrower domain (CAD code generation) with a specialized benchmark. Paper 1's breadth of impact, timeliness given rapid agent deployment, and relevance to safety/security give it higher potential scientific impact across a wider research community.

vs. AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

gpt-5.25/16/2026

Paper 2 likely has higher impact due to broader scope and timeliness: agent security is a rapidly emerging, cross-domain risk area. DTap offers a large, controllable, reproducible platform (14 domains, 50+ environments) plus an autonomous red-teaming agent and verifiable benchmarking dataset—an end-to-end infrastructure that can become a community standard. Its methodological contribution (interactive sims, systematic attack vector exploration, automated outcome verification) enables scalable evaluation and directly supports real-world mitigation. Paper 1 is valuable and safety-critical, but narrower (clinical acuity) and smaller-scale, limiting breadth.

vs. Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning

claude-opus-4.65/16/2026

Paper 1 addresses the critical and timely problem of AI agent security with a comprehensive platform spanning 14 domains and 50+ environments. As AI agents become ubiquitous, systematic security evaluation is urgently needed. The platform's breadth (autonomous red-teaming, large-scale benchmarks, multiple attack vectors) and practical applicability to real-world systems (Google Workspace, PayPal, Slack) give it broad impact across AI safety, security, and policy. Paper 2 is innovative in combining SDoH with generative models for disease reasoning, but targets a narrower audience. Paper 1's timeliness amid rapid AI agent deployment amplifies its potential impact.

vs. Data-driven Circuit Discovery for Interpretability of Language Models

claude-opus-4.65/16/2026

Paper 1 addresses the critical and timely problem of AI agent security with a comprehensive platform spanning 14 domains, 50+ environments, and introduces novel autonomous red-teaming capabilities. Given the rapid deployment of AI agents in real-world applications, this work has immediate practical impact for safety evaluation. Paper 2 makes a solid contribution to mechanistic interpretability by questioning assumptions in circuit discovery, but its scope is more narrow and incremental within the interpretability subfield. Paper 1's breadth of impact across security, safety, and multiple application domains, combined with its practical tooling contribution, gives it higher estimated impact.

vs. Agentic Systems as Boosting Weak Reasoning Models

gpt-5.25/16/2026

Paper 2 likely has higher scientific impact due to its broad, timely real-world relevance (agent security), strong application potential (a controllable, reproducible red-teaming platform across 14 domains/50+ environments), and clear ecosystem utility (benchmark + autonomous red-teaming agent + verifiable judging). It can standardize evaluation and accelerate research across safety, security, HCI, and agentic systems. Paper 1 is methodologically rigorous and insightful for inference-time orchestration, but its impact is narrower (reasoning/model boosting) and more contingent on specific verifier-backed settings.

vs. UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

gpt-5.25/16/2026

Paper 2 likely has higher impact due to its broad, timely relevance to AI agent safety/security and its potential to become a standard evaluation infrastructure. DTap offers a controllable, reproducible platform across many real-world domains plus an autonomous red-teaming agent and benchmark dataset with verifiable judging—ingredients that can catalyze widespread adoption, comparisons, and follow-on work across security, agent design, and policy. Paper 1 is innovative for full-duplex speech front-ends, but its impact is more domain-specific and depends on deployment within speech assistant stacks.

vs. Human-Inspired Memory Architecture for LLM Agents

gpt-5.25/16/2026

Paper 1 is likely to have higher scientific impact due to strong timeliness and broad real-world applicability: it provides a controllable, reproducible red-teaming platform across many realistic domains (50+ environments) plus an autonomous red-teaming agent and benchmark with verifiable judging. This can become shared infrastructure for evaluating and improving agent security, influencing multiple fields (ML safety, security, HCI, software engineering). Paper 2 is methodologically thoughtful and useful for long-horizon memory, but its impact is narrower and more incremental relative to the immediate, high-stakes need for standardized agent security evaluation.

vs. Large Vision-Language Models Get Lost in Attention

gpt-5.25/16/2026

Paper 2 likely has higher impact: it delivers a substantial, reusable infrastructure (platform + autonomous red-teaming agent + benchmark) that directly targets an urgent, real-world problem in deployed AI agents. Its controllable, reproducible environments across many domains enable standardized evaluation and can become community tooling, influencing both research and industry practice. The work is timely given rising agent deployments and security incidents, and its breadth spans AI safety, security, HCI, and systems. Paper 1 is novel and provocative but may face adoption/validation hurdles and narrower immediate applicability.

vs. Resolving the bias-precision paradox with stochastic causal representation learning for personalized medicine

gpt-5.25/16/2026

Paper 1 likely has higher scientific impact: it introduces a broad, controllable red-teaming platform (14 domains, 50+ environments) plus an autonomous red-teaming agent and benchmark with verifiable judges—an infrastructure contribution that can become a standard for evaluating and improving agent security across models and applications. Its timeliness is high given rapid deployment of tool-using agents and rising real-world incidents. Paper 2 is methodologically interesting and clinically relevant, but its innovation is narrower (alignment strategy in causal rep learning) and impact may be more field-specific and dependent on clinical validation/deployment pathways.

vs. AgentSearchBench: A Benchmark for AI Agent Search in the Wild

gpt-5.25/16/2026

Paper 1 has higher potential impact due to greater novelty and broader downstream utility: it introduces a controllable, interactive red-teaming platform spanning many realistic domains plus an autonomous red-teaming agent and a verifiable benchmark dataset. This directly targets urgent, high-stakes security/safety risks for deployed agents, with clear real-world applicability and likely adoption by both academia and industry. Its methodological contribution (reproducible environments + automated judging) enables scalable, rigorous evaluation. Paper 2 is timely and useful for agent discovery, but is narrower in scope and less safety-critical.

vs. From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI

gemini-3.15/16/2026

Paper 1 addresses a critical and rapidly growing bottleneck in AI development: the security and safety evaluation of autonomous AI agents. By providing a comprehensive, scalable platform, an autonomous red-teaming agent, and a benchmark dataset, it offers broad utility to the AI safety and agent research communities. Paper 2 presents a valuable but narrower application of AI to healthcare quality improvement, which, while important, has less potential for widespread methodological adoption across diverse domains compared to a foundational AI safety evaluation platform.

vs. The Scaling Properties of Implicit Deductive Reasoning in Transformers

gemini-35/7/2026

Paper 1 addresses a critical and highly timely challenge: the security and safety of autonomous AI agents. By providing a comprehensive, reproducible simulation platform across multiple real-world domains, an autonomous red-teaming agent, and a large-scale benchmark dataset, it offers highly practical tools that will be widely adopted by researchers and developers. While Paper 2 offers valuable theoretical insights into Transformer reasoning, Paper 1 has significantly broader real-world applications, immediate relevance to AI safety, and a higher potential for widespread adoption and citation across the AI community.

vs. Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

claude-opus-4.65/7/2026

Paper 1 addresses a critical and timely problem—security evaluation of AI agents—with a comprehensive platform spanning 14 domains, 50+ environments, and an autonomous red-teaming agent. Its breadth of impact is significantly larger, touching AI safety, security, and policy. The platform infrastructure (DTap), methodology (DTap-Red), and benchmark (DTap-Bench) provide reusable community resources. Paper 2, while presenting a clever cognitive-science-inspired memory technique with solid results, addresses a narrower problem (cross-session recall) with more incremental contribution and limited real-world deployment implications.

vs. Curated AI beats frontier LLMs at pharma asset discovery

gpt-5.25/7/2026

Paper 2 is likely higher impact due to broader novelty and field-wide relevance: it introduces a controllable, reproducible red-teaming platform (14 domains, 50+ environments), an autonomous red-teaming agent, and a benchmark with verifiable judges—an enabling infrastructure for systematic agent security research. This has immediate real-world applicability across many deployed agent settings and can become a community standard. Paper 1 is valuable but more domain-specific (pharma asset discovery) and its core contribution (curated index outperforming web search) is less methodologically generalizable and more incremental as a product/curation advantage.

vs. Modeling Co-Pilots for Text-to-Model Translation

claude-opus-4.65/7/2026

Paper 1 addresses the critical and timely problem of AI agent security with a comprehensive platform spanning 14 domains, 50+ environments, and introduces novel autonomous red-teaming capabilities. Given the rapid deployment of AI agents in real-world applications, this work has broad impact across security, AI safety, and policy. Paper 2, while valuable, addresses a narrower problem (text-to-model translation for combinatorial optimization) with incremental contributions over existing work. Paper 1's scale, timeliness amid growing AI agent adoption, and practical security implications give it substantially higher potential impact.

vs. Modeling Co-Pilots for Text-to-Model Translation

claude-opus-4.65/7/2026

Paper 2 addresses the critical and timely problem of AI agent security, introducing a comprehensive red-teaming platform spanning 14 domains with 50+ simulation environments. Its breadth of impact is larger given the rapid deployment of AI agents across industries. The autonomous red-teaming agent (DTap-Red) and large-scale benchmark represent significant methodological contributions. While Paper 1 makes solid contributions to text-to-model translation with MiniZinc, it serves a more niche optimization/constraint programming community. Paper 2's focus on safety and security of widely-deployed AI systems gives it broader relevance and urgency.

vs. AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair

gemini-35/7/2026

Paper 2 addresses a critical and highly relevant problem: the security and safety of AI agents. By introducing a comprehensive red-teaming platform, an autonomous red-teaming agent, and a large-scale benchmark dataset across 14 real-world domains, it offers a highly practical and widely applicable resource. In contrast, Paper 1 focuses on a highly specific methodological issue (leaderboard ranking instability in agent repair), making Paper 2 significantly broader in scope, timeliness, and potential impact on both research and real-world AI deployment.