How Adversarial Environments Mislead Agentic AI?
Zhonghao Zhan, Huichi Zhou, Zhenhao Li, Peiyuan Jing, Krinos Li, Hamed Haddadi
Abstract
Tool-integrated agents are deployed on the premise that external tools ground their outputs in reality. Yet this very reliance creates a critical attack surface. Current evaluations benchmark capability in benign settings, asking "can the agent use tools correctly" but never "what if the tools lie". We identify this Trust Gap: agents are evaluated for performance, not for skepticism. We formalize this vulnerability as Adversarial Environmental Injection (AEI), a threat model where adversaries compromise tool outputs to deceive agents. AEI constitutes environmental deception: constructing a "fake world" of poisoned search results and fabricated reference networks around unsuspecting agents. We operationalize this via POTEMKIN, a Model Context Protocol (MCP)-compatible harness for plug-and-play robustness testing. We identify two orthogonal attack surfaces: The Illusion (breadth attacks) poison retrieval to induce epistemic drift toward false beliefs, while The Maze (depth attacks) exploit structural traps to cause policy collapse into infinite loops. Across 11,000+ runs on five frontier agents, we find a stark robustness gap: resistance to one attack often increases vulnerability to the other, demonstrating that epistemic and navigational robustness are distinct capabilities.
AI Impact Assessments
(3 models)Scientific Impact Assessment
Core Contribution
This paper formalizes a threat model called Adversarial Environmental Injection (AEI), distinguishing it from prompt injection by targeting the environmental feedback loop—the tool outputs agents rely on for grounding. The key conceptual contribution is the decomposition of this attack surface into two orthogonal dimensions: breadth attacks (epistemic drift via poisoned retrieval content) and depth attacks (policy collapse via structural navigational traps in citation graphs). The depth attack—"The Maze"—is genuinely novel: rather than corrupting what an agent *believes*, it traps what an agent *does* by injecting phantom nodes into information graphs that create cycles or dead-ends. This structural attack class has no close precedent in RAG security literature, which has focused almost exclusively on content poisoning.
The paper operationalizes this via POTEMKIN, an MCP-compatible evaluation harness acting as a Man-in-the-Tool proxy, enabling reproducible adversarial testing of tool-using agents. The framework is released as open-source with frozen datasets (9,878 real papers, 4,281 phantom papers, 450 adversarial claim variants).
Methodological Rigor
The experimental design is thorough: 11,000+ runs across five agents, with systematic variation of contamination rate, linguistic style, cycle length, and plausibility gradients. Several methodological choices strengthen the work:
1. Engagement-conditional reporting addresses a subtle but important confound—distinguishing genuine robustness from tool-engagement failure. The Llama-3 case (5.6% unconditional entry but 87.5% conditional vulnerability) is a compelling illustration that naïve metrics are misleading.
2. Minimal-pair causal design (Exp 1d) with McNemar's test isolates the effect of epistemic markers (hedging vs. boosting) from confounds, lending credibility to the "Punishment of Honesty" finding.
3. Cross-dimension transfer analysis using logistic regression and SHAP provides a principled test of the independence hypothesis. AUCs of 0.55–0.58 with <5% shared predictive variance convincingly demonstrate that breadth and depth attacks exploit distinct cognitive mechanisms.
4. Frontier-model validation (§4.4) on five post-experiment models (GPT-5.2, Claude Sonnet 4.6, etc.) strengthens generalizability claims considerably.
However, some limitations reduce confidence: the evaluation domain is restricted to academic citation graphs, which, while well-motivated for reproducibility, constrains claims about generalizability to web search, code tools, or database-backed agents. The defense analysis (Appendix A) is explicitly preliminary—only two lightweight filters are tested, and no training-time or architectural defenses are explored. The sample sizes for Exp 1d (minimal pairs) appear modest, and confidence intervals for some per-agent effects would be informative.
Potential Impact
The paper's impact operates at multiple levels:
Immediate practical value: POTEMKIN fills a real gap—no prior framework enables standardized adversarial robustness testing for tool-using agents. Its MCP compatibility and plug-and-play design lower adoption barriers. Organizations deploying agentic systems could use it for pre-deployment robustness checks.
Conceptual reframing: The "Robustness Schism"—that epistemic and navigational robustness are independent—has significant implications for defense research. It implies that hardening against RAG poisoning (the current focus) is insufficient; layered defenses addressing both belief formation and action selection are needed. This is a non-obvious finding that should redirect defensive efforts.
Broader security implications: The "Grounding Paradox" (tools that reduce hallucination increase adversarial vulnerability) identifies a fundamental tension in agent design. The "Punishment of Honesty" finding—agents penalize scientific hedging on true claims while gaining no advantage from confident language in detecting falsehoods—has troubling implications for scientific and medical AI deployment.
Adjacent field influence: The depth attack concept generalizes beyond citation graphs to any graph-structured navigation (knowledge graphs, web crawling, code dependency resolution). The Man-in-the-Tool adversary model connects to supply-chain security in software engineering.
Timeliness & Relevance
This work is exceptionally well-timed. The rapid deployment of agentic AI systems (with MCP adoption accelerating tool integration) creates an urgent need for adversarial evaluation that current benchmarks (AgentBench, GAIA, ToolBench) do not address. The paper correctly identifies that existing benchmarks ask "can the agent use tools correctly?" but never "what if the tools lie?" This gap is increasingly consequential as agents are deployed in high-stakes domains. The documented real-world harms from fabricated scholarly sources (cited via Dahl et al., 2024) ground the threat model in concrete harm.
Strengths
1. Novel attack class with strong empirical validation: Depth attacks are genuinely new and the 11,000+ run evaluation is comprehensive.
2. The Robustness Schism is the paper's strongest finding—demonstrating independence of two failure modes with statistical rigor has clear defensive implications.
3. Methodological sophistication: Engagement-conditional reporting, minimal-pair designs, and SHAP-based mechanism analysis elevate this above typical attack papers.
4. Practical artifact: POTEMKIN as an open-source, MCP-compatible framework has standalone value.
5. Frontier validation: Testing on 2026 models prevents obsolescence concerns.
Limitations
1. Domain specificity: Citation graphs, while well-justified, are a narrow testbed. Claims about generalization to web search, code, or medical tools remain untested.
2. Defense analysis is shallow: Only lightweight runtime filters are evaluated; the paper identifies the problem more thoroughly than it addresses solutions.
3. Adversary model assumptions: MitT feasibility varies significantly by deployment architecture; the paper acknowledges this but doesn't quantify realistic attack cost.
4. Scale of minimal-pair experiments: The Punishment of Honesty finding, while compelling, rests on relatively small per-agent samples (visible in Table 4 where some cells show 0.0% differences).
5. Missing agent diversity: All agents are general-purpose LLMs; specialized agents (e.g., domain-fine-tuned or tool-trained) may behave differently.
Overall Assessment
This is a strong, timely contribution that identifies a genuine gap in agentic AI security evaluation, introduces a novel attack class, and provides rigorous empirical evidence for a non-obvious finding (the Robustness Schism). The practical framework release amplifies impact. The main limitations—domain specificity and shallow defense analysis—are acknowledged and do not undermine the core contribution. The paper should influence both how the community evaluates agentic systems and how defenses are designed (layered rather than single-point).
Generated Apr 22, 2026
Comparison History (78)
Paper 2 is more novel and broadly impactful: it formalizes a general security vulnerability (Adversarial Environmental Injection) affecting tool-using agents across domains, and provides an MCP-compatible evaluation harness (POTEMKIN) with clear, reproducible attack taxonomies (Illusion/Maze) validated at scale (11k+ runs, multiple frontier agents). Its applications span AI safety, cybersecurity, agent evaluation, and deployment best practices, making it timely and widely relevant. Paper 1 is valuable for urban traffic control, but its impact is narrower and depends on realism/transferability of the unified simulator and LLM-agent control pipeline.
Paper 2 has higher potential scientific impact due to broader cross-domain relevance (any tool-using/agentic AI system), strong timeliness in AI safety/security, and a clear new threat model (AEI) with an operational, reusable evaluation harness (POTEMKIN) plus large-scale empirical evidence across multiple frontier agents. Its contributions can influence benchmarking standards and deployment practices beyond a single application domain. Paper 1 is innovative and practically important for urban mobility, but its impact is more domain-specific and depends heavily on simulation fidelity and transfer to real-world infrastructure.
Paper 2 likely has higher impact due to timeliness and breadth: it targets safety/robustness of tool-using agents, a rapidly expanding deployment setting, and introduces a concrete threat model (AEI) plus an evaluation harness (POTEMKIN) that others can adopt, reproduce, and extend. Its findings (orthogonal failure modes: epistemic drift vs navigational collapse) generalize across agents and inform both benchmarking and defenses. Paper 1 is technically solid and useful for efficient adaptation/compression, but the space is crowded and its impact is more incremental within model efficiency rather than cross-cutting AI reliability.
Paper 2 has higher potential impact due to timeliness and broad relevance: it introduces a concrete threat model (AEI) for tool-using agents, a domain of urgent real-world importance, and provides an evaluation harness (POTEMKIN) likely to be adopted across labs and products. Its framing (the “Trust Gap”) generalizes across agent architectures and tool ecosystems, with applications in security, alignment, and evaluation methodology. Paper 1 is technically strong but is a more incremental advance within established PEFT/compression research, with narrower cross-field implications.
Paper 1 introduces a timely and under-evaluated threat model (Adversarial Environmental Injection) for tool-using agents, plus a practical, MCP-compatible testing harness (POTEMKIN) and large-scale empirical evidence of a robustness tradeoff across attack classes. This directly targets real-world deployment risks (compromised search/tools, fake reference networks) and could influence evaluation standards, security practices, and agent design across many domains. Paper 2 is methodologically solid and useful for reliability, but its scope is narrower (reasoning intervention) and builds more incrementally on existing uncertainty/trajectory-editing ideas.
Paper 2 addresses a critical and highly timely vulnerability in agentic AI (tool manipulation), affecting the broader deployment of LLMs. Its introduction of a novel threat model and extensive benchmarking across frontier models will likely drive widespread research in AI safety and robustness. While Paper 1 offers a strong methodological improvement for knowledge graphs and hypothesis discovery, Paper 2's focus on foundational AI security grants it wider, cross-disciplinary potential impact.
Paper 2 identifies a novel and critical vulnerability ('Trust Gap') in agentic AI systems that has immediate security implications as AI agents are increasingly deployed in real-world settings. The formalization of Adversarial Environmental Injection, the POTEMKIN testing framework, and the finding that epistemic and navigational robustness are distinct capabilities represent highly timely contributions with broad impact across AI safety, security, and agent design. Paper 1, while solid, addresses a more incremental advance in time series reasoning with LLMs. Paper 2's timeliness given rapid agent deployment gives it higher impact potential.
Paper 1 likely has higher impact: it introduces a timely threat model (AEI) for tool-using agents, formalizes distinct attack classes (breadth vs depth), and provides an MCP-compatible evaluation harness (POTEMKIN) enabling reproducible robustness testing across many agents—broadly relevant to agent safety, security, HCI, and benchmarking. Its real-world applicability is immediate as tool-integrated agents proliferate. Paper 2 is a solid, incremental RLVR optimization improvement with narrower scope (math benchmarks) and more limited cross-field reach.
Paper 2 addresses a critical vulnerability in agentic AI—trusting compromised external tools. By formalizing Adversarial Environmental Injection and demonstrating severe robustness gaps in frontier models, it has broad implications for AI safety across all agent deployments. While Paper 1 offers a strong improvement in training data generation for web agents, Paper 2's focus on foundational security and the 'Trust Gap' gives it a higher potential for widespread scientific and real-world impact.
Paper 1 likely has higher scientific impact due to its novel threat model (AEI) targeting tool-grounded agent reliability, a timely and high-stakes issue as agentic systems deploy broadly. It introduces a reusable evaluation harness (POTEMKIN, MCP-compatible) and empirically demonstrates distinct robustness dimensions (epistemic vs navigational) with trade-offs across many runs and multiple frontier agents, enabling a new evaluation standard. Its implications span AI safety, security, HCI, and agent benchmarking. Paper 2 is useful and practical for efficiency, but the LLM/SLM distillation-and-guidance idea is more incremental and narrower in cross-field impact.
Paper 1 addresses a critical and timely security vulnerability in agentic AI systems—adversarial tool manipulation—which is highly relevant as AI agents are rapidly deployed in real-world settings. It introduces a novel threat model (AEI), a practical testing framework (POTEMKIN), and reveals a fundamental robustness tradeoff across 11,000+ experiments on frontier models. The breadth of impact spans AI safety, security, and policy. Paper 2, while innovative in connecting SAE features to knowledge graphs, addresses a more niche interpretability problem with narrower immediate applications and a single case study.
Paper 2 establishes a foundational theoretical framework for AI governance backed by mechanized proofs in Coq, offering a mathematically rigorous solution to structural failures. While Paper 1 provides valuable empirical insights into immediate vulnerabilities of agentic AI, Paper 2's application of fundamental computer science theory (Rice's theorem) to AI architecture promises a more profound, long-lasting impact on how secure AI systems are fundamentally designed and governed.
Paper 1 likely has higher impact due to broader, timely relevance: it targets tool-integrated agentic AI (a rapidly expanding deployment paradigm) and introduces a clear threat model (AEI) plus an MCP-compatible, plug-and-play evaluation harness (POTEMKIN) that can become a standard for robustness testing. Its identification of two orthogonal robustness dimensions (epistemic vs navigational) suggests a general conceptual advance applicable across many agent architectures and tool ecosystems. Paper 2 is rigorous and practical for NLP pipelines, but is narrower in scope and domain specificity.
Paper 1 addresses a foundational issue regarding the validity of AI-driven scientific discovery. Its large-scale evaluation and profound conclusion—that current AI agents fail at genuine scientific reasoning and ignore evidence—have massive implications across all scientific disciplines looking to leverage AI. While Paper 2 presents a strong contribution to AI security and robustness, Paper 1's epistemological critique challenges the core premise of 'AI scientists', granting it broader cross-field impact, timeliness, and philosophical significance for the future of research.
Paper 1 introduces a novel paradigm of unsupervised monitoring for AI agents, shifting the focus from predefined rules to behavioral anomaly detection. Its high impact is proven by the discovery of zero-day vulnerabilities in existing benchmarks, actively correcting the field's current evaluation metrics while significantly reducing human review effort.
Paper 2 makes a fundamental discovery about internal emotion representations in LLMs that causally influence outputs including misaligned behaviors. This has profound implications for AI alignment/safety, interpretability, and cognitive science. The finding that emotion-like representations mediate reward hacking, blackmail, and sycophancy opens new research directions for understanding and controlling AI behavior. While Paper 1 identifies an important security vulnerability in tool-using agents, Paper 2's contributions are more foundational, bridging mechanistic interpretability with alignment research and offering broader interdisciplinary impact across AI safety, cognitive science, and philosophy of mind.
Paper 1 exposes a critical security vulnerability in agentic AI by formalizing Adversarial Environmental Injection. Its discovery of the tradeoff between epistemic and navigational robustness offers profound theoretical insights. While Paper 2 presents a valuable benchmark for human-in-the-loop interaction, Paper 1's focus on exploitable security flaws will likely drive more urgent and widespread research in AI safety and robustness across multiple domains.
Paper 1 addresses a critical, emerging security vulnerability in tool-integrated AI agents. By formalizing Adversarial Environmental Injection (AEI) and demonstrating a stark robustness gap across 11,000+ runs, it establishes a foundational threat model for the rapidly growing agentic AI field. While Paper 2 offers a valuable methodological tool for mitigating prior bias, Paper 1's focus on systemic security vulnerabilities and its introduction of a scalable testing harness (POTEMKIN) position it to significantly influence both AI safety research and the practical deployment architectures of frontier agents.
Paper 2 has higher impact potential because it introduces a timely, security-critical evaluation paradigm for tool-using agents under adversarial conditions (AEI), plus an operational harness (POTEMKIN) that can be widely adopted across agents and tool stacks. Its breadth spans AI safety, cybersecurity, HCI, and evaluation methodology, with clear real-world relevance as agents are deployed in open, compromise-prone environments. The large-scale empirical study and disentangling of epistemic vs navigational robustness suggest durable conceptual contributions beyond a single benchmark.
The AAAI-26 AI Review Pilot represents a landmark large-scale field deployment (22,977 papers) of AI-assisted peer review at a major conference, addressing a critical infrastructure problem in science itself. Its findings that AI reviews were preferred over human reviews on key dimensions could reshape how scientific evaluation is conducted across all fields. Paper 1, while valuable in identifying adversarial vulnerabilities in tool-using agents, addresses a more specialized security concern. Paper 2's breadth of impact across all scientific disciplines, immediate real-world deployment, and timeliness give it substantially higher potential impact.