ADR: An Agentic Detection System for Enterprise Agentic AI Security

Chenning Li, Pan Hu, Justin Xu, Baris Ozbas, Olivia Liu, Caroline Van, Manxue Li, Wei Zhou

#441 of 2292 · Artificial Intelligence
Share
Tournament Score
1479±45
10501800
70%
Win Rate
16
Wins
7
Losses
23
Matches
Rating
7/ 10
Significance
Rigor
Novelty
Clarity

Abstract

We present the Agentic AI Detection and Response (ADR) system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability -- existing Endpoint Detection and Response (EDR) tools see file writes but not the agent reasoning, prompts, or causal chains linking intent to execution; (2) insufficient robustness -- static defenses constrained by pre-defined rules fail to generalize across diverse attack techniques and enterprise contexts; and (3) high detection costs -- LLM-based inference is prohibitively expensive at scale. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for systematic pre-deployment red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. Deployed at Uber for over ten months, ADR has sustained reliable detection in production with growing adoption reaching over 7,200 unique hosts and processing over 10,000 agent sessions daily, uncovering hundreds of credential exposures across 26 categories and enabling a shift-left prevention layer (97.2% precision, 206 detected credentials). To validate the approach and enable community adoption, we introduce ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), where ADR achieves zero false positives while detecting 67% of attacks -- outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2--4x in F1-score. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ADR – An Agentic Detection System for Enterprise Agentic AI Security

1. Core Contribution

ADR presents an end-to-end enterprise security framework for AI agents operating through the Model Context Protocol (MCP). The system addresses a genuinely new attack surface: autonomous AI agents that can execute tools, access data, and make decisions via standardized interfaces. The three core components — the ADR Sensor (telemetry collection capturing the full causal chain from prompts to tool execution), the ADR Detector (two-tier online detection combining fast triage with deep contextual reasoning), and the ADR Explorer (offline evolutionary red-teaming) — together form a coherent security architecture.

The paper also introduces ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), which is meaningfully broader than existing benchmarks in both threat coverage and MCP integration. The accompanying threat taxonomy (5 tactics, 17 techniques) synthesized from public incidents, frameworks, and operational telemetry is a useful contribution for the community.

The most compelling novelty is the *observability argument*: traditional EDR tools capture system-level events but miss the semantic layer (prompts, reasoning chains, tool invocations) that gives meaning to agent actions. This is a real gap, and the sensor architecture that reconstructs causal chains from local caches of agentic tools is a practical and elegant solution.

2. Methodological Rigor

The evaluation has notable strengths and weaknesses:

Strengths: The paper evaluates on both a public benchmark (AgentDojo, 93 tasks) and their own benchmark (ADR-Bench, 302 tasks), comparing against three relevant baselines (LlamaFirewall, GuardAgent, ALRPHFS). The ablation studies are informative, demonstrating the value of individual MCP context providers and the triage layer. The production deployment at Uber (10 months, 7,200+ hosts, 10,000+ daily sessions) provides strong evidence of practical viability.

Weaknesses: The benchmark design raises concerns. ADR-Bench was constructed by the same team that built ADR, creating potential overfitting risk — the system may be inadvertently optimized for the types of attacks and benign patterns included. The 42 malicious tasks (out of 302) is relatively small for drawing robust conclusions. The "zero false positives" claim, while impressive, should be interpreted cautiously given that the benign task distribution was curated.

The production deployment metrics are somewhat vague. The 49% false positive rate in deployment (versus zero on the benchmark) reveals a significant gap between benchmark and real-world performance, though the paper acknowledges this honestly. The credential detection results (97.2% precision, 206 TPs) are solid but represent a relatively narrow use case (regex-based pattern matching for secrets) rather than the full ADR detection pipeline.

The comparison with baselines may not be entirely fair: ADR uses GPT-4o and Claude Sonnet 4 (state-of-the-art models) while LlamaFirewall uses Llama Guard 3-8B (a much smaller model). Cost comparisons partially address this, but the capability gap is substantial.

3. Potential Impact

The practical impact potential is high. MCP adoption is accelerating rapidly (16,800+ public servers by 2025), and enterprise security teams urgently need tools designed for this paradigm. ADR's architecture — sensor + hierarchical detector + offline red-teaming — provides a reasonable blueprint that others can adapt.

The benchmark contribution (ADR-Bench) fills a real gap. Prior benchmarks cover only 3-6 of the 17 techniques identified, and most lack native MCP support. The comprehensive threat taxonomy grounded in real incidents is independently valuable for the security community.

The open-source release of the sensor, detection framework, and benchmark significantly amplifies potential impact by enabling reproducibility and community extension.

However, the impact is somewhat bounded by the rapidly evolving nature of both MCP and the attack landscape. The sensor's approach of parsing local caches of specific tools (Cursor, Cline, Claude Code) requires continuous maintenance as these tools change. The threat taxonomy, while comprehensive today, will need frequent updates.

4. Timeliness & Relevance

This paper is extremely well-timed. MCP has become the de facto standard for AI agent-tool integration in under a year, and the security implications are just beginning to be understood. The paper arrives at a moment when enterprises are rapidly deploying MCP-based agents but lack mature security tooling.

The shift from "securing LLMs" to "securing AI agents" represents a genuine paradigm shift in AI security, and ADR is among the first systems to address this comprehensively at enterprise scale. The 10-month production deployment at Uber provides credibility that few academic or industry papers in this space can match.

5. Strengths & Limitations

Key Strengths:

  • *Production validation:* 10-month deployment at Uber with concrete operational metrics is the paper's strongest differentiator
  • *Comprehensive threat model:* The 5-tactic, 17-technique taxonomy is the most complete for MCP security to date
  • *Practical architecture:* The two-tier detection design elegantly balances cost and accuracy (40.7% of tasks handled cheaply by triage)
  • *Open-source commitment:* Releasing sensor, detector, and benchmark enables community adoption
  • *Honest reporting:* The paper acknowledges the 49% FP rate in deployment and the gap between benchmark and production performance
  • Notable Limitations:

  • *Recall ceiling:* 67% attack detection on ADR-Bench means one-third of attacks are missed — this is acknowledged but concerning for a security system
  • *Self-constructed benchmark:* ADR-Bench was built by the same team, risking circular validation
  • *Model dependency:* Heavy reliance on frontier LLMs (GPT-4o, Claude Sonnet 4) creates cost, latency, and vendor-lock concerns not fully addressed
  • *Limited prevention:* The system is primarily detective; the prevention layer (Hooks) only handles credential patterns via regex
  • *Narrow production evidence:* Most concrete deployment results center on credential exposure detection rather than the full attack taxonomy
  • *Permission Abuse detection rate of only 20%* reveals significant gaps in certain threat categories
  • Additional Observations

    The evolutionary red-teaming approach (Explorer) is theoretically interesting but under-evaluated — the paper provides no standalone assessment of how many novel attack variants it discovers or how much it improves detection over time. The fitness function and convergence guarantees are stated but not empirically validated.

    The paper's framing as the "first" enterprise-proven framework is reasonable given the evidence, though the rapidly growing MCP security space means this distinction may be short-lived.

    Rating:7/ 10
    Significance 7.5Rigor 6Novelty 7Clarity 7.5

    Generated May 19, 2026

    Comparison History (23)

    vs. Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
    claude-opus-4.65/20/2026

    Paper 1 introduces a novel conceptual framework that bridges robotics control theory with foundation model safety, offering formally grounded behavioral guarantees for socially sensitive domains. This cross-disciplinary reframing—treating guardrails as runtime trajectory control rather than per-output filtering—is a genuinely new paradigm with broad applicability across AI safety research. Paper 2, while impressive as an engineering contribution with real-world deployment at Uber, is more of a systems/security paper solving a narrower enterprise problem. Paper 1's theoretical contribution has greater potential to influence multiple research communities and spawn new research directions.

    vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
    gpt-5.25/20/2026

    Paper 1 is likely to have higher scientific impact: it introduces a novel, general training objective refinement (policy-aware reweighting for rubric-based RLVR) that can transfer across models, datasets, and future RLHF/RLVR settings, with clear methodological framing and controlled comparisons showing efficiency gains. Its contribution targets a core bottleneck in post-training—multi-criterion optimization with informative reward signals—relevant to many research groups. Paper 2 is highly impactful operationally, but is more systems/engineering- and deployment-specific (MCP/enterprise security), potentially narrowing breadth and long-term generalizability.

    vs. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
    claude-opus-4.65/20/2026

    Paper 1 makes fundamental theoretical contributions—formalizing interface-constrained SMDPs, providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability, and lifting the AIS framework to multi-agent SMDPs. These are novel, rigorous results with broad applicability beyond LLM pipelines to any multi-agent sequential decision-making setting. Paper 2 is a strong engineering/systems contribution with real-world deployment at Uber, but it is more narrowly focused on enterprise AI security and introduces less foundational methodology. Paper 1's theoretical novelty and breadth of impact across reinforcement learning, multi-agent systems, and workflow optimization give it higher long-term scientific impact.

    vs. CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning
    gemini-3.15/19/2026

    Paper 1 demonstrates significantly higher impact due to its immediate relevance to the critical, rapidly growing field of AI agent security. Unlike Paper 2, which focuses on the theoretical aspects of LLM emotion understanding, Paper 1 presents a battle-tested, production-scale system deployed at a major enterprise (Uber). It addresses urgent real-world vulnerabilities, introduces a comprehensive framework (sensor, explorer, detector), and provides a valuable benchmark (ADR-Bench). Its combination of methodological rigor, proven real-world application, and timeliness in securing LLM agents guarantees broad utility and high impact in both industry and academia.

    vs. TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning
    gpt-5.25/19/2026

    Paper 1 likely has higher scientific impact due to its timeliness (enterprise agentic AI security), demonstrated large-scale real-world deployment, and rigorous empirical validation (production metrics plus two benchmarks). Its contributions (telemetry/sensing, red-teaming workflow, scalable two-tier detection) are broadly applicable across enterprises adopting agent protocols, potentially influencing security tooling and standards. Paper 2 is novel and valuable for embodied/household agents, but its impact may be narrower (domestic robotics/assistants) and currently more evaluation-driven than deployment-proven.

    vs. Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation
    claude-opus-4.65/19/2026

    Paper 2 presents a novel, production-proven system addressing the critical and timely problem of securing AI agents in enterprise environments. It introduces a comprehensive framework (ADR) deployed at scale at Uber for 10+ months, demonstrates strong real-world results, and provides a new benchmark (ADR-Bench). The problem of agentic AI security is highly relevant given rapid AI agent adoption. Paper 1 is an incremental improvement to PPO for multi-UAV coverage—a well-studied area with limited novelty (shared backbone is a known technique). Paper 2's breadth of impact, timeliness, and practical validation far exceed Paper 1's contributions.

    vs. Learning to Solve Compositional Geometry Routing Problems
    gpt-5.25/19/2026

    Paper 2 likely has higher scientific impact due to strong timeliness and broad real-world relevance (enterprise AI-agent security), plus unusually convincing empirical validation: long-running production deployment at scale with clear metrics, and a new benchmark (ADR-Bench) enabling reproducibility and follow-on work. Its contributions span systems, security, and applied ML, potentially influencing tooling and standards around agent observability and defense. Paper 1 is novel and methodologically solid in combinatorial optimization/representation learning, but its immediate cross-field and real-world impact is narrower than production-proven security infrastructure.

    vs. Understanding Annotator Safety Policy with Interpretability
    claude-opus-4.65/19/2026

    Paper 1 presents a novel, production-proven enterprise security framework (ADR) for AI agents, addressing a critical and timely gap in securing agentic AI systems. Its deployment at Uber for 10 months with concrete metrics demonstrates real-world impact at scale. It introduces a new benchmark (ADR-Bench) and significantly outperforms baselines. The breadth of impact spans AI security, enterprise systems, and the rapidly growing agentic AI ecosystem. While Paper 2 makes a solid methodological contribution to AI safety annotation, its scope and immediate practical impact are narrower compared to the pressing need for agentic AI security infrastructure.

    vs. Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction
    claude-opus-4.65/19/2026

    Paper 1 offers a deeper scientific contribution by identifying and formalizing 'Safety Geometry Collapse' as a fundamental representation-geometric phenomenon explaining why multimodal LLMs fail to transfer safety capabilities. It provides novel theoretical insights (refusal direction, modality drift, conditional refusal separability), causal validation through interventions, and a principled training-free method (ReGap). This mechanistic understanding has broad implications for the alignment and safety research community. Paper 2, while practically impactful as an engineering system deployed at scale, is more of a systems/engineering contribution with narrower scientific novelty, focused on a specific enterprise security application.

    vs. STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery
    gpt-5.25/19/2026

    Paper 2 likely has higher scientific impact due to its strong real-world validation (10+ months production deployment at scale), immediate applicability to a timely, high-stakes problem (enterprise agent security via MCP), and creation of a benchmark (ADR-Bench) that can shape follow-on research. Its methodological contribution spans telemetry, red-teaming, and scalable two-tier detection, with quantitative results against baselines and multiple benchmarks. Paper 1 is novel for LLM-based equation discovery and may impact scientific modeling, but its evidence is primarily benchmark-based and narrower in near-term deployment impact.

    vs. OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models
    gemini-3.15/19/2026

    Paper 2 addresses the highly timely and critical challenge of Agentic AI security. Its scientific impact is amplified by massive, production-proven real-world validation at Uber, the introduction of a novel benchmark (ADR-Bench), and superior performance over state-of-the-art baselines. While Paper 1 offers valuable advances in vision model interpretability, Paper 2's intersection of cybersecurity, LLM agents, and systems engineering combined with its unprecedented empirical scale promises broader, more immediate adoption and influence across both academia and industry.

    vs. Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
    gemini-3.15/19/2026

    Paper 1 addresses the critical, emerging field of Agentic AI security with a massive, real-world deployment at Uber. Its introduction of a novel framework, combined with extensive production validation and a new open benchmark (ADR-Bench), offers immense practical utility and sets a foundational standard for enterprise AI security. While Paper 2 provides a valuable methodological improvement for diffusion models, Paper 1's scale, real-world application, and timeliness in addressing AI agent vulnerabilities give it a broader and more profound potential impact.

    vs. Rethinking Adapter Placement: A Dominant Adaptation Module Perspective
    gemini-3.15/19/2026

    Paper 1 fundamentally challenges and optimizes a ubiquitous technique (LoRA) used across the entire deep learning community. By demonstrating that a single adapter can match or exceed performance with a 99% reduction in parameters, it offers a profound methodological shift that will heavily influence both theoretical research in model adaptation and practical fine-tuning pipelines across multiple AI disciplines.

    vs. Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks
    claude-opus-4.65/19/2026

    Paper 2 (ADR) demonstrates higher scientific impact due to its immediate real-world deployment at scale (Uber, 10 months, 7,200+ hosts), addressing a timely and critical problem of AI agent security. It introduces a practical, production-proven framework with a new benchmark (ADR-Bench), strong empirical results outperforming baselines by 2-4x in F1, and addresses the rapidly growing enterprise AI agent ecosystem. Paper 1 explores an interesting ToM problem for MLLMs but is more niche, with modest baselines (42% accuracy) and less clear practical applicability. ADR's breadth of impact across AI security, enterprise systems, and the emerging MCP ecosystem gives it broader relevance.

    vs. Reinforcing VLAs in Task-Agnostic World Models
    claude-opus-4.65/19/2026

    Paper 2 (RAW-Dream) introduces a fundamentally novel paradigm for adapting Vision-Language-Action models using task-agnostic world models and zero-shot reinforcement learning, addressing a core scalability bottleneck in robot learning. Its contribution—decoupling world model training from task-specific data—has broad implications across robotics, embodied AI, and foundation model research. Paper 1 (ADR), while practically valuable and production-proven at Uber, is more narrowly focused on enterprise AI security for MCP-based agents, representing strong engineering but comparatively incremental scientific novelty. Paper 2's methodological innovation and cross-domain applicability suggest higher long-term scientific impact.

    vs. Generative AI and the Productivity Divide: Human-AI Complementarities in Education
    claude-opus-4.65/19/2026

    Paper 2 presents a novel, production-deployed security framework (ADR) addressing a critical emerging problem—securing enterprise AI agents—with strong empirical validation at scale (Uber, 10+ months, 7,200 hosts). It introduces a new benchmark (ADR-Bench), demonstrates significant performance improvements over baselines, and addresses a timely, high-stakes problem with broad cross-field implications (cybersecurity, AI safety, enterprise systems). Paper 1, while methodologically sound, addresses a narrower question about AI-mediated learning with more incremental findings about user heterogeneity and scaffolding interventions.

    vs. Voices in the Loop: Mapping Participatory AI
    gemini-3.15/19/2026

    Paper 1 addresses an urgent technical challenge: securing autonomous AI agents. Its introduction of the first large-scale, production-proven framework (deployed at Uber), combined with the release of a novel benchmark (ADR-Bench) and rigorous evaluation showing significant improvements over baselines, guarantees high technological impact. While Paper 2 provides a valuable repository for AI governance, Paper 1 offers a highly innovative, methodologically rigorous, and immediately actionable solution to a critical bottleneck in enterprise AI adoption, ensuring broader real-world application.

    vs. CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings
    claude-opus-4.65/19/2026

    CORTEG introduces a genuinely novel cross-modality transfer framework (scalp EEG to intracranial ECoG) that addresses a fundamental data limitation in brain-computer interfaces. It demonstrates that foundation models pretrained on abundant scalp EEG data can transfer to scarce intracranial recordings, enabling rapid patient calibration. This has broad implications for neuroscience, clinical BCI deployment, and foundation model transfer across modalities. Paper 1, while practically valuable as a deployed security system at Uber, is more of an engineering contribution to a narrower domain (agentic AI security) with less fundamental scientific novelty.

    vs. Responsible Agentic AI Requires Explicit Provenance
    gemini-3.15/19/2026

    Paper 1 demonstrates significantly higher scientific impact due to its rigorous, large-scale real-world deployment and empirical validation. While Paper 2 offers a valuable theoretical framework for AI provenance, Paper 1 presents a production-proven system deployed across thousands of enterprise hosts, addressing immediate, critical security vulnerabilities in Agentic AI. Furthermore, Paper 1 introduces a scalable architecture and a new benchmark (ADR-Bench) that directly enables future empirical research. Its combination of methodological rigor, demonstrated utility at enterprise scale, and tangible open-source contributions gives it a clear edge in actionable, near-term impact.

    vs. Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment
    gpt-5.25/19/2026

    Paper 1 has higher likely scientific impact due to a novel, end-to-end security system validated at large production scale, with concrete telemetry, red-teaming, and scalable detection design plus new benchmarks (ADR-Bench) enabling reproducibility and follow-on research. It demonstrates strong real-world applicability and methodological rigor (deployment metrics, comparative baselines, precision/FP rates) and is timely for enterprise agent security. Paper 2 offers a valuable conceptual framework, but as a position paper without empirical validation or implementation, its near-term impact and evidence base are less certain.