Policy-Invisible Violations in LLM-Based Agents

Jie Wu, Ming Gong

Apr 14, 2026

arXiv:2604.12177v1 PDF

cs.AI(primary)cs.CLcs.CR cs.LG

#128of 2292·Artificial Intelligence

#128 of 2292 · Artificial Intelligence

Tournament Score

1535±26

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty8

Clarity8

Tournament Score

1535±26

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent's visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To demonstrate what world-state-grounded enforcement can achieve under favorable conditions, we introduce Sentinel, an enforcement framework based on counterfactual graph simulation. Sentinel treats every agent action as a proposed mutation to an organizational knowledge graph, performs speculative execution to materialize the post-action world state, and verifies graph-structural invariants to decide Allow/Block/Clarify. Against human-reviewed trace labels, Sentinel substantially outperforms a content-only DLP baseline (68.8% vs. 93.0% accuracy) while maintaining high precision, though it still leaves room for improvement on certain violation categories. These results demonstrate what becomes achievable once policy-relevant world state is made available to the enforcement layer.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: "Policy-Invisible Violations in LLM-Based Agents"

1. Core Contribution

This paper identifies and formalizes a specific failure mode for LLM-based agents: policy-invisible violations, where an agent executes syntactically valid, user-sanctioned actions that violate organizational policy because the facts needed for correct policy judgment are absent from the agent's visible context. This is distinct from jailbreaking, prompt injection, alignment failures, or standard access control issues — it targets the gap between what an agent *can see* and what it *needs to know* for compliance.

The paper makes four concrete contributions: (1) a formal problem definition with a taxonomy of eight violation categories; (2) PhantomPolicy, a 120-case diagnostic benchmark where tool responses contain no policy metadata; (3) empirical evaluation across five frontier models showing 90–98% violation rates on risky cases; and (4) Sentinel, a counterfactual graph simulation framework that achieves 93% accuracy by grounding enforcement in organizational world state.

The conceptual framing is the paper's strongest contribution. By cleanly separating the *knowledge problem* (missing policy-relevant facts) from the *reasoning problem* (model capability), it reframes compliance failures as a systems architecture issue rather than a model capability issue.

2. Methodological Rigor

Benchmark design is thoughtful. The key constraint — all tool responses contain clean business data without policy metadata — is well-motivated and consistently enforced. The inclusion of matched safe-control cases for each violation category is good practice. The three-annotator review protocol (with two-annotator agreement required) adds credibility. The manual review of all 600 traces (changing 32 labels, 5.3%) demonstrates commendable thoroughness and honestly acknowledges the gap between case-level and trace-level ground truth.

Sentinel's formalization is rigorous. The world state graph definition, mutation algebra, speculative execution framework, and three-valued invariant logic are cleanly specified. The complexity analysis (O(|M|) per verification, independent of graph size) and conditional soundness theorem are appropriate, though the conditional nature of Theorem 2 (requiring complete world model and correct invariants) means it's more of a design property than a practical guarantee.

Limitations in rigor: The benchmark is small (120 cases, 60 risky/60 safe), which the authors acknowledge. While they frame it as a "diagnostic unit-test suite," some statistical claims are made on thin margins. The coverage degradation experiments (Table 10) are a valuable addition but use Monte Carlo random entity removal, which may not reflect realistic coverage gaps. The DLP baseline is explicitly described as interpretable rather than competitive, which is fair but limits the strength of comparative claims. The paper would benefit from comparison against more sophisticated baselines (e.g., LLM-as-judge with partial context, or retrieval-augmented policy checking).

3. Potential Impact

Practical relevance is high. As LLM agents move into enterprise deployments with tool access (email, file sharing, CRM), this exact failure mode will become increasingly consequential. The paper correctly identifies that compliance failures in cooperative, non-adversarial settings may be more practically damaging than adversarial attacks, simply because they occur during normal use.

Architectural insight: The principle that "policy enforcement should be architecturally separated from model reasoning" is actionable and aligns with security engineering principles (privilege separation, reference monitors). This could influence how agent frameworks are designed in practice.

Benchmark utility: PhantomPolicy fills a niche — most agent safety benchmarks focus on adversarial attacks or harmful content. A benchmark specifically for organizational policy compliance under information asymmetry is novel and useful, though its small scale limits direct adoption as a standard evaluation.

Cross-field influence: The work connects to contextual integrity theory (Nissenbaum), information flow control, and speculative execution concepts from computer architecture. The knowledge graph approach to policy enforcement could influence enterprise AI governance frameworks.

4. Timeliness & Relevance

This is highly timely. Enterprise adoption of LLM agents with tool access is accelerating, and the safety literature has disproportionately focused on adversarial attacks and harmful content generation. The cooperative, non-adversarial failure mode identified here is arguably more likely to cause real organizational harm at scale. The paper arrives at an inflection point where this class of failures is becoming practically relevant but has not yet been systematically studied.

5. Strengths & Limitations

Key Strengths:

Novel problem formulation: Cleanly identifies a failure mode that falls between existing categories (not jailbreaking, not alignment, not authorization). The taxonomy of eight violation categories is well-structured.

Honest evaluation: The authors are transparent about limitations — benchmark scale, conditional soundness, remaining Sentinel errors, prompt sensitivity. The manual review of all traces is commendable.

Systems-level thinking: Correctly frames the problem as architectural rather than purely about model capability, with the coverage degradation analysis (Table 10) quantifying exactly how enforcement degrades with incomplete world models.

Formal rigor: The graph-theoretic framework, mutation algebra, and composability theorem provide a principled foundation rather than ad-hoc rules.

Policy-in-prompt results: Showing that injecting rules without entity-level metadata only partially helps (40.7% violation rate, down from 95.3%) is an important finding that validates the problem framing.

Notable Weaknesses:

Benchmark scale and generalizability: 120 cases in a single domain (corporate email/file sharing) limits generalizability claims. The authors acknowledge this but the limitation is significant.

World model assumption: Sentinel's strong performance depends on Coverage(W,V) = 1.0. The paper correctly identifies world-model construction as the binding constraint but doesn't tackle it, making the practical path to deployment unclear.

Baseline comparisons: The DLP baseline is intentionally simple; comparing against more sophisticated approaches (e.g., RAG-augmented policy checking, tool-use guardrails from existing frameworks) would strengthen the contribution.

Single prompt regime: Results under one execution-oriented prompt. While policy-in-prompt is also tested, the space of possible system prompts and agent architectures is vast.

Sentinel's recall gap: 37 missed violations out of ~304 positive cases represents meaningful incompleteness. The error analysis attributes this to invariant edge-case coverage, but this somewhat undermines the conditional soundness claim in practice.

Additional Observations

The paper's use of models labeled as "GPT-5.4" and "Claude Opus 4.6" (with an April 2026 arXiv date) suggests evaluation on very recent frontier models, which adds credibility to the persistence-of-failure-mode claim. The composability theorem (Theorem 3) is a genuinely useful property for practical deployment — organizations could incrementally add invariants without regression risk. The connection to speculative execution in CPU architecture is an elegant analogy that aids understanding.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 8Clarity 8

Generated Apr 15, 2026

Comparison History (56)

vs. EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

gemini-35/6/2026

Paper 2 tackles a fundamental bottleneck in AI scaling—the reliance on external supervision—by introducing a novel self-improvement mechanism. Achieving autonomous self-evolution in LLMs has profound implications for the entire foundation model ecosystem, pushing toward AGI. While Paper 1 addresses an important applied problem in agent safety and enterprise compliance, Paper 2's methodological innovation offers broader, paradigm-shifting impact across ML research by solving a core constraint in post-training alignment and reward modeling.

vs. EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

gpt-5.25/5/2026

Paper 2 has higher likely impact: it identifies a clear, underexplored failure mode (policy-invisible violations), contributes a concrete benchmark (PhantomPolicy) with rigorous trace-level human review, and proposes an enforcement method (Sentinel) that operationalizes world-state grounding via counterfactual simulation, showing large gains over a realistic baseline. The work is timely for enterprise agent deployment, broadly relevant across security, compliance, HCI, and agent architectures, and provides actionable methodology and evaluation. Paper 1 is ambitious and benchmark-driven, but “agent framework + SOTA scores” is a crowded space and impact depends on long-term adoption.

vs. EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

claude-opus-4.65/5/2026

EvoMaster presents a foundational framework for autonomous scientific discovery with broad applicability across multiple domains, achieving state-of-the-art results on four major benchmarks. Its self-evolving agent paradigm for scientific inquiry addresses a fundamental gap in AI-driven science and has potential to catalyze research across many fields. Paper 2, while identifying an important safety concern (policy-invisible violations) and introducing a useful benchmark, addresses a narrower problem in LLM agent deployment. EvoMaster's breadth of impact, novelty in continuous self-evolution for science, and demonstrated cross-domain performance give it higher potential scientific impact.

vs. CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search

claude-opus-4.65/5/2026

Paper 1 introduces a novel failure mode ('policy-invisible violations') in LLM-based agents that addresses a critical and underexplored safety/compliance gap. The concept of violations arising from hidden contextual state is highly original, practically important for enterprise AI deployment, and opens a new research direction. The benchmark (PhantomPolicy) and enforcement framework (Sentinel) with counterfactual graph simulation are methodologically innovative. Paper 2, while solid, represents an incremental improvement in agentic RAG by jointly training reasoning and retrieval—a natural extension of existing work. Paper 1's broader implications for AI safety, governance, and enterprise trust give it higher potential impact.

vs. Causal Foundations of Collective Agency

gemini-35/5/2026

Paper 1 establishes a foundational, theoretical framework for understanding collective agency using causal models. Its cross-disciplinary approach impacts multi-agent systems, AI safety, game theory, and biological sciences. While Paper 2 offers a valuable benchmark and practical engineering solution for current LLM deployments, Paper 1 addresses a more fundamental scientific question regarding emergent behavior, giving it broader and longer-term potential scientific impact.

vs. End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

claude-opus-4.65/5/2026

Paper 1 introduces a novel conceptual framework ('policy-invisible violations') addressing a fundamental and broadly applicable problem in LLM-based agent safety. It provides a new benchmark (PhantomPolicy), systematic evaluation across five frontier models, and a novel enforcement approach (Sentinel) using counterfactual graph simulation. The problem it addresses—agents violating policies due to hidden context—is highly relevant as LLM agents proliferate across domains. Paper 2, while practically valuable, is more narrowly focused on governance of a specific clinical AI product (Hyperscribe), offering incremental engineering contributions rather than foundational insights applicable across the field.

vs. Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

gpt-5.25/5/2026

Paper 1 is likelier to have higher scientific impact because it introduces a clearly defined, empirically grounded failure mode (policy-invisible violations), contributes a concrete benchmark (PhantomPolicy) with trace-level human evaluation, and demonstrates a working enforcement approach (Sentinel) with quantified gains over a baseline. This combination of taxonomy + dataset + reproducible evaluation + measurable system improvement supports methodological rigor, immediate applicability to enterprise agent deployments, and follow-on research. Paper 2 is conceptually ambitious, but many elements (crypto tokens, separation-of-powers, formal guarantees) may be harder to validate and adopt without comparable empirical evidence and benchmarks.

vs. METASYMBO: Multi-Agent Language-Guided Metamaterial Discovery via Symbolic Latent Evolution

claude-opus-4.65/5/2026

Paper 2 (METASYMBO) has higher potential scientific impact because it bridges natural language interfaces with metamaterial discovery—a novel intersection of LLMs, multi-agent systems, and materials science with broad real-world engineering applications. Its symbolic latent evolution approach is innovative, enabling programmable design beyond training data. The work addresses a genuine gap in early-stage materials exploration and demonstrates strong quantitative improvements. Paper 1, while addressing an important AI safety concern, is more narrowly focused on LLM policy compliance benchmarking with a specific enforcement framework, limiting its cross-disciplinary reach.

vs. METASYMBO: Multi-Agent Language-Guided Metamaterial Discovery via Symbolic Latent Evolution

claude-opus-4.65/5/2026

Paper 2 (METASYMBO) has higher potential scientific impact because it addresses a fundamental challenge in materials science—bridging natural language intent with physically valid metamaterial design—through a novel multi-agent framework with symbolic latent evolution. It combines LLMs with domain-specific geometric/physical reasoning in a way that enables early-stage scientific discovery, has clear real-world applications in engineering, and demonstrates strong quantitative improvements. Paper 1 addresses an important AI safety problem but is more narrowly focused on LLM agent policy compliance, with a smaller benchmark and incremental enforcement methodology.

vs. Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

gpt-5.25/5/2026

Paper 2 has higher likely impact due to methodological rigor and timeliness: it introduces a clearly scoped, measurable failure mode (policy-invisible violations), provides a benchmark (PhantomPolicy) with human-reviewed trace-level labels, and demonstrates a concrete enforcement approach (Sentinel) with quantitative gains over a baseline. This makes it immediately useful for evaluation, replication, and product/security deployment across organizations. Paper 1 is ambitious and potentially transformative, but its broad architectural and formal claims are harder to validate without comparable empirical evaluation or widely adoptable artifacts.

vs. Mitigating Misalignment Contagion by Steering with Implicit Traits

gpt-5.25/5/2026

Paper 2 introduces a clearly defined, broadly applicable failure mode (policy-invisible violations), contributes a benchmark (PhantomPolicy) with careful human trace-level validation, and proposes a principled enforcement method (Sentinel) grounded in counterfactual world-state simulation with strong empirical gains over a baseline. This combination of taxonomy + dataset + system design is likely to generalize across enterprise agent deployments and influence both research and practice. Paper 1 is timely and interesting, but its mitigation (implicit trait steering) is narrower, more heuristic, and potentially less durable than stateful policy enforcement.

vs. Mitigating Misalignment Contagion by Steering with Implicit Traits

gpt-5.25/5/2026

Paper 2 introduces a clearly defined, broadly relevant failure mode (policy-invisible violations), a benchmark (PhantomPolicy) with rigorous human trace review, and a concrete enforcement approach (Sentinel) grounded in world-state simulation that substantially improves over baselines. This combination of taxonomy + dataset + validated method is likely to catalyze follow-on work across agent safety, governance, security, and enterprise deployment. Paper 1 is timely and useful for multi-agent alignment, but its mitigation (implicit trait steering) is narrower and more prompt-engineering-like, with potentially less generalizable, system-level impact.

vs. Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching

claude-opus-4.65/5/2026

Paper 1 identifies a novel and important failure mode ('policy-invisible violations') in LLM-based agents that has significant real-world implications for enterprise AI safety and compliance. The concept of violations arising from information absent from the agent's context is a fresh framing with broad relevance as organizations deploy agentic AI. The benchmark, human-reviewed evaluation methodology, and Sentinel enforcement framework address a critical gap. Paper 2 makes solid contributions with its benchmark and search algorithm for tool-augmented agents, but addresses a more incremental optimization problem (search efficiency in large tool spaces) with narrower impact scope.

vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

gemini-35/1/2026

Paper 2 proposes a foundational theoretical framework that unifies Bayesian inference, game theory, and thermodynamics. Its potential breadth of impact across physics, biology, and artificial intelligence is vast, offering fundamental insights into collective intelligence. In contrast, Paper 1, while highly relevant to current AI safety and enterprise applications, focuses on a much narrower, domain-specific problem regarding policy violations in LLM agents.

vs. Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling

gpt-5.24/28/2026

Paper 1 has higher impact potential: it identifies a practical, under-discussed safety/compliance failure mode (policy-invisible violations), contributes a focused benchmark (PhantomPolicy) with trace-level human evaluation, and proposes an enforceable systems approach (Sentinel) with strong gains over a realistic baseline. Its applications to enterprise agent deployment, governance, and security are immediate and broad across LLM agent frameworks. Paper 2 is novel and promising for sequential modeling, but evidence is limited to a proprietary production dataset, reducing reproducibility and breadth despite potential ML relevance.

vs. Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling

gpt-5.24/28/2026

Paper 2 likely has higher impact: it introduces a clearly defined, broadly relevant safety failure mode (policy-invisible violations), contributes a benchmark (PhantomPolicy) with careful human trace-level auditing, and proposes an enforcement framework (Sentinel) with strong empirical gains over a baseline. The problem is timely for real-world deployment of LLM agents in enterprises and intersects security, compliance, and agentic AI, giving it cross-field breadth and immediate applicability. Paper 1 is innovative but more incremental within positional encoding research and is validated mainly in a proprietary recommender setting, limiting generalizability.

vs. Detecting Data Contamination in Large Language Models

gpt-5.24/22/2026

Paper 1 introduces a distinct, practically important failure mode (policy-invisible violations), contributes a targeted benchmark (PhantomPolicy) with trace-level human review, and proposes a concrete enforcement architecture (Sentinel) that materially improves compliance via world-state-grounded, counterfactual graph simulation. This is timely for real-world LLM agents and broadly relevant to security, governance, and agent tooling. Paper 2 is valuable as a negative result and unified comparison, but its main finding (black-box MIAs ~ chance) is less actionable and less novel; impact may be narrower unless it motivates new threat models or methods.

vs. Detecting Data Contamination in Large Language Models

gpt-5.24/22/2026

Paper 1 is more likely to have higher impact due to clearer novelty (defining “policy-invisible violations,” introducing a dedicated benchmark, and a world-state-grounded enforcement approach via counterfactual graph simulation). It targets a timely, practical deployment problem for LLM agents—organizational compliance—offering an actionable framework and measurable gains over a baseline, with trace-level human evaluation adding rigor. Its ideas can generalize across security, governance, agent architectures, and knowledge-graph systems. Paper 2 is valuable but largely negative (black-box MIAs fail), and its methodological contribution (Familiarity Ranking) appears less transformative and less immediately actionable.

vs. Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

gpt-5.24/21/2026

Paper 2 likely has higher scientific impact: it identifies a new, broadly relevant failure mode (policy-invisible violations) for agentic LLMs, introduces a benchmark (PhantomPolicy) with careful human trace-level evaluation, and proposes a principled enforcement approach (Sentinel) that connects to knowledge graphs, counterfactual simulation, and compliance/security. The problem is timely for real-world deployment across many domains and the benchmark/enforcement framing can catalyze follow-on work. Paper 1 is strong and rigorous but is more incremental within forecasting methodology and narrower in cross-field impact.

vs. Context Over Content: Exposing Evaluation Faking in Automated Judges

gemini-34/17/2026

Paper 1 exposes a fundamental vulnerability in the widely adopted LLM-as-a-judge paradigm, demonstrating that automated evaluations can be implicitly manipulated without any trace in the model's reasoning. This finding has broad and immediate implications for AI safety, benchmarking, and automated evaluation methodologies across the entire field. While Paper 2 offers a strong framework for agent policy compliance, Paper 1's focus on the integrity of AI evaluation systems gives it a broader and more critical scientific impact.