Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications

Xiaoyue Lu, Xianglin Yang, Haijun Liu, Jiahao Liu, Kuntai Cai, Yan Xiao, Jin Song Dong

May 24, 2026

arXiv:2605.24883v1 PDF

cs.AI(primary)cs.CRcs.SE

#437of 2682·Artificial Intelligence

#437 of 2682 · Artificial Intelligence

Tournament Score

1487±44

10501800

58%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor5.5

Novelty7.5

Clarity7

Tournament Score

1487±44

10501800

58%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

The widespread integration of Large Language Models (LLMs) necessitates rigorous and systematic safety evaluation. Existing paradigms either rely on constructed benchmarks to assess safety from predefined perspectives, or employ dynamic red-teaming to probe potential vulnerabilities. While effective, these approaches face challenges, as they depend heavily on expert domain knowledge, offer limited systematic guarantees, and are vulnerable to rapid obsolescence. To address these limitations, we introduce a novel framework POLARIS that brings the rigor of specification-based software testing to AI safety. POLARIS first compiles unstructured natural-language policies into First-Order Logic (FOL) representations, establishing a traceable link between high-level rules and concrete test cases. This formalization enables the construction of a Semantic Policy Graph, where complex policy violation scenarios are encoded as traversable paths. By systematically exploring this graph, POLARIS uncovers compositional violation patterns, which are then instantiated into executable natural-language test queries, enabling coverage-driven and reproducible safety testing. Experiments demonstrate that POLARIS achieves higher policy coverage and attack success counts compared to established baselines. Crucially, by bridging formal methods and AI safety, POLARIS provides a principled, automated approach to ensuring LLMs adhere to safety-critical policies with verifiable traceability. We release our code at https://github.com/huac-lxy/POLARIS.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: POLARIS – Inverting the Shield

1. Core Contribution

POLARIS introduces a specification-based testing paradigm for LLM safety evaluation, bridging formal methods from software engineering with AI red-teaming. The core idea is compelling: rather than relying on ad hoc benchmark construction or heuristic red-teaming, safety policies themselves are formalized into First-Order Logic (FOL) axioms called Abstract Violation Templates (AVTs), which are then organized into a Semantic Policy Graph. Systematic graph traversal discovers compositional violation pathways that are instantiated into natural-language adversarial queries. This provides three theoretically valuable properties: traceability (each test case maps to a specific policy clause), coverage guarantees (systematic exploration of the policy space), and reproducibility (the process is deterministic given the graph structure).

The conceptual contribution—treating safety policies as formal specifications and inverting them into test cases—is genuinely novel in the LLM safety space. The analogy to specification-based testing and software fuzzing is well-drawn and provides a principled foundation that most red-teaming work lacks.

2. Methodological Rigor

The methodology has both notable strengths and significant concerns:

Strengths: The three-stage pipeline (policy-to-logic compilation → semantic graph construction → query instantiation) is well-structured and clearly presented. The validation of intermediate components (Section 4.5) is a responsible addition—verifying that the FOL translation achieves 92.06% binary accuracy and 9.10/10 fine-grained score, and that entity extraction achieves 84.7% exact match and 90.1% semantic match, provides useful confidence bounds.

Concerns:

The FOL formalization is itself performed by an LLM, introducing a circularity: the system uses LLMs to formalize policies, then tests LLMs against those formalizations. The ~8% error rate in logical formalization and ~10-15% error in entity extraction could propagate through the pipeline. While the authors mention an "automated consistency filter," its design and effectiveness are not rigorously evaluated.

The "semantic densification" step—using LLM-driven link prediction to add edges to the graph—is essentially injecting LLM commonsense knowledge, which could introduce biases or hallucinated connections. The rigor of this step is unclear.

The evaluation metrics raise questions. The density-weighted coverage and novelty scores are creative but the thresholds (τ ∈ {0.4, 0.5, 0.6}) are somewhat arbitrary. The primary attack efficacy metric is "attack success count" rather than success rate, justified by analogy to fuzzing. While the authors explain this choice, it partially obscures whether individual queries are more or less effective than baselines. The volume of queries generated (28,660) is substantially larger than most baselines, and while the authors claim they "strictly matched the query volume of dynamic baselines," this constraint appears to apply only to the curiosity-driven baseline, not static benchmarks.

The evaluator diversity (5 evaluators) is commendable, but the consistency across evaluators varies substantially in Table 9, raising questions about what "attack success" truly means across different judges.

3. Potential Impact

The framework addresses a genuine need in the LLM safety ecosystem. As organizations deploy LLMs with specific usage policies, the ability to automatically generate compliance test suites from those policies has clear practical value. Several impact dimensions stand out:

Industry adoption potential: Companies maintaining safety policies could use POLARIS to automatically generate regression test suites when policies are updated, addressing the obsolescence problem of static benchmarks.

Regulatory compliance: With increasing AI regulation (the paper incorporates Chinese regulatory documents), automated policy-to-test translation could support compliance verification.

Scalability: At ~$0.94 per 1,000 queries for incremental generation (after one-time graph construction), the framework is cost-effective for large-scale testing.

However, the dual-use concern is significant—the paper essentially provides an automated system for generating harmful queries at scale. The authors briefly distinguish their work from "jailbreak" prompts but don't deeply engage with responsible disclosure considerations.

4. Timeliness & Relevance

The paper is highly timely. The gap between the proliferation of LLM safety policies and the ability to systematically verify compliance is a real and growing problem. The approach of treating policies as formal specifications addresses a recognized bottleneck. The incorporation of multiple corporate and regulatory policies (16 policies from 9 companies + 4 regulatory documents) demonstrates practical grounding.

5. Strengths & Limitations

Key Strengths:

Novel conceptual framing that imports specification-based testing to AI safety

Full traceability from generated queries back to specific policy clauses

100% policy clause coverage—a unique achievement among compared methods

Comprehensive experimental setup with 6 target models, 5 evaluators, and 10 baseline comparisons

Cost-effective and scalable pipeline with reusable graph infrastructure

Code release enhances reproducibility

Key Limitations:

Single-turn only—the limitation to static, single-turn interactions is significant given that many real-world safety violations emerge in multi-turn contexts

Policy quality dependency—the "garbage-in, garbage-out" problem is acknowledged but unaddressed

The FOL formalization, while conceptually appealing, may not capture the full nuance of natural language policies (ambiguity, context-dependence, cultural variation)

The comparison with the curiosity-driven red-teaming baseline appears somewhat unfair—POLARIS generates far more queries and benefits from extensive policy knowledge as input

Limited analysis of false positives—queries flagged as successful attacks may not represent genuine safety violations

The ablation study shows relatively modest gains from the semantic graph component (average novelty improvement from 24.80% to 28.00%), questioning whether the graph construction overhead is justified

No human evaluation of query quality, naturalness, or actual harmfulness

Summary

POLARIS makes a meaningful conceptual contribution by introducing specification-based testing to LLM safety evaluation, with a well-engineered pipeline and thorough experimental evaluation. The traceability and coverage properties are genuinely valuable. However, the reliance on LLMs for the formalization step introduces circularity concerns, the evaluation methodology has notable gaps (especially regarding false positive analysis and human evaluation), and the practical gains over simpler approaches appear moderate in some dimensions. The work opens an interesting research direction but leaves important questions about the fidelity and completeness of LLM-generated formal specifications unresolved.

Rating:6.5/ 10

Significance 7Rigor 5.5Novelty 7.5Clarity 7

Generated May 26, 2026

Comparison History (19)

vs. UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

gemini-3.15/27/2026

Paper 1 introduces a general RL optimization framework for LLM-based multi-agent systems, addressing a critical bottleneck in the transition from manually prompted agents to optimized, trainable agentic workflows. By providing a reusable infrastructure for multi-agent RL, it has the potential to become a foundational tool for a rapidly expanding field, leading to widespread adoption and high citation impact. While Paper 2 offers a rigorous approach to AI safety, Paper 1's contribution as a structural framework for agent optimization offers broader utility across various domains.

vs. The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

gpt-5.25/27/2026

Paper 2 introduces a broadly relevant and timely problem in retrieval-augmented generation—distinguishing parametric memory use from reliance on retrieved evidence—framed as the “attribution blind spot.” Its proposed CRM method targets internal-representation signals rather than output, and is evaluated across multiple model families with mechanistic interventions and generalization tests, suggesting stronger methodological rigor and wider cross-domain implications (RAG reliability, provenance, auditing, safety). Paper 1 is practically valuable for safety testing, but depends on policy-to-logic compilation accuracy and is more narrowly scoped to policy compliance evaluation.

vs. Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

gpt-5.25/27/2026

Paper 2 likely has higher impact: it introduces a general, formal-methods-inspired framework (POLARIS) that compiles natural-language policies into logic, enabling coverage-driven, reproducible safety testing with traceable guarantees. This is timely given regulatory and deployment pressure around LLM safety, and has broad applicability across domains requiring policy compliance (health, finance, enterprise, gov) and across fields (AI safety, formal methods, software testing). Paper 1 is novel and useful for agent reliability, but is narrower to long-lived agent memory/harness dynamics and benchmarking.

vs. Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

gemini-3.15/26/2026

Paper 1 offers a highly timely and innovative integration of formal methods with AI safety. By translating natural-language policies into First-Order Logic, it transforms LLM safety evaluation from ad-hoc red-teaming into a rigorous, coverage-driven software testing paradigm. This methodological leap addresses a critical bottleneck in deploying safe AI systems across all domains. While Paper 2 presents a valuable neuro-symbolic approach for healthcare, Paper 1's framework is fundamentally more broadly applicable to the entire foundation model ecosystem and tackles the urgent, cross-disciplinary challenge of verifiable AI alignment.

vs. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

gpt-5.25/26/2026

Paper 2 (Shepherd) has higher likely impact: it delivers a general-purpose runtime substrate with formalized semantics (Lean mechanization), efficient fork/replay execution traces, and demonstrated gains across multiple meta-agent applications (supervision, counterfactual optimization, RL training). This infrastructure can be reused broadly across agent systems, verification, debugging, and training workflows, giving it wider cross-field applicability and timeliness as agentic LLMs proliferate. Paper 1 is novel and valuable for safety testing, but its impact is narrower to policy-spec safety evaluation and depends on policy-to-logic compilation fidelity.

vs. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

gemini-3.15/26/2026

While Paper 1 offers a highly novel formal approach to AI safety testing, Paper 2 proposes a fundamental architectural advancement in linear attention by decoupling erase and write gates. By outperforming state-of-the-art sub-quadratic models like Mamba-2 and Mamba-3 on long-context tasks, it addresses a critical bottleneck in foundation model efficiency and scaling, which will likely drive broader adoption and foundational impact across the deep learning community.

vs. Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact. It introduces a broadly novel bridge between formal methods (FOL, specification-based testing) and LLM safety evaluation, enabling systematic, coverage-driven, reproducible safety tests with traceability—an approach that can generalize across models, domains, and policy regimes. Its real-world applicability is immediate for compliance and safety auditing, and its impact spans AI safety, NLP, software testing, and formal verification. Paper 1 is valuable and timely for LVLM hallucinations, but is more incremental within a narrower subarea and may have less cross-field reach.

vs. Inference Time Context Sparsity: Illusion or Opportunity?

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental efficiency bottleneck in LLM inference with broad implications for architecture design, training, and deployment. Its extensive empirical study across 20 models, five families, and multiple task types, combined with theoretical grounding and demonstrated 10x hardware speedups, positions it to influence both systems research and model design. The finding that extreme context sparsity is principled rather than heuristic could reshape how the field approaches long-context LLM inference. Paper 2, while valuable for AI safety testing, addresses a narrower problem with a more incremental contribution combining existing techniques (FOL, graph traversal) for test generation.

vs. A Sober Look at Agentic Misalignment in Automated Workflows

gpt-5.25/26/2026

Paper 1 likely has higher scientific impact due to a more concrete, broadly applicable, and timely methodology: compiling natural-language policies into formal logic, constructing a semantic graph, and generating coverage-driven, reproducible safety tests with traceability—bridging formal methods and LLM safety evaluation. It offers clear real-world utility (standardized safety testing pipelines) and measurable gains over baselines, supporting rigor and adoption. Paper 2 addresses an important emerging area (multi-agent misalignment) with an appealing framing, but its proposed paradigm appears less operationalized/standardizable and may be harder to validate and generalize across settings.

vs. MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

claude-opus-4.65/26/2026

MemQ introduces a fundamentally novel theoretical framework (Exogenous-Context MDP, provenance DAGs with TD(λ) eligibility traces for memory credit assignment) that bridges reinforcement learning theory with LLM memory systems. This offers broader impact across multiple fields (RL, agent systems, memory architectures) and addresses a foundational limitation in how LLM agents manage episodic memory. While POLARIS is a solid engineering contribution applying existing formal methods to LLM safety testing, MemQ's theoretical novelty, formalization of a new MDP class, and demonstrated gains across six diverse benchmarks suggest deeper and more lasting scientific influence.

vs. Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis

claude-opus-4.65/26/2026

Paper 2 (POLARIS) introduces a novel, principled framework bridging formal methods and AI safety evaluation—a highly timely and broadly impactful contribution. Its systematic approach using First-Order Logic and Semantic Policy Graphs for coverage-driven safety testing offers strong methodological rigor, reproducibility, and practical applicability across diverse LLM deployments. While Paper 1 addresses an important gap in context learning, Paper 2's combination of novelty (specification-based testing for AI safety), broad cross-field relevance (formal methods + AI safety), and immediate real-world utility gives it higher estimated impact.

vs. Test-Time Deep Thinking to Explore Implicit Rules

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental and broadly applicable challenge—enabling LLM agents to discover and reason about implicit rules through interaction—introducing a novel RL training pipeline and demonstrating significant performance gains. This has wide applicability across embodied AI, game-playing, and autonomous agents. Paper 2 presents a valuable but more niche contribution to LLM safety testing through formal specification methods. While rigorous, safety testing frameworks tend to have narrower impact compared to foundational reasoning advances. Paper 1's approach to test-time reasoning and exploration represents a more transformative contribution to the field.

vs. SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver

gemini-3.15/26/2026

Paper 1 addresses the critical and highly timely issue of LLM safety by innovatively bridging formal methods with AI testing. Its approach offers verifiable traceability and systematic guarantees, which are urgently needed as LLM integration expands across all sectors. While Paper 2 offers a strong contribution to optimization and routing, Paper 1's focus on AI safety grants it a much broader potential impact across multiple disciplines and real-world applications.

vs. When Can Human-AI Teams Outperform Individuals? Tight Bounds with Impossibility Guarantees

gemini-3.15/26/2026

Paper 2 provides fundamental theoretical bounds and impossibility guarantees for human-AI teaming, solving a pervasive empirical puzzle. Its mathematical rigor, high predictive accuracy on real datasets, and broad applicability across decision-making domains offer lasting scientific impact. In contrast, Paper 1 offers a highly timely and practical engineering framework for LLM safety, but its contribution is more specialized and applied compared to the foundational theoretical advancements of Paper 2.

vs. Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

gpt-5.25/26/2026

Paper 2 (POLARIS) likely has higher scientific impact because it introduces a more broadly applicable, formally grounded methodology: compiling natural-language policies into logic, building a semantic graph, and enabling systematic, coverage-driven safety testing with traceability. This bridges formal methods and AI safety, offering reusable tooling for evaluation across models, domains, and evolving policies—high timeliness and cross-field relevance. Paper 1 is practically valuable for controlled safety relaxation, but its impact is narrower (deployment/authorization settings) and more tightly coupled to specific model adaptation techniques.

vs. Not all uncertainty is alike: volatility, stochasticity, and exploration

gemini-3.15/26/2026

Paper 1 addresses the critical and highly timely challenge of LLM safety. By integrating formal methods with AI safety, it offers a rigorous, automated, and reproducible framework for red-teaming, directly impacting the safe deployment of foundational models. While Paper 2 presents profound theoretical insights into decision-making and RL, Paper 1's immediate real-world applicability in the rapidly expanding AI sector gives it a higher potential for rapid, widespread scientific and industrial impact.

vs. From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

gpt-5.25/26/2026

Paper 1 likely has higher impact due to a more novel, methodologically rigorous bridge between formal methods (FOL compilation, traceable specification-based testing) and LLM safety, yielding automated, coverage-driven, reproducible evaluations with direct real-world governance/compliance applications. It proposes a concrete framework with systematic guarantees and code release, addressing a timely bottleneck in safety assessment. Paper 2 is valuable and broad as a lifecycle evaluation study with practical insights, but is primarily diagnostic/empirical; its core contribution is less fundamentally new than a formalized testing paradigm for safety-critical policy adherence.

vs. AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

gemini-3.15/26/2026

Paper 1 introduces a rigorous, formal methods-based approach to LLM safety testing, addressing a critical bottleneck in AI deployment. By bridging First-Order Logic with AI safety policies, it provides a traceable, systematic, and concrete methodological innovation with immediate real-world applications. While Paper 2 offers a valuable comprehensive survey of AI in scientific discovery, Paper 1's actionable framework and empirical guarantees offer a more direct and substantial technical impact on the rapidly growing field of AI safety.

vs. When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

gpt-5.25/26/2026

Paper 2 likely has higher impact due to stronger novelty and broader, timely applicability: it connects formal specification-based testing (FOL, coverage, traceability) to LLM safety evaluation, a rapidly evolving and high-stakes area with wide cross-domain relevance (ML, security, formal methods, policy/compliance). Its framework can generalize to many models and policy sets, enabling systematic, reproducible safety testing. Paper 1 is methodologically important and improves rigor in learning analytics, but its scope and cross-field reach are narrower and primarily incremental protocol refinement within an established application area.