Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

Chen Linze, Cai Yufan, Hou Zhe, Dong Jin Song

May 26, 2026

arXiv:2605.26530v1 PDF

cs.AI(primary)

#642of 2682·Artificial Intelligence

#642 of 2682 · Artificial Intelligence

Tournament Score

1464±42

10501800

58%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor5.5

Novelty7

Clarity6.5

Tournament Score

1464±42

10501800

58%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal-relevance-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should-change and should-not-change evaluation across judicial fairness, robustness, and statute-confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact-statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper makes two interconnected contributions. First, it formulates the concept of legal-relevance-sensitive evaluation, a framework that unifies fairness, robustness, adversarial resistance, and statute-confusion testing under a single principled lens: legal AI should change predictions only when legally material facts change ("should-change") and remain invariant when irrelevant attributes are perturbed ("should-not-change"). Second, it proposes LexGuard, a multi-agent adversarial framework grounded in SMT (Satisfiability Modulo Theories) solving via Z3, which formalizes statutes into executable constraints, uses prosecutor/defense agents to extract competing fact-statute arguments, and invokes formal verification to check legal applicability.

The framing is conceptually clean and well-motivated. The distinction between should-change and should-not-change perturbations is intuitive yet rarely formalized in legal AI evaluation. Prior work (notably J&H) focused primarily on label-preserving robustness; this paper extends the evaluation paradigm to include sensitivity to legally material modifications—a meaningful and underexplored dimension.

2. Methodological Rigor

Evaluation Framework: The perturbation taxonomy (Table 1) is comprehensive, covering extra-legal factors, surface-form changes, major/minor premise attacks, conclusion-level adversarial perturbations, and statutory element/mental state/exception sensitivity. The framework is well-defined with clear metrics (invariance, change alignment, bias magnitude, ASR, etc.). The dual evaluation axes (should-change/should-not-change) are formally specified.

LexGuard Architecture: The pipeline is methodologically sound in principle—autoformalization of statutes into SMT constraints, adversarial agent extraction, and solver-based adjudication. The formalization (Appendix B) is detailed, specifying typed predicates, article/clause guards, penalty encoding, and aggravating/mitigating factor handling. The three-stage validation (syntactic, semantic, case-level testing) adds credibility.

Experimental Concerns: Several aspects weaken rigor:

The evaluation is conducted primarily on Chinese criminal law datasets (LeCaRDv2, LEEC), limiting generalizability claims. The controlled perturbation set of 8,000 cases uses relatively short fact descriptions (avg. 134.89 characters), raising questions about complexity representativeness.

The paper uses GPT-5.2 as the base LLM—a model not yet publicly available at the time of assessment, making reproducibility uncertain.

RQ4 results show that even LexGuard has relatively high ASR (~50%) and low invariance (~26%), suggesting the framework's robustness improvements, while meaningful relative to baselines, remain modest in absolute terms.

The ablation study (Table 4) is informative but limited to LeCaRDv2 and only three component removals.

Statistical significance tests are absent throughout.

3. Potential Impact

The paper addresses a genuine need in legal AI deployment. The relevance-sensitive evaluation framework could become a standard testing methodology for legal NLP systems. The insight that trustworthiness requires *calibrated sensitivity*—not just accuracy or stability—is valuable beyond legal AI and applies to any domain where some input variations should alter outputs while others should not.

LexGuard's neural-symbolic approach connects to broader trends in grounding LLM outputs with formal verification. The architecture could inspire similar frameworks in medical diagnosis, financial compliance, or regulatory reasoning where rule-based verification is essential.

However, practical impact faces constraints: the formalization pipeline currently handles only statutory rules (not case law or open-textured norms), the system requires significant domain-specific knowledge engineering for each jurisdiction, and the computational overhead (107 seconds, 10+ LLM calls per case) may limit scalability for high-volume applications.

4. Timeliness & Relevance

The paper is highly timely. As LLMs are increasingly deployed in legal settings—document review, case analysis, legal Q&A—the question of whether these systems are sensitive to the *right* things is critical. Recent high-profile cases of LLM hallucination in legal contexts (fabricated citations, etc.) underscore this need. The formalization of trustworthiness beyond accuracy aligns with growing regulatory demands for AI explainability and fairness in high-stakes domains.

The neural-symbolic approach is also timely, as the field moves toward combining LLM flexibility with formal guarantees. Using SMT solvers to provide verifiable reasoning chains addresses the transparency requirements emerging from AI governance frameworks.

5. Strengths & Limitations

Key Strengths:

The should-change/should-not-change evaluation dichotomy is a genuinely useful conceptual contribution that could influence evaluation practices beyond legal AI.

The comprehensive perturbation taxonomy provides a structured methodology for stress-testing legal AI.

The combination of adversarial multi-agent extraction with SMT verification is architecturally novel in the legal AI space.

Detailed formalization in Appendix B demonstrates serious engagement with legal structure.

RQ5 results on confusing-statute discrimination (88.71% vs. 58.57% positive exactness) demonstrate meaningful practical improvement.

Notable Limitations:

Jurisdiction specificity: exclusively Chinese criminal law, with no evidence of cross-jurisdictional transfer.

The autoformalization quality depends on LLM accuracy, creating a circular dependency the authors acknowledge but don't resolve.

The paper assumes deterministic rule parsing, which is fundamentally at odds with how many legal provisions operate (vague standards, discretionary elements, evolving interpretation).

Absolute robustness numbers remain concerning—50% ASR for LexGuard suggests adversarial vulnerability persists despite solver grounding.

Comparison baselines are somewhat limited: specialized legal LLMs (LexiLaw, DISC-LawLLM) show poor performance on many metrics, making LexGuard's improvements partially attributable to stronger base models.

The paper's use of GPT-5.2 as the primary model raises reproducibility concerns.

No human evaluation of output quality or legal expert validation of the formal knowledge base is reported.

6. Additional Observations

The paper is well-written but dense, with substantial material relegated to appendices. The motivating example (Appendix H) is thorough and demonstrates the pipeline's interpretability. The cost analysis (Table 7) showing $0.08 per case is a practical contribution. The work sits at an interesting intersection of legal informatics, formal methods, and LLM evaluation, though it would benefit from engagement with legal scholarship on rule formalization and open-texture challenges.

Rating:6.5/ 10

Significance 7Rigor 5.5Novelty 7Clarity 6.5

Generated May 27, 2026

Comparison History (19)

vs. MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

gpt-5.25/28/2026

Paper 2 (MUSE) has higher likely impact due to broader applicability and timeliness: text-to-CAD for manufacturable, functional assemblies targets a large industrial/design ecosystem and multiple research communities (LLMs, CAD/graphics, manufacturing, HCI). Its benchmark, multi-stage evaluation protocol, public leaderboard, and scalable VLM-judge validation can become shared infrastructure, accelerating progress and enabling standardized comparison. Paper 1 offers a strong, novel trustworthiness framing for legal AI with solver-grounding, but its domain specificity (law + statutes formalization) and deployment constraints likely limit breadth relative to a widely reusable CAD benchmark.

vs. EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

claude-opus-4.65/28/2026

Paper 2 addresses a fundamental problem in legal AI—calibrated sensitivity to legally relevant vs. irrelevant changes—which has broad implications for trustworthy AI deployment in high-stakes domains. Its formalization of relevance-sensitive evaluation and the LexGuard framework combining formal reasoning with SMT solvers represents a novel methodological contribution bridging AI and formal methods. Paper 1, while comprehensive as a benchmark, is more incremental in the crowded space of multimodal benchmarks. Paper 2's focus on trustworthiness and its cross-disciplinary impact (AI, law, formal verification) gives it higher potential scientific impact.

vs. Do Clinical Models Change Treatment Decisions?

gpt-5.25/28/2026

Paper 1 has higher estimated impact due to a more novel and broadly applicable contribution: it formalizes “relevance-sensitive” evaluation (should-change vs should-not-change) and proposes LexGuard, a solver-grounded, adversarial multi-agent framework that integrates formal statute constraints and SMT verification—strong methodological rigor and a clear path to trustworthy deployment in high-stakes legal settings. Its ideas generalize beyond law to robustness, invariance testing, and neuro-symbolic verification. Paper 2 introduces a valuable clinical benchmark and training signal, but is more incremental (benchmarking + supervision) and narrower in cross-field methodological innovation.

vs. Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

gemini-3.15/27/2026

Paper 2 identifies a fundamental flaw in how multi-hop reasoning is evaluated across LLMs and introduces a novel protocol to address it. Its findings impact the entire field of LLM evaluation and post-training, offering significantly broader scientific impact than Paper 1, which, while highly rigorous and practically valuable, is largely confined to the specific domain of legal AI.

vs. From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

gpt-5.25/27/2026

Paper 1 has higher likely scientific impact due to broader cross-domain relevance: it addresses a fundamental, general problem in multi-turn RL for dialogue—compounding context distribution shift—and proposes a unified framework (interactive RL + simulator alignment) with theoretical justification and empirical gains. The ideas can transfer beyond legal AI to any interactive LLM setting (assistants, tutoring, tool use, embodied agents), making applications wide and timely. Paper 2 is rigorous and valuable but more domain-specific (legal) and its solver-grounded framework may face adoption and scalability limits outside that domain.

vs. CODESKILL: Learning Self-Evolving Skills for Coding Agents

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental and underexplored problem in legal AI—calibrated sensitivity to legally relevant vs. irrelevant changes—introducing both a novel evaluation framework and a solver-grounded reasoning system (LexGuard). It combines formal methods (SMT solvers) with LLMs in a principled way, addressing trustworthiness concerns critical for high-stakes AI deployment. Paper 2, while solid, addresses a more incremental improvement in coding agent skill learning. Paper 1's broader implications for AI safety, fairness, and formal verification in legal domains give it higher cross-disciplinary impact and timeliness.

vs. Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact: it identifies a broadly applicable, structural vulnerability in the dominant alignment paradigm (RLHF) and empirically demonstrates amplification of diverse misaligned biases, making it timely for frontier-model safety and deployment. Its implications cut across NLP, AI safety, human-computer interaction, and ML training pipelines, with clear real-world consequences for many systems beyond a single domain. Paper 1 is novel and rigorous for legal AI trustworthiness, but its primary impact is narrower (legal reasoning) despite strong methodology and practical relevance there.

vs. Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

gpt-5.25/27/2026

Paper 2 has higher likely scientific impact due to a clearer, broadly reusable problem formulation (calibrated sensitivity to relevant vs irrelevant perturbations), a unified evaluation suite that can become a community benchmark, and a solver-grounded mitigation method (LexGuard) that strengthens methodological rigor and trustworthiness. Its approach bridges NLP, legal informatics, robustness/fairness evaluation, and neuro-symbolic reasoning, increasing cross-field impact and timeliness amid rising regulatory focus on reliable legal AI. Paper 1 is practical for controlled safety relaxation, but is more application-specific and potentially constrained by deployment/policy considerations.

vs. Generating Robust Portfolios of Optimization Models using Large Language Models

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental and timely problem in legal AI trustworthiness with a novel evaluation framework (relevance-sensitive evaluation) and a concrete solution (LexGuard) combining formal verification with LLMs. It introduces a new conceptual framing—calibrated sensitivity to legally material changes—that could reshape how legal AI systems are evaluated and deployed. The integration of SMT solvers with adversarial multi-agent reasoning is methodologically innovative. Paper 2, while solid, addresses a narrower problem (optimization model generation) with a more incremental contribution (portfolio generation with theoretical guarantees). Paper 1 has broader societal implications given the high-stakes nature of legal AI.

vs. Hypothesis Generation and Inductive Inference in Children and Language Models

gemini-3.15/27/2026

Paper 1 bridges cognitive science, developmental psychology, and AI to explore fundamental mechanisms of inductive reasoning. By comparing children and LLMs using Bayesian program synthesis, it offers profound theoretical insights into human cognition and AI behavior. Paper 2 presents a rigorous neuro-symbolic approach for Legal AI, but its focus is primarily domain-specific. Paper 1's foundational exploration of hypothesis generation grants it broader interdisciplinary appeal and a higher potential for widespread scientific impact across multiple fields.

vs. Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

gemini-3.15/27/2026

While Paper 1 presents a strong, rigorous approach to improving LLM trustworthiness in the legal domain via formal reasoning, Paper 2 tackles a more broadly applicable and urgent bottleneck: evaluating long-horizon AI agents. By identifying and mitigating 'artifact drift' in benchmark generation, Paper 2's Anchor pipeline offers a scalable, verifiable methodology for agent evaluation across various enterprise and real-world domains, likely leading to wider methodological adoption and impact across the broader AI agent research community.

vs. On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

gpt-5.25/27/2026

Paper 2 likely has higher impact: it targets a timely, high-stakes domain (legal AI) with broad relevance to trustworthy LLM evaluation, robustness, and alignment. It contributes a unified evaluation suite (reusable benchmark value) and a mitigation framework (LexGuard) combining adversarial multi-agent prompting with solver-grounded formal constraints, offering clear real-world applicability and potential cross-domain transfer to other regulated settings (healthcare, finance). Paper 1 is rigorous and important for lifted inference correctness, but is narrower in audience and application scope despite strong theoretical value.

vs. GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

gemini-3.15/27/2026

Paper 1 offers higher scientific impact because it not only identifies a critical flaw in LLM reasoning (sensitivity to irrelevant changes) but also proposes a novel, neuro-symbolic methodological solution (LexGuard) integrating adversarial agents and SMT solvers. While Paper 2 provides a valuable and rigorous benchmark for a specific medical subdomain (dentistry), Paper 1 advances the fundamental architecture of trustworthy AI, offering a formal reasoning framework that can be adapted to broader domains beyond legal AI.

vs. ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

gemini-3.15/27/2026

Paper 1 addresses a critical and universal bottleneck in AI-driven research—verifiability and hallucinations—with broad implications across all scientific fields. By ensuring reproducible and hallucination-free autonomous research, it has a vastly wider potential impact than Paper 2, which focuses on the narrower, domain-specific challenge of robustness in Legal AI.

vs. A Sober Look at Agentic Misalignment in Automated Workflows

claude-opus-4.65/27/2026

Paper 1 addresses a critical and timely problem in legal AI trustworthiness with a novel evaluation framework (relevance-sensitive evaluation) and a concrete solution (LexGuard) grounded in formal methods (SMT solvers). It combines methodological rigor with clear real-world applications in legal AI, an area of growing importance. Paper 2 tackles agentic misalignment with a Bayesian framework and AEA, which is valuable but more incremental in the multi-agent alignment space. Paper 1's domain-specific formalization, practical evaluation suite, and integration of formal verification with LLMs offer broader and more distinctive contributions.

vs. HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

claude-opus-4.65/27/2026

HyperGuide introduces a novel and broadly applicable technique—using hyperbolic geometry to guide LLM multi-step reasoning—that addresses a fundamental efficiency-accuracy tradeoff across many domains. The geometric insight connecting reasoning tree structure to hyperbolic space is creative and theoretically motivated, with potential impact across all reasoning tasks. Paper 1, while rigorous and important for legal AI trustworthiness, is more domain-specific. HyperGuide's cross-domain applicability, methodological novelty, and open-source availability give it broader potential scientific impact.

vs. From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

gemini-3.15/27/2026

Paper 2 addresses a fundamental challenge in LLMs—trustworthiness and sensitivity to irrelevant changes—by introducing a novel evaluation suite and a neuro-symbolic framework integrating adversarial agents with SMT solvers. This verifiable reasoning approach advances AI safety and methodology. Paper 1 presents a solid, but more narrowly applied RAG pipeline for a specific legal task, giving Paper 2 broader theoretical and cross-disciplinary impact.

vs. Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

gpt-5.25/27/2026

Paper 1 has higher potential scientific impact due to a clearer novelty in evaluation (relevance-sensitive should-change/should-not-change paradigm) plus a solver-grounded mitigation (SMT-backed constraints with adversarial argument extraction) that strengthens methodological rigor and trustworthiness claims. It targets a high-stakes domain (law) with immediate real-world implications (robustness to manipulative framing, statutory disambiguation) and offers broadly transferable ideas for calibrated sensitivity and formal verification in LLM evaluation. Paper 2 is timely and useful for personalization, but memory-augmented long-term interaction and retrieval mechanisms are more incremental and crowded.

vs. Advancing Graph Few-Shot Learning via In-Context Learning

gemini-3.15/27/2026

Paper 1 tackles a critical challenge in a high-stakes domain (Legal AI) by combining LLMs with formal reasoning (SMT solvers). This neuro-symbolic approach not only addresses immediate real-world concerns regarding fairness, robustness, and trustworthiness in legal applications, but also introduces a novel methodology (LexGuard) that could inspire reliable AI systems in other sensitive fields. Paper 2 presents a solid algorithmic improvement for graph few-shot learning via in-context learning, but its impact is likely more confined to the graph ML community, whereas Paper 1 bridges AI, law, and formal methods.