Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

Parisa Kordjamshidi, Samer Aslan, Madhavan Seshadri, Leslie Barrett, Enrico Santus

May 15, 2026

arXiv:2605.16052v1 PDF

cs.AI(primary)cs.CL

#1501of 2292·Artificial Intelligence

#1501 of 2292 · Artificial Intelligence

Tournament Score

1373±37

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty6.5

Clarity7.5

Tournament Score

1373±37

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper addresses a critical question in legal AI: whether LLMs' strong performance on tax law reasoning benchmarks reflects genuine reasoning or memorization of contaminated training data. The authors make three interrelated contributions: (1) a contamination detection protocol adapted from Golchin & Surdeanu (2025) applied to the SARA benchmark, demonstrating that modern LLMs exhibit substantial contamination (up to ~90% for Gemini 3 Pro); (2) a novel test suite, SARA+, with systematically generated case and rule perturbations that preserve formal correctness while reducing contamination overlap; and (3) a comprehensive empirical comparison showing that neuro-symbolic (Prolog-based) pipelines are more robust to perturbations than direct LLM reasoning.

The central thesis—that LLMs should serve as *translators* to formal representations rather than as *reasoners*—is well-motivated and supported by the experimental evidence. The paper demonstrates that performance gaps between direct QA and Prolog-based approaches widen significantly under rule and case perturbations, with Prolog-based systems maintaining stable accuracy while direct QA degrades sharply.

2. Methodological Rigor

Contamination Analysis: The adaptation of the BDQ/BCQ protocol from Golchin & Surdeanu is methodologically sound. Using Cohen's Kappa to adjust for positional bias and reporting both minimum and maximum contamination estimates provides appropriate uncertainty quantification. The correlation between contamination levels and direct QA performance (Table 7 vs. Table 6) is compelling, though the authors correctly note this does not establish causality.

SARA+ Construction: The perturbation strategy is well-designed, with aligned modifications to both textual and Prolog representations ensuring ground truth correctness without additional human annotation. The taxonomy of splits (rule changes, case numerical changes, case paraphrasing, and combinations) enables systematic isolation of different generalization dimensions. However, restricting perturbations to numerical values only is a notable limitation—this controls confounds but limits ecological validity, as real-world legal changes often involve semantic/conceptual modifications.

Experimental Design: The paper evaluates 12+ LLMs across both proprietary and open-source families, providing broad coverage. The comparison between direct QA and Prolog-based reasoning uses identical LLM backbones, ensuring fair comparison. However, the Prolog-based approach relies on *human-coded* statute rules, which is a significant advantage that somewhat undermines the comparison's fairness—the neuro-symbolic system has access to expert-verified formal knowledge that the direct QA system must derive from text.

Weaknesses in rigor: Statistical significance tests are absent throughout. Given the relatively small test sets (100 entailment, 20 numerical in the original SARA), variance estimates would strengthen claims. The paper also lacks error analysis—what specific types of errors do LLMs make under perturbation versus what Prolog pipelines handle correctly?

3. Potential Impact

Legal AI: The findings have direct implications for deploying AI in legal practice. The demonstration that contamination inflates reported performance serves as a cautionary note for legal technology companies. The advocacy for neuro-symbolic pipelines with human-in-the-loop verification aligns with practical needs in high-stakes legal applications.

Benchmark Design: The SARA+ dataset and the contamination-aware evaluation protocol provide a template for more rigorous evaluation in legal NLP and potentially other domains with contamination risks. The publicly released dataset enhances reproducibility.

Broader NLP: The paper contributes to the growing literature questioning whether LLM benchmark performance reflects genuine capabilities. The domain-specific application to legal reasoning, where correctness is paramount, makes the contamination concern particularly consequential.

Limitations on impact: The scope is narrow—tax law reasoning on a single benchmark family (SARA). The reliance on human-coded Prolog rules for the statute knowledge base limits scalability. The paper acknowledges this but doesn't propose solutions for automating rule formalization, which remains the critical bottleneck for real-world deployment.

4. Timeliness & Relevance

The paper is highly timely. Data contamination in LLM evaluation is an active concern across NLP, and legal AI is experiencing rapid commercial deployment. The juxtaposition of contamination analysis with neuro-symbolic alternatives addresses two converging trends: skepticism about LLM evaluation validity and renewed interest in hybrid AI systems. The inclusion of very recent models (GPT-5.x, Gemini 3 Pro, Claude Opus 4) ensures the findings are relevant to the current state of the art.

5. Strengths & Limitations

Key Strengths:

The contamination-robustness connection is novel and well-demonstrated: Prolog-based systems show weak correlation with contamination levels while direct QA shows strong correlation.

Comprehensive model coverage across 12+ LLMs spanning multiple families and generations.

SARA+ is a practical, reusable contribution with automatically verified ground truth.

Figure 2 effectively summarizes the core finding—Prolog-based reasoning maintains stability across all perturbation types while direct QA degrades.

Notable Limitations:

The human-coded Prolog statute rules represent a strong prior that direct QA lacks, making the comparison somewhat asymmetric. The paper should have included an ablation where LLMs also generate statute rules.

The entailment task (SARA_Ee) shows that LLMs are already robust to paraphrasing, weakening claims about LLM fragility in simpler reasoning scenarios.

Small dataset sizes limit statistical power and generalizability claims.

The paper doesn't explore whether contamination effects can be mitigated through prompting strategies or fine-tuning, which would strengthen practical recommendations.

No cost/latency analysis of the Prolog pipeline versus direct QA, which matters for deployment.

Additional Observations

The paper's framing of "reasoners vs. translators" is compelling but slightly overstated—the Prolog-based system still requires LLMs to correctly translate cases into formal representations, and errors in this translation step (which Table 6 shows are non-trivial for entailment) represent a different failure mode. The ~80-88% accuracy of Prolog-based entailment suggests translation remains a significant challenge.

The finding that GPT-4o improves by ~72% on numerical reasoning when used as a translator rather than a reasoner is particularly striking and suggests that decomposing complex tasks into translation + verified execution is a promising paradigm beyond legal reasoning.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 6.5Clarity 7.5

Generated May 18, 2026

Comparison History (31)

vs. Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators

claude-opus-4.65/19/2026

Paper 1 addresses a fundamental methodological concern (data contamination) affecting all LLM benchmarking in legal AI, proposes a novel contamination detection protocol, and demonstrates the superiority of neuro-symbolic approaches for legal reasoning with a new test suite. It has broader impact across AI safety, legal tech, and formal methods. Paper 2, while insightful about LLM negotiation limitations, is more narrowly focused on diagnosing a specific failure mode without proposing a solution, limiting its constructive impact.

vs. CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

claude-opus-4.65/19/2026

Paper 1 introduces a novel dataset (CBT-Audio) addressing a significant gap in mental health AI research—the lack of audio-based evaluation for CBT. It bridges audio and text modalities, demonstrating that vocal cues add value beyond transcripts for distress estimation. This has broad implications for clinical AI, multimodal language models, and mental health applications. Paper 2 makes valuable contributions to legal AI by examining contamination and neuro-symbolic methods, but addresses a narrower domain (tax law) with less novelty in its core findings. Paper 1's new resource and cross-disciplinary relevance give it higher impact potential.

vs. CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

gpt-5.25/19/2026

Paper 2 likely has higher impact due to stronger methodological rigor (explicit contamination detection, generalization-focused test suite), clearer real-world stakes (tax law compliance and reliability), and broader relevance to trustworthy LLM evaluation and neuro-symbolic system design beyond law. Its contamination-aware framing is timely and addresses a widely recognized failure mode in LLM benchmarking. While Paper 1 is novel in process-level emotion appraisal evaluation and valuable for affective AI, its applications are less immediate and its impact may be more specialized to HCI/affective computing compared to the cross-domain implications of contamination-robust reasoning.

vs. Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

claude-opus-4.65/19/2026

Paper 2 addresses fundamental questions about LLM reasoning vs. memorization with broader scientific implications. Its contamination-aware evaluation methodology and neuro-symbolic approach are applicable across many domains beyond tax law. It contributes to core AI research debates about genuine reasoning capabilities. Paper 1, while practically useful for SRE workflows, is more narrowly focused on an engineering application (causal grounding for incident diagnosis) with limited generalizability beyond DevOps/reliability engineering contexts.

vs. GraphMind: From Operational Traces to Self-Evolving Workflow Automation

gpt-5.25/19/2026

Paper 2 has higher estimated impact due to a more novel, end-to-end closed-loop paradigm (trace-to-graph extraction, multi-agent execution, and self-evolving reinforcement) with demonstrated production deployment across multiple services and strong empirical evaluation. Its applications span enterprise automation, incident response, and operations, giving broad cross-domain relevance and timeliness for LLM+systems research. Paper 1 is rigorous and important for legal AI evaluation/contamination awareness, but its domain specificity (tax law) and incremental nature relative to existing neuro-symbolic and contamination-discussion work likely narrows overall scientific reach.

vs. SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

gemini-3.15/19/2026

Paper 2 introduces a novel methodological advancement (SAPO) addressing a fundamental credit assignment problem in reinforcement learning for structured generation. While Paper 1 offers valuable empirical insights into LLM contamination and neuro-symbolic methods in the legal domain, Paper 2's approach to step-aligned policy optimization has broader implications for improving reasoning-based generative models across various domains, offering higher potential impact in the rapidly advancing field of RL-driven generative AI.

vs. Interactive Evaluation Requires a Design Science

gpt-5.25/19/2026

Paper 1 targets a broad, timely shift in how LLMs are deployed (interactive, tool-using agents) and proposes a general evaluation paradigm with taxonomy, design principles, and reporting standards. This can reshape evaluation methodology across many domains (agents, HCI, RL, safety, reliability), giving it wide cross-field impact. Paper 2 is methodologically rigorous and practically important for legal AI, but its scope is narrower (tax law) and its main contributions (contamination-aware testing, neuro-symbolic robustness) are more domain-specific, limiting breadth despite strong real-world relevance.

vs. Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact due to a novel, clinically grounded framework (structured intermediate reasoning for ECGs) plus a new optimization method (SSPO) that avoids annotated reasoning traces—advancing both interpretability and performance in a high-stakes, widely studied medical domain. Its real-world applicability to scalable ECG diagnosis and potential transfer to other physiological signal tasks broaden impact across healthcare AI. Paper 2 is timely and rigorous (contamination-aware evaluation; neuro-symbolic robustness), but is primarily evaluative within a narrower domain (tax law) and offers less broadly reusable methodological innovation.

vs. Dynamics of collective creativity in AI art competitions

claude-opus-4.65/19/2026

Paper 1 addresses a novel intersection of cultural evolution, collective creativity, and human-AI co-creation at scale, analyzing a unique large-scale dataset (130K+ images). It contributes fundamental insights about how creativity emerges in networked human-AI systems, with broad implications across cultural evolution, computational creativity, and social computing. Paper 2, while rigorous, addresses a more incremental question (contamination in LLM legal reasoning) within a narrower domain. Paper 1's findings about attractor dynamics and the paradox of novelty preferences are more likely to inspire cross-disciplinary research.

vs. RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

gemini-3.15/19/2026

Paper 2 addresses the highly impactful intersection of LLM agents, Knowledge Graph construction, and Retrieval-Augmented Generation (RAG). Its framework offers broad, cross-domain applicability compared to Paper 1's narrower focus on tax law. By improving retrieval precision and interpretability through auditable provenance, Paper 2 provides significant methodological advancements for high-stakes domains, giving it a broader and more timely scientific impact.

vs. Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

gpt-5.25/19/2026

Paper 2 has higher potential impact due to its strong timeliness (contamination-aware evaluation), high real-world relevance (legal/tax compliance), and broader cross-field implications (evaluation methodology, robustness, and neuro-symbolic AI). Its contamination detection protocol and new generalization-focused test suite address a central reliability bottleneck for LLM deployment, and the neuro-symbolic comparison offers actionable design guidance beyond a single task. Paper 1 is methodologically interesting for capability-aware clustering and could aid evaluation/routing, but its impact is more niche within LLM benchmarking and depends on adoption of the ECC framework.

vs. Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

claude-opus-4.65/19/2026

Paper 2 presents concrete empirical contributions—a contamination detection protocol, a novel test suite, and systematic comparison of monolithic vs. neuro-symbolic approaches for legal reasoning—with actionable findings about data contamination and compositional reasoning. Paper 1 is a position paper proposing a theoretical architecture without implementation or empirical validation. While Paper 1 raises important structural arguments about LLM safety, its impact is more speculative. Paper 2's methodological contributions (contamination-aware evaluation, neuro-symbolic benchmarking) are immediately applicable and address a timely, widespread concern about LLM evaluation integrity.

vs. Learning Lifted Action Models from Traces with Minimal Information About Actions and States

gemini-3.15/19/2026

Paper 1 addresses a highly relevant and timely issue—data contamination and reasoning capabilities in LLMs—within a critical real-world domain (tax law). By demonstrating the superiority of neuro-symbolic frameworks over monolithic LLMs for robust legal reasoning, it has broad implications for AI safety and application. Paper 2, while methodologically rigorous in classical planning (learning STRIPS models), appeals to a much narrower subfield and has less immediate cross-disciplinary or real-world impact compared to the widespread adoption of LLMs.

vs. AI for Auto-Research: Roadmap & User Guide

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to broader, cross-disciplinary relevance (affecting all of science and AI tooling), strong timeliness (through April 2026) amid rapid growth of autonomous/agentic research systems, and clear real-world applicability via taxonomy, benchmarks, design principles, and a practitioner playbook. Paper 1 is novel and methodologically valuable (contamination-aware evaluation plus neuro-symbolic robustness) but is narrower in domain scope (tax law) and thus likely to have more specialized impact.

vs. X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Human Attention

claude-opus-4.65/18/2026

Paper 1 addresses a fundamental and broadly relevant issue in AI—data contamination and genuine reasoning vs. memorization in LLMs—applied to legal reasoning. Its contamination detection protocol and systematic comparison of neuro-symbolic vs. monolithic approaches have broad methodological implications across AI and law. Paper 2 presents an interesting enterprise context synthesis framework but targets a narrower application domain (sales lead identification) with evaluation on a single task. Paper 1's contributions to understanding LLM reliability and compositional reasoning have wider scientific relevance and timeliness.

vs. LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

gemini-3.15/18/2026

Paper 1 addresses a highly generalizable and critical problem in AI deployment—optimizing performance and cost by dynamically routing multimodal queries to the best model. Its novel latent communication approach provides a scalable solution applicable across various AI domains. Paper 2, while offering rigorous evaluation and addressing data contamination, focuses primarily on the specific domain of tax law, making its immediate breadth of impact narrower compared to Paper 1's foundational infrastructure contribution.

vs. Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

claude-opus-4.65/18/2026

Paper 1 addresses a fundamental question about whether LLMs truly reason or merely memorize, introducing contamination-aware evaluation and demonstrating neuro-symbolic approaches' superiority for legal reasoning. This has broader implications across AI and law, touching on critical issues of trustworthiness, generalization, and compositional reasoning. Paper 2 contributes a valuable psychometric framework for AI-inferred user states, but addresses a more niche problem. Paper 1's novelty in combining contamination detection with neuro-symbolic evaluation in a high-stakes domain gives it greater potential for cross-disciplinary impact.

vs. FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

gemini-3.15/18/2026

Paper 2 presents a novel, generalizable protocol (FORGE) for self-evolving agent memory without gradient updates, addressing the highly impactful area of autonomous agent optimization. Its population-based broadcast mechanism offers a scalable solution for improving LLM decision-making across diverse domains. While Paper 1 makes strong contributions to legal AI and LLM evaluation, Paper 2's methodological innovation in agentic workflows has broader implications for artificial general intelligence, multi-agent systems, and reinforcement learning, giving it a higher potential breadth of impact.

vs. Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning

claude-opus-4.65/18/2026

Paper 1 addresses fundamental questions about LLM reliability—whether performance reflects genuine reasoning or data contamination—with broad implications for AI trustworthiness in high-stakes legal domains. Its contamination detection protocol and demonstration that neuro-symbolic approaches provide more robust generalization have cross-disciplinary relevance (AI safety, legal AI, formal methods). Paper 2, while technically solid, presents an incremental improvement to context pruning for coding agents with narrower scope. Paper 1's findings about contamination artifacts and compositional reasoning are more likely to influence future research directions across multiple fields.

vs. Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

gemini-3.15/18/2026

Paper 1 addresses a critical and highly generalizable problem—AI safety, compliance, and governance—by integrating formal methods with LLMs for real-time monitoring and intervention. This approach is highly novel and has broad implications across virtually all domains deploying advanced AI. Paper 2, while rigorous and valuable for legal AI and contamination analysis, has a narrower focus on tax law reasoning. Thus, Paper 1 demonstrates greater breadth of impact, timeliness, and potential for widespread real-world application.