Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho, Jungseul Ok

May 27, 2026

arXiv:2605.27879v1 PDF

cs.AI(primary)

#1485of 2682·Artificial Intelligence

#1485 of 2682 · Artificial Intelligence

Tournament Score

1398±50

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.5

Novelty6.5

Clarity7.5

Tournament Score

1398±50

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Explainable AI (XAI) helps users interpret model behavior and identify potential faults. Agentic XAI systems use Large Language Models (LLMs) to make explanations more accessible through natural-language interaction, but they can also produce plausible yet unfaithful explanations. This risk arises because unreliable XAI outputs for complex models can be amplified by LLMs and mislead users. We propose Faithful Agentic XAI (FAX), a framework that improves explanation faithfulness through explicit verification. FAX decomposes draft explanations into claims and cross-checks them against inherently faithful tools, filtering unsupported or contradictory claims before final generation. We also introduce CRAFTER-XAI-Bench, an open-world reinforcement learning benchmark with complex policies, diverse goals, and challenging scenarios for assessing model-specific faithfulness. On CRAFTER-XAI-Bench, FAX improves simulation faithfulness from 0.20 for the strongest baseline to 0.46 while maintaining high informativeness, relevance, and fluency. On three tabular benchmarks, FAX performs competitively with prior Agentic XAI baselines, but our analysis shows that these settings can conflate task accuracy with model-specific faithfulness. These findings show that explicit verification is essential for faithful Agentic XAI and that that faithfulness benchmarks must be designed to test explanations against the behavior of the target model itself.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness"

1. Core Contribution

This paper addresses a genuine and underexplored problem: agentic XAI systems that use LLMs to orchestrate XAI tools can produce fluent but unfaithful explanations, because they naively trust potentially unreliable post-hoc XAI outputs and amplify errors through coherent verbalization. The authors make two contributions:

FAX (Faithful Agentic XAI): A verification-centric framework that decomposes draft explanations into atomic claims, classifies their evidence sources (faithful tools vs. noisy post-hoc tools vs. LLM priors), plans falsification tests using "inherently faithful" tools (State Editing, Counterfactuals), and filters unsupported claims before final generation. The key insight—treating explanation claims as hypotheses to be tested rather than facts to be verbalized—is conceptually clean and draws on Popperian falsification.

CRAFTER-XAI-Bench: A benchmark built on the Crafter RL environment with three policies (Diamond Seeker, Item Hoarder, Pacifist) having distinct reward shaping, 40 evaluation scenarios across four query categories (why, what-if, counterfactual, plan), and simulation-based faithfulness evaluation. The benchmark is designed so that generic environment knowledge is insufficient—explanations must capture policy-specific behavior.

2. Methodological Rigor

Strengths in design: The distinction between "inherently faithful" tools (whose outputs are verified by direct model execution) and noisy post-hoc tools (SHAP, HIGHLIGHTS) is well-motivated and clearly operationalized. The four-step verification mechanism (claim identification → evidence analysis → verification planning → faithful-tool verification) is systematic.

Evaluation concerns: The faithfulness metric itself relies on an LLM evaluator generating hypothetical states and predicting model actions, which introduces its own potential unreliability. The authors acknowledge this and provide a cross-model robustness check (Qwen3-32B vs. Qwen3-8B), showing moderate ICC of 0.46—which is not particularly strong. The absolute faithfulness numbers are low across all methods (FAX achieves 0.46, baselines around 0.14-0.20), raising questions about what the ceiling is and whether the metric is well-calibrated.

Ablation design: The baselines are well-chosen and informative. The comparison between Structured Agentic XAI w/o Verification and FAX cleanly isolates the verification contribution. The Naive LLM baseline revealing high faithfulness on tabular tasks (0.74 average) powerfully demonstrates the prior-leakage problem in standard benchmarks.

Statistical concerns: Results are averaged over three random seeds, but no confidence intervals or significance tests are reported. With 40 scenarios and considerable variance visible across query categories, the statistical power of the evaluation is uncertain.

3. Potential Impact

Practical relevance: As LLM-based agents increasingly mediate between complex ML systems and users, ensuring explanation faithfulness is a real and growing concern. FAX's approach—treating explanations as testable hypotheses—is a transferable design principle beyond the specific implementation.

Benchmark contribution: CRAFTER-XAI-Bench fills a genuine gap. The paper's most compelling finding is arguably the diagnostic one: that standard tabular benchmarks conflate task accuracy with model-specific faithfulness (demonstrated clearly by the Naive LLM's high performance on tabular tasks). This observation alone could influence how the community designs XAI evaluations.

Limitations on scope: The framework currently requires domain-specific tool design (state editing must be defined for each environment), limiting out-of-the-box applicability. The computational overhead (6,129 tokens/question vs. 486 for Naive LLM) is substantial. The benchmark is restricted to a single RL environment with relatively simple policies.

4. Timeliness & Relevance

The paper is timely given the rapid proliferation of LLM agents in various application domains and growing concerns about hallucination and unfaithful reasoning. The intersection of XAI and LLM agents is nascent, with only TalkToModel (Slack et al., 2023) and a few other systems as prior art. The paper's argument that verification must be explicit—not just implicit through better prompting—is a needed corrective as the field moves toward more autonomous AI systems.

5. Strengths & Limitations

Key Strengths:

The problem formulation is sharp and well-motivated: the danger of LLMs making unreliable XAI evidence sound authoritative

The diagnostic finding about tabular benchmark limitations is valuable and independently impactful

Clean ablation structure that isolates the verification contribution

The qualitative examples (Figures 5, 6) effectively demonstrate model-specific vs. generic explanations

The distinction between inherently faithful and potentially unfaithful tools provides a useful conceptual framework

Notable Weaknesses:

The absolute faithfulness improvement (0.20 → 0.46) is notable but the ceiling remains low, suggesting fundamental limitations not fully addressed

No human evaluation is conducted; all metrics are LLM-based, creating a circularity concern (using LLMs to evaluate LLM-generated explanations)

The benchmark uses only one environment (Crafter) with three simple reward-shaped policies; generalization to more complex domains is undemonstrated

The definition of "inherently faithful" tools is somewhat narrow and domain-dependent; the framework's applicability to domains where such tools are hard to construct is unclear

The paper uses a single backbone LLM (Qwen3-32B); sensitivity to LLM choice is not explored

The tabular evaluation, while diagnostic, shows FAX performing comparably or below some baselines in faithfulness, weakening the universal applicability claim

Overall Assessment: This is a well-motivated paper that identifies a real problem at the intersection of LLM agents and XAI, proposes a reasonable solution, and provides a useful benchmark. The diagnostic insights about existing evaluation practices may be the most lasting contribution. However, the empirical evaluation, while informative, lacks the depth (scale, human evaluation, statistical rigor) to fully validate the approach. The work is best understood as a strong conceptual contribution and proof-of-concept rather than a definitive solution.

Rating:6.2/ 10

Significance 6.5Rigor 5.5Novelty 6.5Clarity 7.5

Generated May 28, 2026

Comparison History (14)

vs. Constrained Auto-Bidding via Generative Response Modeling

claude-opus-4.65/28/2026

Paper 2 addresses a fundamental and broadly relevant problem—faithfulness of AI explanations—which spans multiple fields (XAI, LLMs, trustworthy AI, HCI). It introduces both a novel verification framework (FAX) and a new benchmark (CRAFTER-XAI-Bench), providing reusable community infrastructure. The problem of unfaithful LLM-generated explanations is timely given widespread LLM adoption. Paper 1, while technically rigorous, addresses a narrower domain (auto-bidding in ad auctions) with more limited cross-field impact. Paper 2's contributions to evaluation methodology and verification of agentic AI systems have broader implications for AI safety and reliability.

vs. Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

claude-opus-4.65/28/2026

Paper 1 offers a more novel and mechanistically grounded contribution: demonstrating that refusal behavior is linearly decodable from intermediate LLM activations and leveraging this for efficient adversarial attacks. This has broad implications for AI safety, interpretability, and red-teaming, with a concrete 72% speedup. Paper 2 addresses an important but more incremental problem in XAI faithfulness verification with a narrower scope. Paper 1's findings about structured safety signals in intermediate representations are more likely to inspire follow-up research across multiple subfields of AI safety and mechanistic interpretability.

vs. Automatic Layer Selection for Hallucination Detection

claude-opus-4.65/28/2026

Paper 1 addresses the broadly relevant problem of hallucination detection in LLMs with a training-free, principled method (FEPoID) that works across diverse architectures, scales, and tasks. Its practical utility, minimal computational overhead, and strong empirical results give it wide applicability. Paper 2 introduces a valuable framework for faithful agentic XAI but addresses a more niche intersection (agentic XAI faithfulness) with a benchmark limited to a specific RL environment. Paper 1's contributions are more likely to be widely adopted given the centrality of hallucination detection in the LLM community.

vs. Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

gpt-5.25/28/2026

Paper 2 has higher likely impact due to a clearer, broadly relevant problem (faithfulness in agentic XAI), a general framework (claim decomposition + tool-based verification) applicable across domains, and the introduction of a new open-world benchmark that can shape future evaluation practice. The reported gains are large and tied to a concrete faithfulness metric, improving methodological rigor and comparability. Paper 1 is novel for online calibration/ranking of LLM reasoning prefixes, but it is more specialized to math-trace reliability and incremental performance gains may limit breadth.

vs. A Query Engine for the Agents

gpt-5.25/28/2026

Paper 2 has higher likely scientific impact: it tackles a broadly recognized, timely problem (unfaithful LLM-mediated explanations), proposes a general framework (verification-based faithfulness) and contributes a new benchmark (CRAFTER-XAI-Bench) that can shape evaluation practice across XAI, LLM agents, and RL. The methodological contribution is more transferable and field-defining than Paper 1’s strong systems work, which is impactful but narrower (JS-native lakehouse querying + LLM-UDF execution) and more engineering/product-facing. Paper 2’s benchmark + metrics can drive follow-on research widely.

vs. Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

gpt-5.25/28/2026

Paper 2 likely has higher impact due to broader real-world applicability and timeliness: it provides a generalizable, rigorous pipeline (constraint optimization + solver-certified solutions + state-based verifiers) that directly addresses a key blocker for agent evaluation/training in enterprise settings—artifact drift. The resulting ERP-Bench is a sizable, production-grade, economically relevant benchmark with auditable optimality and controllable difficulty, enabling reliable progress measurement across models and institutions. Paper 1 is novel for faithfulness verification in Agentic XAI, but its impact is more specialized and benchmark gains are narrower.

vs. LACUNA: Safe Agents as Recursive Program Holes

gpt-5.25/28/2026

Paper 2 (LACUNA) introduces a novel, general programming model that unifies agent runtime and model-written code via typed “program holes” with pre-execution type-checking, offering a principled safety mechanism (reject/rollback) and clear extensibility to common agent patterns. Its potential real-world impact is broad (safer tool use, prompt-injection resilience, reliable automation) and spans PL, systems, and AI safety, making it timely. Paper 1 is valuable for XAI faithfulness and benchmarking, but its impact is more niche and depends on faithfulness tools/benchmarks; gains are moderate and domain-specific.

vs. VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

gpt-5.25/28/2026

Paper 1 pairs a new, general framework (verification-based filtering of explanation claims against faithful tools) with an open-world RL benchmark explicitly targeting model-specific faithfulness—addressing a timely, high-stakes failure mode of LLM-mediated XAI (plausible but unfaithful explanations). The methodological contribution is actionable and could influence both XAI system design and evaluation standards across domains where faithful explanations matter. Paper 2 provides a valuable benchmark for personalization/proactiveness, but is primarily evaluative; its impact hinges more on adoption. Overall, Paper 1 offers broader cross-field methodological leverage and a clearer path to changing practice.

vs. Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

gpt-5.25/28/2026

Paper 2 has higher impact potential due to a broader and more actionable contribution: a general verification framework for agentic XAI plus a new open-world benchmark targeting model-specific faithfulness. This addresses a timely, high-stakes problem (LLM-amplified unfaithful explanations) with clear real-world applications across safety-critical domains, RL, and interpretable ML. It also provides measurable gains and highlights evaluation pitfalls in common tabular settings. Paper 1 is novel and valuable for RAG evaluation, but its scope is narrower and primarily impacts citation-grounded generation assessment.

vs. When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

gemini-3.15/28/2026

Paper 1 addresses a fundamental and pervasive flaw in current LLM alignment paradigms—brittle safety under context shifts. By demonstrating that standard action-level guardrails systematically fail in consequence-flip scenarios, it challenges current industry standards and motivates a shift towards state-aware safety architectures. This has profound implications for the safe deployment of LLMs across all applications, giving it a broader and more immediate scientific impact compared to the important, yet more specialized, focus on Agentic XAI verification in Paper 2.

vs. CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models

gpt-5.25/28/2026

Paper 1 likely has higher scientific impact due to broader relevance and timeliness: ensuring faithfulness in LLM-mediated agentic XAI is a pressing, cross-domain problem affecting many ML deployments. Its explicit verification framework is conceptually novel and generalizable, and the open-world benchmark could become a community standard for evaluating model-specific faithfulness. Paper 2 is strong and application-critical (real-time EEG), but its impact may be narrower to biosignal/time-series modeling and depends on robustness/clinical validation beyond benchmark SOTA.

vs. MIRA: A Bilingual Benchmark for Medical Information Response Audit

claude-opus-4.65/28/2026

Paper 1 (MIRA) addresses a critical and timely problem—health equity in LLM-generated medical information—with broad societal impact. It introduces a novel concept (Differential Information Dilution), a systematic bilingual benchmark, and demonstrates actionable mitigation strategies. The work bridges AI safety, health literacy, and multilingual NLP, affecting a larger user population. Paper 2 (FAX) makes solid contributions to XAI faithfulness verification but targets a narrower technical audience. MIRA's focus on real-world health disparities amplified by widely-deployed LLMs gives it higher potential for cross-disciplinary impact and policy relevance.

vs. Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

gemini-3.15/28/2026

Paper 1 addresses a critical challenge in modern AI—the faithfulness of LLM-driven explanations—by introducing a novel verification framework and an open-world benchmark. This contributes fundamentally to AI safety and interpretability, likely spurring significant follow-up research. While Paper 2 offers a highly practical and efficient engineering solution for embedding compression, it explicitly describes itself as a simple codec without new theoretical contributions, making its broader scientific (rather than practical) impact potentially more limited.

vs. BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting

claude-opus-4.65/28/2026

BatteryMFormer addresses a critical real-world problem in battery degradation forecasting with broad applications in electric vehicles, energy storage, and manufacturing. Its multi-level Transformer architecture with domain-specific innovations (aging-condition-aware decoding, SOC-localized analysis) offers methodological novelty with clear practical utility. It demonstrates consistent improvements across four domains. Paper 1 addresses an important but narrower problem in XAI faithfulness verification, and while novel, its impact is more confined to the AI interpretability community. Battery research has broader cross-disciplinary relevance spanning materials science, engineering, and sustainability.