Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety

Abhinaw Priyadershi, Mandar Pitale, Jelena Frtunikj, Maria Spence

Jun 3, 2026

arXiv:2606.05461v1 PDF

cs.AI(primary)

#1518of 3404·Artificial Intelligence

#1518 of 3404 · Artificial Intelligence

Tournament Score

1415±47

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance7

Rigor5.5

Novelty6.5

Clarity7.5

Tournament Score

1415±47

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Safety standards for ML-based autonomous driving specify the kind of evidence an assurance case must contain (directed cause-and-effect chains, quantified interventional effects, named root-cause variables), yet the XAI literature is organised by output type and technique family (saliency maps, feature attribution, counterfactuals, causal graphs, language traces). SHAP, the most-recommended ADS XAI method, returns a ranked feature list that no implementation effort can convert into a directed chain (Fig.1). We name this mismatch the evidence-type gap. From AMLAS, ISO 26262, ISO21448, ISO/PAS 8800 we derive 19 testable evidentiary criteria across 7 lifecycle stages with representative clause-cited derivations and score six XAI method classes structurally. Causal XAI emerges as structurally required to satisfy the derived criteria at three stages: hazard identification (+62% rubric gap), incident investigation (+50%), and data management (+50%); the verdict set is stable across thresholds T in (0%, 50%]$ and survives a worst-case single-cell flip down to T = 25%. At the remaining four stages, correlational or language-based methods are comparable or sufficient. The rubric identifies structural admissibility (necessary but not sufficient for compliance): an admissible method's specific output content may still be wrong, and validating that fidelity (the edges a fitted SCM produces, the cause a trace names) is the open assurance challenge. A single-VLA proof of concept on 1,996 real-world driving clips (79,840 rows, ten splits) is consistent with each method's observed output type matching its rubric prediction. XAI method selection for ADS safety assurance should be driven by lifecycle-stage evidence demand, not by method popularity.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and formalizes what it terms the "evidence-type gap" — a structural mismatch between the evidentiary demands of safety standards (ISO/PAS 8800, ISO 21448, ISO 26262, AMLAS) for autonomous driving systems and the output types produced by popular XAI methods. The central insight is compelling and well-articulated: SHAP produces ranked feature lists, but standards like ISO/PAS 8800 Cl. 6.7.1 demand directed cause-and-effect chains. These are categorically different objects, and no amount of engineering refinement can transform one into the other.

The authors derive 19 testable evidentiary criteria across 7 lifecycle stages, score 6 XAI method classes against them using a Satisfies/Partial/Fails rubric, and conclude that causal XAI (specifically SCMs) is structurally necessary at three stages: hazard identification, incident investigation, and data management. The theoretical grounding in Pearl's causal hierarchy (rung-1 methods cannot answer rung-2 questions) provides a clean, principled basis for the structural impossibility claims.

Methodological Rigor

The paper has a clearly layered methodology: standards extraction → criteria formalization → structural scoring → robustness analysis → empirical proof-of-concept. Several aspects deserve scrutiny:

Strengths in rigor:

The rubric's robustness analysis is thorough: verdicts are stable across thresholds T ∈ (0%, 50%] and survive worst-case single-cell flips down to T = 25%.

The authors are admirably transparent about interpretive choices (§2.1), acknowledging that their reading of "causal" evidence as requiring Pearl's interventional semantics is one valid interpretation, and that alternative readings would produce different results.

The paper carefully distinguishes structural admissibility from compliance sufficiency.

Weaknesses in rigor:

The rubric is self-scored by the authors, which is the most significant methodological vulnerability. While they propose a two-panel validation protocol for future work, the current verdicts rest entirely on the authors' interpretive judgments. The acknowledged plan for external rater validation with weighted κ/Krippendorff's α is appropriate but absent.

The empirical proof-of-concept is modest: single VLA, single dataset, only 3 of 6 methods empirically tested. The SCM fitted on this data fails to recover 4 of 7 perturbation types at α = 0.01, which somewhat undermines the practical case for causal XAI even while the structural argument remains valid.

The learned diagnosis from downstream signals performs at chance (~30%), which the authors acknowledge as the "central open challenge." This is a significant practical gap — structural admissibility is shown, but practical utility is not demonstrated.

Potential Impact

The paper addresses a genuinely important practical problem at the intersection of XAI research and safety engineering. Its potential impact operates at several levels:

1. Standards compliance guidance: Safety engineers selecting XAI methods for ADS assurance cases now have a principled framework rather than defaulting to method popularity. This is directly actionable.

2. Research prioritization: The finding that causal XAI is structurally necessary at three lifecycle stages but essentially absent from the ADS XAI literature (per surveys of 84+ papers) identifies a clear research gap that could redirect community effort.

3. Cross-domain generalizability: The authors claim the rubric construction procedure is domain-general, applicable to any standard and method catalogue. If validated, this could influence safety-critical AI beyond autonomous driving (medical devices, aerospace, industrial automation).

4. Conceptual framework: The "output type before quality" principle — that checking whether a method *can* produce the right kind of evidence should precede evaluating how well it does so — is a useful conceptual contribution that reframes XAI evaluation.

However, impact may be limited by the paper's heavy reliance on one particular interpretation of standards language. Safety standards are intentionally method-agnostic, and standards bodies or certification authorities may not agree with the authors' strict Pearlian reading of "causal."

Timeliness & Relevance

This is highly timely. ISO/PAS 8800:2024 was published recently, and the autonomous driving industry is actively grappling with how to build assurance cases for ML-based systems. The explosion of VLA/foundation model deployment in ADS makes the XAI evidence question urgent. The paper arrives at a moment when practitioners need this kind of guidance.

Strengths & Limitations

Key strengths:

The Figure 1 example (SHAP ranking the active perturbation 8th while the standard demands a directed chain) is an exceptionally clear illustration of the core problem.

The careful separation of structural admissibility from empirical fidelity avoids overclaiming.

Transparent reporting of negative results (4/7 missed SCM edges, chance-level diagnosis).

The robustness analysis across thresholds and cell-flips adds credibility to the verdicts.

Notable weaknesses:

Self-scored rubric without external validation is the primary threat to validity.

The empirical section is thin relative to the claims: single system, single dataset, 3/6 methods.

All authors are NVIDIA employees testing on NVIDIA data/systems, introducing potential bias despite the disclosure.

The criteria derivation involves significant interpretive judgment (acknowledged), and several criteria (H2, H3, I2, D2, D3) are clause-formalizations rather than direct requirements.

The paper doesn't engage with the practical barriers to deploying causal XAI (computational cost, expert knowledge for SCM specification, scalability to production systems).

The 2S+P scoring formula is somewhat arbitrary; alternative weightings could shift results.

Overall Assessment

This is a conceptually valuable contribution that frames an important problem clearly and provides a structured approach to XAI method selection for ADS safety. The theoretical argument is sound — associational methods genuinely cannot produce interventional evidence. However, the work is early-stage: the rubric needs external validation, the empirical component is limited, and the practical implications of the structural necessity finding are unclear given that the authors' own SCM implementation struggles with basic recovery tasks. The paper is best understood as a well-articulated position paper with a preliminary analytical framework, rather than a fully validated methodology.

Rating:5.8/ 10

Significance 7Rigor 5.5Novelty 6.5Clarity 7.5

Generated Jun 5, 2026

Comparison History (16)

vs. The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

gemini-3.16/8/2026

Paper 1 sets a broad research agenda for foundation model agents by framing their deployment challenges as a classical MDP sim-to-real gap. This unified perspective has the potential to influence a vast cross-section of the AI and agentic communities. Paper 2, while highly rigorous and valuable for autonomous driving safety compliance, is more narrowly focused in its application domain.

vs. 2-Step Agent: A Framework for the Interaction of a Decision Maker with AI Decision Support

gpt-5.26/6/2026

Paper 1 is likely to have higher impact due to its direct alignment with timely, safety-critical autonomous-driving standards and its actionable rubric that translates normative requirements into testable XAI admissibility criteria across lifecycle stages. This creates immediate real-world utility for assurance cases, compliance, and tool selection, with potentially broad uptake in industry and regulation. Methodologically, it offers a structured standards-derived evaluation plus robustness checks and a proof-of-concept study. Paper 2 is novel and rigorous, but its impact may be more academic and context-dependent, with fewer near-term standardization levers.

vs. PieArena: Ranking and Profiling Language Agents in Realistic Negotiation Scenarios

gpt-5.26/6/2026

Paper 1 offers a more novel, standards-derived framework that directly connects XAI output types to concrete safety-assurance evidence requirements across the autonomous-driving lifecycle, producing testable criteria and robustness checks. Its potential real-world impact is high because it targets regulatory/assurance practice (ISO 26262, ISO 21448, etc.), where admissibility of evidence is a bottleneck for deployment, and it could influence both tooling and certification processes across safety-critical ML domains. Paper 2 is timely and useful as an evaluation benchmark, but negotiation leaderboards may have narrower cross-field and regulatory impact than safety-assurance admissibility criteria.

vs. Unsupervised Skill Discovery for Agentic Data Analysis

gemini-3.16/6/2026

Paper 2 demonstrates higher potential impact by addressing a critical bottleneck in life-critical systems: the misalignment between XAI outputs and autonomous driving safety standards. While Paper 1 offers a valuable framework for LLM agents, Paper 2 bridges machine learning, systems engineering, and regulatory compliance. By formally deriving a rubric to evaluate XAI admissibility for safety assurance, it provides a foundational framework with immediate, high-stakes real-world applications. Its structural approach to the 'evidence-type gap' will likely heavily influence both future XAI research directions and the practical deployment and regulation of autonomous vehicles.

vs. Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

gemini-3.16/5/2026

Paper 2 uncovers a fundamental behavioral limitation of LLMs (convergence to attractor regions) in program evolution, which broadly impacts the highly active fields of LLM-based code generation, evolutionary algorithms, and open-ended exploration. Paper 1, while highly valuable for autonomous driving safety, is more applied and regulatory-focused, mapping existing XAI methods to safety standards rather than revealing a novel underlying scientific phenomenon.

vs. Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

gpt-5.26/5/2026

Paper 2 has higher potential impact: it introduces a standards-derived, testable rubric that directly connects XAI outputs to evidence requirements in autonomous-driving safety assurance, addressing a timely and high-stakes deployment bottleneck. The approach is novel in framing “admissibility” via explicit lifecycle-stage criteria grounded in ISO/AMLAS, offering broad applicability across safety-critical ML domains and influencing both research and regulatory/industrial practice. While Paper 1 is technically innovative for LLM agent memory and shows strong benchmark gains, its impact is more confined to agent architectures and may iterate on an active retrieval trend rather than reshape evaluation/selection norms across fields.

vs. Insurance of Agentic AI

gpt-5.26/5/2026

Paper 1 is more methodologically rigorous and technically novel: it derives a clause-cited, testable rubric from multiple safety standards, evaluates XAI classes against explicit evidentiary criteria across lifecycle stages, and includes robustness checks plus an empirical proof-of-concept. Its contributions can influence both XAI research directions (favoring causal XAI where structurally required) and safety assurance practice in autonomous driving, with potential spillover to other safety-critical ML domains. Paper 2 is timely and application-relevant but is largely conceptual/framework-based with less technical validation, making its scientific impact more uncertain.

vs. AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks

gemini-3.16/5/2026

Paper 1 bridges a critical gap between AI safety standards and XAI techniques in autonomous driving, addressing a highly relevant and urgent regulatory challenge. Its development of a standards-derived rubric offers broader implications for AI certification and policy. In contrast, Paper 2 presents an incremental application of existing neural network architectures to a niche domain (maritime trajectory prediction), which, while practically useful, lacks the transformative scientific and regulatory impact of Paper 1.

vs. ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

claude-opus-4.66/5/2026

Paper 1 presents a concrete, actionable method (CMTF) with extensive empirical validation (2448 runs, 102 tasks, 4 LLM backends) that addresses a growing practical problem in LLM agent reliability. Its training-free approach, dramatic efficiency gains (~90% token reduction), and broad applicability to the rapidly expanding LLM agent ecosystem give it high near-term impact. Paper 2 contributes a useful analytical rubric but is more niche (ADS safety + XAI intersection), primarily taxonomic rather than methodological, and its empirical validation is limited to a single proof-of-concept. Paper 1's broader relevance and practical utility suggest greater impact.

vs. VeRO: A Harness for Agents to Optimize Agents

claude-opus-4.66/5/2026

Paper 1 addresses a rapidly growing and broadly applicable problem—using coding agents to optimize other agents—with a concrete benchmark and tooling (VeRO/VeRO-Bench) that enables systematic research in a high-demand area. The agent-optimizing-agent paradigm is timely and has wide applicability across AI development. Paper 2 addresses an important but narrower niche (XAI method selection for autonomous driving safety standards), producing a useful rubric but with limited breadth of impact beyond the ADS safety assurance community. Paper 1's infrastructure contribution is likely to catalyze more downstream research.

vs. A Motivational Architecture for Conversational AGI

gemini-3.16/5/2026

Paper 1 bridges a critical gap between XAI methods and established safety standards (ISO) in autonomous driving. Its methodological rigor, reliance on testable criteria, and immediate real-world applicability in a safety-critical domain provide a stronger foundation for near-term scientific and industrial impact compared to the more theoretical, conceptual framework proposed in Paper 2.

vs. Individual Gain, Collective Loss: Metacognitive Adaptation in AI-Assisted Creativity

claude-opus-4.66/5/2026

Paper 2 addresses a concrete, high-stakes regulatory gap in autonomous driving safety by creating a standards-derived admissibility rubric linking XAI methods to specific lifecycle-stage evidence requirements. It offers immediately actionable criteria (19 testable evidentiary criteria across 7 stages), empirical validation, and targets a rapidly growing industry with pressing regulatory needs. Paper 1 proposes an interesting theoretical framework about AI-assisted creativity with a useful taxonomy, but remains largely conceptual without empirical validation. Paper 2's methodological rigor, direct regulatory applicability, and timeliness in the booming AV/AI safety domain give it broader and more immediate impact.

vs. Optimal Transport-based Permutation-Invariant Bayesian Optimization of Offshore Wind Farm Layouts

gemini-3.16/5/2026

Paper 2 bridges a critical gap between XAI methodologies and regulatory safety standards in autonomous driving. Its framework for aligning XAI outputs with strict compliance requirements has profound implications for AI safety, certification, and deployment in high-stakes domains, offering significantly broader cross-disciplinary impact than Paper 1's domain-specific optimization method.

vs. A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

gemini-3.16/5/2026

Paper 2 addresses a fundamental methodological flaw in the rapidly expanding field of AI alignment and reinforcement learning. By mathematically proving a systematic bias in a commonly used metric and releasing a reusable audit harness, it provides a crucial correction that could standardize evaluations across the broader AI community. While Paper 1 offers highly valuable, domain-specific regulatory insights for autonomous driving, Paper 2's theoretical rigor and broad applicability to foundation model training give it a higher potential for widespread scientific impact.

vs. Bidirectional Search for Longest Paths: Case for Front-to-Front Heuristics

claude-opus-4.66/5/2026

Paper 1 addresses a critical gap between XAI methods and safety standards for autonomous driving, a high-stakes domain with enormous real-world impact. It provides a novel, actionable rubric derived from established safety standards (ISO 26262, SOTIF, AMLAS) that can guide industry practice. The breadth of impact spans AI safety, autonomous vehicles, regulation, and XAI research. Paper 2, while technically sound, addresses a narrower algorithmic problem (bidirectional search for longest paths) with more limited applicability. Paper 1's timeliness—given growing regulatory demands for AI transparency—further amplifies its potential impact.

vs. Harnessing Generalist Agents for Contextualized Time Series

gpt-5.26/5/2026

Paper 2 likely has higher impact: it proposes a general-purpose, code-released framework that extends LLM agents to structured time-series reasoning with tools, memory, and reusable routines, validated across many domains and benchmarks—broad, timely, and readily adoptable. Paper 1 is novel and rigorous in aligning XAI outputs with safety-standards evidence needs, but its impact is more specialized (autonomous-driving assurance) and primarily provides a rubric/analysis rather than a widely reusable technical system.