Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation
Nicanor Mayumu, Xiaoheng Deng, Patrick Mukala
Abstract
We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferences across 100 diverse PhysicalAI-AV scenarios. Our main finding is that output natural-language rationales with trajectories may be significantly unfaithful: (i) overall reasoning fidelity is only 42.5%, with Chain-of-Causation matching scene reality less than half the time; (ii) 94 missed pedestrians in one-third of pedestrian-relevant scenes; (iii) 97.7% trajectory fragility under mild visual perturbations; and (iv) only 48.3% mean reasoning-action consistency, with 53.3% of inferences exhibiting low consistency, including 37.9% of stop-claimed cases where the model continues instead. We formalize faithfulness information-theoretically, define entity and action fidelity with verification criteria, and outline a four-component safety architecture aligned with these results.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper addresses a genuinely important question: whether the natural-language reasoning traces ("Chain-of-Causation") produced by Vision-Language-Action (VLA) driving models actually reflect their internal decision-making. The authors evaluate Alpamayo-R1-10B across 300 inferences on 100 PhysicalAI-AV scenarios and find alarming unfaithfulness — 42.5% overall fidelity, 94 missed pedestrians, 97.7% trajectory fragility under mild perturbations, and 14.3% "silent failures" where behavior changes but reasoning doesn't.
The core novelty lies in transplanting the LLM faithfulness evaluation paradigm (from Lanham et al., Turpin et al.) into the safety-critical autonomous driving domain and formalizing it with driving-specific definitions. The paper defines entity fidelity (Jaccard overlap between mentioned and actual scene entities), action fidelity (kinematic consistency between stated and executed actions), and counterfactual faithfulness (coherent response to perturbations). This is a meaningful conceptual contribution that fills a gap — prior VLA work focused on trajectory accuracy, not on whether the reasoning is trustworthy.
2. Methodological Rigor
Strengths:
Weaknesses:
3. Potential Impact
The paper's findings, if they hold across models, have significant implications for the autonomous driving industry. The key insight — that VLA reasoning traces can be dangerously misleading — directly challenges the narrative that Chain-of-Causation provides interpretability for regulators and users. The 14.3% silent failure rate is particularly compelling as a safety argument.
The proposed SafeDriveX architecture (independent VLM monitor + RLAIF + RSS safety floor) is only sketched at a high level and not implemented or evaluated. This limits the constructive impact of the paper.
The fidelity framework itself could become a standard evaluation protocol for VLA models if adopted. The entity/action fidelity definitions and counterfactual perturbation taxonomy are directly actionable for benchmarking.
4. Timeliness & Relevance
This paper is highly timely. VLA models for driving are proliferating rapidly (Alpamayo, EMMA, DriveVLM), and there is increasing regulatory pressure for interpretable AI in safety-critical systems. The question of whether explanations are trustworthy is central to regulatory acceptance. The paper positions itself well at the intersection of LLM faithfulness research and autonomous driving safety.
The acceptance at the DriveX Workshop at CVPR 2026 is appropriate for the scope and depth of the work.
5. Strengths & Limitations Summary
Key Strengths:
Notable Weaknesses:
Overall Assessment
This is a well-motivated workshop paper that raises an important alarm about VLA reasoning faithfulness. The formal framework is a useful contribution, and the empirical findings are concerning enough to warrant attention. However, the single-model evaluation, modest statistical rigor, and absence of an implemented solution limit its scientific impact. It reads more as a compelling pilot study and call-to-action than as a definitive investigation. The findings need replication across models and more rigorous statistical treatment to support the strong claims made.
Generated May 19, 2026
Comparison History (20)
Paper 2 addresses critical safety vulnerabilities in autonomous driving (VLA models), revealing severe real-world risks like missed pedestrians and unfaithful reasoning. While Paper 1 offers a valuable methodological simplification for LLM training, Paper 2's focus on high-stakes physical AI safety and its rigorous information-theoretic formalization of faithfulness give it a higher potential for urgent, broad, and transformative scientific impact across both AI research and autonomous systems engineering.
Paper 1 addresses a critical and timely safety concern for VLA models in autonomous driving—a domain with immediate real-world consequences. It provides the first systematic faithfulness analysis of VLA reasoning, revealing alarming failure rates (42.5% fidelity, 94 missed pedestrians, 97.7% trajectory fragility). These findings have direct implications for safety-critical AI deployment and regulatory frameworks. While Paper 2 (MindLoom) makes solid contributions to reasoning data synthesis, it addresses a more incremental improvement in training methodology. Paper 1's novelty in formalizing faithfulness for embodied AI and its implications for autonomous vehicle safety give it broader cross-disciplinary impact and urgency.
Paper 1 (Search-E1) proposes a novel and elegant simplification of search-augmented reasoning training that achieves state-of-the-art results across seven benchmarks while eliminating complex auxiliary machinery. Its contribution—showing that self-distillation alone suffices—has broad methodological impact across the LLM training community. Paper 2 provides valuable empirical analysis of VLA faithfulness in autonomous driving, but is more narrowly scoped as a diagnostic study of a single model (Alpamayo-R1-10B) on one benchmark, with impact mainly limited to VLA safety. Paper 1's methodological contribution and demonstrated SOTA performance suggest wider and more lasting impact.
Paper 2 likely has higher scientific impact due to broader applicability and scalability: a general framework for controllable frontier-level reasoning data synthesis that can improve many LLMs across multiple STEM domains. It offers a compositional conceptual innovation (thought modes), a full pipeline (decompose→retrieve→compose→judge), multi-benchmark evaluation, ablations, and open-sourcing—supporting rigor and adoption. Paper 1 is timely and important for VLA safety/faithfulness, but is narrower in scope (driving scenarios, one primary model) and primarily diagnostic rather than enabling a widely reusable method.
Paper 1 exposes critical safety and faithfulness flaws in Vision-Language-Action driving models. Given the life-critical nature of autonomous vehicles, identifying that these models lack reasoning fidelity and are highly fragile to perturbations has profound and immediate implications for AI safety, robotics, and model interpretability. While Paper 2 provides a valuable framework for engineering design, Paper 1 addresses a more urgent, high-stakes problem with broader societal and regulatory impact.
Paper 2 likely has higher impact: it tackles a central, timely question in foundation-model training (whether code improves general reasoning) with controlled 10T-token pretraining and mechanistic routing evidence, yielding actionable data-composition guidance that can influence many models and domains. Its conclusions generalize across NLP/ML, math reasoning, and training methodology. Paper 1 is novel and important for AV safety/faithfulness in VLA models, but is narrower (one model, 300 inferences) and may have more limited breadth and immediate adoption compared to broadly applicable pretraining insights.
Paper 2 likely has higher scientific impact: it identifies a broadly relevant, previously under-isolated failure mode (library drift) in self-evolving LLM agent systems, provides a reproducible trigger, trace-level diagnostics, and a verified governance fix with large measured gains and multiple ablations—strong methodological rigor and clear actionable guidance. Its applications span many agent/tooling setups beyond a single domain, making breadth and timeliness high. Paper 1 is important for VLA driving safety and offers formalization plus empirical probing, but its scope is narrower (one model/scenario suite) and is more diagnostic than providing a demonstrated mitigation.
Paper 1 addresses a universal and critical bottleneck in the deployment of agentic AI systems across all domains. By introducing a formal framework for explicit provenance and a computable responsibility tensor, it offers a foundational, cross-disciplinary solution that spans software engineering, AI ethics, and law. While Paper 2 provides highly valuable empirical safety data for autonomous driving, Paper 1's conceptual framework has broader potential to shape the fundamental architecture and regulation of future multi-agent AI ecosystems.
Paper 2 is likely to have higher scientific impact due to strong timeliness and real-world stakes: it targets safety-critical Vision-Language-Action autonomy (driving) and quantifies concrete failure modes (missed pedestrians, action–rationale inconsistency, perturbation fragility) with clear safety implications. Its information-theoretic framing and verification criteria provide methodological rigor and a foundation for evaluation standards and safety architectures. While Paper 1 is novel and broader as a benchmark for interdisciplinary LLM reasoning, Paper 2’s direct applicability to PhysicalAI/AV safety and alignment makes it more immediately influential across industry, regulation, and ML safety research.
Paper 1 has higher likely scientific impact: it targets safety-critical VLA driving systems, provides the first systematic faithfulness evaluation with quantitative failure rates, formal definitions, and verification criteria—supporting methodological rigor and enabling follow-on benchmarking and safety architectures. Its findings are timely given rapid deployment of multimodal/agentic models and could influence AV evaluation, interpretability, and safety engineering across robotics and AI. Paper 2 is novel and useful for creative-writing alignment, but its applications are less safety-critical, evaluation is inherently subjective, and impact is narrower (primarily NLP/creative generation) despite relevance.
Paper 2 proposes a highly innovative, formal, and scalable architectural solution to AI safety and governance. By integrating formal verification, JIT compilation, and hardware security (TEEs), it addresses a critical bottleneck in deploying autonomous systems. While Paper 1 is a strong empirical study on VLA driving models, Paper 2 offers a broader, foundational methodology with provable guarantees, giving it higher potential for cross-disciplinary impact in AI safety, systems engineering, and governance.
Paper 2 has higher estimated scientific impact due to broader, timely relevance to safety-critical VLA/AV systems and general AI reasoning faithfulness. It introduces formal, information-theoretic definitions and verification criteria, providing reusable methodology across vision-language, robotics, and alignment/safety research. The findings (low fidelity, missed pedestrians, perturbation fragility, rationale-action inconsistency) are consequential and likely to influence evaluation standards and safety architectures. Paper 1 is practically valuable for SRE/enterprise observability, but its impact is narrower and more systems-engineering/benchmark-specific, with less generalizable theory.
Paper 1 introduces a novel paradigm (point-precise GUI control), a comprehensive benchmark, and an innovative agent architecture combining supervised tuning with precision-aligned RL. It fundamentally expands the capabilities of multimodal agents beyond forgiving region-tolerant tasks to precision-sensitive geometric constructions. This constructive contribution offers broader methodological advancements and applications in design and HCI compared to Paper 2, which, while crucial for autonomous driving safety, is primarily an empirical probing study of existing model flaws.
Paper 2 likely has higher scientific impact: it targets rapidly growing VLA/PhysicalAI driving systems with immediate safety implications, proposes formal information-theoretic faithfulness definitions and verification criteria, and provides a systematic evaluation across diverse scenarios with striking failure statistics (reasoning unfaithfulness, pedestrian misses, perturbation fragility, reasoning-action inconsistency). Its concepts generalize beyond driving to broader multimodal reasoning and AI safety evaluation. Paper 1 is impactful clinically but has smaller scale (9 participants) and narrower domain, limiting breadth and methodological generalizability.
Paper 1 addresses a critical safety issue in autonomous driving VLA models, revealing alarming unfaithfulness in reasoning (42.5% fidelity, 94 missed pedestrians, 97.7% trajectory fragility). This has immediate, high-stakes real-world implications for autonomous vehicle safety. The formalization of faithfulness information-theoretically and the proposed safety architecture contribute foundational methodology. Paper 2 introduces a useful benchmark for GUI agents but addresses a narrower, lower-stakes problem. Paper 1's findings could reshape how the field approaches VLA deployment and safety verification, giving it broader and more urgent impact.
Paper 1 addresses critical safety flaws in Vision-Language-Action models for autonomous driving, exposing severe real-world risks like missed pedestrians and unfaithful reasoning. Its findings have urgent, high-stakes implications for the deployment of physical AI. Paper 2 offers a clever, broadly applicable LLM efficiency module, but its incremental performance gains lack the profound safety and societal impact of Paper 1.
Paper 1 introduces a novel, generalizable framework (Persona Policies) addressing a fundamental limitation of LLM-based evaluation—simulator homogeneity—with strong empirical results across multiple domains. The evolutionary program search for persona generation is innovative and broadly applicable to any LLM agent evaluation setting. Paper 2 provides valuable safety analysis of VLA driving models but is narrower in scope (single model, single domain) and more diagnostic than constructive. Paper 1's methodology has broader cross-field applicability and offers actionable solutions for improving both evaluation and training of LLM agents.
Paper 2 establishes fundamental theoretical bridges between reward hacking and model exploitation in reinforcement learning, addressing core AI safety challenges. Its general theory and proofs are likely to impact a broad range of AI and RL research. Paper 1, while highly relevant for autonomous driving safety, is a narrower empirical study focused on a specific class of VLA models, resulting in a more localized scientific impact.
Paper 1 addresses a critical safety issue in autonomous driving (unfaithful reasoning and hallucinations in physical AI), which has profound real-world consequences for human life. While Paper 2 offers a strong commercial application in e-commerce, Paper 1's exposure of severe vulnerabilities in VLA models and its formalization of faithfulness metrics will likely have a broader and more urgent impact on AI safety, robotics, and regulatory discussions.
Paper 1 exposes critical safety vulnerabilities in Vision-Language-Action models for autonomous driving, a high-stakes and rapidly deploying technology. Identifying severe unfaithfulness and missed pedestrians has immediate, profound implications for AI safety, regulatory standards, and future model design, offering higher real-world and scientific urgency than the methodological improvements in EEG foundation models presented in Paper 2.