Jaineet Shah
When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the-art step-level accuracy on the Who&When benchmark is about 14%). We present Causal Agent Replay (CAR), which answers the question by intervention: it models an agent run as a structural causal model, applies a do-operation to a step, and re-executes the trajectory forward under the same stochastic policy, measuring the shift in the outcome distribution. We define an intervention algebra over agent steps, a single-step contrastive estimator whose point-of-commitment rule resolves a confound specific to stochastic run-forward, and a budget-bounded Monte-Carlo Shapley estimator that splits credit across interacting steps. Every effect is reported with confidence intervals. We validate against synthetic structural causal models with planted ground truth: the contrastive estimator recovers the pivotal step, and Shapley recovers a two-step interaction (0.44, 0.45, ~0; efficiency sum 0.909 versus the analytic 0.91). CAR is open source and runs on hosted or free local models.
CAR addresses a genuine gap in the LLM agent debugging ecosystem: identifying *which step* in a multi-step agent trajectory caused a failure, rather than merely logging what happened or evaluating pass/fail. The paper formalizes an agent trajectory as a structural causal model (SCM), defines five `do(·)` intervention operations over agent steps, and provides two attribution estimators: a contrastive single-step estimator with a "point-of-commitment" rule, and a budget-bounded Monte-Carlo Shapley estimator for multi-step interactions.
The key intellectual contribution is the identification and resolution of a specific confound in stochastic run-forward attribution: resampling step *k* necessarily re-rolls all downstream stochastic decisions, making early irrelevant steps appear causal. The point-of-commitment rule (attributing to the *latest* step whose effect CI excludes zero) is a clean, principled solution to this problem.
This is where the paper has both strengths and significant weaknesses.
Strengths: The causal framing is principled and well-motivated. The paper correctly identifies that LLM-judge attribution is correlational (citing ~14% accuracy on Who&When), and that the step executing a harmful action is often not the step that decided on it. The distributional treatment—reporting confidence intervals rather than point estimates—is methodologically sound. The honesty about provider nondeterminism (reporting action-match rates rather than asserting reproducibility) is commendable.
Weaknesses: The validation is entirely on *synthetic* SCMs with planted ground truth. While the paper argues this is "non-optional," it is also clearly insufficient. The Shapley recovery (0.909 vs. analytic 0.91) on a two-step interaction is encouraging but trivial—these are toy settings with known structure. There is no evaluation on real agent failures, no comparison against baselines on the Who&When benchmark (despite citing its 14% accuracy as motivation), and no demonstration that CAR improves on that number. The paper motivates itself against Who&When but never actually runs on it.
The single qualitative example (Figure 1, a support-agent prompt injection) is illustrative but not evaluative. We don't know if the attribution is correct in any rigorous sense beyond visual plausibility.
The Shapley estimator's exponential worst case is acknowledged, and the budget-bounding is practical, but no empirical analysis of convergence rates or computational costs on realistic trajectory lengths is provided.
The problem being addressed is genuinely important and timely. As LLM agents are deployed in customer support, code generation, and autonomous workflows, failure attribution becomes critical for debugging, safety, and trust. The causal framing is the right conceptual move, and the open-source release lowers adoption barriers.
However, several practical limitations constrain near-term impact:
The conceptual framework could influence how the community thinks about agent debugging, even if the current implementation has practical gaps.
Highly timely. The paper addresses a 2025-2026 problem space (citing concurrent work from ICML 2025 and 2025 arXiv preprints). LLM agent deployment is accelerating, and failure attribution is an emerging bottleneck. The paper positions itself well in this nascent literature, differentiating from oracle-substitution (AgenTracer) and static-log approaches (Ma et al.).
CAR presents a well-motivated and conceptually clean framework for causal attribution of LLM agent failures. The intervention algebra, distributional treatment, and point-of-commitment rule are genuine contributions to the emerging agent-debugging literature. However, the paper reads more as a well-executed position paper with a proof-of-concept implementation than as a complete empirical contribution. The gap between the ambitious framing (production agent failures, prompt injections, data leaks) and the actual validation (two synthetic SCMs) is substantial. The work would be significantly strengthened by evaluation on real agent traces, comparison with baselines on established benchmarks, and scalability analysis.
The framework has the potential to become influential if extended with real-world validation, but in its current form, its impact is primarily conceptual and directional.
Generated Jun 9, 2026
Paper 1 has higher likely cross-field scientific impact: it introduces a generally applicable causal-intervention framework for attributing failures in LLM agents, a rapidly growing and broadly relevant area (AI safety, debugging, evaluation, reliable autonomy). The methodological core (SCM framing, do-operator replay, confound handling, Shapley credit with CIs) is comparatively rigorous and reusable across domains and agent architectures. Paper 2 appears highly impactful within manufacturing/materials, but its scope is narrower and validation seems centered on a specific testbed, making generalization and broad uptake less certain.
Paper 2 addresses a fundamental and widespread methodological concern—aperiodic spectral confounds—affecting deep learning across multiple physiological signal domains (EEG, ECG). Its findings impact a large community (clinical ML, neuroscience, cardiology) and propose a reusable audit framework that could become standard practice. The breadth of validation (six architectures, seven foundation models, two signal modalities) strengthens its generalizability. Paper 1 addresses the important but narrower problem of LLM-agent failure attribution with promising but primarily synthetic validation, limiting its demonstrated real-world impact at this stage.
Paper 2 is likely higher impact: it introduces a principled causal-intervention framework for attributing failures in LLM agents—an urgent, high-leverage problem for safety, reliability, and governance. The methodological contribution (SCM formalization, do-operator replay, contrastive estimator addressing stochastic confounding, Monte-Carlo Shapley with CIs) is broadly applicable across agent architectures and tool-use settings, with immediate real-world utility. Paper 1 is solid and efficient but is more incremental within neural field representation learning and likely narrower in downstream adoption compared to causal debugging for deployed LLM agents.
Paper 1 has higher likely impact due to a concrete, novel causal-intervention framework for diagnosing LLM-agent failures with confidence intervals, addressing an immediate, widely felt tooling gap in deployed agent systems. Its methodological contributions (SCM modeling, do-operator replay, contrastive estimator resolving stochastic confounding, Shapley credit assignment) are operationalizable and validated against ground-truth synthetic SCMs, and it is open-sourced—boosting adoption. Paper 2 is conceptually ambitious and potentially broad, but the abstract indicates more limited empirical grounding and unclear applicability beyond a constructed bandit setting, making near-term impact less certain.
Paper 1 addresses a critical computational bottleneck in RLVR training for LLMs—long-context rollout generation—with a principled framework (sparse-to-dense mismatch analysis) that achieves significant speedups (2-2.4x) across multiple model scales. Its practical impact on making RL-based LLM training more efficient is substantial given the field's trajectory. Paper 2 introduces an interesting causal attribution framework for LLM agent failures, but its validation is limited to synthetic settings, and the niche scope (debugging agent failures) limits breadth. Paper 1's methodological rigor, scalability evidence, and relevance to the booming RLVR paradigm give it higher impact potential.
Paper 1 introduces a highly novel, mathematically rigorous causal framework to solve a critical bottleneck in LLM agent deployment: failure attribution. By applying structural causal models and intervention algebra to agent trajectories, it significantly advances beyond current heuristic approaches. Paper 2 provides a valuable but more incremental extension of consistency training for alignment. Paper 1's strong methodological innovation and immediate applicability to the rapidly growing field of autonomous agents give it a higher potential for broad scientific and practical impact.
Paper 2 has higher likely impact due to timeliness and broad applicability: diagnosing and preventing LLM-agent failures is an urgent, rapidly growing need across industry and research. CAR introduces a clear intervention-based attribution framework (SCM + do-operations), addresses a specific confound in stochastic replay, provides uncertainty estimates, and offers scalable credit assignment via Monte-Carlo Shapley—methodologically well-scoped and easily deployable (open source, works with local/hosted models). Paper 1 is valuable but closer to incremental integration of discovery+time-varying inference and appears validated mainly via behavioral diagnostics on panel data, with narrower immediate adoption.
RREDCoT addresses a fundamental challenge in RL-based training of reasoning LLMs—credit assignment for chain-of-thought traces—which is central to the rapidly growing field of reasoning model development. Its method for reward redistribution during training has broad applicability to all GRPO-based training pipelines. Paper 1 (CAR) introduces a valuable causal debugging framework for LLM agents, but its impact is more niche (post-hoc failure analysis). Paper 2's contribution to improving training efficiency and effectiveness of reasoning models has higher potential to influence a larger body of follow-up research and practical deployments.
Paper 2 addresses a novel and timely problem—causal attribution of failures in LLM agents—which is increasingly critical as agentic AI systems are deployed. It introduces a principled causal inference framework (structural causal models, do-calculus, Shapley values) to a new domain, offering both theoretical rigor and practical utility. The breadth of impact is high, spanning AI safety, reliability engineering, and interpretability. Paper 1, while competent, applies existing mesh graph network architectures to a relatively incremental structural mechanics surrogate modeling task with limited novelty beyond the domain transfer.
Paper 1 (CAR) addresses a fundamental and underexplored problem—causal attribution of failures in LLM agents—with a rigorous, novel framework combining structural causal models, do-calculus, and Shapley values in the agent debugging context. This opens a new research direction at the intersection of causality and AI safety/reliability, with broad implications as LLM agents proliferate. Paper 2 proposes an incremental fine-tuning technique (weight interpolation with random initialization) that, while practically useful, is a more conventional optimization contribution with narrower conceptual novelty, building on well-known loss landscape smoothing ideas.