Why LLMs Fail at Causal Discovery and How Interventional Agents Escape
Amartya Roy, Sonali Parbhoo
Abstract
Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper addresses a fundamental question: why do LLMs fail at causal discovery from correlational data? The answer comes in two parts. First, a kernel obstruction theorem (Theorem 1) proves that supervised fine-tuning (SFT), direct preference optimization (DPO), and in-context learning (ICL)—when operating as kernel-type predictors in the lazy/NTK regime—cannot distinguish between "near-miss" causal graphs whose observational signatures overlap. The achievable score margin decays as O(1/d) with graph complexity d. Second, the authors propose Agentic Causal Bayesian Optimization (A-CBO), which sidesteps this obstruction by using a frozen LLM only as an interventional oracle answering binary queries ("Does Vj change under do(Vi)?"), while an external Bayesian loop in the probability simplex Δ_{n-1} performs hypothesis discrimination. Additionally, the paper contributes Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples.
Methodological Rigor
The theoretical framework is mathematically clean. Theorem 1 follows from a straightforward Cauchy-Schwarz argument; the substantive content lies in establishing that the near-miss parameter δ = O(1/d²) vanishes as graph complexity grows (Lemma 1). The convergence guarantee for A-CBO (Theorem 2) provides a round complexity of O(log n / log((1-η)/η)), independent of δ—an elegant structural property.
However, several assumptions warrant scrutiny:
1. Lazy/NTK regime assumption: The entire obstruction relies on LLMs operating as kernel-type predictors. While this is justified for ICL and lightly fine-tuned models, heavily fine-tuned models (especially with LoRA or full fine-tuning) may leave the lazy regime. The paper acknowledges this but doesn't analyze the "rich regime" escape route beyond noting A-CBO offers "a cheaper escape."
2. Oracle reliability (Assumption 1): The assumption that the LLM answers interventional queries correctly with probability 1-η > 1/2 is strong. The paper's theoretical justification (Lemma 2: binary responses are kernel-separated) is necessary but potentially circular—it presumes the LLM can do local causal reasoning while proving it cannot do global causal reasoning. Why a model that fails at global discrimination should succeed at local interventional queries deserves deeper empirical validation.
3. Lemma 1's token decomposition: The claim that near-miss pairs differ in O(1) tokens while sharing O(d²) tokens assumes the distinguishing information is concentrated at the sequence's end. In practice, conditional independence statements encoding distinguishing structure may be scattered throughout the premise.
4. Experimental asymmetries: A-CBO with GLM-5.1★ (large model, thinking mode, up to 20×3×8 = 480 forward passes) is compared against RoBERTa-Large SFT (355M parameters). Computational cost comparisons are entirely absent. The Extended C2C benchmark uses all-negative labels, reducing the task to binary rejection—a simplification that may inflate apparent advantages.
Potential Impact
The paper's strongest contribution is conceptual: it provides a principled framework for understanding why correlational-to-causal inference is fundamentally difficult for current training paradigms, and demonstrates that interventional querying is a theoretically grounded escape. This could influence:
However, practical impact may be limited by A-CBO's computational overhead and the synthetic nature of the evaluation. Real-world causal discovery involves noisy, incomplete textual evidence far removed from clean correlational premises.
Timeliness & Relevance
The paper is timely. Multiple concurrent works (Wu et al. 2025, Kadziolka & Salehkaleybar 2025, Sun et al. 2025b) address LLM causal reasoning from different angles, but none provide formal impossibility results. The kernel obstruction theorem offers the first theoretical explanation for widely observed empirical failures.
Strengths
Limitations
Overall Assessment
This is a theoretically motivated paper that provides genuine insight into a practically important problem. The kernel obstruction theorem, while dependent on the lazy regime assumption, offers the cleanest formal explanation to date for LLM failures at causal discovery. A-CBO is an elegant constructive response. The experimental evaluation, while somewhat asymmetric in model comparisons, demonstrates the framework's practical viability. The paper would be substantially strengthened by computational cost analysis, real-world evaluation, and empirical measurement of oracle fidelity across scales.
Generated May 28, 2026
Comparison History (26)
Paper 2 establishes a mathematically proven, fundamental limitation of current LLM training paradigms (SFT, DPO, ICL) for causal discovery, giving it profound theoretical significance. It bridges LLMs with Bayesian optimization to provide an innovative, provable solution. While Paper 1 addresses a critical and timely empirical issue in AI safety, Paper 2's theoretical rigor, fundamental insights into LLM capabilities, and broader implications for scientific reasoning across disciplines suggest a higher and more lasting scientific impact.
Paper 2 likely has higher scientific impact: it proposes a broadly applicable, decentralized multi-agent framework that demonstrates strong empirical gains across diverse, high-value domains (biomed ML, LLM training optimization, protein fitness) with clear real-world utility for accelerating computational experimentation. Its cross-domain benchmarks and sizable improvements suggest immediate adoption potential and wide spillover across fields. Paper 1 is conceptually novel with a compelling limitation theorem and a targeted workaround, but its demonstrated impact is narrower (causal discovery benchmarks) and may be less broadly deployable than an automated experimentation system.
Paper 1 makes a fundamental theoretical contribution by proving an impossibility result (kernel obstruction theorem) for LLMs performing causal discovery, explaining *why* they fail rather than just observing failure. It then proposes a principled solution (A-CBO) with provable convergence guarantees and validates it on new benchmarks. This combines novel theory, a new method, and a new benchmark, with broad implications across causal inference, AI reasoning, and scientific discovery. Paper 2 applies existing offline RL techniques to code generation—a useful engineering contribution but incremental in novelty and narrower in scope.
Paper 2 offers a fundamental theoretical proof (kernel obstruction theorem) explaining a core limitation of LLMs in causal discovery, which is a highly significant and debated topic in AI. It not only proves why current paradigms fail but also introduces a provably convergent agentic solution (A-CBO) and a large-scale benchmark. This combination of rigorous theoretical foundation and practical innovation provides broader and more foundational impact across AI and scientific reasoning than Paper 1, which, while important for AI ethics, relies on a relatively small dataset and less foundational methodology.
Paper 2 offers a fundamental theoretical contribution (kernel obstruction theorem) proving why LLMs inherently fail at causal discovery, which is a deep insight applicable across the field. It combines rigorous mathematical foundations with a practical solution (A-CBO) that provably overcomes the limitation. The work addresses a more foundational question in AI/ML—causal reasoning—with broader implications for scientific reasoning. Paper 1, while practically useful, represents an incremental engineering improvement in claim verification. Paper 2's combination of theoretical depth, novel formalism, and practical algorithm gives it significantly higher potential impact.
Paper 2 establishes a fundamental impossibility result (kernel obstruction theorem) showing why LLMs inherently cannot perform causal discovery from observational data, which is a deeper theoretical contribution with broad implications for AI and scientific reasoning. It then proposes a principled solution (A-CBO) that provably circumvents the limitation. While Paper 1 makes a solid practical contribution to hallucination detection with good theoretical backing, Paper 2's combination of a fundamental impossibility proof with a constructive workaround addresses a more foundational question about LLM capabilities, likely inspiring broader follow-up work across causal inference, AI safety, and scientific discovery.
Paper 2 offers a fundamental theoretical proof (kernel obstruction theorem) regarding LLM limitations in causal discovery, a core aspect of scientific reasoning, and provides a provably convergent solution. This deep theoretical contribution and its implications for AI and scientific discovery give it a broader and more profound impact compared to Paper 1, which primarily introduces a practical, yet more incremental, benchmarking framework for evaluating agent efficiency.
Paper 1 establishes a fundamental theoretical impossibility result (kernel obstruction theorem) explaining why LLMs inherently fail at causal discovery, which has broad implications across AI and causal inference. It also provides a provably convergent solution (A-CBO) and introduces a new benchmark. The combination of a deep theoretical contribution with practical algorithmic innovation and empirical validation across scales gives it exceptional breadth. Paper 2 makes a strong applied contribution to personalized medicine with clinical validation, but its impact is more domain-specific. Paper 1's theoretical foundations are likely to reshape how the field approaches LLM-based causal reasoning.
Paper 2 establishes a fundamental theoretical limitation (kernel obstruction theorem) regarding LLMs' ability to perform causal discovery, proving that standard paradigms intrinsically fail regardless of scale or data. Fundamental impossibility theorems combined with mathematically grounded solutions (A-CBO) typically yield broader, longer-lasting impact across machine learning and scientific reasoning than specific algorithmic improvements like the knowledge editing method proposed in Paper 1.
Paper 2 addresses a fundamental open problem in AI by mathematically proving the intrinsic limitations of LLMs in causal discovery and proposing a novel agentic solution. The intersection of LLMs and causal inference guarantees broad, cross-disciplinary impact. Furthermore, its theoretical rigor and introduction of a new benchmark make it highly foundational. In contrast, while Paper 1 provides a highly practical and much-needed standardization framework for Prognostics and Health Management (PHM), its scope is mostly limited to a specific applied engineering domain, resulting in lower overall scientific breadth and theoretical impact.
Paper 2 presents a massive-scale foundation model with extensive multi-cohort validation (1.5 million external ECGs) and clear, immediate clinical applications. Its ability to detect rare cardiovascular conditions using routine ECGs offers profound real-world public health impact. While Paper 1 provides important theoretical insights into LLM limitations for causal discovery, Paper 2's unprecedented scale, rigorous clinical validation, and direct potential to improve patient outcomes give it a broader and more immediate scientific and societal impact.
Paper 1 offers a more novel and general scientific contribution: a formal impossibility result (kernel obstruction theorem) explaining why standard LLM training paradigms fail at causal discovery, plus a provably convergent workaround (A-CBO) using interventions and Bayesian optimization. This combines theoretical rigor with a broadly relevant insight for ML, causality, and agentic AI, and introduces a scalable benchmark. Paper 2 is valuable as a systems/design contribution for finance decision support, but is more domain-specific, less theoretically grounded, and likely to have narrower cross-field impact.
Paper 2 provides a fundamental theoretical contribution (kernel obstruction theorem) proving why LLMs inherently fail at causal discovery, which is a deep insight applicable across the field. It combines rigorous mathematical formalization with a principled solution (A-CBO) that provably converges. The theoretical nature of the contribution—showing an intrinsic limitation of the learning paradigm rather than specific models—has broader implications for understanding LLM capabilities and limitations. Paper 1, while solid engineering work introducing a benchmark and agentic framework for audio-visual reasoning, is more incremental and application-specific.
Paper 2 addresses a fundamental, theoretical limitation of LLMs in causal discovery by providing a mathematical proof of failure and proposing a novel, domain-agnostic solution. Its contributions to AI theory, causal inference, and scientific reasoning offer a significantly broader and more profound scientific impact compared to Paper 1, which focuses on a narrower, applied system for industrial automation.
Paper 2 offers a fundamental theoretical proof (kernel obstruction theorem) explaining why LLMs inherently fail at causal discovery, moving beyond empirical observations to establish intrinsic limitations of current learning paradigms. Furthermore, it introduces a novel, scalable solution (A-CBO) that bypasses these limitations without retraining. While Paper 1 addresses an important, timely vulnerability in RLHF, Paper 2's combination of rigorous theoretical foundations, broad implications for causal reasoning in AI, and a mathematically proven algorithmic intervention gives it higher potential for long-term scientific impact.
Paper 1 has higher potential impact due to a strong theoretical contribution (a paradigm-level “kernel obstruction” explaining why common LLM training/inference schemes cannot do causal discovery from observational data) plus a provably convergent workaround (agentic interventional loop) that reframes how LLMs can be used for scientific reasoning. This is timely given current LLM evaluation debates and could influence benchmarking, theory, and agent design across ML, causal inference, and scientific automation. Paper 2 is promising and broadly applicable, but appears more incremental relative to existing planning/control paradigms and is more empirical/engineering-driven.
Paper 2 provides a fundamental theoretical contribution (kernel obstruction theorem) proving why LLMs inherently fail at causal discovery, which is a deep insight applicable across the field. It then proposes A-CBO, a principled solution with provable convergence guarantees. This combination of rigorous negative results with a constructive remedy has high impact potential. Paper 1, while practically useful as an evaluation framework with impressive scale (400K rollouts), is primarily an engineering contribution that standardizes benchmarking—important but incremental. Paper 2's theoretical foundations and novel paradigm for causal reasoning are more likely to reshape research directions.
Paper 1 tackles a foundational limitation of LLMs in causal discovery, providing theoretical impossibility proofs and a novel agentic solution. Its findings have broad, cross-disciplinary implications for AI and scientific reasoning. In contrast, Paper 2 focuses on a highly specific application (dental CT denoising) using existing plug-and-play methodologies, making its potential impact and novelty significantly narrower.
Paper 2 has higher potential impact because it establishes a fundamental impossibility result (kernel obstruction theorem) proving that standard LLM training paradigms cannot solve causal discovery, which is a deep theoretical contribution with broad implications across AI, causality, and scientific reasoning. It also provides a principled solution (A-CBO) that provably overcomes this limitation. The combination of a fundamental negative result with a constructive workaround has historically driven paradigm shifts. Paper 1, while solid, represents an incremental improvement to RL credit assignment for LLM reasoning—an important but more narrow contribution within an already active optimization space.
Paper 2 offers a mathematically rigorous proof of a fundamental limitation in current LLM training paradigms regarding causal discovery and provides an empirically validated, scalable solution. Its contribution to both theoretical understanding and practical AI methodology gives it higher potential for broad, measurable scientific impact across disciplines using AI for scientific reasoning, compared to Paper 1's primarily conceptual and philosophical contribution to AI ethics.