Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Amartya Roy, Sonali Parbhoo

#38 of 2682 · Artificial Intelligence
Share
Tournament Score
1581±47
10501800
92%
Win Rate
24
Wins
2
Losses
26
Matches
Rating
7/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a fundamental question: why do LLMs fail at causal discovery from correlational data? The answer comes in two parts. First, a kernel obstruction theorem (Theorem 1) proves that supervised fine-tuning (SFT), direct preference optimization (DPO), and in-context learning (ICL)—when operating as kernel-type predictors in the lazy/NTK regime—cannot distinguish between "near-miss" causal graphs whose observational signatures overlap. The achievable score margin decays as O(1/d) with graph complexity d. Second, the authors propose Agentic Causal Bayesian Optimization (A-CBO), which sidesteps this obstruction by using a frozen LLM only as an interventional oracle answering binary queries ("Does Vj change under do(Vi)?"), while an external Bayesian loop in the probability simplex Δ_{n-1} performs hypothesis discrimination. Additionally, the paper contributes Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples.

Methodological Rigor

The theoretical framework is mathematically clean. Theorem 1 follows from a straightforward Cauchy-Schwarz argument; the substantive content lies in establishing that the near-miss parameter δ = O(1/d²) vanishes as graph complexity grows (Lemma 1). The convergence guarantee for A-CBO (Theorem 2) provides a round complexity of O(log n / log((1-η)/η)), independent of δ—an elegant structural property.

However, several assumptions warrant scrutiny:

1. Lazy/NTK regime assumption: The entire obstruction relies on LLMs operating as kernel-type predictors. While this is justified for ICL and lightly fine-tuned models, heavily fine-tuned models (especially with LoRA or full fine-tuning) may leave the lazy regime. The paper acknowledges this but doesn't analyze the "rich regime" escape route beyond noting A-CBO offers "a cheaper escape."

2. Oracle reliability (Assumption 1): The assumption that the LLM answers interventional queries correctly with probability 1-η > 1/2 is strong. The paper's theoretical justification (Lemma 2: binary responses are kernel-separated) is necessary but potentially circular—it presumes the LLM can do local causal reasoning while proving it cannot do global causal reasoning. Why a model that fails at global discrimination should succeed at local interventional queries deserves deeper empirical validation.

3. Lemma 1's token decomposition: The claim that near-miss pairs differ in O(1) tokens while sharing O(d²) tokens assumes the distinguishing information is concentrated at the sequence's end. In practice, conditional independence statements encoding distinguishing structure may be scattered throughout the premise.

4. Experimental asymmetries: A-CBO with GLM-5.1★ (large model, thinking mode, up to 20×3×8 = 480 forward passes) is compared against RoBERTa-Large SFT (355M parameters). Computational cost comparisons are entirely absent. The Extended C2C benchmark uses all-negative labels, reducing the task to binary rejection—a simplification that may inflate apparent advantages.

Potential Impact

The paper's strongest contribution is conceptual: it provides a principled framework for understanding why correlational-to-causal inference is fundamentally difficult for current training paradigms, and demonstrates that interventional querying is a theoretically grounded escape. This could influence:

  • System design: Encouraging the community to build LLM-based causal systems where the LLM serves as an oracle rather than a judge, aligning with Wu et al. (2025)'s non-decisional role prescription.
  • Benchmark development: Extended Corr2Cause fills a gap in evaluating scalability of causal reasoning methods.
  • Theoretical foundations: The kernel obstruction framework could generalize to other domains where near-miss discrimination is required (e.g., legal reasoning, differential diagnosis).
  • However, practical impact may be limited by A-CBO's computational overhead and the synthetic nature of the evaluation. Real-world causal discovery involves noisy, incomplete textual evidence far removed from clean correlational premises.

    Timeliness & Relevance

    The paper is timely. Multiple concurrent works (Wu et al. 2025, Kadziolka & Salehkaleybar 2025, Sun et al. 2025b) address LLM causal reasoning from different angles, but none provide formal impossibility results. The kernel obstruction theorem offers the first theoretical explanation for widely observed empirical failures.

    Strengths

  • Novel theoretical contribution: First formal proof that dominant LLM training paradigms face geometric impossibility for causal discrimination, not merely empirical difficulty.
  • Constructive escape: A-CBO is derived from the theory, not ad hoc—the Bayesian loop in Δ_{n-1} directly addresses the identified obstruction.
  • Strong ablation: Table 4 convincingly shows the loop, not model capability, drives performance (LLaMA-7B: 26.8→72.6 F1).
  • New benchmark: Extended Corr2Cause provides a structured stress test for causal reasoning at scale.
  • Limitations

  • NTK regime applicability: The lazy regime assumption is the theory's Achilles' heel; modern fine-tuning often escapes it.
  • Synthetic evaluation only: No real-world causal discovery tasks are tested.
  • Missing cost analysis: Inference cost of A-CBO's iterative querying vs. single-pass baselines is not reported.
  • Interventional query fidelity: No systematic measurement of actual oracle accuracy η across models and graph sizes.
  • All-negative Extended C2C: The evaluation task is artificially simplified; mixed-label evaluation would be more informative.
  • Overall Assessment

    This is a theoretically motivated paper that provides genuine insight into a practically important problem. The kernel obstruction theorem, while dependent on the lazy regime assumption, offers the cleanest formal explanation to date for LLM failures at causal discovery. A-CBO is an elegant constructive response. The experimental evaluation, while somewhat asymmetric in model comparisons, demonstrates the framework's practical viability. The paper would be substantially strengthened by computational cost analysis, real-world evaluation, and empirical measurement of oracle fidelity across scales.

    Rating:7/ 10
    Significance 7.5Rigor 6.5Novelty 7.5Clarity 7.5

    Generated May 28, 2026

    Comparison History (26)

    vs. When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models
    gemini-3.15/28/2026

    Paper 2 establishes a mathematically proven, fundamental limitation of current LLM training paradigms (SFT, DPO, ICL) for causal discovery, giving it profound theoretical significance. It bridges LLMs with Bayesian optimization to provide an innovative, provable solution. While Paper 1 addresses a critical and timely empirical issue in AI safety, Paper 2's theoretical rigor, fundamental insights into LLM capabilities, and broader implications for scientific reasoning across disciplines suggest a higher and more lasting scientific impact.

    vs. AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
    gpt-5.25/28/2026

    Paper 2 likely has higher scientific impact: it proposes a broadly applicable, decentralized multi-agent framework that demonstrates strong empirical gains across diverse, high-value domains (biomed ML, LLM training optimization, protein fitness) with clear real-world utility for accelerating computational experimentation. Its cross-domain benchmarks and sizable improvements suggest immediate adoption potential and wide spillover across fields. Paper 1 is conceptually novel with a compelling limitation theorem and a targeted workaround, but its demonstrated impact is narrower (causal discovery benchmarks) and may be less broadly deployable than an automated experimentation system.

    vs. Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning
    claude-opus-4.65/28/2026

    Paper 1 makes a fundamental theoretical contribution by proving an impossibility result (kernel obstruction theorem) for LLMs performing causal discovery, explaining *why* they fail rather than just observing failure. It then proposes a principled solution (A-CBO) with provable convergence guarantees and validates it on new benchmarks. This combines novel theory, a new method, and a new benchmark, with broad implications across causal inference, AI reasoning, and scientific discovery. Paper 2 applies existing offline RL techniques to code generation—a useful engineering contribution but incremental in novelty and narrower in scope.

    vs. Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI
    gemini-3.15/28/2026

    Paper 2 offers a fundamental theoretical proof (kernel obstruction theorem) explaining a core limitation of LLMs in causal discovery, which is a highly significant and debated topic in AI. It not only proves why current paradigms fail but also introduces a provably convergent agentic solution (A-CBO) and a large-scale benchmark. This combination of rigorous theoretical foundation and practical innovation provides broader and more foundational impact across AI and scientific reasoning than Paper 1, which, while important for AI ethics, relies on a relatively small dataset and less foundational methodology.

    vs. DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation
    claude-opus-4.65/28/2026

    Paper 2 offers a fundamental theoretical contribution (kernel obstruction theorem) proving why LLMs inherently fail at causal discovery, which is a deep insight applicable across the field. It combines rigorous mathematical foundations with a practical solution (A-CBO) that provably overcomes the limitation. The work addresses a more foundational question in AI/ML—causal reasoning—with broader implications for scientific reasoning. Paper 1, while practically useful, represents an incremental engineering improvement in claim verification. Paper 2's combination of theoretical depth, novel formalism, and practical algorithm gives it significantly higher potential impact.

    vs. Entropy Distribution as a Fingerprint for Hallucinations in Generative Models
    claude-opus-4.65/28/2026

    Paper 2 establishes a fundamental impossibility result (kernel obstruction theorem) showing why LLMs inherently cannot perform causal discovery from observational data, which is a deeper theoretical contribution with broad implications for AI and scientific reasoning. It then proposes a principled solution (A-CBO) that provably circumvents the limitation. While Paper 1 makes a solid practical contribution to hallucination detection with good theoretical backing, Paper 2's combination of a fundamental impossibility proof with a constructive workaround addresses a more foundational question about LLM capabilities, likely inspiring broader follow-up work across causal inference, AI safety, and scientific discovery.

    vs. AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios
    gemini-3.15/28/2026

    Paper 2 offers a fundamental theoretical proof (kernel obstruction theorem) regarding LLM limitations in causal discovery, a core aspect of scientific reasoning, and provides a provably convergent solution. This deep theoretical contribution and its implications for AI and scientific discovery give it a broader and more profound impact compared to Paper 1, which primarily introduces a practical, yet more incremental, benchmarking framework for evaluating agent efficiency.

    vs. Resolving the bias-precision paradox with stochastic causal representation learning for personalized medicine
    claude-opus-4.65/28/2026

    Paper 1 establishes a fundamental theoretical impossibility result (kernel obstruction theorem) explaining why LLMs inherently fail at causal discovery, which has broad implications across AI and causal inference. It also provides a provably convergent solution (A-CBO) and introduces a new benchmark. The combination of a deep theoretical contribution with practical algorithmic innovation and empirical validation across scales gives it exceptional breadth. Paper 2 makes a strong applied contribution to personalized medicine with clinical validation, but its impact is more domain-specific. Paper 1's theoretical foundations are likely to reshape how the field approaches LLM-based causal reasoning.

    vs. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation
    gemini-3.15/28/2026

    Paper 2 establishes a fundamental theoretical limitation (kernel obstruction theorem) regarding LLMs' ability to perform causal discovery, proving that standard paradigms intrinsically fail regardless of scale or data. Fundamental impossibility theorems combined with mathematically grounded solutions (A-CBO) typically yield broader, longer-lasting impact across machine learning and scientific reasoning than specific algorithmic improvements like the knowledge editing method proposed in Paper 1.

    vs. Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains
    gemini-3.15/28/2026

    Paper 2 addresses a fundamental open problem in AI by mathematically proving the intrinsic limitations of LLMs in causal discovery and proposing a novel agentic solution. The intersection of LLMs and causal inference guarantees broad, cross-disciplinary impact. Furthermore, its theoretical rigor and introduction of a new benchmark make it highly foundational. In contrast, while Paper 1 provides a highly practical and much-needed standardization framework for Prognostics and Health Management (PHM), its scope is mostly limited to a specific applied engineering domain, resulting in lower overall scientific breadth and theoretical impact.

    vs. A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography
    gemini-3.15/28/2026

    Paper 2 presents a massive-scale foundation model with extensive multi-cohort validation (1.5 million external ECGs) and clear, immediate clinical applications. Its ability to detect rare cardiovascular conditions using routine ECGs offers profound real-world public health impact. While Paper 1 provides important theoretical insights into LLM limitations for causal discovery, Paper 2's unprecedented scale, rigorous clinical validation, and direct potential to improve patient outcomes give it a broader and more immediate scientific and societal impact.

    vs. FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research
    gpt-5.25/28/2026

    Paper 1 offers a more novel and general scientific contribution: a formal impossibility result (kernel obstruction theorem) explaining why standard LLM training paradigms fail at causal discovery, plus a provably convergent workaround (A-CBO) using interventions and Bayesian optimization. This combines theoretical rigor with a broadly relevant insight for ML, causality, and agentic AI, and introduces a scalable benchmark. Paper 2 is valuable as a systems/design contribution for finance decision support, but is more domain-specific, less theoretically grounded, and likely to have narrower cross-field impact.

    vs. Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning
    claude-opus-4.65/28/2026

    Paper 2 provides a fundamental theoretical contribution (kernel obstruction theorem) proving why LLMs inherently fail at causal discovery, which is a deep insight applicable across the field. It combines rigorous mathematical formalization with a principled solution (A-CBO) that provably converges. The theoretical nature of the contribution—showing an intrinsic limitation of the learning paradigm rather than specific models—has broader implications for understanding LLM capabilities and limitations. Paper 1, while solid engineering work introducing a benchmark and agentic framework for audio-visual reasoning, is more incremental and application-specific.

    vs. An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning
    gemini-3.15/28/2026

    Paper 2 addresses a fundamental, theoretical limitation of LLMs in causal discovery by providing a mathematical proof of failure and proposing a novel, domain-agnostic solution. Its contributions to AI theory, causal inference, and scientific reasoning offer a significantly broader and more profound scientific impact compared to Paper 1, which focuses on a narrower, applied system for industrial automation.

    vs. Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
    gemini-3.15/28/2026

    Paper 2 offers a fundamental theoretical proof (kernel obstruction theorem) explaining why LLMs inherently fail at causal discovery, moving beyond empirical observations to establish intrinsic limitations of current learning paradigms. Furthermore, it introduces a novel, scalable solution (A-CBO) that bypasses these limitations without retraining. While Paper 1 addresses an important, timely vulnerability in RLHF, Paper 2's combination of rigorous theoretical foundations, broad implications for causal reasoning in AI, and a mathematically proven algorithmic intervention gives it higher potential for long-term scientific impact.

    vs. Neuro-Inspired Inverse Learning for Planning and Control
    gpt-5.25/28/2026

    Paper 1 has higher potential impact due to a strong theoretical contribution (a paradigm-level “kernel obstruction” explaining why common LLM training/inference schemes cannot do causal discovery from observational data) plus a provably convergent workaround (agentic interventional loop) that reframes how LLMs can be used for scientific reasoning. This is timely given current LLM evaluation debates and could influence benchmarking, theory, and agent design across ML, causal inference, and scientific automation. Paper 2 is promising and broadly applicable, but appears more incremental relative to existing planning/control paradigms and is more empirical/engineering-driven.

    vs. A Unified Framework for the Evaluation of LLM Agentic Capabilities
    claude-opus-4.65/28/2026

    Paper 2 provides a fundamental theoretical contribution (kernel obstruction theorem) proving why LLMs inherently fail at causal discovery, which is a deep insight applicable across the field. It then proposes A-CBO, a principled solution with provable convergence guarantees. This combination of rigorous negative results with a constructive remedy has high impact potential. Paper 1, while practically useful as an evaluation framework with impressive scale (400K rollouts), is primarily an engineering contribution that standardizes benchmarking—important but incremental. Paper 2's theoretical foundations and novel paradigm for causal reasoning are more likely to reshape research directions.

    vs. Gradient Step Plug-and-Play Model for Dental Cone-Beam CT Reconstruction
    gemini-3.15/28/2026

    Paper 1 tackles a foundational limitation of LLMs in causal discovery, providing theoretical impossibility proofs and a novel agentic solution. Its findings have broad, cross-disciplinary implications for AI and scientific reasoning. In contrast, Paper 2 focuses on a highly specific application (dental CT denoising) using existing plug-and-play methodologies, making its potential impact and novelty significantly narrower.

    vs. Credit Assignment with Resets in Language Model Reasoning
    claude-opus-4.65/28/2026

    Paper 2 has higher potential impact because it establishes a fundamental impossibility result (kernel obstruction theorem) proving that standard LLM training paradigms cannot solve causal discovery, which is a deep theoretical contribution with broad implications across AI, causality, and scientific reasoning. It also provides a principled solution (A-CBO) that provably overcomes this limitation. The combination of a fundamental negative result with a constructive workaround has historically driven paradigm shifts. Paper 1, while solid, represents an incremental improvement to RL credit assignment for LLM reasoning—an important but more narrow contribution within an already active optimization space.

    vs. The Illusion of Opting in AI-Mediated Consequential Decisions
    gemini-3.15/28/2026

    Paper 2 offers a mathematically rigorous proof of a fundamental limitation in current LLM training paradigms regarding causal discovery and provides an empirically validated, scalable solution. Its contribution to both theoretical understanding and practical AI methodology gives it higher potential for broad, measurable scientific impact across disciplines using AI for scientific reasoning, compared to Paper 1's primarily conceptual and philosophical contribution to AI ethics.