CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

Nicolás Astorga, Anita Kriz, Mihaela van der Schaar

May 9, 2026

arXiv:2605.09079v1 PDF

cs.AI(primary)

#66of 2292·Artificial Intelligence

#66 of 2292 · Artificial Intelligence

Tournament Score

1558±44

10501800

95%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1558±44

10501800

95%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Despite surpassing human performance across mathematics, coding, and other knowledge-intensive tasks, large language models (LLMs) continue to struggle with causal reasoning. A core obstacle is the target data itself: causal systems are complex and often expressed in non-executable forms, while ground-truth answers to causal queries are inherently scarce. We introduce CauSim, a framework that turns causal reasoning from a scarce-label problem into a scalable supervised one. CauSim constructs increasingly complex causal simulators: executable structural causal models (SCMs), incrementally built by LLMs, that scale to globally complex systems while maintaining verifiable answers to causal queries. CauSim operates across representations by formalizing non-executable causal knowledge into code, enabling data augmentation, and translating executable SCMs into natural language, enabling supervision in previously difficult-to-supervise representations. We structure our research into two parts: (1) how to construct increasingly complex causal simulators, and (2) a systematic study of what CauSim enables, demonstrating generalization across representations, consistent gains from curriculum scaling and data volume, LLM self-improvement through self-generated simulators, and data augmentation via formalization of existing domain knowledge.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

1. Core Contribution

CauSim addresses a fundamental bottleneck in training LLMs for causal reasoning: the scarcity of ground-truth supervision for interventional and counterfactual queries. The key insight is to construct executable structural causal models (SCMs) as Python programs that can be incrementally grown by LLMs, enabling automatic generation of verified answers to any causal query type (deduction, abduction, intervention, counterfactual). The framework operates bidirectionally across representations—formalizing non-executable causal knowledge into code and informalizing executable SCMs into natural language—thereby bridging the gap between executable supervision and the non-executable forms in which causal knowledge typically exists.

The incremental construction approach is the central technical innovation: rather than generating complex SCMs in one shot (which degrades rapidly with scale), CauSim grows SCMs node-by-node through a plan-execute-verify loop. The planner operates on a semantic view of the current SCM, while the executor sees only localized context (parent docstrings and child code), maintaining both global coherence and local executability.

2. Methodological Rigor

The experimental design is well-structured around four clearly articulated research questions, with controlled experiments isolating specific factors:

Strengths in rigor:

Training on "nonsense" (meaningless 4-letter) variables is a clever design choice that disentangles causal reasoning improvements from knowledge confounding—a persistent confounder in causal reasoning evaluations.

The incremental vs. one-shot comparison (Experiment 1) provides direct evidence for the scalability claim, tested across two models (GPT-5, Qwen-2.5-7B) and two domains.

The curriculum and data scaling studies (§5.2) are systematic, comparing four curriculum strategies and four data volumes.

The self-improvement experiment (§5.3) controls for generator quality by comparing GPT-5, GPT-5-mini, and self-generated SCMs.

Concerns about rigor:

Absolute performance numbers are quite low—pass@1 drops below 0.1 for most query types at higher complexity levels (7-10 nodes), even after training. While improvements are consistent, it's unclear whether the learned capabilities are practically useful at scale.

The evaluation is predominantly on the paper's own synthetic SCMs and a small set of external benchmarks (bnlearn graphs, ReImagine-derived GSM8K). The external benchmark results (Table 23, 25) show modest improvements, and the evaluation on truly natural causal reasoning tasks is limited.

The paper uses a relatively small base model (Qwen2.5-3B-Instruct), and it remains unclear how findings scale to larger models that may already possess stronger causal reasoning.

The deterministic, discrete setting is acknowledged as a limitation but is quite restrictive—real-world causal reasoning involves continuous variables, stochastic mechanisms, and probabilistic counterfactuals.

3. Potential Impact

Direct impact on causal reasoning in LLMs: CauSim provides the first scalable framework for generating supervised causal reasoning data, addressing what the authors correctly identify as a data bottleneck rather than an architecture bottleneck. This could accelerate research on improving LLM causal capabilities.

Broader methodological impact: The paradigm of using LLM-generated executable environments for self-supervision (building on Absolute Zero) is extended to a new domain. The formalization/informalization pipeline could inspire similar approaches in other domains where executable models exist but training data is scarce (physics simulation, circuit design, etc., as suggested in Table 28).

Practical applications: If causal reasoning improvements generalize robustly, applications in clinical decision support, policy analysis, and scientific reasoning could benefit. However, the current performance levels suggest this is still early-stage.

4. Timeliness & Relevance

The paper is highly timely. Causal reasoning is increasingly recognized as a critical gap in LLM capabilities, and the community is actively searching for scalable approaches to improve reasoning. The work sits at the intersection of several active threads: LLM self-improvement, synthetic data generation for training, and executable verification for reasoning tasks. The connection to Absolute Zero and the broader RLVR (reinforcement learning with verifiable rewards) paradigm positions this work within a rapidly growing research direction.

5. Strengths & Limitations

Key Strengths:

Novel framing: Recasting causal reasoning as a scalable supervision problem rather than an evaluation problem is a meaningful conceptual contribution.

Principled design: The incremental construction with plan-execute-verify is well-motivated and empirically validated.

Comprehensive study: The four research questions cover generalization, scaling, self-improvement, and data augmentation—providing a thorough initial investigation.

Honest limitations discussion: The paper includes thoughtful appendices (F, G) acknowledging that pattern recognition vs. mechanism learning remains unresolved, and that self-improvement has inherent ceilings.

Cross-representation transfer: Demonstrating that training on code-based SCMs with nonsense variables improves performance on natural language medical SCMs is a compelling finding.

Notable Limitations:

Limited complexity scale: 10-node SCMs are relatively small. Real-world causal systems can involve dozens to hundreds of variables.

Deterministic restriction: The focus on deterministic settings with discrete variables limits ecological validity.

Performance ceiling concerns: Absolute accuracy remains low at higher complexities, raising questions about whether the approach hits fundamental limits quickly.

Single base model: Results on only Qwen2.5-3B (and 7B for self-improvement) leave generalizability to other architectures and scales uncertain.

No comparison with RL-based training: The paper uses rejection-sampling SFT but doesn't compare with reinforcement learning approaches that could potentially leverage the executable verifier more effectively.

Overall Assessment

CauSim makes a solid conceptual and empirical contribution by establishing that executable SCMs can serve as scalable training environments for causal reasoning, with promising transfer properties. The incremental construction mechanism is technically sound and practically useful. However, the impact is tempered by the limited scale of demonstrated improvements, the restrictive deterministic setting, and the gap between synthetic evaluation and real-world causal reasoning tasks. This is a strong foundational paper that opens a promising research direction, though significant work remains to demonstrate practical utility.

Rating:6.5/ 10

Significance 7Rigor 6.5Novelty 7Clarity 7.5

Generated May 12, 2026

Comparison History (21)

vs. DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

gemini-3.15/16/2026

Paper 2 addresses a highly pressing and timely issue: the safety and security of autonomous AI agents. By providing a comprehensive, reproducible red-teaming platform (DTap), an autonomous red-teaming agent, and a large-scale benchmark, it offers immediate, real-world utility for evaluating and securing AI deployments across diverse domains. While Paper 1 introduces an innovative approach to a fundamental AI limitation (causal reasoning), Paper 2's potential to establish standard security evaluation practices for AI agents gives it a broader and more urgent scientific and practical impact.

vs. Fusion-fission forecasts when AI will shift to undesirable behavior

gpt-5.25/16/2026

Paper 1 likely has higher scientific impact: it proposes a broadly applicable, scalable framework (CauSim) to generate verifiable causal-reasoning supervision via executable SCMs, enabling curriculum/data scaling, cross-representation training, and self-improvement—advances that can influence causal inference, AI reasoning, data synthesis, and scientific modeling. Its methodology is grounded in established causal formalism with built-in answer verifiability. Paper 2 is timely and application-relevant for safety, but its strong universality/forecasting claims hinge on a specific dynamical theory that may generalize less broadly and will face heavier validation/scrutiny for rigor and reproducibility.

vs. Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

gemini-3.15/16/2026

Paper 1 addresses a fundamental and broad limitation in current AI models—causal reasoning—by introducing a highly novel, scalable framework for generating causal data via structural causal models. This methodological innovation has significant implications for advancing AI capabilities across numerous fields. In contrast, Paper 2, while valuable for addressing the reproducibility crisis in social sciences, is primarily an evaluative application of existing LLM agents to a narrower domain, resulting in a lower ceiling for widespread scientific impact.

vs. LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs

gemini-3.15/16/2026

Paper 2 addresses a fundamental and pervasive weakness in current AI systems: causal reasoning. By providing a scalable framework to generate structural causal models and verifiable queries, CauSim has the potential to significantly advance LLM capabilities across numerous domains like medicine, science, and economics. While Paper 1 offers a valuable improvement to MILP solvers—which are important for operations research—its impact is more domain-specific. The breadth of application and fundamental nature of advancing causal reasoning in AI makes Paper 2 more scientifically impactful.

vs. Online Allocation with Unknown Shared Supply

claude-opus-4.65/16/2026

CauSim addresses a fundamental limitation of LLMs—causal reasoning—which is a broadly impactful problem across AI. It introduces a novel framework converting causal reasoning into a scalable supervised learning problem, with demonstrated self-improvement capabilities. The breadth of impact spans multiple fields relying on causal inference and LLMs. Paper 2, while rigorous with tight theoretical bounds for an online allocation problem, addresses a more niche operations research problem with narrower applicability. CauSim's timeliness given the LLM era and its potential to unlock causal reasoning at scale gives it higher impact potential.

vs. Quantifying the human visual exposome with vision language models

claude-opus-4.65/16/2026

CauSim addresses a fundamental limitation of LLMs—causal reasoning—with a scalable, generalizable framework that has broad implications across AI, science, and decision-making. Its methodological contribution (turning scarce-label causal problems into supervised ones via executable SCMs) is technically novel and applicable across many domains. Paper 1 is innovative in applying VLMs to visual exposomics for mental health, but its impact is narrower, primarily relevant to environmental health research. CauSim's potential to improve causal reasoning in LLMs represents a more transformative advance with wider cross-disciplinary impact.

vs. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

claude-opus-4.65/16/2026

CauSim addresses a fundamental limitation of LLMs (causal reasoning) with a novel, scalable framework that generates training data from executable structural causal models. It offers concrete methodological contributions—curriculum scaling, self-improvement, cross-representation generalization—with broad applicability across AI and causal inference. Paper 2 provides valuable meta-analysis of benchmarking practices but is primarily observational and descriptive, with impact limited to AI evaluation methodology discourse. CauSim's technical contributions have greater potential to advance the field and inspire follow-up work across multiple research areas.

vs. SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

gemini-3.15/16/2026

Paper 2 addresses a fundamental weakness in current LLMs—causal reasoning—by introducing a novel, scalable framework for generating verifiable synthetic data. While Paper 1 provides a valuable algorithmic efficiency improvement for RLHF (SPPO), Paper 2's approach unlocks fundamentally new LLM capabilities. Enhancing causal reasoning has profound, broad-ranging implications for scientific discovery, medicine, and decision-making, granting it a higher potential for transformative cross-disciplinary impact.

vs. ASMR-Bench: Auditing for Sabotage in ML Research

claude-opus-4.65/16/2026

CauSim addresses a fundamental limitation of LLMs—causal reasoning—with a scalable framework that converts scarce-label problems into supervised ones. It introduces novel methodology (executable SCMs, curriculum scaling, self-improvement loops) with broad applicability across AI and causal inference. Paper 2 (ASMR-Bench) introduces a valuable but narrower benchmark for AI safety auditing, primarily documenting current limitations rather than providing solutions. CauSim's methodological contributions and potential to improve LLM reasoning capabilities across domains give it broader and deeper scientific impact.

vs. RewardHarness: Self-Evolving Agentic Post-Training

gpt-5.25/16/2026

Paper 2 (CauSim) likely has higher scientific impact due to broader, more foundational novelty: converting causal reasoning into a scalable supervised setting via executable SCM simulators, enabling verifiable labels, curriculum scaling, cross-representation training, and domain-knowledge formalization. This can generalize across many fields (medicine, econometrics, social science, policy, systems) where causal queries matter, and addresses a timely, widely recognized LLM limitation. Paper 1 is innovative for data-efficient reward modeling in image editing but is narrower in application domain and may depend on specific benchmark/task setups.

vs. Large Language Models Exhibit Normative Conformity

gpt-5.25/16/2026

Paper 1 likely has higher impact: it proposes a scalable framework (CauSim) that converts causal reasoning into supervised learning via executable SCM simulators, enabling curriculum scaling, data augmentation, cross-representation supervision, and self-improvement—broadly useful across causal inference, LLM training, and scientific domains with structured knowledge. The approach is more infrastructural and generative of new benchmarks/data, with clear methodological levers and downstream applications. Paper 2 is timely and relevant for safety and multi-agent systems, but is primarily empirical/diagnostic with narrower immediate tooling compared to CauSim’s platform-like contribution.

vs. QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

gpt-5.25/16/2026

Paper 2 (CauSim) likely has higher impact due to broader cross-field relevance: scalable causal reasoning applies to medicine, economics, policy, and ML interpretability, not a single scientific subdomain. Its key innovation—LLM-constructed executable SCM simulators enabling verifiable labels, curriculum scaling, and cross-representation supervision—could become a general paradigm for generating ground-truth causal QA and training/evaluating models. Paper 1 is rigorous and useful for physics-aligned LLMs, but its dataset and verification stack are more domain-specific (quantum mechanics), limiting breadth despite strong methodological contributions.

vs. ASPECT:Analogical Semantic Policy Execution via Language Conditioned Transfer

claude-opus-4.65/12/2026

CauSim addresses a fundamental limitation of LLMs—causal reasoning—with a scalable framework that converts scarce-label problems into supervised ones via executable structural causal models. It demonstrates broad contributions including curriculum scaling, self-improvement, and cross-representation generalization. The problem is central to AI safety and reasoning, with wide applicability. Paper 2, while creative in using LLMs for semantic remapping in RL transfer, addresses a narrower problem domain and builds more incrementally on existing zero-shot transfer approaches. CauSim's systematic methodology and broader implications give it higher impact potential.

vs. CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents

claude-opus-4.65/12/2026

CauSim addresses a fundamental limitation of LLMs—causal reasoning—with a general-purpose framework that is representation-agnostic and scalable. It introduces a novel methodology for converting causal reasoning into a supervised learning problem, with demonstrated gains across multiple dimensions (curriculum scaling, self-improvement, data augmentation). Its breadth of impact spans any domain requiring causal inference. CodeClinic, while valuable, is more narrowly scoped as a domain-specific benchmark for clinical reasoning with incremental methodological contributions (autoformalization pipeline). CauSim's foundational nature gives it broader and deeper potential impact.

vs. U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

gemini-3.15/12/2026

Paper 2 tackles a fundamental and timely bottleneck in artificial intelligence: the lack of robust causal reasoning in Large Language Models. By introducing a scalable framework to generate supervised training data via executable causal simulators, it offers a broad, highly applicable solution that could fundamentally improve LLM capabilities across numerous domains. While Paper 1 provides a valuable contribution to explainable AI, Paper 2's potential to enhance foundational model reasoning gives it a significantly wider scope and greater potential for transformative impact in the field.

vs. Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

gpt-5.25/12/2026

Paper 2 (CauSim) has higher likely scientific impact due to its broader, more foundational contribution: a scalable framework for causal reasoning supervision via executable SCM simulators, enabling verifiable labels, curriculum scaling, cross-representation training, and self-improvement—capabilities relevant across many domains (science, policy, healthcare, economics) and multiple ML subfields (causal inference, data generation, evaluation). Paper 1 is innovative for enterprise BI, but is more domain-specific (retail/BI systems, proprietary DSL/environment), potentially limiting breadth and reproducibility compared with CauSim’s general methodological advance.

vs. Can We Formally Verify Neural PDE Surrogates? SMT Compilation of Small Fourier Neural Operators

claude-opus-4.65/12/2026

CauSim addresses a fundamental limitation of LLMs (causal reasoning) with a scalable framework that bridges multiple representations and enables self-improvement. Its breadth of impact is larger—spanning AI safety, causal inference, and LLM training methodology—and it tackles a timely, high-profile problem. Paper 1 is rigorous but narrowly focused on formal verification of small-scale FNOs, with significant scalability limitations (timing out on ReLU models, restricted to ~117 parameters). While Paper 1 makes an important conceptual contribution, Paper 2's practical scalability and relevance to the booming LLM ecosystem give it higher potential impact.

vs. Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

gemini-3.15/12/2026

Paper 2 addresses a fundamental and critical limitation of current LLMs—causal reasoning—by introducing a highly novel, scalable framework using executable structural causal models. Improving causal reasoning has profound implications across numerous scientific and real-world domains. In contrast, Paper 1 focuses on improving the evaluation interface for LLM leaderboards. While useful for transparency and model selection, it represents an incremental improvement in HCI and evaluation methodology rather than a foundational advancement in AI capabilities.

vs. PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering

gpt-5.25/12/2026

Paper 2 (CauSim) has higher potential impact due to a more broadly applicable and timely agenda: scalable causal reasoning for LLMs via executable structural causal models, enabling verifiable supervision, curriculum scaling, self-improvement, and cross-representation transfer. Its approach could influence multiple areas (causal inference, program synthesis, agentic reasoning, scientific modeling) and provides a general method for generating reliable causal QA data where labels are scarce. Paper 1 is solid and practical for KGQA, but its contributions are more domain-specific and likely narrower in cross-field impact.

vs. Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

gpt-5.25/12/2026

Paper 2 (CauSim) is more likely to have higher scientific impact due to greater methodological innovation and broader applicability: it proposes a scalable paradigm (LLM-built executable SCM simulators) that converts causal reasoning into a supervised learning problem, enabling curriculum scaling, verifiable labels, cross-representation supervision, and self-improvement. This can influence multiple areas (causal inference, scientific modeling, evaluation/training of LLM reasoning, simulation-based learning) with clear downstream applications. Paper 1 is timely and valuable for safety evaluation, but primarily contributes a benchmark/framework rather than a broadly enabling training methodology.