Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs
Clément Yvernes, Emilie Devijver, Marianne Clausel, Eric Gaussier
Abstract
The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This process induces a rich space of equivalent interventional expressions, but combining and ordering these rules remains challenging. In this work, we introduce derivation graphs, which represent how do-calculus rules are applied and combined, and characterize the full space of observational and interventional probabilities which are equivalent under the do-calculus. The structure of these graphs yields a simple procedure that uses at most four applications of do-calculus rules. Finally, we show how applying identification algorithms to equivalent causal queries produces multiple valid estimands for the same causal quantity, eventually yielding more efficient estimators.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs"
1. Core Contribution
This paper introduces derivation graphs, a novel graph-theoretic representation that organizes the full space of interventional and observational expressions equivalent under Pearl's do-calculus. The key insight is that by treating each expression P(y|do(x),w) as a node and each valid single-rule application as an edge, one obtains a structured graph whose connected components precisely characterize equivalence classes of causal expressions.
The central theoretical result (Theorem 6) is that any two equivalent expressions can be connected by a canonical sequence of at most four do-calculus rule applications: R₂↑, R₃↑, R₃↓, R₂↓ — eliminating the need for arbitrary-length derivation chains. This is accompanied by an efficient graphical criterion (Corollary 1) requiring only four d-separation tests on mutilated graphs, providing a sound and complete test for expression equivalence.
2. Methodological Rigor
The paper demonstrates strong theoretical rigor. The proofs are detailed and carefully structured across the appendix (spanning ~16 pages), covering:
The proofs employ a consistent and clean methodology: contrapositive arguments using path-based d-separation reasoning and explicit SCM construction. The geometric interpretation on derivation graphs (triangles for R₁ redundancy, quadrilaterals for commutativity) adds intuitive clarity to the formal results.
One limitation is that the experimental validation is relatively modest — a synthetic linear Gaussian example and a single real dataset (Sachs protein signaling). The variance comparisons, while illustrative, lack formal statistical analysis (e.g., no confidence intervals on the variance estimates themselves, no asymptotic efficiency comparisons).
3. Potential Impact
Theoretical impact: The paper provides foundational structural understanding of the do-calculus that has been surprisingly absent from the literature. The do-calculus has been known to be complete since 2006, yet the combinatorial structure of its derivation space was poorly characterized. The normal form theorem (at most 4 steps) is an elegant simplification result with potential pedagogical and algorithmic value.
Practical impact: The framework has clear applications in:
However, the practical impact is somewhat limited by the exponential growth of equivalence classes (up to 3^|V\Y| expressions), which may limit scalability. The paper acknowledges this but does not provide pruning strategies or approximate methods for large graphs.
4. Timeliness & Relevance
The paper addresses a genuine gap in the causal inference literature. While significant work has focused on identification algorithms (ID algorithm), adjustment criteria, and efficient estimation, the *structural organization* of do-calculus derivations has received little attention. With growing interest in optimal estimator selection (Rotnitzky & Smucler 2020; Henckel et al. 2022) and experimental design in causal settings, understanding the full space of equivalent expressions is timely.
The connection to the "napkin graph" — a problem recently studied by Guo et al. (2025) — demonstrates relevance to active research questions.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
This is a solid theoretical contribution that provides new structural insight into the do-calculus. The normal form theorem and graphical criterion are clean, useful results. However, the practical implications, while promising, remain largely at the proof-of-concept stage. The paper would benefit from stronger connections to efficiency theory and scalable algorithms for exploiting the derivation graph structure.
Generated Jun 3, 2026
Comparison History (20)
Paper 2 offers a concrete, technically novel contribution—derivation graphs that characterize equivalence classes under do-calculus and yield a bounded (≤4 steps) reasoning procedure—potentially advancing automated causal identification and estimator construction. This is methodologically sharper than Paper 1’s position/agenda framing, which is timely and broadly relevant but less directly impactful without new theorems/algorithms. Paper 2’s results can influence causal inference, ML, epidemiology, econometrics, and tool-building for causal reasoning, with clearer near-term uptake and measurable downstream benefits (multiple estimands, efficiency gains).
While Paper 2 offers a profound theoretical contribution to causal inference, Paper 1 addresses a critical bottleneck in the rapidly expanding field of LLM-based multi-agent systems. By rigorously applying Shapley values to solve the notoriously difficult credit assignment problem in MARL, Paper 1 achieves substantial empirical improvements. Its high timeliness, direct applicability to real-world AI systems, and fusion of cooperative game theory with LLM training give it exceptionally high potential for immediate and widespread scientific impact.
Paper 2 addresses the timely and rapidly growing field of AI agent reliability, proposing a comprehensive evaluation framework with 12 metrics across 4 dimensions. Given the explosive deployment of AI agents, this work has immediate broad applicability and fills a critical gap between benchmark performance and real-world reliability. Paper 1 makes a solid theoretical contribution to causal inference by characterizing do-calculus derivation structure, but its impact is more narrowly scoped to the causal inference community. Paper 2's relevance to AI safety, its practical framework, and the breadth of its potential influence across AI development give it higher estimated impact.
Paper 2 likely has higher scientific impact due to strong timeliness and broad relevance: it targets LLM reasoning evaluation, a central open problem across AI, education, and scientific domains. Its formalized, calibrated, contamination-free probe framework suggests methodological rigor and immediate applicability for benchmarking, model development, and safety/reliability assessment, with potential to become a standard tool. Paper 1 is novel and rigorous within causal inference, offering structural insights and estimator improvements, but its impact is more specialized to causal identification/estimation communities and may diffuse more slowly across fields.
Paper 1 addresses highly critical and timely bottlenecks (inflexibility and scalability) in LLM-KG integration, a rapidly expanding area of AI research. By abstracting KG facts into executable code representations, it provides a highly scalable, practical solution with significant empirical gains. While Paper 2 offers a strong foundational contribution to causal inference, Paper 1's methodology is likely to see faster, broader adoption and immediate real-world applications across the pervasive LLM ecosystem.
Paper 2 has higher potential impact: it advances foundational causal inference theory by characterizing equivalence classes of interventional expressions via derivation graphs and providing a bounded (≤4 steps) procedure, with downstream consequences for identification and estimator efficiency. This is methodologically rigorous, broadly relevant across statistics, epidemiology, econometrics, and ML, and can influence both theory and practice in causal estimation. Paper 1 is timely and useful for LLM instruction-following, but appears more incremental/system-level and likely to age faster with model/training changes, with narrower cross-field reach.
Paper 2 is more likely to have higher scientific impact due to stronger novelty and breadth: it contributes to the theoretical foundations of causal inference by introducing derivation graphs to characterize do-calculus equivalence classes and providing a bounded (≤4 steps) reasoning procedure, with downstream implications for identification and estimation efficiency. This can influence multiple areas (statistics, epidemiology, economics, ML causality) and has long-term relevance. Paper 1 is timely and practically useful for LLM efficiency, but it appears more incremental within a fast-moving, model-specific optimization landscape and may generalize less broadly.
Paper 1 makes a fundamental theoretical contribution to causal inference by characterizing the complete structure of do-calculus reasoning through derivation graphs, proving a tight bound of four rule applications, and showing practical implications for efficient estimation. This advances core methodology with broad applicability across all causal inference applications. Paper 2 is a scoping review synthesizing existing AI work in dentistry—useful but incremental, domain-specific, and primarily organizational rather than introducing new methods or theory. Paper 1's contributions to foundational causal reasoning methodology have broader cross-disciplinary impact potential.
Paper 1 makes a fundamental theoretical contribution to causal inference by characterizing the complete structure of do-calculus reasoning, proving a tight bound of four rule applications, and demonstrating practical benefits through more efficient estimators. This advances a core problem in causal inference with broad implications across statistics, epidemiology, economics, and AI. Paper 2 addresses an important but more niche problem of refactoring LLM-generated formal proofs, with impact largely limited to the formal verification community. Paper 1's theoretical depth and breadth of applicability give it higher potential impact.
Paper 1 addresses a fundamental problem in causal inference—systematizing do-calculus reasoning—with direct practical implications for causal identification and estimation efficiency. The introduction of derivation graphs provides both theoretical insight (bounding rule applications to four) and practical benefits (multiple valid estimands yielding more efficient estimators). Causal inference impacts many fields (epidemiology, economics, ML). Paper 2 makes a solid but more niche contribution extending non-monotonic reasoning to a specific modal logic fragment, with narrower applicability primarily within the formal logic community.
Paper 1 addresses the timely and practically important problem of safety in long-horizon LLM agents, proposing a novel compression-based framework (TRACE) with strong empirical results across multiple benchmarks. Given the explosive growth of LLM agent deployments, this work has broad real-world applicability and addresses a critical gap in AI safety. Paper 2 makes a solid theoretical contribution to causal inference by formalizing do-calculus derivation graphs, but its impact is more niche, primarily benefiting the causal inference community. Paper 1's combination of novelty, practical relevance, and timeliness gives it higher estimated impact.
Paper 1 offers a concrete, novel technical contribution to causal inference: derivation graphs that characterize do-calculus equivalence classes, a bounded-step procedure, and a practical route to multiple estimands and potentially more efficient estimators. This is methodologically rigorous, immediately actionable for identification/estimation workflows, and likely to impact statistics, ML, epidemiology, and econometrics. Paper 2 is timely and broadly relevant but is primarily a conceptual position paper with less formal methodology and fewer directly testable/implementable results, which typically yields less near-term scientific uptake.
Paper 1 offers a fundamental theoretical advancement in causal inference by simplifying do-calculus rules and improving estimators. This mathematical rigor and foundational contribution will likely have a deep, enduring impact across multiple scientific disciplines. In contrast, Paper 2 presents a practical, engineering-focused application of LLMs for business forecasting, which, while highly relevant and timely for industry, has narrower long-term scientific implications compared to foundational theoretical breakthroughs.
Paper 1 addresses a highly timely and widely applicable problem in the booming field of LLM agents. By automating the induction of reasoning primitives, it demonstrates massive empirical improvements across diverse tasks. Its immediate practical utility and relevance to AI problem-solving give it a broader and more immediate potential scientific impact compared to the fundamental, but more niche, theoretical contributions to causal inference in Paper 2.
Paper 1 likely has higher near-term scientific impact due to strong timeliness (LLM agents + autonomous driving safety), clear real-world applicability (scenario generation for validation/training), and broad relevance across robotics, simulation, and AI safety. Its Pareto-based multi-objective evolutionary framework with simulator grounding is a concrete systems contribution likely to be adopted and extended. Paper 2 offers a valuable theoretical/algorithmic advance in causal inference (derivation graphs, bounded rule applications, multiple estimands), but its impact may be narrower and slower to translate into widely used tooling compared to the immediate engineering utility of Paper 1.
Paper 1 advances the fundamental theory of causal inference by structuring do-calculus reasoning, a foundational tool utilized across diverse scientific disciplines such as medicine, economics, and AI. Its ability to simplify causal identification and yield more efficient estimators provides profound, cross-disciplinary impact. In contrast, Paper 2 offers strong practical innovations for a specific application (short-video recommender systems). While commercially valuable, its scientific impact is narrower and more domain-specific compared to the foundational and broadly applicable theoretical contributions of Paper 1.
While Paper 1 offers a strong theoretical advancement in causal inference, Paper 2 tackles a highly timely and critical bottleneck in modern AI: autonomous LLM training and agentic reinforcement learning. By introducing a co-evolutionary framework for LLM policies and training harnesses, Paper 2 demonstrates immediate, state-of-the-art practical applications in high-impact domains like repository-level software engineering and mathematical reasoning. The explosive growth, broad applicability, and immense real-world utility of autonomous AI agents give Paper 2 a significantly higher potential for widespread scientific and industrial impact.
Paper 2 likely has higher scientific impact: it targets a timely, high-demand problem (autonomous post-training/alignment), proposes a system-level framework with concrete tooling, and reports state-of-the-art results plus open-source release, enabling rapid adoption and follow-on work. Its applications span agentic data generation, alignment, and automated ML pipelines, giving broad cross-field relevance. Paper 1 appears theoretically novel and potentially important for causal inference methodology, but its immediate applicability and breadth may be narrower and uptake slower compared to an empirically validated, deployable alignment/data-synthesis framework.
Paper 1 makes a fundamental theoretical contribution to causal inference by formalizing the structure of do-calculus reasoning through derivation graphs, proving that at most four rule applications suffice, and showing how equivalent queries yield more efficient estimators. This advances core causal inference methodology with broad implications across statistics, epidemiology, economics, and AI. Paper 2 offers an interesting but more incremental contribution about using visual graphs to assist LLM reasoning, which is timely but narrower in scope and dependent on the rapidly evolving LLM landscape.
Paper 2 likely has higher impact due to strong timeliness and broad real-world applicability in single-cell multi-omics, a rapidly growing field with many labs needing standardized evaluation. A comprehensive, open-source benchmark with datasets, metrics, and scenario analyses can become community infrastructure, shaping method development across genomics, ML, and bioinformatics. Paper 1 offers novel theoretical structure for do-calculus reasoning and could influence causal inference, but its audience is narrower and downstream adoption may be slower compared to a widely usable benchmark resource.