Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

Clément Yvernes, Emilie Devijver, Marianne Clausel, Eric Gaussier

Jun 2, 2026

arXiv:2606.03719v1 PDF

cs.AI(primary)

#1816of 3355·Artificial Intelligence

#1816 of 3355 · Artificial Intelligence

Tournament Score

1395±45

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor8

Novelty7.5

Clarity7.5

Tournament Score

1395±45

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This process induces a rich space of equivalent interventional expressions, but combining and ordering these rules remains challenging. In this work, we introduce derivation graphs, which represent how do-calculus rules are applied and combined, and characterize the full space of observational and interventional probabilities which are equivalent under the do-calculus. The structure of these graphs yields a simple procedure that uses at most four applications of do-calculus rules. Finally, we show how applying identification algorithms to equivalent causal queries produces multiple valid estimands for the same causal quantity, eventually yielding more efficient estimators.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs"

1. Core Contribution

This paper introduces derivation graphs, a novel graph-theoretic representation that organizes the full space of interventional and observational expressions equivalent under Pearl's do-calculus. The key insight is that by treating each expression P(y|do(x),w) as a node and each valid single-rule application as an edge, one obtains a structured graph whose connected components precisely characterize equivalence classes of causal expressions.

The central theoretical result (Theorem 6) is that any two equivalent expressions can be connected by a canonical sequence of at most four do-calculus rule applications: R₂↑, R₃↑, R₃↓, R₂↓ — eliminating the need for arbitrary-length derivation chains. This is accompanied by an efficient graphical criterion (Corollary 1) requiring only four d-separation tests on mutilated graphs, providing a sound and complete test for expression equivalence.

2. Methodological Rigor

The paper demonstrates strong theoretical rigor. The proofs are detailed and carefully structured across the appendix (spanning ~16 pages), covering:

Redundancy of Rule R₁ (Theorem 2): Formally establishing that R₁ is derivable from R₂ and R₃, extending earlier one-directional results from Huang & Valtorta (2006).

Commutativity analysis (Theorems 3-5): Systematically characterizing when rule applications commute, with careful treatment of asymmetric cases (e.g., R₃ insertion/deletion does not always commute).

Theorem A (shortcut validity): A novel and technically sophisticated result showing that if a valid multi-step derivation exists between two expressions, the corresponding single shortcut rule must also be valid. The proof constructs explicit linear Gaussian SCMs as counterexamples in the contrapositive direction.

The proofs employ a consistent and clean methodology: contrapositive arguments using path-based d-separation reasoning and explicit SCM construction. The geometric interpretation on derivation graphs (triangles for R₁ redundancy, quadrilaterals for commutativity) adds intuitive clarity to the formal results.

One limitation is that the experimental validation is relatively modest — a synthetic linear Gaussian example and a single real dataset (Sachs protein signaling). The variance comparisons, while illustrative, lack formal statistical analysis (e.g., no confidence intervals on the variance estimates themselves, no asymptotic efficiency comparisons).

3. Potential Impact

Theoretical impact: The paper provides foundational structural understanding of the do-calculus that has been surprisingly absent from the literature. The do-calculus has been known to be complete since 2006, yet the combinatorial structure of its derivation space was poorly characterized. The normal form theorem (at most 4 steps) is an elegant simplification result with potential pedagogical and algorithmic value.

Practical impact: The framework has clear applications in:

Estimator selection: Different equivalent identification formulae can yield estimators with drastically different variances (demonstrated empirically). The derivation graph makes this multiplicity explicit.

Experimental design: When interventions have different costs or feasibility, knowing the full equivalence class allows principled selection among experimental strategies.

Algorithmic improvements: The graphical criterion in Corollary 1 could be integrated into causal inference software to efficiently enumerate equivalent expressions.

However, the practical impact is somewhat limited by the exponential growth of equivalence classes (up to 3^|V\Y| expressions), which may limit scalability. The paper acknowledges this but does not provide pruning strategies or approximate methods for large graphs.

4. Timeliness & Relevance

The paper addresses a genuine gap in the causal inference literature. While significant work has focused on identification algorithms (ID algorithm), adjustment criteria, and efficient estimation, the *structural organization* of do-calculus derivations has received little attention. With growing interest in optimal estimator selection (Rotnitzky & Smucler 2020; Henckel et al. 2022) and experimental design in causal settings, understanding the full space of equivalent expressions is timely.

The connection to the "napkin graph" — a problem recently studied by Guo et al. (2025) — demonstrates relevance to active research questions.

5. Strengths & Limitations

Key Strengths:

Conceptual clarity: The derivation graph is an intuitive and powerful abstraction. The visual representations (Figures 2-4) effectively communicate the structure.

Completeness of analysis: The paper systematically covers all pairwise interactions between rules (R₂-R₂, R₃-R₃, R₂-R₃), including asymmetric cases.

The 4-step bound is a clean, surprising result — reducing an unbounded derivation problem to a constant-length one.

Reproducibility: Code is provided via a public repository.

Sound and complete graphical criterion: Corollary 1 provides a practical, polynomial-time test for equivalence.

Notable Limitations:

Limited empirical evaluation: The experiments serve as illustrations rather than comprehensive evaluations. No systematic comparison of estimator efficiency across graph classes is provided.

No formal efficiency analysis: While the paper shows variance differences, it does not characterize *which* equivalent formula yields the optimal estimator, nor does it connect to semiparametric efficiency theory.

Scalability concerns: The exponential size of equivalence classes is acknowledged but not addressed algorithmically.

Scope: The framework applies to expressions of the form P(y|do(x),w) but does not extend to the algebraic expressions produced by the ID algorithm (acknowledged as future work). This limits the practical utility for deriving optimal estimands.

The connection to existing work on optimal adjustment (Henckel et al., Rotnitzky & Smucler) could be made more explicit — when does the derivation graph recover known optimal adjustment sets?

Overall Assessment

This is a solid theoretical contribution that provides new structural insight into the do-calculus. The normal form theorem and graphical criterion are clean, useful results. However, the practical implications, while promising, remain largely at the proof-of-concept stage. The paper would benefit from stronger connections to efficiency theory and scalable algorithms for exploiting the derivation graph structure.

Rating:6.8/ 10

Significance 7Rigor 8Novelty 7.5Clarity 7.5

Generated Jun 3, 2026

Comparison History (20)

vs. Position Paper: Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations

gpt-5.26/5/2026

Paper 2 offers a concrete, technically novel contribution—derivation graphs that characterize equivalence classes under do-calculus and yield a bounded (≤4 steps) reasoning procedure—potentially advancing automated causal identification and estimator construction. This is methodologically sharper than Paper 1’s position/agenda framing, which is timely and broadly relevant but less directly impactful without new theorems/algorithms. Paper 2’s results can influence causal inference, ML, epidemiology, econometrics, and tool-building for causal reasoning, with clearer near-term uptake and measurable downstream benefits (multiple estimands, efficiency gains).

vs. Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System

gemini-3.16/5/2026

While Paper 2 offers a profound theoretical contribution to causal inference, Paper 1 addresses a critical bottleneck in the rapidly expanding field of LLM-based multi-agent systems. By rigorously applying Shapley values to solve the notoriously difficult credit assignment problem in MARL, Paper 1 achieves substantial empirical improvements. Its high timeliness, direct applicability to real-world AI systems, and fusion of cooperative game theory with LLM training give it exceptionally high potential for immediate and widespread scientific impact.

vs. Towards a Science of AI Agent Reliability

claude-opus-4.66/5/2026

Paper 2 addresses the timely and rapidly growing field of AI agent reliability, proposing a comprehensive evaluation framework with 12 metrics across 4 dimensions. Given the explosive deployment of AI agents, this work has immediate broad applicability and fills a critical gap between benchmark performance and real-world reliability. Paper 1 makes a solid theoretical contribution to causal inference by characterizing do-calculus derivation structure, but its impact is more narrowly scoped to the causal inference community. Paper 2's relevance to AI safety, its practical framework, and the breadth of its potential influence across AI development give it higher estimated impact.

vs. X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

gpt-5.26/5/2026

Paper 2 likely has higher scientific impact due to strong timeliness and broad relevance: it targets LLM reasoning evaluation, a central open problem across AI, education, and scientific domains. Its formalized, calibrated, contamination-free probe framework suggests methodological rigor and immediate applicability for benchmarking, model development, and safety/reliability assessment, with potential to become a standard tool. Paper 1 is novel and rigorous within causal inference, offering structural insights and estimator improvements, but its impact is more specialized to causal identification/estimation communities and may diffuse more slowly across fields.

vs. Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs

gemini-3.16/3/2026

Paper 1 addresses highly critical and timely bottlenecks (inflexibility and scalability) in LLM-KG integration, a rapidly expanding area of AI research. By abstracting KG facts into executable code representations, it provides a highly scalable, practical solution with significant empirical gains. While Paper 2 offers a strong foundational contribution to causal inference, Paper 1's methodology is likely to see faster, broader adoption and immediate real-world applications across the pervasive LLM ecosystem.

vs. Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

gpt-5.26/3/2026

Paper 2 has higher potential impact: it advances foundational causal inference theory by characterizing equivalence classes of interventional expressions via derivation graphs and providing a bounded (≤4 steps) procedure, with downstream consequences for identification and estimator efficiency. This is methodologically rigorous, broadly relevant across statistics, epidemiology, econometrics, and ML, and can influence both theory and practice in causal estimation. Paper 1 is timely and useful for LLM instruction-following, but appears more incremental/system-level and likely to age faster with model/training changes, with narrower cross-field reach.

vs. ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

gpt-5.26/3/2026

Paper 2 is more likely to have higher scientific impact due to stronger novelty and breadth: it contributes to the theoretical foundations of causal inference by introducing derivation graphs to characterize do-calculus equivalence classes and providing a bounded (≤4 steps) reasoning procedure, with downstream implications for identification and estimation efficiency. This can influence multiple areas (statistics, epidemiology, economics, ML causality) and has long-term relevance. Paper 1 is timely and practically useful for LLM efficiency, but it appears more incremental within a fast-moving, model-specific optimization landscape and may generalize less broadly.

vs. Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

claude-opus-4.66/3/2026

Paper 1 makes a fundamental theoretical contribution to causal inference by characterizing the complete structure of do-calculus reasoning through derivation graphs, proving a tight bound of four rule applications, and showing practical implications for efficient estimation. This advances core methodology with broad applicability across all causal inference applications. Paper 2 is a scoping review synthesizing existing AI work in dentistry—useful but incremental, domain-specific, and primarily organizational rather than introducing new methods or theory. Paper 1's contributions to foundational causal reasoning methodology have broader cross-disciplinary impact potential.

vs. Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

claude-opus-4.66/3/2026

Paper 1 makes a fundamental theoretical contribution to causal inference by characterizing the complete structure of do-calculus reasoning, proving a tight bound of four rule applications, and demonstrating practical benefits through more efficient estimators. This advances a core problem in causal inference with broad implications across statistics, epidemiology, economics, and AI. Paper 2 addresses an important but more niche problem of refactoring LLM-generated formal proofs, with impact largely limited to the formal verification community. Paper 1's theoretical depth and breadth of applicability give it higher potential impact.

vs. Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

claude-opus-4.66/3/2026

Paper 1 addresses a fundamental problem in causal inference—systematizing do-calculus reasoning—with direct practical implications for causal identification and estimation efficiency. The introduction of derivation graphs provides both theoretical insight (bounding rule applications to four) and practical benefits (multiple valid estimands yielding more efficient estimators). Causal inference impacts many fields (epidemiology, economics, ML). Paper 2 makes a solid but more niche contribution extending non-monotonic reasoning to a specific modal logic fragment, with narrower applicability primarily within the formal logic community.

vs. TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety

claude-opus-4.66/3/2026

Paper 1 addresses the timely and practically important problem of safety in long-horizon LLM agents, proposing a novel compression-based framework (TRACE) with strong empirical results across multiple benchmarks. Given the explosive growth of LLM agent deployments, this work has broad real-world applicability and addresses a critical gap in AI safety. Paper 2 makes a solid theoretical contribution to causal inference by formalizing do-calculus derivation graphs, but its impact is more niche, primarily benefiting the causal inference community. Paper 1's combination of novelty, practical relevance, and timeliness gives it higher estimated impact.

vs. Solipsistic Superintelligence is Unlikely to be Cooperative

gpt-5.26/3/2026

Paper 1 offers a concrete, novel technical contribution to causal inference: derivation graphs that characterize do-calculus equivalence classes, a bounded-step procedure, and a practical route to multiple estimands and potentially more efficient estimators. This is methodologically rigorous, immediately actionable for identification/estimation workflows, and likely to impact statistics, ML, epidemiology, and econometrics. Paper 2 is timely and broadly relevant but is primarily a conceptual position paper with less formal methodology and fewer directly testable/implementable results, which typically yields less near-term scientific uptake.

vs. Bridging the Last Mile of Time Series Forecasting with LLM Agents

gemini-3.16/3/2026

Paper 1 offers a fundamental theoretical advancement in causal inference by simplifying do-calculus rules and improving estimators. This mathematical rigor and foundational contribution will likely have a deep, enduring impact across multiple scientific disciplines. In contrast, Paper 2 presents a practical, engineering-focused application of LLMs for business forecasting, which, while highly relevant and timely for industry, has narrower long-term scientific implications compared to foundational theoretical breakthroughs.

vs. Inducing Reasoning Primitives from Agent Traces

gemini-3.16/3/2026

Paper 1 addresses a highly timely and widely applicable problem in the booming field of LLM agents. By automating the induction of reasoning primitives, it demonstrates massive empirical improvements across diverse tasks. Its immediate practical utility and relevance to AI problem-solving give it a broader and more immediate potential scientific impact compared to the fundamental, but more niche, theoretical contributions to causal inference in Paper 2.

vs. EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

gpt-5.26/3/2026

Paper 1 likely has higher near-term scientific impact due to strong timeliness (LLM agents + autonomous driving safety), clear real-world applicability (scenario generation for validation/training), and broad relevance across robotics, simulation, and AI safety. Its Pareto-based multi-objective evolutionary framework with simulator grounding is a concrete systems contribution likely to be adopted and extended. Paper 2 offers a valuable theoretical/algorithmic advance in causal inference (derivation graphs, bounded rule applications, multiple estimands), but its impact may be narrower and slower to translate into widely used tooling compared to the immediate engineering utility of Paper 1.

vs. FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

gemini-3.16/3/2026

Paper 1 advances the fundamental theory of causal inference by structuring do-calculus reasoning, a foundational tool utilized across diverse scientific disciplines such as medicine, economics, and AI. Its ability to simplify causal identification and yield more efficient estimators provides profound, cross-disciplinary impact. In contrast, Paper 2 offers strong practical innovations for a specific application (short-video recommender systems). While commercially valuable, its scientific impact is narrower and more domain-specific compared to the foundational and broadly applicable theoretical contributions of Paper 1.

vs. EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

gemini-3.16/3/2026

While Paper 1 offers a strong theoretical advancement in causal inference, Paper 2 tackles a highly timely and critical bottleneck in modern AI: autonomous LLM training and agentic reinforcement learning. By introducing a co-evolutionary framework for LLM policies and training harnesses, Paper 2 demonstrates immediate, state-of-the-art practical applications in high-impact domains like repository-level software engineering and mathematical reasoning. The explosive growth, broad applicability, and immense real-world utility of autonomous AI agents give Paper 2 a significantly higher potential for widespread scientific and industrial impact.

vs. ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact: it targets a timely, high-demand problem (autonomous post-training/alignment), proposes a system-level framework with concrete tooling, and reports state-of-the-art results plus open-source release, enabling rapid adoption and follow-on work. Its applications span agentic data generation, alignment, and automated ML pipelines, giving broad cross-field relevance. Paper 1 appears theoretically novel and potentially important for causal inference methodology, but its immediate applicability and breadth may be narrower and uptake slower compared to an empirically validated, deployable alignment/data-synthesis framework.

vs. Visual Graph Scaffolds for Structural Reasoning in Large Language Models

claude-opus-4.66/3/2026

Paper 1 makes a fundamental theoretical contribution to causal inference by formalizing the structure of do-calculus reasoning through derivation graphs, proving that at most four rule applications suffice, and showing how equivalent queries yield more efficient estimators. This advances core causal inference methodology with broad implications across statistics, epidemiology, economics, and AI. Paper 2 offers an interesting but more incremental contribution about using visual graphs to assist LLM reasoning, which is timely but narrower in scope and dependent on the rapidly evolving LLM landscape.

vs. scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

gpt-5.26/3/2026

Paper 2 likely has higher impact due to strong timeliness and broad real-world applicability in single-cell multi-omics, a rapidly growing field with many labs needing standardized evaluation. A comprehensive, open-source benchmark with datasets, metrics, and scenario analyses can become community infrastructure, shaping method development across genomics, ML, and bioinformatics. Paper 1 offers novel theoretical structure for do-calculus reasoning and could influence causal inference, but its audience is narrower and downstream adoption may be slower compared to a widely usable benchmark resource.