Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

Gautam Gare, John Galeotti, Michael Mozer, Deva Ramanan, Nan Rosemary Ke

Jun 2, 2026

arXiv:2606.03251v1 PDF

cs.AI(primary)cs.CVcs.LGeess.IVstat.ML

#2357of 3404·Artificial Intelligence

#2357 of 3404 · Artificial Intelligence

Tournament Score

1352±44

10501800

35%

Win Rate

Wins

Losses

Matches

Rating

3.5/ 10

Significance4

Rigor3

Novelty5

Clarity5.5

Tournament Score

1352±44

10501800

35%

Win Rate

Wins

Losses

Matches

Rating

3.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

The paper poses an interesting conceptual question: do commonly used real-world ML datasets implicitly contain natural experiments (i.e., interventional data structures), and can recognizing this improve downstream predictive performance? The proposed detection mechanism is indirect: use the DCDI causal discovery algorithm to infer a causal graph under both observational and interventional (soft-known) configurations, extract Markov blanket features for classification, and if the interventional configuration yields better F1-scores, conclude that the dataset likely contains natural experiments. The authors validate on synthetic Sachs data and evaluate on 11 real-world tabular datasets, identifying 3 (Diabetes, Higgs-small, Credit-card-fraud) as potentially containing natural experiments.

2. Methodological Rigor

The methodology has several concerning aspects:

Circular reasoning in the detection criterion. The paper's core claim — that improved performance under interventional DCDI implies the presence of natural experiments — is a weak inferential link. Better F1-score under one algorithmic configuration versus another could arise from many factors (optimization landscape differences, hyperparameter sensitivity, regularization effects), not necessarily because the data genuinely contains interventions. The paper acknowledges DCDI is "highly sensitive to hyperparameter choices" (Limitations section), which undermines confidence that performance differences reflect genuine data structure rather than optimization artifacts.

Strong and potentially unjustified assumptions. The assumption that classification targets represent interventions (class-0 as observational, other classes as interventional) is acknowledged as "strong" but is the foundation of the entire approach. For many datasets (e.g., Higgs-small, where the label indicates particle physics events), this interpretation is strained. The COVID-19 and credit fraud examples are compelling, but the generalization to arbitrary classification labels is not well-justified.

Marginal performance differences. Examining Table 3, the improvements attributed to natural experiments are quite small. For Diabetes: 81.00 vs. 80.26 (best non-causal); for Higgs-small: 62.13 vs. 61.76; for Credit-card-fraud: 99.91 vs. 99.90. Given standard deviations, several of these differences may not be statistically significant. The paper does not perform significance tests comparing causal-sk against baselines.

Synthetic validation is limited. The Sachs graph has only 11 nodes — extremely small compared to real causal systems. The synthetic experiments, while systematic across intervention types, don't convincingly demonstrate that the detection mechanism generalizes.

Fair comparison concerns. While the authors fix random seeds for classifier training, the DCDI hyperparameters are tuned per-dataset by optimizing validation F1-score, which means the causal discovery step itself is indirectly optimized for downstream performance — blurring the line between feature selection and hyperparameter search.

3. Potential Impact

The conceptual framing — questioning whether datasets are purely observational — is thought-provoking and could inspire more rigorous investigations. However, the practical impact is limited:

The method only works on tabular datasets with relatively few features (up to 50) due to DCDI's scalability constraints.

The improvements are marginal and inconsistent (only 3/11 datasets show improvement).

The approach requires substantial computational overhead (DCDI training, hyperparameter tuning) for modest gains.

The connection to the rich econometrics literature on natural experiments (instrumental variables, difference-in-differences, regression discontinuity) is superficial — the paper doesn't leverage any of these established frameworks.

4. Timeliness & Relevance

The paper touches on a timely intersection of causal inference and machine learning. The question of when and how to leverage causal structure in predictive modeling is genuinely important. However, the execution doesn't advance the state of the art in causal discovery, causal inference, or feature selection individually. The paper also doesn't engage deeply with related work on distribution shift, domain adaptation, or dataset shift, where interventional thinking has been more formally developed.

5. Strengths & Limitations

Strengths:

Novel and thought-provoking research question that bridges natural experiments (economics/social science) with ML practice.

Systematic experimental design covering multiple intervention types and data configurations on synthetic data.

Comprehensive comparison against both classical and causal feature selection baselines.

The Diabetes dataset case study (insulin feature exclusion) is a compelling qualitative finding, well-supported by the brute-force analysis and domain literature.

Honest framing as "preliminary exploration within a limited scope."

Limitations:

The detection criterion (performance improvement ↔ natural experiment) conflates algorithmic behavior with data properties, lacking formal justification.

No formal statistical testing of claimed improvements; effect sizes are very small.

The assumption that class labels represent interventions is too strong and domain-specific to generalize.

DCDI sensitivity to hyperparameters and frequent non-convergence/cyclic graph issues limit reproducibility and practical applicability.

No comparison with other interventional causal discovery methods beyond DCDI.

The paper doesn't provide a formal definition of what constitutes a natural experiment in this context, making the claim unfalsifiable.

Only 3/11 datasets support the hypothesis, which could also be consistent with random variation.

The connection to the Nobel Prize-winning work on natural experiments is invoked for motivation but not substantively leveraged.

Overall Assessment

This paper raises an interesting conceptual question but the evidence provided is insufficient to convincingly answer it. The indirect detection mechanism (improved F1 → natural experiment) lacks formal grounding, the empirical improvements are marginal and potentially within noise, and the strong assumptions limit generalizability. The work would benefit significantly from: (1) formal statistical tests, (2) datasets where ground-truth interventional structure is known, (3) engagement with formal causal inference frameworks for natural experiments, and (4) larger-scale validation.

Rating:3.5/ 10

Significance 4Rigor 3Novelty 5Clarity 5.5

Generated Jun 3, 2026

Comparison History (20)

vs. Evaluating Agentic Configuration Repair for Computer Networks

gemini-3.16/6/2026

Paper 2 addresses a fundamental challenge in causal inference and data analysis, offering broad applicability across any scientific discipline that utilizes observational data. By bridging causal discovery with practical feature selection, it has the potential to fundamentally alter machine learning practices. While Paper 1 provides a valuable and highly practical application of LLM agents in networking, its impact is relatively confined to network management, making Paper 2's foundational contribution more widely impactful.

vs. REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment

gpt-5.26/5/2026

Paper 2 likely has higher scientific impact: it introduces a concrete, novel framework (Reasoning-Pivot Alignment) plus a new dataset (REAL-VQA) and an integrated training/decoding pipeline with demonstrated gains across multiple KI-VQA benchmarks—strong real-world applicability for retrieval-augmented multimodal systems and timely given current LLM/VLM deployment. Paper 1 is conceptually interesting and broadly relevant to causal inference, but the proposed “performance improvement implies natural experiments” criterion may be less rigorous/identifiable and its impact depends heavily on causal discovery reliability on real datasets.

vs. An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

gpt-5.26/5/2026

Paper 2 likely has higher impact due to strong real-world clinical relevance (osteoarthritis imaging and pain trajectories), immediate applicability to large-scale longitudinal studies, and a rigorous, trustworthy methodology combining deep learning, conformal uncertainty quantification, and established longitudinal modeling. It uses a major public dataset (OAI) and yields concrete, interpretable clinical associations. Paper 1 is novel and broadly relevant conceptually, but its contribution is more exploratory/early-stage and hinges on indirect performance-based evidence for “natural experiments,” which may limit methodological strength and near-term adoption.

vs. Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection

gpt-5.26/5/2026

Paper 2 has higher potential scientific impact due to broader cross-domain relevance and conceptual novelty: it operationalizes the question of whether existing datasets embed implicit interventions (“natural experiments”) and proposes an empirical causal-feature-selection protocol to detect and exploit them. This can affect many fields using observational data (medicine, economics, social science, ML benchmarking) and is timely amid growing interest in causal inference and dataset auditing. Paper 1 is practically valuable for industrial anomaly detection, but its impact is narrower and more engineering/system-integration oriented, with novelty largely in workflow structuring and evaluation efficiency.

vs. HLL: Can Agents Cross Humanity's Last Line of Verification?

gpt-5.26/3/2026

Paper 2 is likely higher impact: it introduces a timely, concrete benchmark for multimodal agents in a high-stakes real-world boundary (CAPTCHA/anti-automation), with controlled stressors and trace-conditioned evaluation that can standardize progress across the field. Its applications span AI agents, HCI, web security, and deployment evaluation, and open-sourcing increases adoption. Paper 1 is interesting but more preliminary, relies on assumptions about causal discovery/feature selection as evidence of “natural experiments,” and may be harder to validate broadly; its impact may be narrower and methodologically less definitive.

vs. Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

gpt-5.26/3/2026

Paper 2 has higher impact potential: it targets a timely, fast-growing area (LLM-generated formal proofs) and addresses a key bottleneck—turning compilable proofs into reusable library-quality artifacts—enabling practical downstream adoption in formalization workflows. Its process-guided, multi-phase agentic refactoring is novel relative to proxy-metric optimization and has clear real-world applicability in proof assistants and automated theorem proving. The evaluation on established Lean benchmarks and comparison to a strong baseline suggests reasonable rigor. Paper 1 is interesting but relies on indirect evidence for “natural experiments” and may have narrower, more method-sensitive impact.

vs. EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

claude-opus-4.66/3/2026

EvoBrain addresses a significant and timely problem—building unified EEG foundation models with continual learning across heterogeneous BCI tasks. It introduces novel architectural components (NSN, RAD), demonstrates strong empirical results across six BCI tasks, and pioneers cross-task continual learning in EEG. This has clear real-world applications in brain-computer interfaces and broader impact connecting foundation models, continual learning, and neuroscience. Paper 1, while interesting, is more exploratory and preliminary in scope, as acknowledged by the authors, and the connection between natural experiments and causal feature selection, though novel, has narrower immediate impact.

vs. MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation

gemini-3.16/3/2026

Paper 1 addresses a fundamental methodology in observational data analysis by leveraging causal discovery to identify natural experiments. Its findings have broad applicability across numerous fields that rely on observational data, such as healthcare, economics, and general machine learning. Paper 2 presents an innovative LLM-based framework, but its application is primarily restricted to the specific domain of human mobility and urban planning, giving it a narrower potential scientific impact.

vs. Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions

claude-opus-4.66/3/2026

Paper 2 addresses a fundamental question about the nature of real-world datasets and natural experiments, with broad implications across all empirical sciences. Its combination of causal discovery and feature selection to detect implicit interventions is novel and widely applicable. While Paper 1 (RACL) makes a solid contribution to constraint learning with practical applications, it addresses a narrower problem domain. Paper 2's potential to change how researchers treat observational data across multiple fields gives it broader scientific impact, despite Paper 1's stronger methodological specificity and impressive empirical results.

vs. SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

gemini-3.16/3/2026

Paper 2 addresses the highly timely and rapidly expanding field of multi-agent LLM ecosystems and self-improvement. It introduces a comprehensive evaluation framework applied across complex, relevant domains, offering nuanced insights into social versus self-evolution. In contrast, while Paper 1 presents an interesting application of causal inference, it explicitly notes that it is a preliminary exploration with limited scope, making its immediate breadth and magnitude of impact likely lower than Paper 2.

vs. AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

gpt-5.26/3/2026

Paper 1 has higher likely scientific impact due to a more novel, concrete systems contribution: an executable symbolic environment tightly integrating taxonomy/XBRL graphs with deterministic verification tools plus a multi-agent audit workflow, yielding a large, ablation-supported gain and clear evidence that symbolic checking is essential. It targets a high-stakes, real-world domain (financial reporting compliance) with immediate applicability and potential industry adoption. Paper 2 poses an interesting question but offers a more indirect, weaker-impact criterion (“performance improves”) for detecting natural experiments and is framed as preliminary, making rigor and actionable novelty less compelling.

vs. Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

gemini-3.16/3/2026

Paper 2 addresses a fundamental problem spanning multiple disciplines (machine learning, statistics, empirical sciences) by proposing a method to identify and leverage natural experiments in observational data. This has broader applicability and higher potential for transformative impact across all data-driven sciences compared to Paper 1, which, while highly timely and practical, is restricted to the specific niche of AI coding agents and software engineering workflows.

vs. LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

gemini-3.16/3/2026

Paper 2 presents a significant breakthrough in AI and formal mathematics, demonstrating state-of-the-art performance on highly challenging benchmarks (Putnam, IMO) and proving research-level utility. The advancement of LLMs in formal theorem proving is a major frontier with high timeliness and broad implications. In contrast, Paper 1 offers a preliminary exploration of natural experiments in datasets, which, while interesting, lacks the definitive, high-impact results and methodological breakthrough showcased in Paper 2.

vs. The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact: it targets an urgent, widely relevant problem (LLM benchmark validity), identifies concrete failure modes (distribution shift, scale), and backs claims with broad, systematic evaluation (27 models, 335 runs) plus an open-source benchmark, enabling follow-on work. Its conclusions affect evaluation methodology across many LLM subfields and industry practice. Paper 1 is novel in probing “natural experiments” in datasets via causal feature selection, but appears more preliminary, with narrower immediate applications and heavier dependence on causal discovery assumptions that can limit robustness and adoption.

vs. The DeepSpeak-Agentic Dataset

gemini-3.16/3/2026

Paper 1 introduces a conceptually novel approach to identifying and leveraging natural experiments in standard observational datasets using causal discovery. This offers fundamental methodological advancements with broad impact across any discipline relying on machine learning and observational data. In contrast, while Paper 2 provides a timely and useful dataset for AI forensics and human-AI interaction, its scientific scope is narrower and primarily serves as a benchmark rather than proposing a foundational methodological shift. Thus, Paper 1 has higher potential for widespread, cross-disciplinary scientific impact.

vs. ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

gpt-5.26/3/2026

Paper 1 is more likely to have higher impact: it introduces a concrete, novel training framework (introspective redundancy identification + masked preference optimization) that directly targets a timely, widely observed LRM failure mode (overthinking/inefficient long CoTs) and demonstrates large efficiency gains (~56% token reduction) while preserving SOTA accuracy—high immediate practical value and broad relevance across LLM training and deployment. Paper 2 poses an interesting question but relies on stronger assumptions (causal discovery reliability) and frames results as preliminary/limited-scope, likely reducing near-term adoption and impact.

vs. Can AI Review Improve Paper Drafting? An Empirical Study on 20 Computer Architecture Submissions

claude-opus-4.66/3/2026

Paper 1 addresses a fundamental question in causal inference—whether real-world datasets contain natural experiments and how to exploit them—with rigorous methodology combining causal discovery, feature selection, and systematic evaluation on synthetic and real-world datasets. This has broad implications across many scientific fields that rely on observational data. Paper 2, while timely and practical, is a small-scale case study (20 papers) in a narrow domain with limited generalizability. Paper 1's contributions to causal inference methodology have deeper and broader potential scientific impact.

vs. EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

gemini-3.16/3/2026

Paper 2 addresses a critical bottleneck in the highly active field of LLM agentic reinforcement learning. By co-evolving policies and training harnesses, it presents a scalable, automated alternative to static human-engineered training recipes. Its strong empirical results on complex tasks like repository-level software engineering demonstrate immediate, high-impact real-world applications. In contrast, while Paper 1 presents a novel conceptual framework for causal feature selection, it self-identifies as a preliminary exploration with limited scope, making its near-term scientific impact likely lower than the timely advancements in LLM training offered by Paper 2.

vs. Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

gemini-3.16/3/2026

Paper 2 presents a highly ambitious, mathematically rigorous category-theoretic framework for agentic AI in scientific discovery. While Paper 1 offers a broad and useful empirical approach to causal inference, Paper 2 tackles the foundational architecture of self-revising AI scientists. Given the explosive growth and transformative potential of AI-driven scientific discovery, establishing a robust formal framework for these systems yields a higher potential for deep, cross-disciplinary impact.

vs. GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

gpt-5.26/3/2026

Paper 2 has higher likely impact due to timeliness and broad relevance: rigorous, curriculum-grounded benchmarking of frontier LLMs addresses an urgent need across AI evaluation, education, and research governance. It contributes a reusable dataset/benchmark, multi-tier difficulty design, and careful human-vs-judge analysis revealing evaluation unreliability—actionable for the field. Paper 1 is conceptually interesting but relies on indirect evidence (performance gains) to infer natural experiments, which may conflate causal discovery errors and model/selection effects, limiting methodological rigor and near-term adoption.