D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery

Hanane Nour Moussa, Yifei Li, Zhuoyang Li, Yankai Yang, Cheng Tang, Tianshu Zhang, Nesreen K. Ahmed, Ali Payani

#120 of 2292 · Artificial Intelligence
Share
Tournament Score
1535±35
10501800
56%
Win Rate
24
Wins
19
Losses
43
Matches
Rating
7.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific tasks. To fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and strong alignment in domain-specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3-Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3-32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3-Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU-NLP-Group/D3-Gym.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: D3-Gym

1. Core Contribution

D3-Gym addresses a genuine infrastructure gap: the lack of automatically constructed, executable, and verifiable environments for scientific data-driven discovery tasks. The paper's main contributions are threefold: (1) an automated pipeline that transforms real scientific GitHub repositories into verifiable task environments, (2) a dataset of 565 tasks across four scientific disciplines with automatically synthesized evaluation scripts, and (3) demonstration that training on trajectories sampled from these environments yields substantial improvements in open-weight models on ScienceAgentBench.

The key technical novelty lies in the evaluation script generation pipeline, which uses a two-phase planning-then-coding approach to produce domain-specific evaluation scripts without human intervention. This is meaningfully harder than test generation in software engineering (where unit tests often exist) or ML benchmarks (where standard metrics like F1/RMSE suffice), because scientific tasks require domain-aware metrics, artifact-specific checks, and scientifically justified acceptance thresholds.

2. Methodological Rigor

The paper demonstrates strong methodological discipline. The multi-stage filtering pipeline (5,111 → 1,586 → 1,263 → 565 tasks) reflects genuine quality control rather than scale-at-all-costs. The validation of silver evaluation scripts against 50 human-annotated gold scripts (175 person-hours of annotation) is particularly well-designed, measuring both execution-based agreement (87.5% accuracy, 66.1% recall, 91.0% specificity) and evaluation logic alignment across three dimensions.

The ablation study cleanly isolates the contribution of each pipeline component: removing planning drops recall to 3.4%, removing dataset previews degrades specificity substantially, and removing code output reduces recall to 0%. This provides convincing evidence that the design choices are well-motivated.

The training experiments are conducted across multiple model sizes (4B through 32B) with both RFT-Distill and RFT-Self variants, three independent runs per setting, and reporting of both average and best-of-3 metrics. The distribution alignment analysis (NLL/PPL comparison for self vs. teacher trajectories) provides a principled explanation for why RFT-Self outperforms RFT-Distill at larger scales.

One methodological concern: the 66.1% recall means roughly one-third of correct solutions are rejected by silver scripts. While the authors frame this as "mild strictness," it means the training signal is biased — models trained on D3-Gym may learn to satisfy stricter-than-necessary criteria. The analysis of this bias and its downstream effects is somewhat limited.

3. Potential Impact

Immediate impact: D3-Gym directly enables training open-weight models for scientific coding tasks, an area where proprietary models currently dominate. The 7.8-point SR@3 improvement for Qwen3-32B is substantial, bringing it within striking distance of o1-preview and Claude Sonnet 4.5.

Infrastructure value: The automated pipeline for constructing verifiable environments is arguably more impactful than the dataset itself. It provides a template for scaling scientific task environments as more repositories become available, and could be adapted to additional scientific disciplines beyond the four currently covered.

Training paradigm implications: The finding that RFT-Self matches or outperforms RFT-Distill at ≥14B scale, combined with the distribution alignment analysis, has implications beyond this specific benchmark — it suggests that for complex, domain-specific tasks, on-policy learning is particularly important.

Benchmark contribution: ScienceAgentBench-Verified, while secondary, addresses real issues in an important benchmark, improving evaluation fidelity for the community.

4. Timeliness & Relevance

This work is highly timely. The convergence of several trends — growing interest in AI for scientific discovery, the emergence of reasoning models, and the push to close the gap between open-weight and proprietary models — creates strong demand for exactly this type of infrastructure. The reference to Karpathy's Autoresearch (2026) highlights that the manual bottleneck in creating verifiable scientific environments is a widely recognized problem.

The focus on open-weight models is strategically important given growing concerns about reproducibility and transparency in scientific AI systems.

5. Strengths & Limitations

Key Strengths:

  • The planning-then-coding decomposition for evaluation script generation is elegant and well-validated through ablations
  • Genuine scientific diversity: domain-specific packages (rdkit, ase, pysam, geopandas), diverse data modalities, and multi-faceted task types
  • The detailed task examples (Appendix D.1) demonstrate authentic scientific complexity — these are not toy tasks
  • Comprehensive error analysis (Figure 5b) provides actionable insights about remaining failure modes
  • Full artifact release including environments, pipeline, trajectories, and models
  • Notable Limitations:

  • Scale remains modest (565 tasks) compared to SE environments like SWE-Gym
  • The silver script bias toward strictness means the verification signal is systematically conservative, potentially filtering out valid but unconventional solution approaches
  • Only four disciplines are covered; extension to experimental sciences, physics, or social sciences is untested
  • The reliance on Claude Sonnet 4.5 for evaluation script generation means the pipeline quality is coupled to a specific proprietary model
  • No RL experiments are conducted, despite the paper framing D3-Gym as suitable for RL training — this is acknowledged as future work but represents a missed opportunity to demonstrate the full potential of verifiable environments
  • The cost (~$1,700 for 565 tasks) is reasonable but not trivial, and scaling analysis of the pipeline cost is absent
  • Comparison to prior art: The positioning against AutoSDT, SWE-Gym, and MLE-Dojo is well-articulated. The key differentiator — that scientific tasks require bespoke evaluation logic rather than pre-existing tests or universal metrics — is convincingly argued and demonstrated.

    Overall Assessment

    D3-Gym makes a solid and timely contribution by solving a genuine infrastructure problem for scientific AI. The automated evaluation script generation pipeline is the most novel component and is well-validated. The downstream training results are convincing and practically meaningful. While scale and discipline coverage remain limitations, the open release of the full pipeline positions this as a foundation for community-driven expansion.

    Rating:7.2/ 10
    Significance 7.5Rigor 7.5Novelty 6.8Clarity 8

    Generated May 5, 2026

    Comparison History (43)

    vs. Agentic-imodels: Evolving agentic interpretability tools via autoresearch
    gpt-5.25/6/2026

    Paper 1 has higher likely impact due to a broadly useful, verifiable benchmark infrastructure for real scientific data-driven discovery. Its main contribution (auto-constructed executable environments + evaluation scripts with demonstrated human agreement) directly enables rigorous, reproducible training/evaluation of scientific agents across disciplines, and is immediately reusable by the community. Methodological rigor is strong (agreement study, cross-model gains, open artifacts). Paper 2 is innovative (agent-facing interpretability, autoresearch loop) with promising results, but its LLM-graded metric and scope (tabular scikit-learn regressors) are narrower and may be more sensitive to model/metric drift.

    vs. OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
    claude-opus-4.65/6/2026

    D3-Gym introduces a novel, reusable benchmark infrastructure for scientific data-driven discovery with verifiable environments—a foundational contribution that can catalyze broad research across multiple scientific disciplines and AI for science. While OpenSeeker-v2 achieves impressive SOTA results for search agents via efficient SFT, its contribution is more incremental (data filtering/scaling improvements). D3-Gym addresses a more fundamental gap (lack of verifiable scientific task environments), has broader cross-disciplinary impact, and provides a generalizable framework that can benefit the entire AI-for-science community long-term.

    vs. Agentic-imodels: Evolving agentic interpretability tools via autoresearch
    gpt-5.25/6/2026

    Paper 1 likely has higher impact: it delivers a broadly useful, verifiable benchmark infrastructure for real scientific data-driven discovery with executable environments and auto-synthesized evaluators, addressing a key bottleneck (reliable evaluation) and enabling reproducible progress across many agent/LLM systems and disciplines. Its methodological rigor is supported by human-agreement validation of evaluation scripts and demonstrated training gains on external benchmarks. Paper 2 is novel and timely but hinges on an LLM-graded interpretability metric that may be less stable/standardizable, and its scope is narrower (tabular scikit-learn regressors) despite strong downstream gains.

    vs. OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
    claude-opus-4.65/6/2026

    D3-Gym addresses a fundamental infrastructure gap for scientific data-driven discovery by creating verifiable environments—a reusable benchmark that enables systematic evaluation and training of scientific AI agents. Its 565 tasks across 239 repositories and four disciplines provide broad, lasting utility. The demonstrated training gains on ScienceAgentBench show practical value. While OpenSeeker-v2 achieves impressive search agent results with efficient SFT, its contributions are more incremental (data filtering/scaling heuristics) and narrowly focused on web search agents. D3-Gym's benchmark-creation methodology and cross-disciplinary scope give it broader and more enduring scientific impact.

    vs. QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
    gemini-35/5/2026

    Paper 1 introduces foundational infrastructure (565 verifiable environments across four disciplines) for training and evaluating AI scientists, addressing a critical bottleneck in data-driven discovery. While Paper 2 achieves a significant milestone in mathematics by solving open problems, Paper 1 has broader applicability across multiple scientific fields and provides a reusable benchmark that will likely catalyze widespread follow-up research and model development.

    vs. QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
    gemini-35/5/2026

    Paper 1 demonstrates an AI system solving genuinely open, expert-contributed mathematical problems, representing a major milestone in AI-driven scientific discovery. While Paper 2 provides a valuable dataset and benchmark for training scientific agents, Paper 1's achievement of producing original, verified proofs for unsolved problems has profound and immediate implications for the future of mathematical research and AI capabilities.

    vs. Machine Collective Intelligence for Explainable Scientific Discovery
    gpt-5.25/5/2026

    Paper 2 has higher likely impact because it delivers a scalable, openly released benchmark+infrastructure (verifiable executable environments) that can standardize evaluation and accelerate progress across many labs and domains. Its methodological rigor is strengthened by human-agreement validation of evaluation scripts and demonstrated downstream training gains across model sizes. The breadth and timeliness are high given the current bottleneck in trustworthy agent evaluation for real scientific workflows. Paper 1 is potentially very high-impact if broadly validated, but from the abstract it appears more like a specific method whose generality and comparative rigor may be harder to establish.

    vs. Machine Collective Intelligence for Explainable Scientific Discovery
    gpt-5.25/5/2026

    Paper 2 likely has higher impact: it proposes a broadly applicable paradigm (multi-agent symbolic + metaheuristic equation discovery) targeting a core scientific bottleneck—recovering interpretable, extrapolatable governing laws—claiming dramatic extrapolation improvements and compression into human-interpretable forms. This is timely for explainable AI in science and could influence multiple fields (physics, biology, engineering). Paper 1 is methodologically solid and very useful infrastructure for benchmarking and training agents, but its impact is more indirect (enabling evaluation) and narrower than a general discovery mechanism if Paper 2’s claims hold.

    vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules
    claude-opus-4.65/5/2026

    MIMIC presents a fundamentally novel generative multimodal foundation model that unifies sequence, structure, regulation, evolution, and context for biomolecules—addressing a core challenge in computational biology. It demonstrates state-of-the-art results across multiple downstream tasks (splicing prediction, protein design, RNA editing) with direct therapeutic implications (HBB mutation correction, PD-L1 binding). Its breadth of impact spans genomics, transcriptomics, proteomics, and drug design. While D3-Gym is a valuable benchmark contribution for AI agents in scientific discovery, MIMIC's methodological innovation and potential for transformative real-world biological applications give it substantially higher scientific impact.

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    gemini-35/5/2026

    Paper 1 presents a foundation model for human physiology capable of simulating clinical interventions and predicting disease trajectories across multimodal data. Its potential to enable 'clinical digital twins' represents a transformative leap for personalized medicine and clinical trial design. While Paper 2 provides a valuable benchmark for AI agents, Paper 1's direct, life-saving applicability to human health, massive scale, and success in matching real-world RCT outcomes grant it vastly higher potential scientific and societal impact.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    claude-opus-4.65/5/2026

    IatroBench addresses a critical and timely issue—AI safety measures causing iatrogenic harm through identity-contingent knowledge withholding—with a rigorous pre-registered methodology across frontier models. It reveals a fundamental tension in AI alignment (safety vs. omission harm) with immediate real-world clinical implications affecting vulnerable populations. The finding that safety measures systematically harm those who need help most, and that standard evaluation pipelines share the same blind spot, challenges core assumptions in AI safety. While D3-Gym is a valuable benchmark contribution for scientific discovery agents, IatroBench's findings have broader societal implications across AI policy, healthcare, and alignment research.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    gemini-35/5/2026

    Paper 2 addresses a critical, potentially life-threatening flaw in current AI safety paradigms (omission harm in medical contexts). Its exposure of 'identity-contingent withholding' challenges standard alignment practices and has profound implications for AI ethics, medical AI, and policy, giving it broader societal and cross-disciplinary impact than the benchmarking tool presented in Paper 1.

    vs. AI scientists produce results without reasoning scientifically
    claude-opus-4.65/5/2026

    Paper 1 addresses a fundamental epistemological question about whether AI scientific agents truly reason scientifically, revealing critical failures (evidence ignored 68% of the time, rare refutation-driven revision) across 25,000+ runs. This has profound implications for the trustworthiness of AI-generated scientific knowledge and challenges the field's reliance on outcome-based evaluation. Its findings affect all downstream work using LLM agents for science. Paper 2 is a solid benchmark/dataset contribution, but Paper 1's insights about the fundamental limitations of current approaches have broader, more transformative impact on how the community builds and evaluates scientific AI systems.

    vs. AI scientists produce results without reasoning scientifically
    gpt-5.25/5/2026

    Paper 2 likely has higher scientific impact: it tackles a foundational, timely question about whether autonomous AI scientists follow epistemic norms, using large-scale evaluation (25k+ runs) across eight domains with both performance decomposition and behavioral/epistemic analysis. Its conclusions (outcome metrics miss failures; scaffold tweaks insufficient; reasoning must be trained) can reshape evaluation standards and research priorities across AI, scientific automation, and AI governance. Paper 1 is a strong enabling dataset/benchmark with practical utility, but its impact is more incremental and narrower to agent training/evaluation infrastructure.

    vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules
    gemini-35/5/2026

    MIMIC presents a multimodal foundation model for biomolecules, bridging sequence, structure, and cellular context. Its ability to perform state-of-the-art prediction, isoform-aware inference, and constrained biomolecular design holds transformative potential for computational biology, drug discovery, and genetic engineering. While Paper 1 provides a valuable benchmarking environment for AI agents, Paper 2 directly addresses fundamental biological challenges with broad, real-world applications in medicine and biotechnology, offering a much higher ceiling for paradigm-shifting scientific impact.

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    claude-opus-4.65/5/2026

    HealthFormer represents a paradigm-shifting contribution to precision medicine by creating a generative 'health world model' trained on deeply phenotyped longitudinal data from 15,000+ individuals across 667 measurements and seven physiological domains. Its ability to simulate clinical interventions in silico, validate against published RCTs (41/41 correct direction of effect), and outperform established clinical risk scores across 27/30 disease endpoints without task-specific training demonstrates extraordinary potential for clinical digital twins. While D3-Gym is a valuable benchmark for AI-driven scientific discovery, HealthFormer's direct medical applications, methodological novelty, and breadth of validated clinical utility give it substantially higher real-world impact.

    vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures
    gemini-35/5/2026

    Generative Structure Search addresses a fundamental bottleneck in molecular and materials discovery. By dramatically accelerating the search for metastable structures, it has direct, profound real-world applications in drug design and materials science (e.g., batteries, catalysts). While Paper 1 provides a valuable benchmarking tool for AI agents, Paper 2 offers a foundational methodological breakthrough with immediate, transformative potential across chemistry and physics.

    vs. End-to-end autonomous scientific discovery on a real optical platform
    gpt-5.25/5/2026

    Paper 1 likely has higher scientific impact due to stronger novelty (end-to-end autonomous discovery on a real physical platform) and a concrete, experimentally validated new physical mechanism (optical bilinear interaction) with clear downstream applications in optical computing hardware. Its breadth spans AI agents, experimental optics, and hardware acceleration, and it is highly timely given interest in autonomous labs and post-Transformer compute. Paper 2 is rigorous and valuable infrastructure for benchmarking and training, but primarily advances evaluation/engineering rather than producing a new scientific phenomenon.

    vs. End-to-end autonomous scientific discovery on a real optical platform
    claude-opus-4.65/5/2026

    Paper 2 demonstrates a fundamentally more impactful contribution: the first end-to-end autonomous scientific discovery system operating on a real physical platform that identifies and experimentally validates a previously unreported physical mechanism (optical bilinear interaction). This represents a paradigm shift in how scientific research can be conducted. While Paper 1 provides a valuable benchmark/training dataset for data-driven discovery agents, Paper 2 achieves what Paper 1's benchmarks aspire to measure—actual autonomous discovery with real-world experimental validation. The novelty, real-world demonstration, and cross-disciplinary implications (AI + optics + computing hardware) give Paper 2 substantially higher impact potential.

    vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures
    claude-opus-4.65/5/2026

    Paper 1 introduces a fundamentally novel framework (GSS) that unifies generative models with physics-based search for molecular and crystal structure prediction—a core challenge in materials science and chemistry. Its >10x efficiency gains and ability to discover diverse metastable structures outside training distributions represent a significant methodological advance with broad applications in drug discovery, materials design, and catalysis. Paper 2, while valuable as a benchmark for AI agents in scientific discovery, is more incremental—creating evaluation infrastructure rather than enabling new scientific capabilities. Paper 1's direct impact on physical sciences gives it higher long-term scientific significance.