D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery
Hanane Nour Moussa, Yifei Li, Zhuoyang Li, Yankai Yang, Cheng Tang, Tianshu Zhang, Nesreen K. Ahmed, Ali Payani
Abstract
Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific tasks. To fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and strong alignment in domain-specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3-Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3-32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3-Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU-NLP-Group/D3-Gym.
AI Impact Assessments
(1 models)Scientific Impact Assessment: D3-Gym
1. Core Contribution
D3-Gym addresses a genuine infrastructure gap: the lack of automatically constructed, executable, and verifiable environments for scientific data-driven discovery tasks. The paper's main contributions are threefold: (1) an automated pipeline that transforms real scientific GitHub repositories into verifiable task environments, (2) a dataset of 565 tasks across four scientific disciplines with automatically synthesized evaluation scripts, and (3) demonstration that training on trajectories sampled from these environments yields substantial improvements in open-weight models on ScienceAgentBench.
The key technical novelty lies in the evaluation script generation pipeline, which uses a two-phase planning-then-coding approach to produce domain-specific evaluation scripts without human intervention. This is meaningfully harder than test generation in software engineering (where unit tests often exist) or ML benchmarks (where standard metrics like F1/RMSE suffice), because scientific tasks require domain-aware metrics, artifact-specific checks, and scientifically justified acceptance thresholds.
2. Methodological Rigor
The paper demonstrates strong methodological discipline. The multi-stage filtering pipeline (5,111 → 1,586 → 1,263 → 565 tasks) reflects genuine quality control rather than scale-at-all-costs. The validation of silver evaluation scripts against 50 human-annotated gold scripts (175 person-hours of annotation) is particularly well-designed, measuring both execution-based agreement (87.5% accuracy, 66.1% recall, 91.0% specificity) and evaluation logic alignment across three dimensions.
The ablation study cleanly isolates the contribution of each pipeline component: removing planning drops recall to 3.4%, removing dataset previews degrades specificity substantially, and removing code output reduces recall to 0%. This provides convincing evidence that the design choices are well-motivated.
The training experiments are conducted across multiple model sizes (4B through 32B) with both RFT-Distill and RFT-Self variants, three independent runs per setting, and reporting of both average and best-of-3 metrics. The distribution alignment analysis (NLL/PPL comparison for self vs. teacher trajectories) provides a principled explanation for why RFT-Self outperforms RFT-Distill at larger scales.
One methodological concern: the 66.1% recall means roughly one-third of correct solutions are rejected by silver scripts. While the authors frame this as "mild strictness," it means the training signal is biased — models trained on D3-Gym may learn to satisfy stricter-than-necessary criteria. The analysis of this bias and its downstream effects is somewhat limited.
3. Potential Impact
Immediate impact: D3-Gym directly enables training open-weight models for scientific coding tasks, an area where proprietary models currently dominate. The 7.8-point SR@3 improvement for Qwen3-32B is substantial, bringing it within striking distance of o1-preview and Claude Sonnet 4.5.
Infrastructure value: The automated pipeline for constructing verifiable environments is arguably more impactful than the dataset itself. It provides a template for scaling scientific task environments as more repositories become available, and could be adapted to additional scientific disciplines beyond the four currently covered.
Training paradigm implications: The finding that RFT-Self matches or outperforms RFT-Distill at ≥14B scale, combined with the distribution alignment analysis, has implications beyond this specific benchmark — it suggests that for complex, domain-specific tasks, on-policy learning is particularly important.
Benchmark contribution: ScienceAgentBench-Verified, while secondary, addresses real issues in an important benchmark, improving evaluation fidelity for the community.
4. Timeliness & Relevance
This work is highly timely. The convergence of several trends — growing interest in AI for scientific discovery, the emergence of reasoning models, and the push to close the gap between open-weight and proprietary models — creates strong demand for exactly this type of infrastructure. The reference to Karpathy's Autoresearch (2026) highlights that the manual bottleneck in creating verifiable scientific environments is a widely recognized problem.
The focus on open-weight models is strategically important given growing concerns about reproducibility and transparency in scientific AI systems.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Comparison to prior art: The positioning against AutoSDT, SWE-Gym, and MLE-Dojo is well-articulated. The key differentiator — that scientific tasks require bespoke evaluation logic rather than pre-existing tests or universal metrics — is convincingly argued and demonstrated.
Overall Assessment
D3-Gym makes a solid and timely contribution by solving a genuine infrastructure problem for scientific AI. The automated evaluation script generation pipeline is the most novel component and is well-validated. The downstream training results are convincing and practically meaningful. While scale and discipline coverage remain limitations, the open release of the full pipeline positions this as a foundation for community-driven expansion.
Generated May 5, 2026
Comparison History (43)
Paper 1 has higher likely impact due to a broadly useful, verifiable benchmark infrastructure for real scientific data-driven discovery. Its main contribution (auto-constructed executable environments + evaluation scripts with demonstrated human agreement) directly enables rigorous, reproducible training/evaluation of scientific agents across disciplines, and is immediately reusable by the community. Methodological rigor is strong (agreement study, cross-model gains, open artifacts). Paper 2 is innovative (agent-facing interpretability, autoresearch loop) with promising results, but its LLM-graded metric and scope (tabular scikit-learn regressors) are narrower and may be more sensitive to model/metric drift.
D3-Gym introduces a novel, reusable benchmark infrastructure for scientific data-driven discovery with verifiable environments—a foundational contribution that can catalyze broad research across multiple scientific disciplines and AI for science. While OpenSeeker-v2 achieves impressive SOTA results for search agents via efficient SFT, its contribution is more incremental (data filtering/scaling improvements). D3-Gym addresses a more fundamental gap (lack of verifiable scientific task environments), has broader cross-disciplinary impact, and provides a generalizable framework that can benefit the entire AI-for-science community long-term.
Paper 1 likely has higher impact: it delivers a broadly useful, verifiable benchmark infrastructure for real scientific data-driven discovery with executable environments and auto-synthesized evaluators, addressing a key bottleneck (reliable evaluation) and enabling reproducible progress across many agent/LLM systems and disciplines. Its methodological rigor is supported by human-agreement validation of evaluation scripts and demonstrated training gains on external benchmarks. Paper 2 is novel and timely but hinges on an LLM-graded interpretability metric that may be less stable/standardizable, and its scope is narrower (tabular scikit-learn regressors) despite strong downstream gains.
D3-Gym addresses a fundamental infrastructure gap for scientific data-driven discovery by creating verifiable environments—a reusable benchmark that enables systematic evaluation and training of scientific AI agents. Its 565 tasks across 239 repositories and four disciplines provide broad, lasting utility. The demonstrated training gains on ScienceAgentBench show practical value. While OpenSeeker-v2 achieves impressive search agent results with efficient SFT, its contributions are more incremental (data filtering/scaling heuristics) and narrowly focused on web search agents. D3-Gym's benchmark-creation methodology and cross-disciplinary scope give it broader and more enduring scientific impact.
Paper 1 introduces foundational infrastructure (565 verifiable environments across four disciplines) for training and evaluating AI scientists, addressing a critical bottleneck in data-driven discovery. While Paper 2 achieves a significant milestone in mathematics by solving open problems, Paper 1 has broader applicability across multiple scientific fields and provides a reusable benchmark that will likely catalyze widespread follow-up research and model development.
Paper 1 demonstrates an AI system solving genuinely open, expert-contributed mathematical problems, representing a major milestone in AI-driven scientific discovery. While Paper 2 provides a valuable dataset and benchmark for training scientific agents, Paper 1's achievement of producing original, verified proofs for unsolved problems has profound and immediate implications for the future of mathematical research and AI capabilities.
Paper 2 has higher likely impact because it delivers a scalable, openly released benchmark+infrastructure (verifiable executable environments) that can standardize evaluation and accelerate progress across many labs and domains. Its methodological rigor is strengthened by human-agreement validation of evaluation scripts and demonstrated downstream training gains across model sizes. The breadth and timeliness are high given the current bottleneck in trustworthy agent evaluation for real scientific workflows. Paper 1 is potentially very high-impact if broadly validated, but from the abstract it appears more like a specific method whose generality and comparative rigor may be harder to establish.
Paper 2 likely has higher impact: it proposes a broadly applicable paradigm (multi-agent symbolic + metaheuristic equation discovery) targeting a core scientific bottleneck—recovering interpretable, extrapolatable governing laws—claiming dramatic extrapolation improvements and compression into human-interpretable forms. This is timely for explainable AI in science and could influence multiple fields (physics, biology, engineering). Paper 1 is methodologically solid and very useful infrastructure for benchmarking and training agents, but its impact is more indirect (enabling evaluation) and narrower than a general discovery mechanism if Paper 2’s claims hold.
MIMIC presents a fundamentally novel generative multimodal foundation model that unifies sequence, structure, regulation, evolution, and context for biomolecules—addressing a core challenge in computational biology. It demonstrates state-of-the-art results across multiple downstream tasks (splicing prediction, protein design, RNA editing) with direct therapeutic implications (HBB mutation correction, PD-L1 binding). Its breadth of impact spans genomics, transcriptomics, proteomics, and drug design. While D3-Gym is a valuable benchmark contribution for AI agents in scientific discovery, MIMIC's methodological innovation and potential for transformative real-world biological applications give it substantially higher scientific impact.
Paper 1 presents a foundation model for human physiology capable of simulating clinical interventions and predicting disease trajectories across multimodal data. Its potential to enable 'clinical digital twins' represents a transformative leap for personalized medicine and clinical trial design. While Paper 2 provides a valuable benchmark for AI agents, Paper 1's direct, life-saving applicability to human health, massive scale, and success in matching real-world RCT outcomes grant it vastly higher potential scientific and societal impact.
IatroBench addresses a critical and timely issue—AI safety measures causing iatrogenic harm through identity-contingent knowledge withholding—with a rigorous pre-registered methodology across frontier models. It reveals a fundamental tension in AI alignment (safety vs. omission harm) with immediate real-world clinical implications affecting vulnerable populations. The finding that safety measures systematically harm those who need help most, and that standard evaluation pipelines share the same blind spot, challenges core assumptions in AI safety. While D3-Gym is a valuable benchmark contribution for scientific discovery agents, IatroBench's findings have broader societal implications across AI policy, healthcare, and alignment research.
Paper 2 addresses a critical, potentially life-threatening flaw in current AI safety paradigms (omission harm in medical contexts). Its exposure of 'identity-contingent withholding' challenges standard alignment practices and has profound implications for AI ethics, medical AI, and policy, giving it broader societal and cross-disciplinary impact than the benchmarking tool presented in Paper 1.
Paper 1 addresses a fundamental epistemological question about whether AI scientific agents truly reason scientifically, revealing critical failures (evidence ignored 68% of the time, rare refutation-driven revision) across 25,000+ runs. This has profound implications for the trustworthiness of AI-generated scientific knowledge and challenges the field's reliance on outcome-based evaluation. Its findings affect all downstream work using LLM agents for science. Paper 2 is a solid benchmark/dataset contribution, but Paper 1's insights about the fundamental limitations of current approaches have broader, more transformative impact on how the community builds and evaluates scientific AI systems.
Paper 2 likely has higher scientific impact: it tackles a foundational, timely question about whether autonomous AI scientists follow epistemic norms, using large-scale evaluation (25k+ runs) across eight domains with both performance decomposition and behavioral/epistemic analysis. Its conclusions (outcome metrics miss failures; scaffold tweaks insufficient; reasoning must be trained) can reshape evaluation standards and research priorities across AI, scientific automation, and AI governance. Paper 1 is a strong enabling dataset/benchmark with practical utility, but its impact is more incremental and narrower to agent training/evaluation infrastructure.
MIMIC presents a multimodal foundation model for biomolecules, bridging sequence, structure, and cellular context. Its ability to perform state-of-the-art prediction, isoform-aware inference, and constrained biomolecular design holds transformative potential for computational biology, drug discovery, and genetic engineering. While Paper 1 provides a valuable benchmarking environment for AI agents, Paper 2 directly addresses fundamental biological challenges with broad, real-world applications in medicine and biotechnology, offering a much higher ceiling for paradigm-shifting scientific impact.
HealthFormer represents a paradigm-shifting contribution to precision medicine by creating a generative 'health world model' trained on deeply phenotyped longitudinal data from 15,000+ individuals across 667 measurements and seven physiological domains. Its ability to simulate clinical interventions in silico, validate against published RCTs (41/41 correct direction of effect), and outperform established clinical risk scores across 27/30 disease endpoints without task-specific training demonstrates extraordinary potential for clinical digital twins. While D3-Gym is a valuable benchmark for AI-driven scientific discovery, HealthFormer's direct medical applications, methodological novelty, and breadth of validated clinical utility give it substantially higher real-world impact.
Generative Structure Search addresses a fundamental bottleneck in molecular and materials discovery. By dramatically accelerating the search for metastable structures, it has direct, profound real-world applications in drug design and materials science (e.g., batteries, catalysts). While Paper 1 provides a valuable benchmarking tool for AI agents, Paper 2 offers a foundational methodological breakthrough with immediate, transformative potential across chemistry and physics.
Paper 1 likely has higher scientific impact due to stronger novelty (end-to-end autonomous discovery on a real physical platform) and a concrete, experimentally validated new physical mechanism (optical bilinear interaction) with clear downstream applications in optical computing hardware. Its breadth spans AI agents, experimental optics, and hardware acceleration, and it is highly timely given interest in autonomous labs and post-Transformer compute. Paper 2 is rigorous and valuable infrastructure for benchmarking and training, but primarily advances evaluation/engineering rather than producing a new scientific phenomenon.
Paper 2 demonstrates a fundamentally more impactful contribution: the first end-to-end autonomous scientific discovery system operating on a real physical platform that identifies and experimentally validates a previously unreported physical mechanism (optical bilinear interaction). This represents a paradigm shift in how scientific research can be conducted. While Paper 1 provides a valuable benchmark/training dataset for data-driven discovery agents, Paper 2 achieves what Paper 1's benchmarks aspire to measure—actual autonomous discovery with real-world experimental validation. The novelty, real-world demonstration, and cross-disciplinary implications (AI + optics + computing hardware) give Paper 2 substantially higher impact potential.
Paper 1 introduces a fundamentally novel framework (GSS) that unifies generative models with physics-based search for molecular and crystal structure prediction—a core challenge in materials science and chemistry. Its >10x efficiency gains and ability to discover diverse metastable structures outside training distributions represent a significant methodological advance with broad applications in drug discovery, materials design, and catalysis. Paper 2, while valuable as a benchmark for AI agents in scientific discovery, is more incremental—creating evaluation infrastructure rather than enabling new scientific capabilities. Paper 1's direct impact on physical sciences gives it higher long-term scientific significance.