D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery
Hanane Nour Moussa, Yifei Li, Zhuoyang Li, Yankai Yang, Cheng Tang, Tianshu Zhang, Nesreen K. Ahmed, Ali Payani
Abstract
Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific tasks.To fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and strong alignment in domain-specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3-Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3-32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3-Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU-NLP-Group/D3-Gym.
AI Impact Assessments
(3 models)Scientific Impact Assessment: D3-Gym
1. Core Contribution
D3-Gym addresses a genuine infrastructure gap in AI for scientific discovery: the lack of verifiable, executable environments for training and evaluating language models on real-world data-driven scientific tasks. The paper's core contributions are threefold: (1) an automated pipeline for constructing executable environments from scientific GitHub repositories, (2) a two-phase (planning-then-coding) approach for synthesizing domain-specific evaluation scripts, and (3) demonstration that training on trajectories sampled from these environments yields substantial improvements in open-weight models.
The key technical novelty lies in the evaluation script generation methodology. Unlike software engineering where unit tests exist naturally, scientific tasks require domain-specific evaluation logic—appropriate metrics, thresholds, and artifact inspection—that must be reasoned about from scratch. The decomposition into planning (reasoning about scientific validity) and coding (implementing the evaluation) is well-motivated and empirically validated through ablation studies.
2. Methodological Rigor
The paper demonstrates strong methodological discipline in several ways:
Quality validation is thorough. The 87.5% agreement between silver and gold evaluation scripts, validated on 50 human-annotated tasks requiring 175 person-hours, is a credible assessment. The multidimensional evaluation—execution-based (accuracy, recall, specificity) and logic-based (metric choice, threshold, artifact)—provides a nuanced picture rather than a single aggregate number. The identification that silver scripts tend toward mild strictness (66.1% recall, 91.0% specificity) is honest and well-characterized.
Ablation studies are informative. Removing planning, dataset previews, or code outputs each degrades performance in interpretable ways, validating design choices. The finding that removing planning collapses recall to 3.4% is particularly compelling.
Training experiments are well-designed. The paper evaluates across four model sizes (4B–32B), two training paradigms (RFT-Distill, RFT-Self), and reports both average and best-of-3 metrics. The NLL/PPL analysis explaining why RFT-Self outperforms RFT-Distill at larger scales adds mechanistic insight.
Potential concerns: The validation set of 50 tasks is relatively small given the diversity of 565 tasks across four disciplines. The LLM-as-judge for output verification (GPT-5.2) and evaluation logic assessment (Claude Sonnet 4.5) introduces potential circularity, though the human agreement checks (92.31% and 85% respectively) partially mitigate this. The paper also doesn't discuss potential data contamination—whether evaluation benchmark tasks might overlap with training data at the concept level despite excluding specific repositories.
3. Potential Impact
Training infrastructure for scientific AI. The most immediate impact is providing a scalable training resource. The 7.8 absolute point improvement for Qwen3-32B on ScienceAgentBench, approaching proprietary models like o1-preview and Claude Sonnet 4.5, demonstrates practical value. This could accelerate the development of open-weight scientific coding agents.
Methodology transfer. The evaluation script generation pipeline—planning-then-coding with domain-specific reasoning—could generalize to other domains where verification requires domain expertise (e.g., engineering simulation, clinical data analysis).
Benchmark contribution. ScienceAgentBench-Verified, while secondary, addresses real evaluation noise issues and demonstrates responsible benchmarking practices.
Limitations on broader impact: The 565 tasks, while carefully curated, remain relatively small. The four disciplines covered, while diverse, exclude major scientific areas (physics, materials science, ecology). The pipeline's reliance on GitHub repositories with specific structural properties may introduce systematic biases in task representation.
4. Timeliness & Relevance
This paper is highly timely. The convergence of several trends—the push toward RL/RFT training for reasoning models, the "Autoresearch" paradigm, and growing interest in open-weight scientific AI—creates strong demand for exactly this type of infrastructure. The paper explicitly positions itself against the manual effort bottleneck highlighted by Karpathy's Autoresearch, offering an automated alternative.
The focus on open-weight models addresses a real equity concern in scientific AI, where proprietary model dependence limits reproducibility and accessibility. The gap between open-weight and proprietary models on scientific tasks makes this contribution particularly relevant.
5. Strengths & Limitations
Key strengths:
Notable limitations:
Additional observations: The detailed task examples (Kriging interpolation, Madelung constants) effectively illustrate the scientific depth of D3-Gym tasks. The error analysis showing that D3-Gym training reduces data schema errors by 44.7% but increases logical/algorithmic errors (as more programs now execute successfully) provides nuanced insight into what the training actually teaches—primarily better code mechanics rather than deeper scientific reasoning.
The comparison with AutoSDT-5K demonstrating superior sample efficiency of verified environments over static instruction-solution pairs is an important result that validates the core thesis: verification signals matter for training data quality.
Generated May 1, 2026
Comparison History (47)
While Paper 1 provides a highly useful benchmarking environment, Paper 2 has greater potential impact because it challenges the fundamental validity of current AI-driven scientific discovery. By rigorously demonstrating across 25,000 runs that LLMs fail at basic epistemic reasoning—ignoring evidence and failing to self-correct—Paper 2 exposes a critical flaw in the heavily hyped 'AI Scientist' paradigm. This critical evaluation will likely catalyze a major shift in how AI models are trained and evaluated, forcing the field to focus on reasoning processes rather than just outcomes.
Paper 1 addresses a critical, universal bottleneck in autonomous agent deployment: knowing when to ask for help. While Paper 2 provides a valuable resource for the specific domain of scientific discovery, Paper 1 introduces a novel metric and evaluation paradigm for selective escalation that applies broadly across all agentic AI domains. Furthermore, demonstrating that this judgment is trainable via RL opens up foundational new pathways for developing reliable, human-aligned agents, giving it a broader and potentially more transformative impact across the entire AI field.
Paper 2 investigates a fundamental and novel question about internal emotion representations in LLMs and their causal influence on alignment-relevant behaviors like reward hacking and sycophancy. This has broad implications for AI safety, interpretability, and alignment—fields of immense current importance. The concept of 'functional emotions' is a novel theoretical contribution with cross-disciplinary appeal (AI, cognitive science, philosophy of mind). Paper 1, while useful as a benchmark/dataset contribution for scientific discovery agents, is more incremental and narrowly scoped to the evaluation infrastructure community. Paper 2's findings are more likely to reshape how the field thinks about LLM behavior and safety.
Paper 2 addresses a critical, universal bottleneck in scientific publishing—the peer review crisis—by demonstrating successful AI integration at an unprecedented scale (over 20,000 papers). Its findings that AI reviews are competitive with or preferred over human reviews could fundamentally transform scientific evaluation across all disciplines. While Paper 1 provides valuable infrastructure for AI agents, Paper 2's direct real-world application and systemic implications give it a broader and more immediate scientific impact.
Paper 2 identifies a fundamental, broadly applicable problem—LLMs silently blending memorized priors with data-driven inference—and provides a practical, generalizable protocol (epistemic blinding) applicable across diverse domains (biology, finance, etc.). This addresses a critical trust and auditability gap in all LLM-assisted analysis, with potential to reshape how LLMs are used in scientific and professional settings. Paper 1, while valuable as a benchmark/training resource for scientific discovery agents, is more incremental and narrower in scope, primarily benefiting the AI-for-science benchmarking community rather than transforming methodology across fields.
Paper 2 is likely to have higher scientific impact because it delivers a broadly useful, reusable infrastructure artifact: a large, automatically constructed, verifiable benchmark with executable environments across multiple scientific disciplines. This can become a community standard for training/evaluating agents in real-world data-driven discovery, enabling many follow-on methods and cross-field applications. Its methodological rigor is supported by human-agreement validation and demonstrated training gains. Paper 1 is novel and timely for AI safety, but its primary impact may be narrower (monitoring/benchmark integrity) and more tool-specific.
Paper 1 likely has higher scientific impact due to its broadly enabling contribution: a scalable, automatically constructed, executable-and-verifiable benchmark for real scientific data-driven discovery across multiple disciplines, with released environments, workflow, and demonstrated training gains. This can become infrastructure for agent evaluation/training, improving reproducibility and accelerating progress across ML-for-science and software agents. Paper 2 is timely and important for AI safety/medical governance, but its scope (60 scenarios, 6 models) is narrower and more domain-specific; impact may be strong in policy and evaluation but less broadly enabling than a general-purpose verifiable environment dataset.
MIMIC presents a fundamentally novel generative multimodal foundation model that unifies sequence, structure, regulatory, evolutionary, and contextual modalities for biomolecules—a significant conceptual advance. It demonstrates state-of-the-art results across diverse tasks (splicing prediction, protein design, RNA editing) with direct therapeutic applications (HBB mutation correction, PD-L1/hACE2 binder design). Its breadth of impact spans genomics, transcriptomics, proteomics, and drug design. While D3-Gym is a valuable benchmark contribution for AI-driven scientific discovery, MIMIC's methodological innovation and potential for transformative real-world biomedical applications give it substantially higher scientific impact.
Paper 2 identifies a critical, real-world vulnerability in current AI safety paradigms—where safety measures inadvertently cause medical omission harm. Its focus on life-or-death implications, rigorous pre-registered methodology, and profound implications for AI alignment and healthcare policy give it immense immediate and long-term scientific impact. While Paper 1 introduces a valuable benchmark for AI-driven science, Paper 2 challenges core assumptions about AI safety and has urgent, broad societal relevance.
Paper 1 introduces a novel paradigm shift from supervised to unsupervised monitoring for AI agents, addressing a critical bottleneck in AI safety and evaluation. By focusing on discovering unknown misbehaviors and demonstrating its efficacy through uncovering real benchmark vulnerabilities, it offers profound implications for the reliable deployment of autonomous systems. While Paper 2 provides a valuable dataset for AI-driven scientific discovery, Paper 1's methodological innovation in safety and its broad applicability across all agentic AI domains give it a higher potential for foundational scientific impact.
HiL-Bench addresses a fundamental and underexplored problem—whether AI agents can recognize the limits of their knowledge and ask for help—which has broad implications across all agentic AI applications. The concept of 'selective escalation' and the Ask-F1 metric are novel contributions that identify a critical failure mode invisible to existing benchmarks. The finding that judgment is trainable via RL is particularly impactful. While D3-Gym is a solid benchmark contribution for scientific discovery, HiL-Bench tackles a more foundational capability gap with wider applicability to human-AI collaboration across domains.
Paper 1 addresses a fundamental epistemological question about whether AI scientific agents truly reason scientifically, revealing critical failures (evidence ignored 68% of the time, rare belief revision) across 25,000+ runs. This has broad implications for AI safety, trustworthiness, and the foundations of AI-driven science. Its finding that scaffold engineering cannot fix reasoning deficits and that reasoning must become a training target reshapes the field's direction. Paper 2 is a valuable benchmark contribution, but Paper 1's insights are more transformative, challenging core assumptions about LLM-based scientific discovery and influencing policy, evaluation standards, and future training paradigms.
Paper 1 reports the first large-scale deployment of AI-assisted peer review at a major conference (AAAI-26, 22,977 papers), addressing a critical infrastructure problem in science. The finding that AI reviews were preferred over human reviews on key dimensions is paradigm-shifting for scientific publishing. Its breadth of impact spans all scientific fields that use peer review, not just AI. Paper 2, while a solid benchmark contribution for scientific discovery agents, is more incremental—creating another evaluation dataset in a crowded benchmarking space with more limited scope of impact.
Paper 2 likely has higher scientific impact: it introduces a broadly applicable generative “health world model” trained on a large longitudinal, multimodal cohort, shows strong transfer to independent cohorts, beats clinical risk scores on many endpoints, and demonstrates in silico intervention simulation aligned with RCT results—high real-world clinical relevance and timeliness for digital twins. Paper 1 is novel and useful for ML/scientific-agent benchmarking and reproducibility, but its impact is more enabling within AI evaluation/training, whereas Paper 2 directly targets high-stakes medicine with wide cross-domain implications.
Paper 2 likely has higher impact due to a broadly useful, verifiable benchmark/environment suite enabling reproducible evaluation and training of scientific agents across multiple disciplines. Its methodological rigor is strengthened by executable environments, synthesized evaluation scripts validated against human gold standards, and demonstrated downstream gains on external benchmarks. The dataset/workflow release supports immediate real-world adoption and follow-on research. Paper 1 is novel and timely for interpretability/alignment, but appears narrower (single model focus) and less directly enabling for the community than a widely reusable, verifiable task infrastructure.
Paper 1 presents a highly innovative 'health world model' capable of simulating clinical interventions and predicting disease trajectories across multiple domains. Its ability to exceed established clinical risk scores and accurately predict trial outcomes in silico offers profound real-world applications in personalized medicine and clinical trial design, likely resulting in broader and more transformative impact than the benchmarking environment presented in Paper 2.
Paper 2 likely has higher impact due to broader, more generalizable utility: a large, verifiable, executable benchmark/environment suite spanning four disciplines with reference solutions and automated evaluation. This directly enables rigorous, reproducible training/evaluation of scientific agents, addressing a widely recognized bottleneck. It demonstrates methodological rigor (human agreement study; model gains) and provides reusable infrastructure likely to be adopted across labs. Paper 1 introduces a valuable auditing protocol for LLM prior contamination with compelling case studies, but its impact is narrower (primarily agent prompting/audit) and less foundational than a community-scale verifiable environment dataset.
Paper 2 (MIMIC) has higher estimated impact due to a more transformative, broadly applicable methodological contribution: a generative multimodal foundation model that unifies conditioning, prediction, and design across biomolecular modalities (sequence/structure/regulation/evolution/context). Its demonstrated state-of-the-art results and concrete high-value applications (splicing, isoform-aware inference, clinically relevant RNA edits, protein binder design) suggest strong real-world relevance and cross-field reach (genomics, structural biology, drug discovery). Paper 1 is novel and useful infrastructure for agent evaluation, but its impact is more indirect and narrower.
Paper 2 likely has higher scientific impact: it introduces a new, openly released benchmark/dataset with executable, verifiable environments across multiple scientific disciplines, enabling standardized evaluation and training for agentic scientific discovery. This has broad applicability (LLM agents, ML for science, benchmarking, reproducibility), strong timeliness, and immediate real-world utility via artifacts and workflow. Paper 1 is a useful incremental improvement to LoRA compression (post-hoc rank pruning) with solid practicality, but its novelty and breadth are narrower and impact depends more on adoption in fine-tuning pipelines.
D3-Gym addresses a critical infrastructure gap for AI-driven scientific discovery by providing 565 verifiable tasks from real scientific repositories. It offers a reusable benchmark and training resource that can broadly accelerate research in scientific AI agents across multiple disciplines. The demonstrated training gains (7.8 points on ScienceAgentBench) show concrete utility. Paper 1, while interesting in studying visual priming effects on VLMs, is more narrowly scoped to a specific behavioral phenomenon with less transformative potential for the broader research community.