D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery

Hanane Nour Moussa, Yifei Li, Zhuoyang Li, Yankai Yang, Cheng Tang, Tianshu Zhang, Nesreen K. Ahmed, Ali Payani

Frozen v1 — this version was superseded on arXiv. Stats below reflect the state at freeze time and will not change.View latest (v2) →
#88 of 2292 · Artificial Intelligence
Share
Tournament Score
1547±29
10501800
57%
Win Rate
27
Wins
20
Losses
47
Matches
Rating
7.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific tasks.To fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and strong alignment in domain-specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3-Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3-32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3-Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU-NLP-Group/D3-Gym.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: D3-Gym

1. Core Contribution

D3-Gym addresses a genuine infrastructure gap in AI for scientific discovery: the lack of verifiable, executable environments for training and evaluating language models on real-world data-driven scientific tasks. The paper's core contributions are threefold: (1) an automated pipeline for constructing executable environments from scientific GitHub repositories, (2) a two-phase (planning-then-coding) approach for synthesizing domain-specific evaluation scripts, and (3) demonstration that training on trajectories sampled from these environments yields substantial improvements in open-weight models.

The key technical novelty lies in the evaluation script generation methodology. Unlike software engineering where unit tests exist naturally, scientific tasks require domain-specific evaluation logic—appropriate metrics, thresholds, and artifact inspection—that must be reasoned about from scratch. The decomposition into planning (reasoning about scientific validity) and coding (implementing the evaluation) is well-motivated and empirically validated through ablation studies.

2. Methodological Rigor

The paper demonstrates strong methodological discipline in several ways:

Quality validation is thorough. The 87.5% agreement between silver and gold evaluation scripts, validated on 50 human-annotated tasks requiring 175 person-hours, is a credible assessment. The multidimensional evaluation—execution-based (accuracy, recall, specificity) and logic-based (metric choice, threshold, artifact)—provides a nuanced picture rather than a single aggregate number. The identification that silver scripts tend toward mild strictness (66.1% recall, 91.0% specificity) is honest and well-characterized.

Ablation studies are informative. Removing planning, dataset previews, or code outputs each degrades performance in interpretable ways, validating design choices. The finding that removing planning collapses recall to 3.4% is particularly compelling.

Training experiments are well-designed. The paper evaluates across four model sizes (4B–32B), two training paradigms (RFT-Distill, RFT-Self), and reports both average and best-of-3 metrics. The NLL/PPL analysis explaining why RFT-Self outperforms RFT-Distill at larger scales adds mechanistic insight.

Potential concerns: The validation set of 50 tasks is relatively small given the diversity of 565 tasks across four disciplines. The LLM-as-judge for output verification (GPT-5.2) and evaluation logic assessment (Claude Sonnet 4.5) introduces potential circularity, though the human agreement checks (92.31% and 85% respectively) partially mitigate this. The paper also doesn't discuss potential data contamination—whether evaluation benchmark tasks might overlap with training data at the concept level despite excluding specific repositories.

3. Potential Impact

Training infrastructure for scientific AI. The most immediate impact is providing a scalable training resource. The 7.8 absolute point improvement for Qwen3-32B on ScienceAgentBench, approaching proprietary models like o1-preview and Claude Sonnet 4.5, demonstrates practical value. This could accelerate the development of open-weight scientific coding agents.

Methodology transfer. The evaluation script generation pipeline—planning-then-coding with domain-specific reasoning—could generalize to other domains where verification requires domain expertise (e.g., engineering simulation, clinical data analysis).

Benchmark contribution. ScienceAgentBench-Verified, while secondary, addresses real evaluation noise issues and demonstrates responsible benchmarking practices.

Limitations on broader impact: The 565 tasks, while carefully curated, remain relatively small. The four disciplines covered, while diverse, exclude major scientific areas (physics, materials science, ecology). The pipeline's reliance on GitHub repositories with specific structural properties may introduce systematic biases in task representation.

4. Timeliness & Relevance

This paper is highly timely. The convergence of several trends—the push toward RL/RFT training for reasoning models, the "Autoresearch" paradigm, and growing interest in open-weight scientific AI—creates strong demand for exactly this type of infrastructure. The paper explicitly positions itself against the manual effort bottleneck highlighted by Karpathy's Autoresearch, offering an automated alternative.

The focus on open-weight models addresses a real equity concern in scientific AI, where proprietary model dependence limits reproducibility and accessibility. The gap between open-weight and proprietary models on scientific tasks makes this contribution particularly relevant.

5. Strengths & Limitations

Key strengths:

  • End-to-end pipeline from repository mining to verified training environments, with each stage carefully validated
  • The planning-then-coding decomposition for evaluation scripts is elegant and well-ablated
  • Comprehensive training experiments across model scales with interpretable analysis (error taxonomy, per-category breakdowns, scaling curves)
  • Honest characterization of silver script limitations (mild strictness bias)
  • Full artifact release including environments, workflow, trajectories, and models
  • Notable limitations:

  • Scale remains modest (565 tasks vs. thousands in SE benchmarks like SWE-Gym)
  • Only RFT is explored; the "Gym" framing suggests RL potential that isn't demonstrated
  • The evaluation is primarily on ScienceAgentBench, a single benchmark from the same research group, raising questions about generalization
  • Cost of ~1,700for565environmentsisreasonablebuttheperenvironmentcostof1,700 for 565 environments is reasonable but the per-environment cost of3 may limit scaling to very large datasets
  • The paper mentions future reference dates (2026) for several citations, which is unusual and raises questions about the timeline
  • Additional observations: The detailed task examples (Kriging interpolation, Madelung constants) effectively illustrate the scientific depth of D3-Gym tasks. The error analysis showing that D3-Gym training reduces data schema errors by 44.7% but increases logical/algorithmic errors (as more programs now execute successfully) provides nuanced insight into what the training actually teaches—primarily better code mechanics rather than deeper scientific reasoning.

    The comparison with AutoSDT-5K demonstrating superior sample efficiency of verified environments over static instruction-solution pairs is an important result that validates the core thesis: verification signals matter for training data quality.

    Rating:7.2/ 10
    Significance 7.5Rigor 7.5Novelty 6.8Clarity 8

    Generated May 1, 2026

    Comparison History (47)

    vs. AI scientists produce results without reasoning scientifically
    gemini-35/1/2026

    While Paper 1 provides a highly useful benchmarking environment, Paper 2 has greater potential impact because it challenges the fundamental validity of current AI-driven scientific discovery. By rigorously demonstrating across 25,000 runs that LLMs fail at basic epistemic reasoning—ignoring evidence and failing to self-correct—Paper 2 exposes a critical flaw in the heavily hyped 'AI Scientist' paradigm. This critical evaluation will likely catalyze a major shift in how AI models are trained and evaluated, forcing the field to focus on reasoning processes rather than just outcomes.

    vs. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
    gemini-35/1/2026

    Paper 1 addresses a critical, universal bottleneck in autonomous agent deployment: knowing when to ask for help. While Paper 2 provides a valuable resource for the specific domain of scientific discovery, Paper 1 introduces a novel metric and evaluation paradigm for selective escalation that applies broadly across all agentic AI domains. Furthermore, demonstrating that this judgment is trainable via RL opens up foundational new pathways for developing reliable, human-aligned agents, giving it a broader and potentially more transformative impact across the entire AI field.

    vs. Emotion Concepts and their Function in a Large Language Model
    claude-opus-4.65/1/2026

    Paper 2 investigates a fundamental and novel question about internal emotion representations in LLMs and their causal influence on alignment-relevant behaviors like reward hacking and sycophancy. This has broad implications for AI safety, interpretability, and alignment—fields of immense current importance. The concept of 'functional emotions' is a novel theoretical contribution with cross-disciplinary appeal (AI, cognitive science, philosophy of mind). Paper 1, while useful as a benchmark/dataset contribution for scientific discovery agents, is more incremental and narrowly scoped to the evaluation infrastructure community. Paper 2's findings are more likely to reshape how the field thinks about LLM behavior and safety.

    vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
    gemini-35/1/2026

    Paper 2 addresses a critical, universal bottleneck in scientific publishing—the peer review crisis—by demonstrating successful AI integration at an unprecedented scale (over 20,000 papers). Its findings that AI reviews are competitive with or preferred over human reviews could fundamentally transform scientific evaluation across all disciplines. While Paper 1 provides valuable infrastructure for AI agents, Paper 2's direct real-world application and systemic implications give it a broader and more immediate scientific impact.

    vs. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
    claude-opus-4.65/1/2026

    Paper 2 identifies a fundamental, broadly applicable problem—LLMs silently blending memorized priors with data-driven inference—and provides a practical, generalizable protocol (epistemic blinding) applicable across diverse domains (biology, finance, etc.). This addresses a critical trust and auditability gap in all LLM-assisted analysis, with potential to reshape how LLMs are used in scientific and professional settings. Paper 1, while valuable as a benchmark/training resource for scientific discovery agents, is more incremental and narrower in scope, primarily benefiting the AI-for-science benchmarking community rather than transforming methodology across fields.

    vs. Hodoscope: Unsupervised Monitoring for AI Misbehaviors
    gpt-5.25/1/2026

    Paper 2 is likely to have higher scientific impact because it delivers a broadly useful, reusable infrastructure artifact: a large, automatically constructed, verifiable benchmark with executable environments across multiple scientific disciplines. This can become a community standard for training/evaluating agents in real-world data-driven discovery, enabling many follow-on methods and cross-field applications. Its methodological rigor is supported by human-agreement validation and demonstrated training gains. Paper 1 is novel and timely for AI safety, but its primary impact may be narrower (monitoring/benchmark integrity) and more tool-specific.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    gpt-5.25/1/2026

    Paper 1 likely has higher scientific impact due to its broadly enabling contribution: a scalable, automatically constructed, executable-and-verifiable benchmark for real scientific data-driven discovery across multiple disciplines, with released environments, workflow, and demonstrated training gains. This can become infrastructure for agent evaluation/training, improving reproducibility and accelerating progress across ML-for-science and software agents. Paper 2 is timely and important for AI safety/medical governance, but its scope (60 scenarios, 6 models) is narrower and more domain-specific; impact may be strong in policy and evaluation but less broadly enabling than a general-purpose verifiable environment dataset.

    vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules
    claude-opus-4.65/1/2026

    MIMIC presents a fundamentally novel generative multimodal foundation model that unifies sequence, structure, regulatory, evolutionary, and contextual modalities for biomolecules—a significant conceptual advance. It demonstrates state-of-the-art results across diverse tasks (splicing prediction, protein design, RNA editing) with direct therapeutic applications (HBB mutation correction, PD-L1/hACE2 binder design). Its breadth of impact spans genomics, transcriptomics, proteomics, and drug design. While D3-Gym is a valuable benchmark contribution for AI-driven scientific discovery, MIMIC's methodological innovation and potential for transformative real-world biomedical applications give it substantially higher scientific impact.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    gemini-35/1/2026

    Paper 2 identifies a critical, real-world vulnerability in current AI safety paradigms—where safety measures inadvertently cause medical omission harm. Its focus on life-or-death implications, rigorous pre-registered methodology, and profound implications for AI alignment and healthcare policy give it immense immediate and long-term scientific impact. While Paper 1 introduces a valuable benchmark for AI-driven science, Paper 2 challenges core assumptions about AI safety and has urgent, broad societal relevance.

    vs. Hodoscope: Unsupervised Monitoring for AI Misbehaviors
    gemini-35/1/2026

    Paper 1 introduces a novel paradigm shift from supervised to unsupervised monitoring for AI agents, addressing a critical bottleneck in AI safety and evaluation. By focusing on discovering unknown misbehaviors and demonstrating its efficacy through uncovering real benchmark vulnerabilities, it offers profound implications for the reliable deployment of autonomous systems. While Paper 2 provides a valuable dataset for AI-driven scientific discovery, Paper 1's methodological innovation in safety and its broad applicability across all agentic AI domains give it a higher potential for foundational scientific impact.

    vs. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
    claude-opus-4.65/1/2026

    HiL-Bench addresses a fundamental and underexplored problem—whether AI agents can recognize the limits of their knowledge and ask for help—which has broad implications across all agentic AI applications. The concept of 'selective escalation' and the Ask-F1 metric are novel contributions that identify a critical failure mode invisible to existing benchmarks. The finding that judgment is trainable via RL is particularly impactful. While D3-Gym is a solid benchmark contribution for scientific discovery, HiL-Bench tackles a more foundational capability gap with wider applicability to human-AI collaboration across domains.

    vs. AI scientists produce results without reasoning scientifically
    claude-opus-4.65/1/2026

    Paper 1 addresses a fundamental epistemological question about whether AI scientific agents truly reason scientifically, revealing critical failures (evidence ignored 68% of the time, rare belief revision) across 25,000+ runs. This has broad implications for AI safety, trustworthiness, and the foundations of AI-driven science. Its finding that scaffold engineering cannot fix reasoning deficits and that reasoning must become a training target reshapes the field's direction. Paper 2 is a valuable benchmark contribution, but Paper 1's insights are more transformative, challenging core assumptions about LLM-based scientific discovery and influencing policy, evaluation standards, and future training paradigms.

    vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
    claude-opus-4.65/1/2026

    Paper 1 reports the first large-scale deployment of AI-assisted peer review at a major conference (AAAI-26, 22,977 papers), addressing a critical infrastructure problem in science. The finding that AI reviews were preferred over human reviews on key dimensions is paradigm-shifting for scientific publishing. Its breadth of impact spans all scientific fields that use peer review, not just AI. Paper 2, while a solid benchmark contribution for scientific discovery agents, is more incremental—creating another evaluation dataset in a crowded benchmarking space with more limited scope of impact.

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    gpt-5.25/1/2026

    Paper 2 likely has higher scientific impact: it introduces a broadly applicable generative “health world model” trained on a large longitudinal, multimodal cohort, shows strong transfer to independent cohorts, beats clinical risk scores on many endpoints, and demonstrates in silico intervention simulation aligned with RCT results—high real-world clinical relevance and timeliness for digital twins. Paper 1 is novel and useful for ML/scientific-agent benchmarking and reproducibility, but its impact is more enabling within AI evaluation/training, whereas Paper 2 directly targets high-stakes medicine with wide cross-domain implications.

    vs. Emotion Concepts and their Function in a Large Language Model
    gpt-5.25/1/2026

    Paper 2 likely has higher impact due to a broadly useful, verifiable benchmark/environment suite enabling reproducible evaluation and training of scientific agents across multiple disciplines. Its methodological rigor is strengthened by executable environments, synthesized evaluation scripts validated against human gold standards, and demonstrated downstream gains on external benchmarks. The dataset/workflow release supports immediate real-world adoption and follow-on research. Paper 1 is novel and timely for interpretability/alignment, but appears narrower (single model focus) and less directly enabling for the community than a widely reusable, verifiable task infrastructure.

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    gemini-35/1/2026

    Paper 1 presents a highly innovative 'health world model' capable of simulating clinical interventions and predicting disease trajectories across multiple domains. Its ability to exceed established clinical risk scores and accurately predict trial outcomes in silico offers profound real-world applications in personalized medicine and clinical trial design, likely resulting in broader and more transformative impact than the benchmarking environment presented in Paper 2.

    vs. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
    gpt-5.25/1/2026

    Paper 2 likely has higher impact due to broader, more generalizable utility: a large, verifiable, executable benchmark/environment suite spanning four disciplines with reference solutions and automated evaluation. This directly enables rigorous, reproducible training/evaluation of scientific agents, addressing a widely recognized bottleneck. It demonstrates methodological rigor (human agreement study; model gains) and provides reusable infrastructure likely to be adopted across labs. Paper 1 introduces a valuable auditing protocol for LLM prior contamination with compelling case studies, but its impact is narrower (primarily agent prompting/audit) and less foundational than a community-scale verifiable environment dataset.

    vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules
    gpt-5.25/1/2026

    Paper 2 (MIMIC) has higher estimated impact due to a more transformative, broadly applicable methodological contribution: a generative multimodal foundation model that unifies conditioning, prediction, and design across biomolecular modalities (sequence/structure/regulation/evolution/context). Its demonstrated state-of-the-art results and concrete high-value applications (splicing, isoform-aware inference, clinically relevant RNA edits, protein binder design) suggest strong real-world relevance and cross-field reach (genomics, structural biology, drug discovery). Paper 1 is novel and useful infrastructure for agent evaluation, but its impact is more indirect and narrower.

    vs. Post-Optimization Adaptive Rank Allocation for LoRA
    gpt-5.25/1/2026

    Paper 2 likely has higher scientific impact: it introduces a new, openly released benchmark/dataset with executable, verifiable environments across multiple scientific disciplines, enabling standardized evaluation and training for agentic scientific discovery. This has broad applicability (LLM agents, ML for science, benchmarking, reproducibility), strong timeliness, and immediate real-world utility via artifacts and workflow. Paper 1 is a useful incremental improvement to LoRA compression (post-hoc rank pruning) with solid practicality, but its novelty and breadth are narrower and impact depends more on adoption in fine-tuning pipelines.

    vs. The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models
    claude-opus-4.65/1/2026

    D3-Gym addresses a critical infrastructure gap for AI-driven scientific discovery by providing 565 verifiable tasks from real scientific repositories. It offers a reusable benchmark and training resource that can broadly accelerate research in scientific AI agents across multiple disciplines. The demonstrated training gains (7.8 points on ScienceAgentBench) show concrete utility. Paper 1, while interesting in studying visual priming effects on VLMs, is more narrowly scoped to a specific behavioral phenomenon with less transformative potential for the broader research community.