SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

Kevin Han, Renfei Zhang, Kathy Wei, Hamed Mahdavi, Niloofar Mireshghallah, Amir Farimani

#373 of 2292 · Artificial Intelligence
Share
Tournament Score
1490±50
10501800
83%
Win Rate
20
Wins
4
Losses
24
Matches
Rating
7.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real-world discovery, limited in scale, or restricted to single-turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD-Bench, a challenging, multi-turn, long-horizon agentic benchmark consisting of 502 guaranteed-solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD-Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2\% of tasks. We hope SMDD-Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at smddbench.com .

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SMDD-Bench

1. Core Contribution

SMDD-Bench introduces the first standardized, multi-turn, agentic benchmark for evaluating LLMs on realistic small molecule drug design (SMDD) tasks. The benchmark comprises 502 task instances across five task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. These span 102 unique protein targets and 855 unique small molecules.

The key methodological innovation is witness-aware task generation — a procedure that guarantees each task instance is solvable by simultaneously constructing a "witness" molecule that passes all evaluation criteria. This elegantly solves the fundamental problem of creating verifiable drug design tasks without human expert curation, enabling scalable benchmark construction while ensuring no task is trivially unsolvable.

2. Methodological Rigor

The benchmark design demonstrates considerable rigor across several dimensions:

Task construction: Each task type has a detailed, principled generation pipeline. For example, Scaffold Hopping tasks use Boltz2 co-folding with 10 diffusion samples and consensus interaction fingerprints (60% threshold) to mitigate stochasticity — a thoughtful design choice. Lead Optimization tasks derive objectives from actual property differences between molecule pairs, grounding tasks in realistic chemistry.

Evaluation design: The multi-gate evaluation (validity, hard constraints, property objectives, binding probability) is well-structured. Using the same oracles (Boltz2, ADMET-AI) for both task generation and evaluation ensures internal consistency, though the authors appropriately acknowledge these are imperfect proxies for real-world assays.

Experimental methodology: The authors benchmark 7 frontier LLMs using a deliberately minimalist ReAct agent to isolate model capability from harness engineering. The oracle budget calibration (sweeping Boltz and ADMET-AI call counts on a calibration subset) is principled. The obfuscation of PDB codes and target names to minimize memorization effects is a smart precaution.

Potential weaknesses: The reliance on Boltz2 and ADMET-AI as ground-truth oracles is the most significant limitation. While the authors frame this as analogous to wet-lab assays, systematic biases in these models could reward agents that exploit oracle weaknesses rather than demonstrating genuine chemical reasoning. The benchmark also lacks validation against actual experimental data — no task instance has been verified through synthesis and biological testing.

The statistical rigor could be improved: no error bars or confidence intervals are reported for main results, and performance on stochastic Boltz2 predictions may vary across runs for the same agent-task pair.

3. Potential Impact

Near-term impact: SMDD-Bench fills a genuine gap. Prior LLM chemistry benchmarks (ChemBench, MolecularIQ, SmolInstruct) are restricted to single-turn QA and don't evaluate agentic reasoning. The public leaderboard, SMDD-Bench Lite (100 tasks for rapid iteration), and SMDD-Bench Diversity subset provide well-designed entry points for the research community.

Research directions enabled: The paper's analyses reveal specific, actionable failure modes — absence of cross-turn SAR synthesis, incoherent multi-turn planning, and the enumeration-vs-selection gap (Table 4 shows agents often enumerate passing molecules but fail to select them). These findings directly inform future agent architecture and training methodology research.

Broader applications: The witness-aware task generation paradigm could generalize beyond drug design to other scientific discovery benchmarks where solution existence is hard to verify. The framework for studying output diversity (SMDD-Bench Diversity) addresses a practical need for parallel agent deployment.

Limitations on real-world translation: The disconnect between oracle-based evaluation and real wet-lab outcomes means performance on SMDD-Bench may not directly predict real-world drug design capability. The authors acknowledge this and propose the benchmark as a training testbed before deployment with real laboratories.

4. Timeliness & Relevance

This work arrives at a critical inflection point: LLM agents are being aggressively promoted for scientific discovery, yet rigorous evaluation frameworks for domain-specific applications lag behind. The drug discovery community increasingly explores AI-driven design, and the absence of standardized benchmarks has led to cherry-picked demonstrations and incomparable results across papers. SMDD-Bench directly addresses this bottleneck.

The finding that GPT-5.4 solves only 40.2% of tasks (predominantly Lead Optimization) while near-zero performance on Interaction Point Discovery, Scaffold Hopping, and Fragment Assembly highlights that current LLMs lack fundamental 3D chemical reasoning — an important empirical contribution to temper overly optimistic narratives about LLM-driven drug discovery.

5. Strengths & Limitations

Key strengths:

  • Witness-aware generation is a genuinely novel and useful contribution ensuring task solvability
  • Comprehensive task diversity spanning five relevant drug design workflows
  • Rich diagnostic analyses (enumeration vs. selection, diversity metrics, failure mode taxonomy)
  • Well-designed benchmark variants (Lite, Diversity) for different use cases
  • Detailed appendix with full task specifications and agent trajectories
  • Notable limitations:

  • Oracle dependence: Boltz2 binding affinity predictions and ADMET-AI property predictions have known inaccuracies, and agents optimizing for these proxies may not produce genuinely useful drug candidates
  • The minimalist agent scaffold, while principled for isolating LLM capability, may understate what purpose-built agentic systems could achieve
  • No human expert baseline is provided — it would be valuable to know how medicinal chemists perform on these tasks under the same oracle constraints
  • The 0% success rate on Interaction Point Discovery (except 4% for Gemini) and near-zero on Fragment Assembly raises questions about whether current task formulations are appropriately calibrated
  • Chemical novelty metrics (Table 2) showing 65-94% novelty are encouraging but should be interpreted cautiously given that structural similarity to known compounds could still be high
  • Overall Assessment

    SMDD-Bench makes a substantial contribution to the intersection of LLM agents and computational drug discovery. The benchmark is well-constructed, the witness-aware generation paradigm is innovative, and the empirical findings provide actionable insights for the field. While the reliance on learned oracles limits direct translational claims, the benchmark serves its stated purpose as a standardized testbed for developing and comparing LLM agent capabilities in drug design.

    Rating:7.5/ 10
    Significance 7.5Rigor 7Novelty 7.5Clarity 8

    Generated May 22, 2026

    Comparison History (24)

    vs. The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems
    gpt-5.25/22/2026

    Paper 1 is likely higher impact: it introduces a large, standardized, guaranteed-solvable benchmark (502 instances, 102 targets) for a high-value real-world domain (small-molecule drug design), directly enabling measurable progress and model comparison. Its methodological contribution (task design, multi-turn long-horizon setup, leaderboard, baseline results showing large headroom) is concrete and timely for LLM-for-science evaluation. Paper 2 is an interesting systems-architecture idea with broad applicability, but appears less empirically validated and closer to an engineering proposal, which may reduce near-term scientific uptake.

    vs. Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
    claude-opus-4.65/22/2026

    SMDD-Bench introduces a comprehensive, standardized benchmark for evaluating LLM agents on real-world drug design tasks—a high-impact application domain. It covers 502 task instances, 102 protein targets, and 5 task types, benchmarking 7 frontier LLMs. This addresses a critical gap in AI for drug discovery with broad implications for computational chemistry and pharmaceutical research. Paper 2 is a narrow case study analyzing a single political speech with limited generalizability. SMDD-Bench's scale, practical relevance, public leaderboard, and potential to drive progress in autonomous drug design give it substantially higher impact.

    vs. A Subjective Logic-based method for runtime confidence updates in safety arguments
    gemini-3.15/22/2026

    Paper 1 addresses a highly timely and broadly impactful domain: evaluating LLMs for small molecule drug design. By providing a standardized, large-scale benchmark for a critical scientific application, it is likely to attract widespread attention and high citation counts from both the AI and computational biology communities. Paper 2, while methodologically sound and useful for safety engineering, targets a more specialized niche (runtime assurance using Subjective Logic), resulting in comparatively lower potential breadth and overall scientific impact.

    vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
    gpt-5.25/22/2026

    Paper 1 identifies a counterintuitive inverse-scaling failure mode in LLM probabilistic forecasting under superlinear growth and regime-change tail risk, validates it across simulated and multiple real-world domains, and shows how common benchmark metrics can mask the issue—directly impacting evaluation practice and deployment safety in high-stakes settings (finance, epidemiology). Its methodological contributions (tail-focused decomposition, within-family scale vs post-training analysis, metric critique) are broadly applicable beyond forecasting. Paper 2 is valuable as a standardized drug-design agent benchmark, but its impact is more field-specific and primarily infrastructural.

    vs. Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings
    gemini-3.15/22/2026

    Paper 1 addresses the highly impactful field of small molecule drug design by providing a standardized benchmark for LLM agents. This has profound implications for accelerating autonomous scientific discovery and real-world pharmaceutical development. Paper 2, while interesting for computational gastronomy, represents a much more niche application with limited broader scientific impact compared to life-saving drug discovery.

    vs. Towards a General Intelligence and Interface for Wearable Health Data
    gpt-5.25/22/2026

    Paper 2 likely has higher impact due to its scale (trillion-minute pretraining; 5M participants), broad applicability across 35 diverse health tasks, and strong real-world relevance given widespread wearables. It contributes a general-purpose foundation model plus an interface layer (LLM agents for head search, Personal Health Agent) with clinician validation, increasing translational potential. Paper 1 is novel and useful as a standardized benchmark for LLM-driven drug design, but benchmarks typically yield narrower direct impact than large-scale models that can immediately improve many downstream health applications.

    vs. ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling
    gemini-3.15/22/2026

    SMDD-Bench introduces a much-needed, standardized benchmark for a critical scientific domain (drug design). By addressing the lack of rigorous evaluation in LLM-driven chemistry and biology, it directly facilitates breakthroughs in real-world medicine. This foundational contribution to AI for Science has a higher potential for transformative real-world impact than the incremental algorithmic improvements in general reasoning offered by Paper 2.

    vs. Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
    gemini-3.15/22/2026

    While Paper 1 offers a novel architectural approach to AI safety, Paper 2 introduces a comprehensive benchmark bridging LLMs and small molecule drug design. By establishing a standardized evaluation for a high-value, interdisciplinary application (computational drug discovery), Paper 2 has a broader potential impact across fields, directly catalyzing real-world scientific breakthroughs and driving future development of specialized LLM agents.

    vs. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
    gemini-3.15/22/2026

    Paper 1 bridges AI and computational biology by introducing a comprehensive benchmark for a high-stakes, real-world scientific problem (drug design). Its interdisciplinary nature, rigorous standardization of complex tasks, and potential to catalyze breakthroughs in life sciences give it a higher potential for broad and profound scientific impact compared to the methodological software-engineering focus of Paper 2.

    vs. Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support
    gpt-5.25/22/2026

    Paper 2 has higher impact potential due to stronger novelty and broader applicability: a large, guaranteed-solvable, multi-turn, long-horizon benchmark spanning 5 realistic drug-design task types, 102 targets, and wide chemical space, with a public leaderboard enabling community adoption and sustained progress. Its applications (autonomous computational drug design) are high-value and cross-cut chemistry, biology, and AI/tool-use agent research. Paper 1 is timely and rigorous for clinical LLM evaluation, but is narrower in domain scope and primarily diagnostic/benchmarking rather than directly enabling discovery pipelines.

    vs. SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
    claude-opus-4.65/22/2026

    SD-Search introduces a novel, self-contained method (on-policy hindsight self-distillation) that addresses a fundamental credit assignment problem in search-augmented reasoning without requiring external teachers or annotations. This is a broadly applicable methodological contribution relevant to RL-based LLM training across many domains. Paper 1 (SMDD-Bench) is a well-constructed benchmark for a specific application domain (drug design), but benchmarks typically have narrower methodological impact. SD-Search's innovation in step-level credit assignment has wider applicability and advances core RL+LLM training methodology.

    vs. Meta-Learning for Rapid Adaptation in Reference Tracking of Uncertain Nonlinear Systems
    gpt-5.25/22/2026

    Paper 2 likely has higher scientific impact: it introduces a standardized, large-scale, multi-turn benchmark with broad community utility, immediate relevance to fast-moving LLM-agent research, and high real-world stakes in drug discovery. The public leaderboard and guaranteed-solvable task design can catalyze reproducible progress across academia/industry and multiple subfields (ML, cheminformatics, bioinformatics, agent tooling, evaluation). Paper 1 is innovative and validated (including hardware), but its impact is narrower to meta-learning control and depends more on adoption within a specialized community.

    vs. Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
    claude-opus-4.65/22/2026

    Paper 2 addresses a fundamental architectural question in AI agent design—whether agentic workflows can be compiled into model weights instead of relying on external orchestration—with broad applicability across all LLM agent applications. It directly challenges the dominant paradigm used by frameworks with 290K+ GitHub stars, offering practical cost reductions (two orders of magnitude) while maintaining quality. Paper 1, while valuable as a benchmark for drug design, serves a narrower community. Paper 2's findings about subterranean agents could reshape how the entire industry builds and deploys AI agents, giving it broader cross-field impact.

    vs. HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools
    claude-opus-4.65/22/2026

    SMDD-Bench addresses a critical gap in evaluating LLM agents for drug discovery—a high-impact application domain. It introduces a standardized, large-scale benchmark (502 tasks, 102 protein targets, 5 task types) that can drive progress in autonomous computational drug design. The finding that even GPT-5.4 solves only 40.2% of tasks highlights significant room for improvement, motivating future research. Paper 1 (HarnessAPI) is a useful engineering contribution reducing API boilerplate, but it is narrower in scope—a developer productivity tool rather than a scientific advance with broad cross-disciplinary impact.

    vs. TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization
    claude-opus-4.65/22/2026

    SMDD-Bench addresses a higher-impact domain (drug discovery) with broader implications for healthcare and pharmaceutical development. It provides a standardized, large-scale benchmark (502 tasks, 102 protein targets, 7 LLMs evaluated) that can accelerate the entire field of AI-driven drug design. Its systematic evaluation of frontier LLMs reveals significant capability gaps (best model solves only 40.2%), establishing clear research directions. The public leaderboard further amplifies community impact. While TO-Agents is innovative in connecting NLP to topology optimization, its scope is narrower, focusing on structural design with limited evaluation scale (two case studies, ten replicates).

    vs. What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct
    gpt-5.25/22/2026

    Paper 2 likely has higher impact: it introduces a large, standardized, multi-turn benchmark (502 tasks, 102 targets) for evaluating LLM agents on real-world small-molecule drug design—an application area with major scientific and commercial relevance. The benchmark + public leaderboard can catalyze measurable progress, enable cross-model comparisons, and influence both ML and cheminformatics workflows. Paper 1 is valuable for conceptual clarification and governance (taxonomy + expert survey), but its contributions are primarily definitional/organizational and less likely to directly drive technical advances or broad downstream adoption than a widely used drug-design benchmark.

    vs. Towards a compositional semantics for quantitative confidence assessment in assurance arguments
    gemini-3.15/22/2026

    Paper 2 addresses a highly timely and rapidly expanding field: the application of LLMs to real-world drug design. By providing a comprehensive, multi-turn benchmark (SMDD-Bench), it sets a standard for future research at the intersection of AI and pharmacology. Its potential real-world applications in accelerating drug discovery give it a significantly broader and more profound impact compared to the niche systems engineering focus of Paper 1.

    vs. TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization
    gpt-5.25/22/2026

    Paper 1 likely has higher impact due to stronger novelty and broader relevance: a large, standardized, multi-turn benchmark (502 solvable tasks, 102 targets) addresses a major gap in evaluating LLM agents for real-world drug design, enabling reproducible comparison and driving community progress via a public leaderboard. Its applications (autonomous computational drug design) have high societal and cross-disciplinary stakes, and the scale/rigor of benchmarking multiple frontier models enhances credibility. Paper 2 is innovative and applied, but is demonstrated on limited case studies and may generalize less broadly than a field-wide benchmark.

    vs. What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct
    gemini-3.15/22/2026

    Paper 1 introduces a benchmark for LLMs in small molecule drug design, directly bridging AI with a high-impact scientific domain (pharmacology/medicine). Its potential real-world applications in accelerating drug discovery offer profound societal and economic value. While Paper 2 provides valuable insights for AI alignment through its taxonomy of sycophancy, Paper 1's contribution is more likely to drive tangible scientific breakthroughs and tool development across the broader scientific community.

    vs. Towards a compositional semantics for quantitative confidence assessment in assurance arguments
    gpt-5.25/22/2026

    Paper 1 likely has higher impact: it introduces a large, standardized, multi-turn benchmark (502 instances, 102 targets) for evaluating LLM agents on realistic small-molecule drug design, with a public leaderboard—highly timely and broadly useful across AI4Science, cheminformatics, and agent evaluation. Its clear performance gap (best model ~40%) creates an actionable research target and can catalyze method development and fair comparison. Paper 2 offers a rigorous semantic framework for confidence in assurance arguments, valuable but more niche, with narrower cross-field uptake and less immediate large-scale empirical leverage.