SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?
Kevin Han, Renfei Zhang, Kathy Wei, Hamed Mahdavi, Niloofar Mireshghallah, Amir Farimani
Abstract
LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real-world discovery, limited in scale, or restricted to single-turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD-Bench, a challenging, multi-turn, long-horizon agentic benchmark consisting of 502 guaranteed-solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD-Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2\% of tasks. We hope SMDD-Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at smddbench.com .
AI Impact Assessments
(1 models)Scientific Impact Assessment: SMDD-Bench
1. Core Contribution
SMDD-Bench introduces the first standardized, multi-turn, agentic benchmark for evaluating LLMs on realistic small molecule drug design (SMDD) tasks. The benchmark comprises 502 task instances across five task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. These span 102 unique protein targets and 855 unique small molecules.
The key methodological innovation is witness-aware task generation — a procedure that guarantees each task instance is solvable by simultaneously constructing a "witness" molecule that passes all evaluation criteria. This elegantly solves the fundamental problem of creating verifiable drug design tasks without human expert curation, enabling scalable benchmark construction while ensuring no task is trivially unsolvable.
2. Methodological Rigor
The benchmark design demonstrates considerable rigor across several dimensions:
Task construction: Each task type has a detailed, principled generation pipeline. For example, Scaffold Hopping tasks use Boltz2 co-folding with 10 diffusion samples and consensus interaction fingerprints (60% threshold) to mitigate stochasticity — a thoughtful design choice. Lead Optimization tasks derive objectives from actual property differences between molecule pairs, grounding tasks in realistic chemistry.
Evaluation design: The multi-gate evaluation (validity, hard constraints, property objectives, binding probability) is well-structured. Using the same oracles (Boltz2, ADMET-AI) for both task generation and evaluation ensures internal consistency, though the authors appropriately acknowledge these are imperfect proxies for real-world assays.
Experimental methodology: The authors benchmark 7 frontier LLMs using a deliberately minimalist ReAct agent to isolate model capability from harness engineering. The oracle budget calibration (sweeping Boltz and ADMET-AI call counts on a calibration subset) is principled. The obfuscation of PDB codes and target names to minimize memorization effects is a smart precaution.
Potential weaknesses: The reliance on Boltz2 and ADMET-AI as ground-truth oracles is the most significant limitation. While the authors frame this as analogous to wet-lab assays, systematic biases in these models could reward agents that exploit oracle weaknesses rather than demonstrating genuine chemical reasoning. The benchmark also lacks validation against actual experimental data — no task instance has been verified through synthesis and biological testing.
The statistical rigor could be improved: no error bars or confidence intervals are reported for main results, and performance on stochastic Boltz2 predictions may vary across runs for the same agent-task pair.
3. Potential Impact
Near-term impact: SMDD-Bench fills a genuine gap. Prior LLM chemistry benchmarks (ChemBench, MolecularIQ, SmolInstruct) are restricted to single-turn QA and don't evaluate agentic reasoning. The public leaderboard, SMDD-Bench Lite (100 tasks for rapid iteration), and SMDD-Bench Diversity subset provide well-designed entry points for the research community.
Research directions enabled: The paper's analyses reveal specific, actionable failure modes — absence of cross-turn SAR synthesis, incoherent multi-turn planning, and the enumeration-vs-selection gap (Table 4 shows agents often enumerate passing molecules but fail to select them). These findings directly inform future agent architecture and training methodology research.
Broader applications: The witness-aware task generation paradigm could generalize beyond drug design to other scientific discovery benchmarks where solution existence is hard to verify. The framework for studying output diversity (SMDD-Bench Diversity) addresses a practical need for parallel agent deployment.
Limitations on real-world translation: The disconnect between oracle-based evaluation and real wet-lab outcomes means performance on SMDD-Bench may not directly predict real-world drug design capability. The authors acknowledge this and propose the benchmark as a training testbed before deployment with real laboratories.
4. Timeliness & Relevance
This work arrives at a critical inflection point: LLM agents are being aggressively promoted for scientific discovery, yet rigorous evaluation frameworks for domain-specific applications lag behind. The drug discovery community increasingly explores AI-driven design, and the absence of standardized benchmarks has led to cherry-picked demonstrations and incomparable results across papers. SMDD-Bench directly addresses this bottleneck.
The finding that GPT-5.4 solves only 40.2% of tasks (predominantly Lead Optimization) while near-zero performance on Interaction Point Discovery, Scaffold Hopping, and Fragment Assembly highlights that current LLMs lack fundamental 3D chemical reasoning — an important empirical contribution to temper overly optimistic narratives about LLM-driven drug discovery.
5. Strengths & Limitations
Key strengths:
Notable limitations:
Overall Assessment
SMDD-Bench makes a substantial contribution to the intersection of LLM agents and computational drug discovery. The benchmark is well-constructed, the witness-aware generation paradigm is innovative, and the empirical findings provide actionable insights for the field. While the reliance on learned oracles limits direct translational claims, the benchmark serves its stated purpose as a standardized testbed for developing and comparing LLM agent capabilities in drug design.
Generated May 22, 2026
Comparison History (24)
Paper 1 is likely higher impact: it introduces a large, standardized, guaranteed-solvable benchmark (502 instances, 102 targets) for a high-value real-world domain (small-molecule drug design), directly enabling measurable progress and model comparison. Its methodological contribution (task design, multi-turn long-horizon setup, leaderboard, baseline results showing large headroom) is concrete and timely for LLM-for-science evaluation. Paper 2 is an interesting systems-architecture idea with broad applicability, but appears less empirically validated and closer to an engineering proposal, which may reduce near-term scientific uptake.
SMDD-Bench introduces a comprehensive, standardized benchmark for evaluating LLM agents on real-world drug design tasks—a high-impact application domain. It covers 502 task instances, 102 protein targets, and 5 task types, benchmarking 7 frontier LLMs. This addresses a critical gap in AI for drug discovery with broad implications for computational chemistry and pharmaceutical research. Paper 2 is a narrow case study analyzing a single political speech with limited generalizability. SMDD-Bench's scale, practical relevance, public leaderboard, and potential to drive progress in autonomous drug design give it substantially higher impact.
Paper 1 addresses a highly timely and broadly impactful domain: evaluating LLMs for small molecule drug design. By providing a standardized, large-scale benchmark for a critical scientific application, it is likely to attract widespread attention and high citation counts from both the AI and computational biology communities. Paper 2, while methodologically sound and useful for safety engineering, targets a more specialized niche (runtime assurance using Subjective Logic), resulting in comparatively lower potential breadth and overall scientific impact.
Paper 1 identifies a counterintuitive inverse-scaling failure mode in LLM probabilistic forecasting under superlinear growth and regime-change tail risk, validates it across simulated and multiple real-world domains, and shows how common benchmark metrics can mask the issue—directly impacting evaluation practice and deployment safety in high-stakes settings (finance, epidemiology). Its methodological contributions (tail-focused decomposition, within-family scale vs post-training analysis, metric critique) are broadly applicable beyond forecasting. Paper 2 is valuable as a standardized drug-design agent benchmark, but its impact is more field-specific and primarily infrastructural.
Paper 1 addresses the highly impactful field of small molecule drug design by providing a standardized benchmark for LLM agents. This has profound implications for accelerating autonomous scientific discovery and real-world pharmaceutical development. Paper 2, while interesting for computational gastronomy, represents a much more niche application with limited broader scientific impact compared to life-saving drug discovery.
Paper 2 likely has higher impact due to its scale (trillion-minute pretraining; 5M participants), broad applicability across 35 diverse health tasks, and strong real-world relevance given widespread wearables. It contributes a general-purpose foundation model plus an interface layer (LLM agents for head search, Personal Health Agent) with clinician validation, increasing translational potential. Paper 1 is novel and useful as a standardized benchmark for LLM-driven drug design, but benchmarks typically yield narrower direct impact than large-scale models that can immediately improve many downstream health applications.
SMDD-Bench introduces a much-needed, standardized benchmark for a critical scientific domain (drug design). By addressing the lack of rigorous evaluation in LLM-driven chemistry and biology, it directly facilitates breakthroughs in real-world medicine. This foundational contribution to AI for Science has a higher potential for transformative real-world impact than the incremental algorithmic improvements in general reasoning offered by Paper 2.
While Paper 1 offers a novel architectural approach to AI safety, Paper 2 introduces a comprehensive benchmark bridging LLMs and small molecule drug design. By establishing a standardized evaluation for a high-value, interdisciplinary application (computational drug discovery), Paper 2 has a broader potential impact across fields, directly catalyzing real-world scientific breakthroughs and driving future development of specialized LLM agents.
Paper 1 bridges AI and computational biology by introducing a comprehensive benchmark for a high-stakes, real-world scientific problem (drug design). Its interdisciplinary nature, rigorous standardization of complex tasks, and potential to catalyze breakthroughs in life sciences give it a higher potential for broad and profound scientific impact compared to the methodological software-engineering focus of Paper 2.
Paper 2 has higher impact potential due to stronger novelty and broader applicability: a large, guaranteed-solvable, multi-turn, long-horizon benchmark spanning 5 realistic drug-design task types, 102 targets, and wide chemical space, with a public leaderboard enabling community adoption and sustained progress. Its applications (autonomous computational drug design) are high-value and cross-cut chemistry, biology, and AI/tool-use agent research. Paper 1 is timely and rigorous for clinical LLM evaluation, but is narrower in domain scope and primarily diagnostic/benchmarking rather than directly enabling discovery pipelines.
SD-Search introduces a novel, self-contained method (on-policy hindsight self-distillation) that addresses a fundamental credit assignment problem in search-augmented reasoning without requiring external teachers or annotations. This is a broadly applicable methodological contribution relevant to RL-based LLM training across many domains. Paper 1 (SMDD-Bench) is a well-constructed benchmark for a specific application domain (drug design), but benchmarks typically have narrower methodological impact. SD-Search's innovation in step-level credit assignment has wider applicability and advances core RL+LLM training methodology.
Paper 2 likely has higher scientific impact: it introduces a standardized, large-scale, multi-turn benchmark with broad community utility, immediate relevance to fast-moving LLM-agent research, and high real-world stakes in drug discovery. The public leaderboard and guaranteed-solvable task design can catalyze reproducible progress across academia/industry and multiple subfields (ML, cheminformatics, bioinformatics, agent tooling, evaluation). Paper 1 is innovative and validated (including hardware), but its impact is narrower to meta-learning control and depends more on adoption within a specialized community.
Paper 2 addresses a fundamental architectural question in AI agent design—whether agentic workflows can be compiled into model weights instead of relying on external orchestration—with broad applicability across all LLM agent applications. It directly challenges the dominant paradigm used by frameworks with 290K+ GitHub stars, offering practical cost reductions (two orders of magnitude) while maintaining quality. Paper 1, while valuable as a benchmark for drug design, serves a narrower community. Paper 2's findings about subterranean agents could reshape how the entire industry builds and deploys AI agents, giving it broader cross-field impact.
SMDD-Bench addresses a critical gap in evaluating LLM agents for drug discovery—a high-impact application domain. It introduces a standardized, large-scale benchmark (502 tasks, 102 protein targets, 5 task types) that can drive progress in autonomous computational drug design. The finding that even GPT-5.4 solves only 40.2% of tasks highlights significant room for improvement, motivating future research. Paper 1 (HarnessAPI) is a useful engineering contribution reducing API boilerplate, but it is narrower in scope—a developer productivity tool rather than a scientific advance with broad cross-disciplinary impact.
SMDD-Bench addresses a higher-impact domain (drug discovery) with broader implications for healthcare and pharmaceutical development. It provides a standardized, large-scale benchmark (502 tasks, 102 protein targets, 7 LLMs evaluated) that can accelerate the entire field of AI-driven drug design. Its systematic evaluation of frontier LLMs reveals significant capability gaps (best model solves only 40.2%), establishing clear research directions. The public leaderboard further amplifies community impact. While TO-Agents is innovative in connecting NLP to topology optimization, its scope is narrower, focusing on structural design with limited evaluation scale (two case studies, ten replicates).
Paper 2 likely has higher impact: it introduces a large, standardized, multi-turn benchmark (502 tasks, 102 targets) for evaluating LLM agents on real-world small-molecule drug design—an application area with major scientific and commercial relevance. The benchmark + public leaderboard can catalyze measurable progress, enable cross-model comparisons, and influence both ML and cheminformatics workflows. Paper 1 is valuable for conceptual clarification and governance (taxonomy + expert survey), but its contributions are primarily definitional/organizational and less likely to directly drive technical advances or broad downstream adoption than a widely used drug-design benchmark.
Paper 2 addresses a highly timely and rapidly expanding field: the application of LLMs to real-world drug design. By providing a comprehensive, multi-turn benchmark (SMDD-Bench), it sets a standard for future research at the intersection of AI and pharmacology. Its potential real-world applications in accelerating drug discovery give it a significantly broader and more profound impact compared to the niche systems engineering focus of Paper 1.
Paper 1 likely has higher impact due to stronger novelty and broader relevance: a large, standardized, multi-turn benchmark (502 solvable tasks, 102 targets) addresses a major gap in evaluating LLM agents for real-world drug design, enabling reproducible comparison and driving community progress via a public leaderboard. Its applications (autonomous computational drug design) have high societal and cross-disciplinary stakes, and the scale/rigor of benchmarking multiple frontier models enhances credibility. Paper 2 is innovative and applied, but is demonstrated on limited case studies and may generalize less broadly than a field-wide benchmark.
Paper 1 introduces a benchmark for LLMs in small molecule drug design, directly bridging AI with a high-impact scientific domain (pharmacology/medicine). Its potential real-world applications in accelerating drug discovery offer profound societal and economic value. While Paper 2 provides valuable insights for AI alignment through its taxonomy of sycophancy, Paper 1's contribution is more likely to drive tangible scientific breakthroughs and tool development across the broader scientific community.
Paper 1 likely has higher impact: it introduces a large, standardized, multi-turn benchmark (502 instances, 102 targets) for evaluating LLM agents on realistic small-molecule drug design, with a public leaderboard—highly timely and broadly useful across AI4Science, cheminformatics, and agent evaluation. Its clear performance gap (best model ~40%) creates an actionable research target and can catalyze method development and fair comparison. Paper 2 offers a rigorous semantic framework for confidence in assurance arguments, valuable but more niche, with narrower cross-field uptake and less immediate large-scale empirical leverage.