MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents
Thao Nguyen, Heng Ji
Abstract
We present MolLingo, a multi-agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM-based approaches either operate as standalone generative models without access to external tools or lack the multi-agent coordination and shared memory needed for iterative, evidence-driven reasoning across the molecular design pipeline. MolLingo addresses this by coordinating a Literature Agent, a Chemist Agent, and an Orchestrator through a shared memory module, with each agent equipped with domain-specific tools. To enable effective molecular reasoning, we introduce BRICS-based Fragment Enumeration (BFE), a synthesis-aware molecular fragmentation method that decomposes molecules into chemically meaningful building blocks represented as block-based SMILES paired with common chemical names. This representation bridges molecular structure and LLM semantic space, enabling block-level reasoning and editing that is difficult with raw SMILES alone. As a case study in early-stage therapeutic design, MolLingo further grounds the Chemist Agent's reasoning in binding site geometry and residue-level protein context derived from molecular docking to optimize molecules for stronger target binding. Across four benchmarks, MolLingo consistently outperforms frontier LLMs and specialized baselines, including a fourfold docking score improvement over GPT-5.4 despite using the same underlying model, consistent drug property optimization gains across multiple LLM backbones, and state-of-the-art results on TOMG-Bench, surpassing both frontier LLMs and the RL-based optimization method RePO. Our results suggest that LLMs are already capable molecular design assistants when guided through chemically meaningful representations and biologically grounded structural context. Code is available at: https://anonymous.4open.science/status/MolLingo-7450.
AI Impact Assessments
(1 models)Scientific Impact Assessment: MolLingo
1. Core Contribution
MolLingo makes two primary contributions: (1) a multi-agent architecture coordinating Literature, Chemist, and Orchestrator agents through shared memory for iterative drug design, and (2) BRICS-based Fragment Enumeration (BFE), a synthesis-aware molecular fragmentation method that represents molecules as sequences of named chemical building blocks paired with block-level SMILES. The central thesis is that LLMs already possess sufficient chemical knowledge for molecular design but are bottlenecked by representation — specifically, that raw SMILES strings are semantically opaque to transformers, while named functional blocks (e.g., "pyridine," "benzamide") align with LLM pretraining distributions.
The BFE approach is algorithmically well-motivated: it achieves O(|C|·n̄³) vocabulary construction versus O((V-n₀)·|C|·n̄⁴) for iterative graph BPE, with formal proofs of correctness and speedup. The fragmentation selection criterion — minimizing frequency standard deviation among blocks after prioritizing fewer, longer blocks — is a principled heuristic that balances informativeness with vocabulary coverage.
2. Methodological Rigor
Strengths in evaluation design: The paper evaluates across four distinct benchmarks covering different pipeline stages (ADMET optimization, hit-to-lead, TOMG-Bench, hit identification), providing breadth. The DILI/hERG benchmarks use clinically withdrawn drugs from Withdrawn 2.0, grounding evaluation in real-world failure modes. The controlled comparison between MolLingo and its base LLM (GPT-5.4) on the same underlying model isolates the representation contribution convincingly — the fourfold docking improvement and sign-flip from negative to positive ADMET improvement are striking.
Concerns about rigor: Several methodological issues warrant scrutiny:
Attention probing analysis (Fig. 5): The demonstration that block-based representation produces localized attention between functional blocks and their associated biological functions is compelling but limited to a single molecule (imatinib) and a single model (Qwen2-7B). A systematic analysis across multiple molecules and models would substantially strengthen this claim.
3. Potential Impact
Near-term practical impact: MolLingo addresses a genuine bottleneck in applying LLMs to chemistry — the representation gap. The finding that the same LLM backbone (GPT-5.4) goes from negative to positive ADMET improvement simply by changing the molecular representation is practically important and immediately actionable for the growing community building LLM-based chemistry tools.
Broader influence: The block-based representation idea could generalize beyond drug design to materials science, catalysis, and polymer design. The insight that LLMs reason better over named chemical fragments than raw SMILES has implications for how the field approaches molecular language modeling more broadly.
Limitations on impact: The system currently operates entirely in silico with no experimental validation. Drug discovery ultimately requires wet-lab confirmation, and computational improvements in docking scores or predicted ADMET properties may not translate to real-world efficacy. The paper appropriately acknowledges this but it significantly limits the demonstrated impact.
4. Timeliness & Relevance
The paper is highly timely, sitting at the intersection of two rapidly growing fields: LLM agents and AI-driven drug discovery. The multi-agent architecture with shared memory reflects current best practices in agentic AI, while the molecular representation problem is a recognized bottleneck. The comparison against GPT-5.4, Claude-4.6-Sonnet, and Gemini-3-Pro (apparently very recent models) suggests currency, though the version numbers are unusual and may indicate yet-unreleased models, raising reproducibility questions.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Missing ablations: The BFE vs. BRICS ablation (Table 4) is valuable, but ablations separating the contribution of common chemical names from block-level SMILES, and the multi-agent coordination from the representation, would clarify the source of improvements.
Summary
MolLingo presents a well-motivated system that makes a convincing case for the importance of molecular representation in LLM-guided chemistry. The BFE representation is the paper's strongest contribution — simple, theoretically grounded, and empirically effective. The multi-agent architecture is competent but less novel. The primary limitation is the absence of experimental validation and the reliance on computational proxies whose correlation with real-world drug properties remains uncertain.
Generated May 28, 2026
Comparison History (17)
Paper 2 has higher likely impact: it identifies a broadly relevant failure mode in cited RAG (citation laundering) and operationalizes it with a clear, reusable benchmark (FORCEBENCH) and metrics that can become standard across evaluation and safety work. The method is comparatively rigorous and generalizable across domains beyond NLP (any system producing claims with evidence). Paper 1 is innovative and application-driven for molecular design, but its impact is narrower to cheminformatics/drug discovery and depends more on tooling/bench choices and domain constraints.
Paper 1 exposes a critical flaw in Chain-of-Thought distillation, revealing that improved accuracy can mask degrading reasoning quality. This challenges fundamental assumptions in LLM evaluation and AI safety, extending its impact far beyond the medical domain. While Paper 2 offers an excellent domain-specific tool for drug discovery, Paper 1 addresses a core, highly timely methodological issue in foundation model training. Its exposure of the accuracy versus reasoning divergence will likely force a paradigm shift in how the broader AI community evaluates and distills reasoning models, giving it higher widespread scientific impact.
Paper 1 presents an empirically validated system with immediate, transformative applications in drug discovery. Its novel molecule-native representation bridges a critical gap in LLM reasoning, demonstrating state-of-the-art results. While Paper 2 offers a valuable theoretical framework for AI safety, Paper 1's concrete methodology, strong empirical performance across multiple benchmarks, and open-source availability position it for immediate, widespread adoption and high citation impact in both artificial intelligence and computational chemistry.
Paper 2 has higher potential impact due to greater novelty (molecule-native block representation + coordinated multi-agent pipeline), clearer and larger real-world applications (therapeutic molecular design), and broader cross-field reach (LLMs, cheminformatics, drug discovery, docking/structural biology). It reports strong benchmark gains and practical grounding via synthesis-aware fragmentation and protein-context docking. Paper 1 is timely and useful for agent evaluation/cost control, but is primarily methodological/diagnostic with narrower application scope and likely smaller downstream societal impact than advances in molecular design automation.
Paper 1 presents a highly innovative, domain-grounded approach to molecular design, a field with immense real-world impact (drug discovery). By introducing chemically meaningful representations and multi-agent coordination grounded in structural biology, it significantly advances AI for Science. While Paper 2 offers a valuable LLM benchmark, Paper 1's tangible applications in therapeutics and its novel integration of domain-specific tools give it a higher potential for transformative scientific impact.
MolLingo presents a significantly more novel and impactful contribution: a multi-agent LLM framework for molecular design with a new molecular representation (BFE), demonstrated across four benchmarks with strong results surpassing frontier LLMs and specialized baselines. It addresses a high-impact problem (drug discovery/molecular design) with broad interdisciplinary relevance spanning AI, chemistry, and biology. Paper 2 applies existing plug-and-play reconstruction methods to dental CBCT denoising — a useful but incremental contribution with narrower scope, limited to synthetic data experiments and qualitative real-image evaluation.
MolLingo demonstrates broader scientific impact through its practical multi-agent framework for molecular design, addressing a high-value real-world application (drug discovery). It introduces novel contributions (BRICS-based Fragment Enumeration, multi-agent coordination with shared memory) and shows strong empirical results across four benchmarks, outperforming frontier LLMs and specialized baselines. Paper 2 presents an interesting theoretical framework for improving recursive model inference, but its scope is narrower (structured reasoning puzzles like Sudoku/Mazes) with less immediate real-world applicability. MolLingo's potential to accelerate therapeutic design gives it substantially greater impact breadth.
MolLingo demonstrates higher potential scientific impact due to its direct real-world applications in drug discovery and molecular design, a field with enormous economic and health implications. It introduces a novel synthesis-aware molecular representation (BFE) that bridges chemistry and LLMs, shows substantial quantitative improvements (4x docking score improvement over GPT-5.4), and achieves state-of-the-art on established benchmarks. The methodological contribution of grounding LLM reasoning in chemically meaningful fragments and protein structural context is broadly applicable. Paper 2, while solid, primarily introduces a benchmark and an agentic framework for audio-visual reasoning with more incremental contributions to multimodal AI.
Paper 2 (MolLingo) has higher impact potential due to stronger domain novelty (molecule-native BRICS-based fragment representation bridging chemistry and LLM semantics), clear real-world applicability to therapeutic molecular design, and tighter grounding via tools (docking, residue context) with multi-agent coordination. The reported benchmark improvements against strong baselines suggest meaningful practical gains. While Paper 1 is broadly relevant to agentic LLM engineering, its contributions are more incremental (skill lifecycle management) and primarily validated on an agent benchmark, making downstream scientific/industrial impact less immediate.
Paper 2 has higher impact potential due to stronger novelty (molecule-native BRICS-based Fragment Enumeration plus multi-agent, tool-using orchestration with shared memory), clear high-value real-world applications (drug discovery/molecular design), and broader cross-field relevance (LLMs, agentic systems, cheminformatics, structural biology). It reports benchmarked performance gains against strong baselines and releases code, supporting methodological rigor and adoption. Paper 1 is timely and valuable for governance in high-stakes AI, but its contributions are more conceptual/framework-oriented with narrower empirical validation, which may limit immediate scientific uptake compared to an open, benchmarked system.
MolLingo demonstrates higher scientific impact potential through its novel multi-agent framework for molecular design with concrete, measurable improvements (4x docking score improvement over GPT-5.4, SOTA on TOMG-Bench). It addresses a high-value application domain (drug discovery/therapeutic design), introduces a novel molecular representation (BFE), and shows broad applicability across multiple LLM backbones. Paper 2 provides valuable insights about explainability-accuracy tradeoffs in MLLMs but is more diagnostic/evaluative in nature with narrower practical implications. MolLingo's combination of methodological innovation, practical utility, and strong empirical results suggests broader and deeper impact.
Paper 1 has higher likely scientific impact: it introduces a technically novel molecule-native representation (BFE) plus a coordinated multi-agent, tool-using architecture grounded in docking/structural biology, and reports strong benchmark gains over both frontier LLM baselines and specialized methods. The work is timely for automated drug discovery and could translate into real-world molecular design workflows, impacting cheminformatics, ML, and computational biology. Paper 2 is conceptually valuable for AI ethics, but its contribution is more incremental (classification/benchmark within predefined theories) with narrower methodological and application leverage than a deployable molecular design agent pipeline.
Paper 2 has higher likely scientific impact due to a concrete, technically novel system (multi-agent molecular design with shared memory plus synthesis-aware BFE representations), clear methodological evaluation across multiple benchmarks, open code, and direct real-world applicability in drug discovery. Its contributions are timely and broadly relevant to ML-for-science, cheminformatics, and agentic LLM tooling, with measurable performance gains over strong baselines. Paper 1 offers an important conceptual-ethical framing, but its impact is more diffuse, harder to operationalize and validate empirically, and less likely to drive near-term cross-disciplinary uptake compared to an evaluated, deployable method.
Paper 2 (MUSE) is likely to have higher scientific impact because it introduces a broadly useful benchmark and evaluation protocol for Text-to-CAD assemblies, targeting manufacturability/functionality/assemblability—key unmet needs for industrial relevance. Benchmarks often become community infrastructure, influencing many subsequent methods across CAD, robotics, manufacturing, and multimodal LLM evaluation. Its methodology (multi-stage checks, rubric-based VLM judge validated by humans, public leaderboard) supports rigor and adoption. Paper 1 is strong and application-relevant, but is more domain-specific (molecular design) and its gains may depend on engineering choices around agents/representations.
While Paper 1 presents a strong, domain-specific application of LLMs to molecular design with significant real-world potential in drug discovery, Paper 2 offers a fundamental advancement in LLM architecture and multi-turn context management. The ZipRL framework's ability to efficiently compress context impacts a much broader range of fields by enhancing general AI agent capabilities, scalability, and token efficiency, leading to a wider overall scientific and technological impact.
Paper 2 likely has higher scientific impact due to broader applicability and timeliness: REFT is a simple, low-overhead modification to RLVR training that can transfer across many reasoning domains and model sizes, potentially affecting a wide swath of LLM post-training practice. It targets a central RLVR bottleneck (rollout diversity) with a clearly testable intervention and shows consistent gains across baselines/models. Paper 1 is strong and application-relevant for drug design, but its impact is more domain-specific and depends on integration complexity, tool availability, and benchmark realism/generalization.
MolLingo addresses the highly impactful field of molecular design and drug discovery using a timely LLM multi-agent framework. Its novel molecule-native representation bridges chemical structures with LLM semantics, offering broad applicability across computational chemistry. While BatteryMFormer provides valuable advancements in battery forecasting, MolLingo's intersection of generative AI, biological context, and chemistry promises a wider transformative impact on automated scientific discovery and the pharmaceutical industry.