Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun, Ziming Li, Connor R. Schmidt, Chuxu Zhang, Matthew J. Webber
Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.
SupraBench introduces the first systematic benchmark for evaluating LLMs on supramolecular host-guest chemistry reasoning. The benchmark comprises four fundamental tasks—binding affinity prediction (regression), top-binder selection (MCQ), solvent identification (classification), and host-guest description (open-ended generation)—plus an auxiliary vision-based molecular identification task. Alongside the benchmark, the authors release SupraPMC, a curated 16M-token corpus of supramolecular chemistry articles from Europe PMC, intended to support domain adaptation.
The paper addresses a genuine gap: while LLM benchmarks exist for small-molecule chemistry (MoleculeNet, ChemLLMBench, ChemBench), none target the multi-molecular, non-covalent interaction reasoning required in supramolecular chemistry. This is a meaningful niche, as host-guest design is industrially relevant (sugammadex being the canonical example) and computationally expensive via traditional methods like DFT/MD.
Data construction is reasonably thorough. The six-step cleaning pipeline (numeric parsing, organic-solvent filtering, default-condition imputation, van't Hoff temperature correction, per-pair averaging, outlier removal) addresses real heterogeneity in experimentally reported binding data. The source data comes from SupraBank, a public repository, which aids reproducibility.
However, several methodological concerns arise:
Positive aspects: The benchmark fills a clear gap and could catalyze research at the intersection of LLMs and supramolecular chemistry. The SupraPMC corpus is a tangible community resource. The finding that CoT amplifies errors when domain knowledge is lacking (Section 4.5) is a genuinely useful insight for practitioners.
Limitations on impact: The benchmark's utility depends heavily on whether the community adopts it. The tasks, while well-motivated, are relatively straightforward reformulations of standard ML task types (regression, MCQ, classification, generation) applied to a new domain. The paper does not benchmark any chemistry-specific models (e.g., molecular property prediction GNNs, physics-based methods) for comparison, which would have been more informative about whether LLMs offer genuine advantages over existing approaches.
The practical impact for supramolecular chemists is unclear—the best MAE of 1.25 log units translates to roughly an order of magnitude uncertainty in Ka, which may be too imprecise for practical screening. The paper does not discuss whether this level of accuracy is useful relative to existing computational methods.
The timing is appropriate given the rapid expansion of LLM applications in chemistry and the growing interest in AI-driven molecular design. Supramolecular chemistry is indeed underserved by existing benchmarks. However, the paper arrives in a crowded benchmark landscape, and the relatively narrow domain focus may limit broad adoption.
SupraBench makes a reasonable contribution by establishing the first LLM-focused benchmark for supramolecular chemistry. The task design is well-motivated and the data processing pipeline is sound. However, the paper would benefit significantly from non-LLM baselines, contamination analysis, larger datasets (especially for HGD), and deeper analysis of what performance levels are practically useful. The insights, while valid, are largely confirmatory of known LLM limitations rather than revealing fundamentally new phenomena.
Generated Jun 12, 2026
Paper 1 offers a fundamental methodological advancement for diffusion models, a highly influential and widely used class of generative AI. By introducing a principled framework (Jeffrey guidance) that improves sample quality and enables fairness interventions, its algorithmic contributions are highly likely to see broad adoption across diverse domains including computer vision, audio, and even scientific generation, yielding a wider and more immediate scientific impact than the domain-specific benchmark presented in Paper 2.
SupraBench introduces the first systematic benchmark for evaluating LLMs in supramolecular chemistry, filling a clear gap at the intersection of AI and chemistry. It provides reusable resources (benchmark + 16M-token corpus) that can catalyze future research across both ML and chemistry communities. Its breadth of impact spans multiple fields (NLP, chemistry, materials science), and benchmarks are historically high-impact contributions. Harpoon, while technically solid, addresses a narrower problem (conditional tabular diffusion) with incremental theoretical extensions to manifold guidance, limiting its broader impact.
Paper 1 is more novel and broadly impactful: it offers mechanistic, causal analysis of objective interference in RLHF reward models (neuron identification/ablation), directly addressing a central, timely bottleneck in AI alignment. The methodological contribution and interpretability insights can influence reward modeling, multi-objective optimization, safety, and model editing across many LLM systems. Paper 2 is valuable infrastructure (benchmark+corpus) for a narrower subfield; its impact depends on community adoption and is primarily evaluative rather than providing new mechanistic understanding.
Paper 2 presents a fundamental algorithmic advance in generative AI with strong theoretical guarantees for any-length discrete diffusion. Its unified framework for reward-guided fine-tuning has broad applicability across multiple domains requiring sequence generation, such as NLP and computational biology. While Paper 1 provides a valuable dataset and benchmark for a specific subfield of chemistry, Paper 2's methodological innovation offers a wider breadth of impact and a foundational contribution to machine learning.
Paper 1 likely has higher impact: it introduces the first systematic benchmark and large curated corpus for supramolecular host–guest reasoning with LLMs, addressing a clear bottleneck in a high-value scientific domain (molecular design). Benchmarks and datasets tend to become community standards, enabling broad, sustained downstream research across AI-for-chemistry, scientific NLP, and materials discovery. Paper 2 is methodologically solid and practically useful for ensemble compression/calibration, but it is more incremental within a mature area and likely to have narrower cross-field adoption than a new domain benchmark resource.
Paper 2 introduces a foundational mathematical framework that solves a fundamental topological mismatch in VAEs. Its theoretical rigor and broad applicability across any discipline dealing with non-Euclidean data give it immense potential impact. While Paper 1 provides a valuable benchmark for a specific chemistry subfield, Paper 2's methodological innovation in representation learning transcends specific applications and advances the core capabilities of generative models.
Paper 2 provides a novel theoretical framework connecting geometric concepts (projection caustics) to phase transitions in diffusion/flow-matching models, which are among the most actively researched generative AI methods. It offers both theoretical insight and practical tools (CBD), with broad applicability across generative modeling. Paper 1, while useful as a benchmark for supramolecular chemistry with LLMs, addresses a narrower domain and primarily evaluates existing models rather than introducing fundamentally new concepts. Paper 2's geometric perspective has potential to influence how researchers understand and control diffusion models across many applications.
Paper 2 introduces a novel world-model-inspired approach to tensor program optimization that achieves significant practical speedups (up to 4.61× over PyTorch) with dramatically fewer measurements. This has immediate, broad impact across all ML systems requiring efficient compilation. Paper 1, while valuable as a benchmark for supramolecular chemistry LLM evaluation, serves a narrower community and primarily documents that LLMs underperform on these tasks rather than proposing a transformative solution. Paper 2's methodological innovation—modeling schedule evaluation as latent dynamics—is more broadly applicable and offers concrete performance gains.
SupraBench introduces the first systematic benchmark for evaluating LLMs in supramolecular chemistry, bridging AI and a critical chemistry subdomain with broad applications in drug delivery, materials science, and catalysis. It provides a curated corpus (SupraPMC), multiple task types, and reveals specific LLM failure modes, establishing a foundation for future research. Paper 2, while valuable for power systems forecasting, addresses a more niche application. SupraBench's cross-disciplinary novelty (AI + chemistry), the growing interest in LLMs for scientific reasoning, and its potential to accelerate supramolecular design give it broader and more timely impact.
Paper 1 offers a fundamental methodological advancement in data-driven scientific discovery, enabling the extraction of governing equations from noisy, high-dimensional data. With theoretical guarantees and empirical validation, this approach has broad, cross-disciplinary applicability across physics, neuroscience, and engineering. Paper 2 provides a valuable but more niche domain-specific benchmark for evaluating LLMs in supramolecular chemistry. The foundational nature and broader applicability of Paper 1 give it a significantly higher potential for transformative scientific impact.