SupraBench: A Benchmark for Supramolecular Chemistry

Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun, Ziming Li, Connor R. Schmidt, Chuxu Zhang, Matthew J. Webber

Jun 11, 2026arXiv:2606.13477v1

cs.LGcs.AIcs.CL

#3195of 5669·cs.LG

#3195 of 5669 · cs.LG

Tournament Score

1386±47

10501750

47%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance5.5

Rigor4.5

Novelty5

Clarity6.5

Abstract

Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SupraBench: A Benchmark for Supramolecular Chemistry

1. Core Contribution

SupraBench introduces the first systematic benchmark for evaluating LLMs on supramolecular host-guest chemistry reasoning. The benchmark comprises four fundamental tasks—binding affinity prediction (regression), top-binder selection (MCQ), solvent identification (classification), and host-guest description (open-ended generation)—plus an auxiliary vision-based molecular identification task. Alongside the benchmark, the authors release SupraPMC, a curated 16M-token corpus of supramolecular chemistry articles from Europe PMC, intended to support domain adaptation.

The paper addresses a genuine gap: while LLM benchmarks exist for small-molecule chemistry (MoleculeNet, ChemLLMBench, ChemBench), none target the multi-molecular, non-covalent interaction reasoning required in supramolecular chemistry. This is a meaningful niche, as host-guest design is industrially relevant (sugammadex being the canonical example) and computationally expensive via traditional methods like DFT/MD.

2. Methodological Rigor

Data construction is reasonably thorough. The six-step cleaning pipeline (numeric parsing, organic-solvent filtering, default-condition imputation, van't Hoff temperature correction, per-pair averaging, outlier removal) addresses real heterogeneity in experimentally reported binding data. The source data comes from SupraBank, a public repository, which aids reproducibility.

However, several methodological concerns arise:

Dataset size is modest: 2,609 samples for BAP, 2,264 for TBS, 2,172 for SID, and only 135 for HGD. The HGD set is particularly small, making conclusions about that task somewhat tenuous.

Data leakage risk is acknowledged but unaddressed: The authors note that frontier LLMs may have seen SupraBank data during pretraining but perform no temporal split or novel-compound analysis to quantify contamination. This is a significant weakness for a benchmark paper.

Evaluation protocol limitations: Using OpenRouter without pinned model versions introduces reproducibility concerns. The authors acknowledge this but offer no mitigation beyond recording request dates.

DAPT analysis is shallow: Only two small models (8-9B) are adapted, using a single recipe. The finding that DAPT hurts MCQ format compliance is interesting but could be an artifact of the specific LoRA configuration rather than a fundamental insight.

Van't Hoff correction with assumed ΔH° values: Using literature-averaged enthalpy values for temperature correction introduces systematic bias, particularly for atypical host-guest pairs.

3. Potential Impact

Positive aspects: The benchmark fills a clear gap and could catalyze research at the intersection of LLMs and supramolecular chemistry. The SupraPMC corpus is a tangible community resource. The finding that CoT amplifies errors when domain knowledge is lacking (Section 4.5) is a genuinely useful insight for practitioners.

Limitations on impact: The benchmark's utility depends heavily on whether the community adopts it. The tasks, while well-motivated, are relatively straightforward reformulations of standard ML task types (regression, MCQ, classification, generation) applied to a new domain. The paper does not benchmark any chemistry-specific models (e.g., molecular property prediction GNNs, physics-based methods) for comparison, which would have been more informative about whether LLMs offer genuine advantages over existing approaches.

The practical impact for supramolecular chemists is unclear—the best MAE of 1.25 log units translates to roughly an order of magnitude uncertainty in Ka, which may be too imprecise for practical screening. The paper does not discuss whether this level of accuracy is useful relative to existing computational methods.

4. Timeliness & Relevance

The timing is appropriate given the rapid expansion of LLM applications in chemistry and the growing interest in AI-driven molecular design. Supramolecular chemistry is indeed underserved by existing benchmarks. However, the paper arrives in a crowded benchmark landscape, and the relatively narrow domain focus may limit broad adoption.

5. Strengths & Limitations

Key Strengths:

First-of-its-kind benchmark for supramolecular chemistry LLM evaluation

Expert-validated task design with clear practical motivation

Comprehensive evaluation across 8 models, 3 prompting strategies, and 5 tasks

The CoT failure analysis (Section 4.5) provides genuinely actionable insight

Release of SupraPMC corpus as a community resource

Clean data processing pipeline with documented steps

Notable Weaknesses:

No comparison with non-LLM baselines (GNNs, classical ML, physics-based methods), making it hard to contextualize LLM performance

Small dataset sizes, especially for HGD (135 samples)

Data contamination risk is acknowledged but not quantified

The DAPT analysis uses only two models with a single recipe, limiting generalizability of conclusions

No train/test split strategy to ensure novel compound generalization

Missing analysis of chemical diversity in the benchmark (how representative is the host/guest coverage?)

The paper does not establish whether the benchmark difficulty is calibrated appropriately—is 51.3% top-binder accuracy meaningful, or is it close to random (25%)?

Rouge-1 F1 below 0.6 for HGD is reported as "substantial headroom" but may partly reflect evaluation metric inadequacy for free-text chemistry answers

Additional Observations:

The paper's framing occasionally overstates the contribution ("first Supramolecular Benchmark" when SAMPL challenges have existed for years, albeit targeting different methods)

The insight that "no single prompting strategy is universally helpful" is not novel—this has been documented across many domains

Model versions cited (GPT-5.4, Qwen3.5, Gemini-3) suggest this paper may reference future/hypothetical models, raising questions about verifiability

Overall Assessment

SupraBench makes a reasonable contribution by establishing the first LLM-focused benchmark for supramolecular chemistry. The task design is well-motivated and the data processing pipeline is sound. However, the paper would benefit significantly from non-LLM baselines, contamination analysis, larger datasets (especially for HGD), and deeper analysis of what performance levels are practically useful. The insights, while valid, are largely confirmatory of known LLM limitations rather than revealing fundamentally new phenomena.

Rating:5/ 10

Significance 5.5Rigor 4.5Novelty 5Clarity 6.5

Generated Jun 12, 2026

Comparison History (17)

Lostvs. Towards More General Control of Diffusion Models Using Jeffrey Guidance

Paper 1 offers a fundamental methodological advancement for diffusion models, a highly influential and widely used class of generative AI. By introducing a principled framework (Jeffrey guidance) that improves sample quality and enables fairness interventions, its algorithmic contributions are highly likely to see broad adoption across diverse domains including computer vision, audio, and even scientific generation, yielding a wider and more immediate scientific impact than the domain-specific benchmark presented in Paper 2.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Harpoon: Generalised Manifold Guidance for Conditional Tabular Diffusion

SupraBench introduces the first systematic benchmark for evaluating LLMs in supramolecular chemistry, filling a clear gap at the intersection of AI and chemistry. It provides reusable resources (benchmark + 16M-token corpus) that can catalyze future research across both ML and chemistry communities. Its breadth of impact spans multiple fields (NLP, chemistry, materials science), and benchmarks are historically high-impact contributions. Harpoon, while technically solid, addresses a narrower problem (conditional tabular diffusion) with incremental theoretical extensions to manifold guidance, limiting its broader impact.

claude-opus-4-6·Jun 12, 2026

Lostvs. Understanding helpfulness and harmless tension in reward models

Paper 1 is more novel and broadly impactful: it offers mechanistic, causal analysis of objective interference in RLHF reward models (neuron identification/ablation), directly addressing a central, timely bottleneck in AI alignment. The methodological contribution and interpretability insights can influence reward modeling, multi-objective optimization, safety, and model editing across many LLM systems. Paper 2 is valuable infrastructure (benchmark+corpus) for a narrower subfield; its impact depends on community adoption and is primarily evaluative rather than providing new mechanistic understanding.

gpt-5.2·Jun 12, 2026

Lostvs. A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding

Paper 2 presents a fundamental algorithmic advance in generative AI with strong theoretical guarantees for any-length discrete diffusion. Its unified framework for reward-guided fine-tuning has broad applicability across multiple domains requiring sequence generation, such as NLP and computational biology. While Paper 1 provides a valuable dataset and benchmark for a specific subfield of chemistry, Paper 2's methodological innovation offers a wider breadth of impact and a foundational contribution to machine learning.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

Paper 1 likely has higher impact: it introduces the first systematic benchmark and large curated corpus for supramolecular host–guest reasoning with LLMs, addressing a clear bottleneck in a high-value scientific domain (molecular design). Benchmarks and datasets tend to become community standards, enabling broad, sustained downstream research across AI-for-chemistry, scientific NLP, and materials discovery. Paper 2 is methodologically solid and practically useful for ensemble compression/calibration, but it is more incremental within a mature area and likely to have narrower cross-field adoption than a new domain benchmark resource.

gpt-5.2·Jun 12, 2026

Lostvs. Constructing VAE Latent Spaces with Prescribed Topology

Paper 2 introduces a foundational mathematical framework that solves a fundamental topological mismatch in VAEs. Its theoretical rigor and broad applicability across any discipline dealing with non-Euclidean data give it immense potential impact. While Paper 1 provides a valuable benchmark for a specific chemistry subfield, Paper 2's methodological innovation in representation learning transcends specific applications and advances the core capabilities of generative models.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

Paper 2 provides a novel theoretical framework connecting geometric concepts (projection caustics) to phase transitions in diffusion/flow-matching models, which are among the most actively researched generative AI methods. It offers both theoretical insight and practical tools (CBD), with broad applicability across generative modeling. Paper 1, while useful as a benchmark for supramolecular chemistry with LLMs, addresses a narrower domain and primarily evaluates existing models rather than introducing fundamentally new concepts. Paper 2's geometric perspective has potential to influence how researchers understand and control diffusion models across many applications.

claude-opus-4-6·Jun 12, 2026

Lostvs. Toward Compiler World Models: Learning Latent Dynamics for Efficient Tensor Program Search

Paper 2 introduces a novel world-model-inspired approach to tensor program optimization that achieves significant practical speedups (up to 4.61× over PyTorch) with dramatically fewer measurements. This has immediate, broad impact across all ML systems requiring efficient compilation. Paper 1, while valuable as a benchmark for supramolecular chemistry LLM evaluation, serves a narrower community and primarily documents that LLMs underperform on these tasks rather than proposing a transformative solution. Paper 2's methodological innovation—modeling schedule evaluation as latent dynamics—is more broadly applicable and offers concrete performance gains.

claude-opus-4-6·Jun 12, 2026

Wonvs. Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

SupraBench introduces the first systematic benchmark for evaluating LLMs in supramolecular chemistry, bridging AI and a critical chemistry subdomain with broad applications in drug delivery, materials science, and catalysis. It provides a curated corpus (SupraPMC), multiple task types, and reveals specific LLM failure modes, establishing a foundation for future research. Paper 2, while valuable for power systems forecasting, addresses a more niche application. SupraBench's cross-disciplinary novelty (AI + chemistry), the growing interest in LLMs for scientific reasoning, and its potential to accelerate supramolecular design give it broader and more timely impact.

claude-opus-4-6·Jun 12, 2026

Lostvs. Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

Paper 1 offers a fundamental methodological advancement in data-driven scientific discovery, enabling the extraction of governing equations from noisy, high-dimensional data. With theoretical guarantees and empirical validation, this approach has broad, cross-disciplinary applicability across physics, neuroscience, and engineering. Paper 2 provides a valuable but more niche domain-specific benchmark for evaluating LLMs in supramolecular chemistry. The foundational nature and broader applicability of Paper 1 give it a significantly higher potential for transformative scientific impact.

gemini-3.1-pro-preview·Jun 12, 2026

#3195of 5669·cs.LG

#3195 of 5669 · cs.LG

Tournament Score

1386±47

10501750

47%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance5.5

Rigor4.5

Novelty5

Clarity6.5