Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

Xinzhe Yuan, Zhuo Chen, Jianshu Zhang, Huan Xiong, Nanyang Ye, Yuqiang Li, Qinying Gu

May 18, 2026

arXiv:2605.17976v1 PDF

cs.AI(primary)math.OC

#73of 2292·Artificial Intelligence

#73 of 2292 · Artificial Intelligence

Tournament Score

1553±45

10501800

85%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor6.8

Novelty7

Clarity7.8

Tournament Score

1553±45

10501800

85%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Scientific discovery is increasingly constrained by costly experiments and limited resources, underscoring the need for efficient optimization in AI for science. Bayesian Optimization (BO), though widely adopted for balancing exploration and exploitation, often exhibits slow cold-start performance and poor scalability in high-dimensional settings, limiting its applicability in real-world scientific problems. To overcome these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO does not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO consistently outperforms existing methods across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe-Cr battery electrolytes, LGBO attains \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10. Together, these results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: LGBO – Unleashing LLMs in Bayesian Optimization

1. Core Contribution

LGBO introduces a region-lifted preference mechanism that continuously integrates LLM-derived semantic guidance into the Bayesian optimization (BO) loop by shifting the GP surrogate's mean function at every iteration, while leaving the covariance structure intact. This is a meaningful departure from prior LLM-BO integration strategies (LLAMBO, LLINBO, ADO-LLM) that either use LLMs only for warm-start initialization or for candidate proposal that is subsequently filtered by the acquisition function. The key insight is elegant: translating coarse LLM suggestions (regions or points with confidence scores) into an exponential tilt on the GP prior, which—via the Cameron-Martin theorem—reduces to a simple, closed-form mean shift (Proposition 1). This makes the integration mathematically clean and computationally cheap.

The framework addresses two genuine pain points in BO for science: cold-start inefficiency (few initial observations) and poor scalability in moderate-to-high dimensions. By injecting domain knowledge through LLM preferences at every iteration rather than just initialization, LGBO maintains a persistent informational advantage.

2. Methodological Rigor

Theoretical analysis. The paper provides regret bounds (Theorem 1/3) showing that under a fixed-lift abstraction, LGBO's worst-case regret degrades only by a constant additive factor (B₀ + λ‖g‖) relative to standard GP-UCB, while the aligned case yields a strictly tighter bound (B₀√(1−c²)). This is a clean and reassuring result—the framework is "provably safe." However, the theoretical analysis studies a frozen-lift (single fixed preference direction throughout), which is a significant simplification of the actual algorithm where the LLM updates its preference every iteration. The authors acknowledge this gap and offer a reasonable justification (LLM suggestions tend to be structurally coherent), but the disconnect between theory and practice is notable. An analysis of the adaptive case, even under restrictive assumptions, would have strengthened the contribution.

Experimental design. The evaluation spans four dry benchmarks (LNP3, Cross-barrel, Concrete, HPLC) across physics, chemistry, biology, and materials science, plus a genuinely novel wet-lab experiment on Fe-Cr redox flow battery electrolytes. The wet-lab component is particularly valuable—it demonstrates real-world applicability where the optimum is unknown and measurement noise is substantial. The headline result (90% of best observed value in 6 iterations vs. >10 for baselines) is compelling for practitioners.

Weaknesses in experimental setup:

Only two baselines (GPBO and LLAMBO) in the main experiments, though the appendix adds ColaLLM, BOPRO, and CAKE comparisons.

Five random seeds is relatively few for statistical confidence, especially given the variance in some tasks.

The LLM backbone (Intern-S1-241B) is a scientific-domain-pretrained model, which may inflate the apparent benefit of LLM guidance. The ablation with other backbones partially addresses this but is limited to HPLC only.

The "dry" benchmarks use interpolated oracles from finite datasets, which may not fully capture the complexity of real black-box functions.

3. Potential Impact

The practical impact potential is high for experimental science workflows. The framework is modular—it works with any GP-based BO pipeline and any LLM backbone—making adoption relatively straightforward. The wet-lab validation on battery electrolytes provides a credible proof-of-concept for self-driving laboratory applications.

Broader implications:

The region-lifted preference mechanism could be adapted for other sources of expert knowledge beyond LLMs (e.g., physics-informed priors, simulation-derived hints).

The structured prompt protocol (point/region + confidence) provides a reusable template for LLM-BO integration.

The calibration scheme (λ = c/√(a⊤Σ_GG a)) is principled and avoids hyperparameter tuning, which is important for practical deployment.

Limitations on impact:

The approach inherits LLM limitations: hallucination, inconsistency across prompts, and sensitivity to prompt engineering. The paper's elaborate prompt design (Appendix B) suggests non-trivial engineering effort for new domains.

Scalability to truly high-dimensional problems (d >> 14) remains untested. The COF benchmark (d=14) is the highest tested.

Cost of LLM inference per iteration is not discussed—for a 241B parameter model, this could be non-negligible.

4. Timeliness & Relevance

This paper sits at an active intersection of LLMs and scientific optimization. The timing is excellent: the community is actively exploring how to leverage foundation models for experimental design, and this work provides one of the more principled approaches. The ICLR 2026 acceptance validates its relevance. The wet-lab component aligns with growing interest in autonomous/self-driving laboratories.

5. Strengths & Limitations

Key Strengths:

Clean mathematical formulation: exponential lift → mean shift is elegant and practically implementable

Theoretical safety guarantee (worst-case bounded degradation)

Wet-lab validation—rare in this literature

Modular design: backbone-agnostic, acquisition-function-agnostic

Thoughtful prompt engineering with anti-collapse mechanisms

Notable Weaknesses:

Theory-practice gap: frozen-lift analysis doesn't cover the adaptive setting

Limited baseline comparisons in main text (expanded in appendix)

Single LLM backbone for main experiments; generalization across LLMs only tested on one task

No discussion of computational overhead or failure modes

The confidence calibration relies on the LLM self-reporting reliability, which is known to be poorly calibrated in LLMs

Relatively small-scale benchmarks; unclear how this scales to large combinatorial or very high-dimensional spaces

Overall Assessment

LGBO makes a solid contribution by providing a principled, theoretically grounded mechanism for continuously integrating LLM preferences into BO. The region-lifted preference is the paper's strongest intellectual contribution—simple, elegant, and practically useful. The wet-lab validation elevates this beyond a purely algorithmic contribution. However, the gap between theoretical analysis (frozen lift) and actual algorithm (adaptive lift), limited baseline diversity in main experiments, and unclear scalability to higher dimensions temper the impact somewhat. This is a well-executed paper that advances the LLM-for-science agenda meaningfully, though incremental in the broader BO literature.

Rating:7.2/ 10

Significance 7.5Rigor 6.8Novelty 7Clarity 7.8

Generated May 19, 2026

Comparison History (20)

vs. Echo: Learning from Experience Data via User-Driven Refinement

claude-opus-4.65/22/2026

Paper 1 (LGBO) demonstrates higher scientific impact potential due to: (1) Novel theoretical framework integrating LLMs into Bayesian Optimization with formal convergence guarantees, (2) Broad applicability across multiple scientific domains (physics, chemistry, biology, materials science), (3) Validated in both dry benchmarks and wet-lab experiments showing significant efficiency gains, (4) Addresses fundamental challenges in scientific discovery (costly experiments, cold-start, high dimensionality). Paper 2 (Echo) addresses an important but narrower problem in code completion agent training. While practically valuable with strong production results, its impact is more domain-specific compared to LGBO's cross-disciplinary scientific optimization framework.

vs. Echo: Learning from Experience Data via User-Driven Refinement

claude-opus-4.65/22/2026

Paper 2 introduces a novel framework (LGBO) that integrates LLM reasoning into Bayesian Optimization with theoretical guarantees and demonstrates impact across multiple scientific domains including wet-lab validation. Its breadth of impact spans physics, chemistry, biology, and materials science, addressing a fundamental challenge in scientific discovery. Paper 1, while practically valuable with strong production results in code completion, is more narrowly focused on a single application domain. Paper 2's theoretical contributions, cross-disciplinary applicability, and real experimental validation suggest broader and deeper scientific impact.

vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

claude-opus-4.65/20/2026

Paper 2 (LGBO) has broader scientific impact potential due to its cross-disciplinary applicability (physics, chemistry, biology, materials science), validated wet-lab experiments demonstrating real-world utility, and novel integration of LLM reasoning into Bayesian optimization with theoretical guarantees. It addresses a fundamental bottleneck in scientific discovery—costly experiments—with a generalizable framework. Paper 1, while technically solid in improving rubric-based RLVR training, addresses a more niche problem within the LLM post-training pipeline with narrower applicability beyond AI alignment/training.

vs. Large Vision-Language Models Get Lost in Attention

gemini-3.15/19/2026

Paper 1 offers a highly practical and scalable framework (LGBO) that directly accelerates experimental workflows across multiple scientific disciplines (chemistry, biology, materials science). Its successful real-world validation in a wet-lab setting demonstrates immediate and broad applicability for accelerating scientific discovery, giving it a higher multidisciplinary scientific impact compared to the theoretical and domain-specific architectural insights of Paper 2.

vs. Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

gemini-3.15/19/2026

Paper 2 demonstrates significant breadth of impact and potential for real-world applications by accelerating scientific discovery across multiple disciplines (physics, chemistry, biology, materials science). Its inclusion of theoretical guarantees alongside a real-world wet-lab experiment (battery electrolytes) showcases strong methodological rigor. While Paper 1 addresses an important AI safety vulnerability, Paper 2's potential to broadly enhance empirical research optimization gives it a higher overall scientific impact.

vs. PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

gpt-5.25/19/2026

Paper 1 likely has higher impact due to broader cross-domain applicability: an LLM-preference-guided Bayesian Optimization framework can generalize to many expensive experimental/simulation settings beyond materials (chemistry, biology, physics, engineering). It offers a novel integration of LLM “semantic” preferences at every BO iteration with theoretical guarantees and a compelling wet-lab validation showing large iteration-efficiency gains—highly relevant and timely for AI-for-science automation. Paper 2 is strong and practical for materials generation, but its scope is more domain-specific and lacks comparable theoretical breadth and real-world experimental demonstration.

vs. State Contamination in Memory-Augmented LLM Agents

gpt-5.25/19/2026

Paper 2 has higher estimated impact due to strong timeliness and broad relevance: memory-augmented LLM agents are rapidly deploying, and “state contamination/memory laundering” targets a concrete, under-studied safety failure mode with implications across agent design, alignment, and security. It contributes a clear measurement (SPG), a rigorous counterfactual rollout methodology, and actionable mitigation insights about intervention placement. Paper 1 is innovative and valuable for AI-for-science optimization, but its impact is more domain-specific and depends on adoption in experimental pipelines; Paper 2’s findings generalize across many LLM-agent applications.

vs. EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness

gemini-3.15/19/2026

Paper 2 presents a foundational methodological advancement by integrating LLMs into Bayesian Optimization, demonstrating rigorous theoretical guarantees and broad applicability across multiple scientific domains (physics, chemistry, biology). Its validation in a real wet-lab setting highlights immense potential to accelerate scientific discovery. In contrast, Paper 1 offers a highly niche application of existing AI tools tailored specifically to agile software management, limiting its overall scientific and cross-disciplinary impact.

vs. Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

gemini-3.15/19/2026

Paper 2 presents a concrete, methodologically rigorous framework with both theoretical guarantees and strong empirical validation across multiple scientific domains, including real-world wet-lab experiments. While Paper 1 addresses the critical issue of LLM safety, it is a position paper sketching an architecture. Paper 2's broad applicability to accelerate scientific discovery across physics, chemistry, biology, and materials science gives it a significantly wider and more immediate scientific impact.

vs. Towards Human-Level Book-Writing Capability

gemini-3.15/19/2026

Paper 1 presents a framework that directly accelerates scientific discovery across multiple domains (physics, chemistry, biology) by integrating LLMs into Bayesian Optimization. Its theoretical guarantees and successful wet-lab application demonstrate significant methodological rigor and immediate, broad real-world utility. In contrast, while Paper 2 advances generative AI for creative writing, its impact is largely confined to NLP and entertainment, making Paper 1's contribution to broader scientific advancement much more profound.

vs. The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

gemini-3.15/19/2026

Paper 2 presents a broader potential impact across multiple scientific disciplines (physics, chemistry, biology) by accelerating experimental optimization. Its combination of theoretical guarantees and real-world wet-lab validation (battery electrolytes) demonstrates immediate and significant practical utility. While Paper 1 offers valuable insights into LLM training methodologies, Paper 2's direct contribution to accelerating scientific discovery gives it a wider and more profound scientific impact.

vs. CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

claude-opus-4.65/19/2026

LGBO offers broader scientific impact by addressing a fundamental challenge (efficient Bayesian optimization) across multiple scientific domains with both theoretical guarantees and wet-lab validation. Its framework is domain-agnostic, applicable to physics, chemistry, biology, and materials science, and demonstrates practical real-world utility in battery electrolyte optimization. While CrystalReasoner is innovative for crystal structure generation, it targets a narrower domain. LGBO's combination of theoretical rigor, broad applicability, and experimental validation in real lab settings gives it higher potential impact.

vs. Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

gpt-5.25/19/2026

Paper 1 likely has higher impact due to stronger cross-domain applicability and real-world validation. LGBO introduces a novel, iterative preference-integration mechanism (region-lifted preferences) that addresses key BO pain points (cold start, high-dimensional scalability), provides theoretical guarantees, and demonstrates broad benchmarks plus a compelling wet-lab electrolyte optimization result. This combination of methodological contribution, theory, and demonstrated scientific utility can influence AI-for-science workflows across chemistry, materials, biology, and physics. Paper 2 is timely and useful for LLM post-training, but its impact is more concentrated within RLVR/reasoning benchmarks.

vs. RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

gpt-5.25/19/2026

Paper 2 has higher potential impact: it introduces a broadly applicable LLM-preference integration into Bayesian Optimization with theoretical guarantees and strong empirical results, including a real wet-lab electrolyte optimization demonstrating tangible resource savings. Its applications span many AI-for-science domains (chemistry, materials, biology, physics) and address timely limitations of BO (cold start, high dimensionality). Paper 1 is novel within EEG-to-text BCIs but shows modest gains on a single dataset/setting and is likely narrower in real-world readiness and cross-field breadth.

vs. Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

gemini-3.15/19/2026

While Paper 1 offers a strong, theoretically grounded approach for physical scientific discovery, Paper 2 demonstrates a highly impactful step toward recursive self-improvement in AI. By enabling autonomous agents to design foundation models that outperform state-of-the-art baselines like Llama 3.2, Paper 2 accelerates the fundamental engine of AI research. This could yield compounding advancements across all fields reliant on machine learning, granting it a broader and more transformative long-term scientific impact.

vs. From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

claude-opus-4.65/19/2026

Paper 2 presents a novel, concrete method (LGBO) with theoretical guarantees and empirical validation including wet-lab experiments, addressing a practical bottleneck in scientific discovery. It introduces a new mechanism (region-lifted preferences) integrating LLMs into Bayesian optimization with broad applicability across physics, chemistry, biology, and materials science. While Paper 1 provides a valuable unified review framework for clinical trajectory modeling, it is primarily a synthesis/review rather than introducing a new method. Paper 2's combination of theoretical novelty, cross-domain empirical results, and timeliness (LLM integration) gives it higher near-term impact potential.

vs. Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

gpt-5.25/19/2026

Paper 2 likely has higher impact: it introduces a timely, broadly applicable evaluation paradigm for longitudinal safety in memory-equipped LLM agents (a rapidly emerging deployment setting). The trigger-probe protocol, NullMemory counterfactual, cross-architecture experiments, and diagnostic monitoring suggest strong methodological rigor and immediate real-world relevance for safety governance across many domains. Paper 1 is innovative and includes wet-lab results, but its impact is narrower (optimization workflows in AI-for-science) and may depend on how reliably LLM “preferences” generalize and remain controllable across tasks and objectives.

vs. When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration

claude-opus-4.65/19/2026

Paper 1 (LGBO) demonstrates higher scientific impact potential due to: (1) it addresses a fundamental bottleneck in scientific discovery—sample-efficient optimization—with broad applicability across physics, chemistry, biology, and materials science; (2) it provides both theoretical guarantees and empirical validation including wet-lab experiments, a rare and compelling combination; (3) it pioneers a novel framework for continuously integrating LLM reasoning into Bayesian optimization beyond warm-starting; (4) the practical implications for reducing experimental costs are immediately actionable. Paper 2 offers valuable insights on context injection in multi-agent systems but has narrower scope limited to software design exploration.

vs. Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities

claude-opus-4.65/19/2026

Paper 2 introduces a novel framework (LGBO) that integrates LLMs into Bayesian optimization for scientific discovery, with broad applicability across physics, chemistry, biology, and materials science. It provides both theoretical guarantees and wet-lab validation, demonstrating real-world impact. Its cross-disciplinary scope, timeliness (leveraging the LLM revolution), and practical significance (reducing experimental iterations) give it higher potential impact than Paper 1, which addresses a more specialized problem (brain tumor segmentation with missing modalities) within a single domain using incremental methodological improvements.

vs. $π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

claude-opus-4.65/19/2026

Paper 2 presents a novel framework (LGBO) that integrates LLM reasoning into Bayesian Optimization with theoretical guarantees and demonstrates real wet-lab validation in battery electrolyte optimization. It addresses a fundamental challenge in scientific discovery—costly experiments with limited budgets—with broad applicability across physics, chemistry, biology, and materials science. The combination of theoretical analysis, diverse benchmarks, and wet-lab validation provides strong methodological rigor. Paper 1, while addressing an important evaluation gap for proactive AI assistants, is primarily a benchmark contribution with narrower scope and no novel methodology beyond the evaluation framework itself.