Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery
Xinzhe Yuan, Zhuo Chen, Jianshu Zhang, Huan Xiong, Nanyang Ye, Yuqiang Li, Qinying Gu
Abstract
Scientific discovery is increasingly constrained by costly experiments and limited resources, underscoring the need for efficient optimization in AI for science. Bayesian Optimization (BO), though widely adopted for balancing exploration and exploitation, often exhibits slow cold-start performance and poor scalability in high-dimensional settings, limiting its applicability in real-world scientific problems. To overcome these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO does not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO consistently outperforms existing methods across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe-Cr battery electrolytes, LGBO attains \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10. Together, these results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.
AI Impact Assessments
(1 models)Scientific Impact Assessment: LGBO – Unleashing LLMs in Bayesian Optimization
1. Core Contribution
LGBO introduces a region-lifted preference mechanism that continuously integrates LLM-derived semantic guidance into the Bayesian optimization (BO) loop by shifting the GP surrogate's mean function at every iteration, while leaving the covariance structure intact. This is a meaningful departure from prior LLM-BO integration strategies (LLAMBO, LLINBO, ADO-LLM) that either use LLMs only for warm-start initialization or for candidate proposal that is subsequently filtered by the acquisition function. The key insight is elegant: translating coarse LLM suggestions (regions or points with confidence scores) into an exponential tilt on the GP prior, which—via the Cameron-Martin theorem—reduces to a simple, closed-form mean shift (Proposition 1). This makes the integration mathematically clean and computationally cheap.
The framework addresses two genuine pain points in BO for science: cold-start inefficiency (few initial observations) and poor scalability in moderate-to-high dimensions. By injecting domain knowledge through LLM preferences at every iteration rather than just initialization, LGBO maintains a persistent informational advantage.
2. Methodological Rigor
Theoretical analysis. The paper provides regret bounds (Theorem 1/3) showing that under a fixed-lift abstraction, LGBO's worst-case regret degrades only by a constant additive factor (B₀ + λ‖g‖) relative to standard GP-UCB, while the aligned case yields a strictly tighter bound (B₀√(1−c²)). This is a clean and reassuring result—the framework is "provably safe." However, the theoretical analysis studies a frozen-lift (single fixed preference direction throughout), which is a significant simplification of the actual algorithm where the LLM updates its preference every iteration. The authors acknowledge this gap and offer a reasonable justification (LLM suggestions tend to be structurally coherent), but the disconnect between theory and practice is notable. An analysis of the adaptive case, even under restrictive assumptions, would have strengthened the contribution.
Experimental design. The evaluation spans four dry benchmarks (LNP3, Cross-barrel, Concrete, HPLC) across physics, chemistry, biology, and materials science, plus a genuinely novel wet-lab experiment on Fe-Cr redox flow battery electrolytes. The wet-lab component is particularly valuable—it demonstrates real-world applicability where the optimum is unknown and measurement noise is substantial. The headline result (90% of best observed value in 6 iterations vs. >10 for baselines) is compelling for practitioners.
Weaknesses in experimental setup:
3. Potential Impact
The practical impact potential is high for experimental science workflows. The framework is modular—it works with any GP-based BO pipeline and any LLM backbone—making adoption relatively straightforward. The wet-lab validation on battery electrolytes provides a credible proof-of-concept for self-driving laboratory applications.
Broader implications:
Limitations on impact:
4. Timeliness & Relevance
This paper sits at an active intersection of LLMs and scientific optimization. The timing is excellent: the community is actively exploring how to leverage foundation models for experimental design, and this work provides one of the more principled approaches. The ICLR 2026 acceptance validates its relevance. The wet-lab component aligns with growing interest in autonomous/self-driving laboratories.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
LGBO makes a solid contribution by providing a principled, theoretically grounded mechanism for continuously integrating LLM preferences into BO. The region-lifted preference is the paper's strongest intellectual contribution—simple, elegant, and practically useful. The wet-lab validation elevates this beyond a purely algorithmic contribution. However, the gap between theoretical analysis (frozen lift) and actual algorithm (adaptive lift), limited baseline diversity in main experiments, and unclear scalability to higher dimensions temper the impact somewhat. This is a well-executed paper that advances the LLM-for-science agenda meaningfully, though incremental in the broader BO literature.
Generated May 19, 2026
Comparison History (20)
Paper 1 (LGBO) demonstrates higher scientific impact potential due to: (1) Novel theoretical framework integrating LLMs into Bayesian Optimization with formal convergence guarantees, (2) Broad applicability across multiple scientific domains (physics, chemistry, biology, materials science), (3) Validated in both dry benchmarks and wet-lab experiments showing significant efficiency gains, (4) Addresses fundamental challenges in scientific discovery (costly experiments, cold-start, high dimensionality). Paper 2 (Echo) addresses an important but narrower problem in code completion agent training. While practically valuable with strong production results, its impact is more domain-specific compared to LGBO's cross-disciplinary scientific optimization framework.
Paper 2 introduces a novel framework (LGBO) that integrates LLM reasoning into Bayesian Optimization with theoretical guarantees and demonstrates impact across multiple scientific domains including wet-lab validation. Its breadth of impact spans physics, chemistry, biology, and materials science, addressing a fundamental challenge in scientific discovery. Paper 1, while practically valuable with strong production results in code completion, is more narrowly focused on a single application domain. Paper 2's theoretical contributions, cross-disciplinary applicability, and real experimental validation suggest broader and deeper scientific impact.
Paper 2 (LGBO) has broader scientific impact potential due to its cross-disciplinary applicability (physics, chemistry, biology, materials science), validated wet-lab experiments demonstrating real-world utility, and novel integration of LLM reasoning into Bayesian optimization with theoretical guarantees. It addresses a fundamental bottleneck in scientific discovery—costly experiments—with a generalizable framework. Paper 1, while technically solid in improving rubric-based RLVR training, addresses a more niche problem within the LLM post-training pipeline with narrower applicability beyond AI alignment/training.
Paper 1 offers a highly practical and scalable framework (LGBO) that directly accelerates experimental workflows across multiple scientific disciplines (chemistry, biology, materials science). Its successful real-world validation in a wet-lab setting demonstrates immediate and broad applicability for accelerating scientific discovery, giving it a higher multidisciplinary scientific impact compared to the theoretical and domain-specific architectural insights of Paper 2.
Paper 2 demonstrates significant breadth of impact and potential for real-world applications by accelerating scientific discovery across multiple disciplines (physics, chemistry, biology, materials science). Its inclusion of theoretical guarantees alongside a real-world wet-lab experiment (battery electrolytes) showcases strong methodological rigor. While Paper 1 addresses an important AI safety vulnerability, Paper 2's potential to broadly enhance empirical research optimization gives it a higher overall scientific impact.
Paper 1 likely has higher impact due to broader cross-domain applicability: an LLM-preference-guided Bayesian Optimization framework can generalize to many expensive experimental/simulation settings beyond materials (chemistry, biology, physics, engineering). It offers a novel integration of LLM “semantic” preferences at every BO iteration with theoretical guarantees and a compelling wet-lab validation showing large iteration-efficiency gains—highly relevant and timely for AI-for-science automation. Paper 2 is strong and practical for materials generation, but its scope is more domain-specific and lacks comparable theoretical breadth and real-world experimental demonstration.
Paper 2 has higher estimated impact due to strong timeliness and broad relevance: memory-augmented LLM agents are rapidly deploying, and “state contamination/memory laundering” targets a concrete, under-studied safety failure mode with implications across agent design, alignment, and security. It contributes a clear measurement (SPG), a rigorous counterfactual rollout methodology, and actionable mitigation insights about intervention placement. Paper 1 is innovative and valuable for AI-for-science optimization, but its impact is more domain-specific and depends on adoption in experimental pipelines; Paper 2’s findings generalize across many LLM-agent applications.
Paper 2 presents a foundational methodological advancement by integrating LLMs into Bayesian Optimization, demonstrating rigorous theoretical guarantees and broad applicability across multiple scientific domains (physics, chemistry, biology). Its validation in a real wet-lab setting highlights immense potential to accelerate scientific discovery. In contrast, Paper 1 offers a highly niche application of existing AI tools tailored specifically to agile software management, limiting its overall scientific and cross-disciplinary impact.
Paper 2 presents a concrete, methodologically rigorous framework with both theoretical guarantees and strong empirical validation across multiple scientific domains, including real-world wet-lab experiments. While Paper 1 addresses the critical issue of LLM safety, it is a position paper sketching an architecture. Paper 2's broad applicability to accelerate scientific discovery across physics, chemistry, biology, and materials science gives it a significantly wider and more immediate scientific impact.
Paper 1 presents a framework that directly accelerates scientific discovery across multiple domains (physics, chemistry, biology) by integrating LLMs into Bayesian Optimization. Its theoretical guarantees and successful wet-lab application demonstrate significant methodological rigor and immediate, broad real-world utility. In contrast, while Paper 2 advances generative AI for creative writing, its impact is largely confined to NLP and entertainment, making Paper 1's contribution to broader scientific advancement much more profound.
Paper 2 presents a broader potential impact across multiple scientific disciplines (physics, chemistry, biology) by accelerating experimental optimization. Its combination of theoretical guarantees and real-world wet-lab validation (battery electrolytes) demonstrates immediate and significant practical utility. While Paper 1 offers valuable insights into LLM training methodologies, Paper 2's direct contribution to accelerating scientific discovery gives it a wider and more profound scientific impact.
LGBO offers broader scientific impact by addressing a fundamental challenge (efficient Bayesian optimization) across multiple scientific domains with both theoretical guarantees and wet-lab validation. Its framework is domain-agnostic, applicable to physics, chemistry, biology, and materials science, and demonstrates practical real-world utility in battery electrolyte optimization. While CrystalReasoner is innovative for crystal structure generation, it targets a narrower domain. LGBO's combination of theoretical rigor, broad applicability, and experimental validation in real lab settings gives it higher potential impact.
Paper 1 likely has higher impact due to stronger cross-domain applicability and real-world validation. LGBO introduces a novel, iterative preference-integration mechanism (region-lifted preferences) that addresses key BO pain points (cold start, high-dimensional scalability), provides theoretical guarantees, and demonstrates broad benchmarks plus a compelling wet-lab electrolyte optimization result. This combination of methodological contribution, theory, and demonstrated scientific utility can influence AI-for-science workflows across chemistry, materials, biology, and physics. Paper 2 is timely and useful for LLM post-training, but its impact is more concentrated within RLVR/reasoning benchmarks.
Paper 2 has higher potential impact: it introduces a broadly applicable LLM-preference integration into Bayesian Optimization with theoretical guarantees and strong empirical results, including a real wet-lab electrolyte optimization demonstrating tangible resource savings. Its applications span many AI-for-science domains (chemistry, materials, biology, physics) and address timely limitations of BO (cold start, high dimensionality). Paper 1 is novel within EEG-to-text BCIs but shows modest gains on a single dataset/setting and is likely narrower in real-world readiness and cross-field breadth.
While Paper 1 offers a strong, theoretically grounded approach for physical scientific discovery, Paper 2 demonstrates a highly impactful step toward recursive self-improvement in AI. By enabling autonomous agents to design foundation models that outperform state-of-the-art baselines like Llama 3.2, Paper 2 accelerates the fundamental engine of AI research. This could yield compounding advancements across all fields reliant on machine learning, granting it a broader and more transformative long-term scientific impact.
Paper 2 presents a novel, concrete method (LGBO) with theoretical guarantees and empirical validation including wet-lab experiments, addressing a practical bottleneck in scientific discovery. It introduces a new mechanism (region-lifted preferences) integrating LLMs into Bayesian optimization with broad applicability across physics, chemistry, biology, and materials science. While Paper 1 provides a valuable unified review framework for clinical trajectory modeling, it is primarily a synthesis/review rather than introducing a new method. Paper 2's combination of theoretical novelty, cross-domain empirical results, and timeliness (LLM integration) gives it higher near-term impact potential.
Paper 2 likely has higher impact: it introduces a timely, broadly applicable evaluation paradigm for longitudinal safety in memory-equipped LLM agents (a rapidly emerging deployment setting). The trigger-probe protocol, NullMemory counterfactual, cross-architecture experiments, and diagnostic monitoring suggest strong methodological rigor and immediate real-world relevance for safety governance across many domains. Paper 1 is innovative and includes wet-lab results, but its impact is narrower (optimization workflows in AI-for-science) and may depend on how reliably LLM “preferences” generalize and remain controllable across tasks and objectives.
Paper 1 (LGBO) demonstrates higher scientific impact potential due to: (1) it addresses a fundamental bottleneck in scientific discovery—sample-efficient optimization—with broad applicability across physics, chemistry, biology, and materials science; (2) it provides both theoretical guarantees and empirical validation including wet-lab experiments, a rare and compelling combination; (3) it pioneers a novel framework for continuously integrating LLM reasoning into Bayesian optimization beyond warm-starting; (4) the practical implications for reducing experimental costs are immediately actionable. Paper 2 offers valuable insights on context injection in multi-agent systems but has narrower scope limited to software design exploration.
Paper 2 introduces a novel framework (LGBO) that integrates LLMs into Bayesian optimization for scientific discovery, with broad applicability across physics, chemistry, biology, and materials science. It provides both theoretical guarantees and wet-lab validation, demonstrating real-world impact. Its cross-disciplinary scope, timeliness (leveraging the LLM revolution), and practical significance (reducing experimental iterations) give it higher potential impact than Paper 1, which addresses a more specialized problem (brain tumor segmentation with missing modalities) within a single domain using incremental methodological improvements.
Paper 2 presents a novel framework (LGBO) that integrates LLM reasoning into Bayesian Optimization with theoretical guarantees and demonstrates real wet-lab validation in battery electrolyte optimization. It addresses a fundamental challenge in scientific discovery—costly experiments with limited budgets—with broad applicability across physics, chemistry, biology, and materials science. The combination of theoretical analysis, diverse benchmarks, and wet-lab validation provides strong methodological rigor. Paper 1, while addressing an important evaluation gap for proactive AI assistants, is primarily a benchmark contribution with narrower scope and no novel methodology beyond the evaluation framework itself.