In-Context Learning for Latent Space Bayesian Optimization

Tuan A. Vu, Harri Lähdesmäki, Julien Martinelli

Jun 8, 2026arXiv:2606.09664v1

cs.LGstat.ML

#3512of 5669·cs.LG

#3512 of 5669 · cs.LG

Tournament Score

1373±44

10501750

43%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor4

Novelty5

Clarity7

Abstract

Bayesian optimization (BO) is a central tool for sample-efficient design, and latent-space Bayesian optimization (LSBO) extends it to structured objects such as molecules and proteins. In parallel, tabular foundation models such as TabPFN and TabICL now achieve state-of-the-art regression performance and are increasingly used as BO surrogates. Because their Bayesian behavior is induced by large synthetic pretraining collections, the composition of this pretraining distribution is crucial. LSBO creates a distinctive mismatch: the induced map from latent code to objective value differs markedly from the regression tasks used to train current in-context models. We address this mismatch by complementing the pretraining stage of tabular foundation model surrogates with synthetic optimization tasks defined on the latent space of a molecular VAE. The continued-pretraining objective features a regularizer that anchors the model to the original checkpoint, preserving its broad regression prior while avoiding overspecialization to the adaptation tasks. On held-out molecular optimization benchmarks, the resulting model achieves strong performance, supporting the relevance of LSBO-specific adaptation for in-context surrogates.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "In-Context Learning for Latent Space Bayesian Optimization"

1. Core Contribution

This paper identifies a specific distribution mismatch problem: tabular foundation models (TFMs) like TabPFN-3 are pretrained on generic synthetic regression tasks, but when deployed as surrogates in latent-space Bayesian optimization (LSBO), they encounter a fundamentally different data distribution — one shaped by VAE latent codes, molecular property landscapes, and value-biased BO histories. The proposed solution, LILBO, performs continued pretraining of TabPFN-3 using synthetically generated molecular LSBO episodes. These episodes are constructed by combining molecular base tasks (logP, QED, similarity scores) through random linear, MLP, or formula-tree combiners, with Boltzmann-sampled contexts that mimic the value-biased nature of BO queries. An L2-SP regularizer anchors the adapted model to the original checkpoint to prevent catastrophic forgetting.

The contribution is conceptually clean: bridge the gap between generic tabular pretraining and domain-specific LSBO deployment through lightweight continued pretraining. This is a reasonable and well-motivated idea.

2. Methodological Rigor

Strengths in methodology:

The synthetic task generation is thoughtfully designed, combining molecular descriptors through multiple combiner families (linear, MLP, formula trees), creating diverse objectives grounded in molecular properties.

The Boltzmann sampling mechanism for context generation is well-motivated for BO settings where contexts are naturally value-biased.

The L2-SP regularization is a principled choice for continued pretraining, and the paper correctly identifies the tension between specialization and preserving broad regression capabilities.

The pretraining diagnostics (latent-space coverage and objective-space coverage via UMAP) are a useful addition for understanding the synthetic distribution.

Weaknesses in methodology:

The empirical improvement over vanilla TabPFN-3 is modest (average rank 2.64 vs. 2.94), and the standard deviations on ranks overlap considerably (±1.56 vs. ±1.64). This makes it difficult to conclude that the continued pretraining is reliably beneficial.

The paper lacks statistical significance tests. Given the overlapping confidence intervals, it's unclear whether the improvement is meaningful or within noise.

The ablation study is absent. We don't know the individual contributions of: (a) value-biased sampling, (b) the specific combiner families, (c) the regularization strength, (d) the number of continued pretraining steps. This makes it hard to identify what actually matters.

The authors acknowledge that VAE retraining during optimization creates latent drift, which fundamentally undermines the premise that pretraining on a fixed latent space transfers to deployment. This is not just a limitation — it's a core tension in the approach.

Regression performance metrics (calibration, NLL, RMSE) on held-out molecular tasks are not reported, which would help disentangle surrogate quality from downstream optimization performance.

3. Potential Impact

The paper operates at the intersection of two active research areas — tabular foundation models and latent-space Bayesian optimization — making it timely. The idea that TFM surrogates should be adapted to their deployment domain is broadly applicable beyond molecular design (proteins, materials, etc.).

However, the practical impact is limited by several factors:

The improvement over off-the-shelf TabPFN-3 is marginal, weakening the case for domain-specific adaptation.

The approach requires access to a library of domain-specific base tasks and a pretrained VAE, which limits out-of-the-box applicability.

The method doesn't uniformly dominate specialized LSBO methods like InvBO or NF-BO, which are designed with complementary mechanisms (latent geometry correction, normalizing flows).

The more impactful finding may actually be the secondary one: that TabPFN-3, as a plug-in surrogate without any adaptation, is already competitive with specialized LSBO methods. This observation alone could influence how practitioners approach LSBO.

4. Timeliness & Relevance

The paper is well-timed. TFMs are rapidly being adopted as BO surrogates (PFNs4BO, GIT-BO), and LSBO remains a workhorse for molecular optimization. The question of whether generic pretraining distributions are sufficient for specialized BO domains is genuinely important and underexplored. The paper also connects to the broader trend of continued pretraining / domain adaptation for foundation models.

5. Strengths & Limitations

Key Strengths:

Clear problem formulation with a well-articulated mismatch hypothesis

The diagnostic tools (Figures 1-2) provide interpretable evidence about the synthetic distribution's coverage

Practical design: LILBO is a drop-in surrogate replacement requiring no changes to the BO loop

Honest presentation of limitations, including latent drift and modest gains

Comprehensive benchmark suite with 8 tasks and 10 seeds

Key Limitations:

Marginal empirical gains that may not be statistically significant

No ablation studies to identify which components matter

The latent drift problem during VAE retraining is acknowledged but unresolved, and it undermines the core assumption

Limited to a single VAE architecture (SELFIES VAE) and a single TFM (TabPFN-3)

The paper is a workshop submission (6 pages + appendix), which naturally limits depth

No computational cost analysis — continued pretraining on 500k episodes is non-trivial

Additional Observations

The paper would benefit substantially from: (1) a direct comparison of surrogate prediction quality (not just optimization performance), (2) ablations isolating the effect of each design choice, (3) experiments on at least one non-molecular domain to demonstrate generality, and (4) analysis of how performance degrades as VAE retraining introduces latent drift.

The framing as a workshop paper is appropriate given the current level of evidence. The idea is sound and worth investigating further, but the empirical case is not yet compelling enough for a top venue.

Rating:4.5/ 10

Significance 4.5Rigor 4Novelty 5Clarity 7

Generated Jun 9, 2026

Comparison History (21)

Wonvs. BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

Paper 2 has higher impact potential: it targets Bayesian optimization for molecules/proteins, a high-value application area, and addresses a timely, broadly relevant problem—how to adapt tabular foundation models/ICL surrogates when task priors mismatch (LSBO). The proposed continued-pretraining with latent-space synthetic optimization tasks plus anchoring regularizer is a transferable recipe for aligning foundation-model priors to downstream decision-making tasks, likely influencing BO, foundation models, and scientific discovery workflows. Paper 1 is useful for HDLSS tabular synthesis (e.g., omics) but is more niche and incremental in combining block structure with existing deep generative priors.

gpt-5.2·Jun 9, 2026

Wonvs. Operator learning for solving Fokker-Planck equations with various initial conditions

Paper 2 has higher estimated impact due to stronger timeliness and broader applicability: it connects foundation-model in-context learning with Bayesian optimization for molecule/protein design, a highly active area with clear real-world relevance. The proposed LSBO-specific continued pretraining and anchoring regularizer is a pragmatic, general recipe that could transfer to many latent-design settings and spur follow-on work across ML, optimization, and computational chemistry. Paper 1 is technically solid and novel for Fokker–Planck operator learning, but its impact is more specialized to stochastic PDE/SDE communities and narrower in immediate application scope.

gpt-5.2·Jun 9, 2026

Wonvs. Learning Dynamics Reveal a Hierarchy of Weight-Induced Layerwise Gram Metrics

Paper 2 demonstrates higher potential scientific impact due to its direct applicability to critical real-world problems like molecular design and protein engineering. By bridging the highly timely fields of in-context learning foundation models and latent-space Bayesian optimization, it offers a practical tool for scientific discovery across chemistry and biology. While Paper 1 provides rigorous theoretical insights into neural network learning dynamics, Paper 2's methodological innovation solves a practical distribution mismatch problem, paving the way for immediate, broad impact in applied sciences and AI-driven drug discovery.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. On Choosing the $μ$ Parameter in Gaussian Differential Privacy

Paper 2 addresses a more impactful problem at the intersection of foundation models and Bayesian optimization for molecular/protein design, which is a rapidly growing field with significant real-world applications in drug discovery and materials science. It introduces a novel continued-pretraining strategy to address distribution mismatch in latent-space BO, combining multiple trending research areas (foundation models, in-context learning, molecular optimization). Paper 1 provides useful but incremental technical guidance on parameter conversion between privacy frameworks, with narrower scope and more limited applicability.

claude-opus-4-6·Jun 9, 2026

Wonvs. RETROSPECT: RETROsynthesis via Sequential Prediction, and Chemically Transformed-ranking

Paper 1 offers higher potential scientific impact due to its methodological innovation and broader applicability. By adapting tabular foundation models for latent-space Bayesian optimization, it addresses a fundamental distribution shift problem, advancing the frontiers of in-context learning for sample-efficient design. This approach can be applied across diverse domains like drug discovery and materials science. In contrast, Paper 2 provides a highly optimized, domain-specific engineering pipeline for chemical retrosynthesis. While valuable, it largely relies on combining established techniques (Transformers, LambdaMART), offering narrower theoretical and cross-disciplinary impact.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Disentanglement with Holographic Reduced Representations

Paper 1 introduces a fundamentally novel approach to disentanglement using holographic reduced representations, combining symbolic AI concepts with neural networks in an innovative way. It provides both empirical results and rigorous information-theoretic analysis, including capacity bounds. The work bridges multiple fields (representation learning, cognitive science, information theory) and addresses a long-standing challenge. Paper 2 makes a more incremental contribution—adapting pretraining distributions for latent-space BO surrogates—which, while useful, is narrower in scope and more application-specific with less foundational theoretical insight.

claude-opus-4-6·Jun 9, 2026

Wonvs. ERBench: A Benchmark and Testsuite for Equation Discovery Algorithms

Paper 2 addresses a timely intersection of foundation models, Bayesian optimization, and molecular design—areas of high current interest. Its novel approach of adapting tabular foundation model surrogates for latent-space BO through continued pretraining with domain-specific synthetic tasks is innovative and has direct applications in drug discovery and materials science. Paper 1, while useful as a benchmark contribution for symbolic regression, is more incremental—it improves evaluation methodology rather than introducing new algorithmic capabilities. Benchmarks have impact but typically less than methodological advances with clear real-world applications in high-impact domains like molecular optimization.

claude-opus-4-6·Jun 9, 2026

Wonvs. Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss

Paper 2 addresses a novel intersection of two rapidly growing fields—foundation models and Bayesian optimization—with broad applications in molecular and protein design. Its contribution of adapting tabular foundation model surrogates for latent-space BO via continued pretraining is timely and innovative, with clear real-world applications in drug discovery and materials science. Paper 1, while solid, offers incremental improvements to prototype rehearsal in exemplar-free continual learning, a narrower subfield. Paper 2's cross-disciplinary relevance (ML, chemistry, biology) and connection to the foundation model paradigm give it higher potential impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. Trajectory Geometry of Transformer Representations Across Layers

Paper 1 introduces a novel geometric framework for understanding transformer representations that is model-agnostic, probe-free, and applicable across architectures. Its findings (attractor dynamics, curvature encoding complexity, trajectory bifurcation, universal three-phase structure) are broadly relevant to the entire mechanistic interpretability community and potentially to all transformer-based AI systems. Paper 2 makes a solid but narrower contribution—adapting tabular foundation models for latent-space Bayesian optimization in molecular design. While useful, it addresses a more specialized problem with incremental methodological innovation (continued pretraining with regularization). Paper 1's breadth of impact across interpretability, neuroscience-inspired AI analysis, and its open-source toolkit give it higher potential impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

Paper 2 has higher estimated impact due to broader applicability and timeliness: robust drift detection/diagnosis is central to deploying continual learning systems in real-world streams (robotics, surveillance, autonomous driving, data-centric MLOps). Leveraging frozen large pretrained models for decoupled, zero-shot monitoring is a novel, generally reusable system concept that can plug into many TFCL methods and modalities, likely influencing multiple subfields (continual learning, distribution shift, foundation-model tooling). Paper 1 is solid and innovative but more specialized to LSBO/tabular ICL surrogates and molecular latent spaces, limiting breadth.

gpt-5.2·Jun 9, 2026

#3512of 5669·cs.LG

#3512 of 5669 · cs.LG

Tournament Score

1373±44

10501750

43%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor4

Novelty5

Clarity7