Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

Zhiyuan Jerry Lin, Benjamin Letham, Samuel Dooley, Maximilian Balandat, Eytan Bakshy

#967 of 2292 · Artificial Intelligence
Share
Tournament Score
1430±38
10501800
50%
Win Rate
11
Wins
11
Losses
22
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts"

1. Core Contribution

The paper introduces ReElicit, a Bayesian optimization (BO) framework that addresses a genuine gap in prompt optimization: tuning system prompts when only aggregate scalar feedback (e.g., task-completion rate, user satisfaction) is available, rather than per-example labels or error traces. The key conceptual innovation is embedding by elicitation — using an LLM to construct a compact, interpretable, low-dimensional feature space from prompt-score history, then performing BO in that space. The LLM serves dual roles: as a representation builder (defining and extracting semantic features) and as an inverse mapper (realizing BO-selected feature targets back into deployable text). The dynamic re-elicitation of the feature space as new data arrives is a notable design choice that distinguishes this from static embedding approaches.

The problem formulation is well-motivated. System prompts in production settings often receive only aggregate metric feedback, making instance-level APO methods inapplicable. Framing this as sample-constrained black-box optimization over discrete variable-length text is appropriate and underserved.

2. Methodological Rigor

Strengths in design: The algorithm cleanly separates information access — feature *definition* sees scores (to identify performance-relevant axes), while feature *extraction* does not (preventing outcome leakage). The cross-validation selection among candidate feature sets and the incumbent retention mechanism are sensible engineering choices. The feature-gap refinement loop for translating continuous BO targets back to text is a practical solution to the inverse mapping problem.

Theoretical analysis: The reachability theorem (Theorem 1) provides a bound connecting near-optimality in the elicited embedding to true prompt quality, with the gap decomposing into optimization error δ and representation error 2BLη_t. While this is a reasonable formalization, it rests on strong assumptions (oracle smooth semantic embedding, bounded RKHS norm, Lipschitz feature map) that are difficult to verify in practice. The analysis is more illustrative than prescriptive — it explains *why* reducing representation error matters but doesn't provide actionable guarantees about when elicited embeddings will be close to oracle.

Experimental evaluation: Ten tasks across GSM8K, MMLU, and BBH with 30 seeds provide reasonable statistical power. The 30-evaluation budget constraint is realistic for the motivating setting. However, several concerns arise:

  • The baselines are *adaptations* of existing methods to the aggregate-only setting, not purpose-built competitors. The authors acknowledge this but it weakens comparative claims.
  • Confidence intervals overlap substantially on several tasks (Table 1), and margins are often small. The pairwise win-or-tie analysis (Table 2) is more convincing for aggregate consistency.
  • The ablation study (Figure 3, Table 3) is informative: refinement and BO contribute most, while dynamic re-elicitation and joint extraction have smaller effects. This honesty about component contributions is appreciated.
  • Diagnostics are thorough: Feature stability analysis (Figure 1), surrogate fit trajectories (Figure 2a), gap-improvement correlation (Figure 2b), and the feature evolution case study (Table 4) provide evidence that the representation has desired properties.

    3. Potential Impact

    Direct applications: The aggregate-feedback framing is genuinely relevant for production AI systems where A/B testing yields only top-line metrics. System prompt optimization for chatbots, coding assistants, and customer-facing AI fits this paradigm.

    Broader conceptual contribution: The idea that LLMs can serve as *adaptive semantic representation builders* for optimization — not just generators — is the paper's most transferable insight. This "embedding by elicitation" pattern could extend to other optimization-over-text problems: evaluation rubric design, tool-use instructions, or policy specifications.

    Limitations on impact: The experiments use offline benchmark accuracy as a proxy for the motivating deployment metrics (user satisfaction, safety-incident rates, retention). This gap between claimed motivation and actual evaluation is the paper's most significant weakness. Real deployment metrics are noisy, delayed, non-stationary, and potentially multi-objective — none of which are tested. The method's reliance on additional optimizer-side LLM calls (for elicitation, extraction, realization, refinement) also limits practical adoption in cost-sensitive settings, though the authors correctly argue that when f is expensive (e.g., large-scale A/B tests), optimizer overhead is negligible.

    4. Timeliness & Relevance

    The paper addresses a timely problem. System prompts are increasingly central to production AI systems, and the gap between academic prompt optimization (with rich per-example feedback) and practical deployment needs (with aggregate metrics) is real. The connection to Bayesian optimization, a mature tool for expensive black-box evaluation, is natural and well-executed. The work also connects to the growing interest in using LLMs as components in optimization loops rather than just targets of optimization.

    5. Strengths & Limitations

    Key strengths:

  • Clean problem formulation that fills a genuine gap between per-example APO and deployment reality
  • Elegant architecture that leverages LLMs for representation construction, not just candidate generation
  • Principled information separation between feature definition and extraction
  • Comprehensive diagnostics beyond just final performance numbers
  • Honest ablation revealing that some components (dynamic re-elicitation) contribute less than others
  • Notable weaknesses:

  • Evaluation-motivation gap: benchmarks don't test the claimed deployment scenarios
  • Relatively modest absolute improvements over baselines on many tasks
  • Theoretical analysis, while structurally sound, relies on unverifiable assumptions
  • Single optimizer LLM (Llama 3.3 70B) and single target LLM (Llama 3.1 8B) — generalization across model families untested
  • The 2-3 dimensional feature spaces may be insufficient for more complex prompt optimization landscapes
  • No comparison against BO methods with fixed dense embeddings (e.g., sentence transformers + PCA), which would test whether LLM-elicited features are better than generic semantic embeddings
  • Summary

    This is a well-executed paper with a clear and useful conceptual contribution. The embedding-by-elicitation idea is novel, practically motivated, and supported by reasonable (though not overwhelming) empirical evidence. The main limitations are the gap between stated motivation and experimental evaluation, modest margins over adapted baselines, and untested generalization. It advances the conversation about how to optimize over natural-language artifacts under realistic feedback constraints.

    Rating:6.5/ 10
    Significance 6.5Rigor 6.5Novelty 7.5Clarity 8

    Generated May 20, 2026

    Comparison History (22)

    vs. IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
    gpt-5.25/22/2026

    Paper 2 is more novel and broadly impactful: it introduces an adaptive “embedding by elicitation” representation for Bayesian optimization of variable-length text under aggregate-only feedback, a common real-world constraint in deployed AI systems. This creates a general framework that can transfer to many optimization problems over natural-language artifacts (system prompts, policies, rubrics), bridging LLMs, BO, and representation learning. Paper 1 is useful and timely but is a more incremental systems-level inference improvement (idle-time speculative planning) with narrower conceptual reach. Both seem empirically validated, but Paper 2’s framing and applicability suggest higher impact.

    vs. LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems
    gpt-5.25/22/2026

    Paper 2 has higher estimated impact due to stronger timeliness and broad applicability: securing latent KV-cache communication is a rapidly emerging need as multi-agent LLM systems adopt KV sharing for efficiency. LCGuard introduces a clear threat model (reconstruction-based leakage) and an adversarial training approach that can generalize across model families and agent settings, with direct real-world implications for privacy and safety. Paper 1 is novel and useful for prompt optimization under aggregate feedback, but its scope is narrower and closer to existing Bayesian optimization/prompt-tuning lines.

    vs. Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence
    gpt-5.25/22/2026

    Paper 2 likely has higher scientific impact due to broader cross-domain relevance and clearer real-world scientific workflows. It introduces an auditable benchmark spanning multiple scientific inference settings, with explicit baselines, controls, and regime characterization of when multi-agent coordination helps—findings that can generalize across fields and guide deployment decisions. The methodological emphasis on frozen panels, scoring protocols, and provenance also strengthens rigor and reproducibility. Paper 1 is novel for prompt optimization under aggregate feedback, but its applications are narrower (LLM system prompt tuning) and impact may be more confined to LLM ops/optimization research.

    vs. IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
    gemini-3.15/22/2026

    Paper 1 introduces a highly novel conceptual framework by bridging Bayesian optimization with LLMs, utilizing them as dynamic semantic representation builders rather than just text generators. This methodological innovation for optimizing discrete, variable-length text under sample-constrained aggregate feedback addresses a major challenge in AI alignment and deployment. While Paper 2 offers a practical systems-level optimization (speculative planning) for agents, Paper 1's core algorithmic contribution has broader implications for combining traditional probabilistic ML with modern LLMs.

    vs. AMEL: Accumulated Message Effects on LLM Judgments
    gpt-5.25/22/2026

    Paper 2 likely has higher scientific impact due to broad, immediate relevance: it identifies and quantifies a systematic bias in LLM-as-judge workflows across many models/providers with a large-scale experimental design (75,898 calls), clear effect sizes, and actionable mitigation guidance. This impacts evaluation, moderation, benchmarking, and agent pipelines across fields. Paper 1 is novel and useful for prompt optimization, but is narrower in scope (aggregate-only Bayesian optimization for system prompts) and its evidence is more task/budget-specific, making generalization and downstream impact less certain.

    vs. AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters
    claude-opus-4.65/22/2026

    AtelierEval addresses a significant gap in T2I evaluation by creating the first benchmark for prompting proficiency rather than model quality, with a comprehensive framework spanning 360 tasks, 8 MLLMs, 48 humans, and 4 backends. Its breadth of impact spans HCI, generative AI evaluation, and cognitive science. Paper 2, while technically interesting in combining Bayesian optimization with LLM-elicited embeddings for prompt tuning, addresses a narrower optimization problem with more incremental contributions. AtelierEval's benchmark and agentic evaluator have stronger potential for community-wide adoption and downstream research impact.

    vs. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
    gemini-3.15/22/2026

    Paper 1 addresses a critical bottleneck in modern AI—token inefficiency in long-horizon reasoning—by introducing a novel architectural decomposition (reactive, simulative, and self-regulatory systems). Its ability to match the performance of massive models with significantly fewer parameters and reasoning tokens represents a major leap in agentic LLM design. Paper 2 offers a valuable prompt optimization technique, but its scope and potential for broad, paradigm-shifting impact are narrower compared to the fundamental efficiency and architectural advancements proposed in Paper 1.

    vs. Probabilistic Tiny Recursive Model
    gpt-5.25/20/2026

    Paper 2 is likely to have higher scientific impact because it tackles a broadly important and timely problem—system prompt optimization under aggregate-only feedback—directly relevant to real-world deployment and governance of LLM systems. Its “embedding by elicitation” concept is a novel, general-purpose mechanism for turning discrete text optimization into sample-efficient Bayesian optimization with adaptive representations, potentially transferable to other natural-language artifact design problems. Paper 1 shows strong results for a niche but valuable class of recursive reasoners; however, its core contribution (stochastic noise + trajectory selection) is a more incremental test-time exploration technique with narrower cross-field reach.

    vs. Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs
    gpt-5.25/20/2026

    Paper 1 is more likely to have higher near-term scientific and real-world impact: it tackles a timely, widely encountered industry problem (optimizing system prompts under aggregate-only metrics) and proposes a novel LLM-in-the-loop Bayesian optimization pipeline with adaptive, interpretable representations, which is broadly applicable across LLM deployment, HCI, and optimization. Paper 2 is methodologically very rigorous and theoretically valuable (minimax-optimal regret with variance-aware constants), but its impact is narrower to specialized RL theory for MNL MDPs and may translate more slowly to practice.

    vs. OpenComputer: Verifiable Software Worlds for Computer-Use Agents
    gpt-5.25/20/2026

    Paper 2 likely has higher impact: it introduces an infrastructure-level, verifier-grounded framework (verifiable software worlds, task generation, evaluation harness) spanning 33 real apps and 1,000 tasks, enabling reproducible, auditable evaluation and training signals for computer-use agents—an area of high current relevance. Its methodological rigor and broad applicability (agent evaluation, reward design, benchmarking, safety/verification) support cross-field adoption. Paper 1 is novel for aggregate-only prompt optimization, but is narrower in scope and closer to incremental advances in prompt/BO tooling compared to a new benchmark+verification ecosystem.

    vs. AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation
    gpt-5.25/20/2026

    Paper 2 is likely to have higher scientific impact due to broader applicability and timeliness: it proposes a general framework for aggregate-feedback optimization of system prompts, relevant to many deployed LLM systems where only scalar metrics are available. The “embedding by elicitation” idea (LLM-built, dynamically re-elicited feature spaces coupled with GP Bayesian optimization) is novel and could transfer to optimizing other discrete natural-language artifacts. Paper 1 is methodologically interesting and clinically relevant but is narrower (radiology report generation) and depends on domain-specific resources, limiting cross-field breadth.

    vs. RAG-based EEG-to-Text Translation Using Deep Learning and LLMs
    gpt-5.25/20/2026

    Paper 2 is likely higher impact: it introduces a generally applicable optimization framework (ReElicit) for tuning system prompts under aggregate-only feedback, a common real-world constraint in deployed AI. The combination of adaptive, LLM-elicited feature representations with Gaussian-process Bayesian optimization is methodologically grounded and broadly reusable across tasks, products, and research areas (prompting, HCI, black-box optimization). It is timely given widespread prompt-based control and evaluation-budget limits. Paper 1 is novel but its impact may be narrower due to modest gains on a single EEG dataset and the inherent limitations/noise of EEG-to-text decoding.

    vs. What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
    gemini-3.15/20/2026

    Paper 1 addresses the fundamental challenge of long-horizon credit assignment in multi-turn LLM agents, a critical bottleneck for autonomous AI development. Its systematic approach to environment-reweighted learning (SERL) bridges RL and LLM distillation, offering deep methodological insights that are likely to broadly impact agentic AI research. While Paper 2 offers a clever and practical Bayesian optimization approach for prompt tuning, Paper 1's focus on autonomous agent capability advances a more complex and transformative area of AI.

    vs. EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design
    claude-opus-4.65/20/2026

    Paper 1 introduces a novel conceptual contribution—'embedding by elicitation'—where LLMs serve as adaptive semantic representation builders for Bayesian optimization over natural language. This is a genuinely new idea that bridges LLM capabilities with principled optimization, with broad applicability beyond prompt tuning to any text-based optimization problem. Paper 2, while solid engineering work, is primarily a benchmark and reference implementation for multi-agent engineering design—a more incremental contribution in an increasingly crowded space of LLM agent frameworks. Paper 1's methodological innovation has greater potential to influence future research directions across multiple fields.

    vs. AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
    claude-opus-4.65/20/2026

    AutoResearchClaw addresses the high-profile challenge of autonomous scientific discovery with a comprehensive multi-agent system featuring novel mechanisms (self-healing execution, cross-run evolution, structured debate, human-in-the-loop collaboration). It demonstrates strong empirical results (54.7% improvement over AI Scientist v2) and has broader impact potential across all scientific fields. The finding that targeted human collaboration outperforms both full autonomy and exhaustive oversight is particularly impactful for the future of human-AI collaboration. ReElicit, while methodologically sound, addresses a narrower problem of system prompt optimization with more incremental contributions.

    vs. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
    gpt-5.25/20/2026

    Paper 1 has higher potential impact due to a stronger theoretical contribution: a first finite-sample guarantee for neural Q-learning under decentralized partial observability in an interface-constrained multi-agent SMDP, plus new AIS extensions and Markovian-noise control under random durations. This is methodologically rigorous and broadly relevant to multi-agent RL, decentralized control, and cross-boundary LLM-agent pipelines. Paper 2 is timely and practical for prompt optimization, but is more application/engineering-focused with weaker general theoretical novelty and potentially narrower scientific spillover.

    vs. Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
    claude-opus-4.65/20/2026

    Paper 1 introduces a novel framework (GVG) that bridges EEG signals with visual representations through generative models, addressing a fundamental challenge in brain-computer interfaces and neural decoding. Its cross-disciplinary impact spans neuroscience, computer vision, and clinical applications. The approach of using hallucinated proxy images to ground non-visual EEG signals in MLLMs is highly innovative and opens new research directions for brain foundation models. Paper 2, while methodologically sound, addresses the narrower problem of system prompt optimization with incremental improvements over existing Bayesian optimization approaches, limiting its broader scientific impact.

    vs. AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
    gemini-3.15/20/2026

    Paper 1 addresses a universally critical challenge in AI (optimizing system prompts) and introduces a highly novel intersection of Bayesian Optimization and LLM-elicited dynamic representations. Its methodology has broader applicability across various natural language tasks compared to Paper 2, which focuses on a more specific, albeit important, efficiency optimization for GUI agents. The conceptual innovation of using LLMs to dynamically define feature spaces for BO gives Paper 1 a higher potential for broad scientific impact.

    vs. Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions
    claude-opus-4.65/20/2026

    Paper 1 introduces a novel conceptual framework ('embedding by elicitation') that addresses a broadly relevant problem—optimizing system prompts with only aggregate feedback—applicable across the rapidly growing ecosystem of LLM-based applications. Its contribution of using LLMs as adaptive semantic representation builders for Bayesian optimization is a transferable idea with wide applicability. Paper 2, while achieving impressive speedups on specific constraint programming benchmarks, addresses a narrower community (CP practitioners) and combines existing techniques (CNNs, LLMs, enumeration) in a pipeline that, though effective, has more limited cross-field impact potential.

    vs. Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
    gpt-5.25/20/2026

    Paper 1 is more novel methodologically: it introduces a new optimization framework (LLM-elicited adaptive embeddings + GP Bayesian optimization) for aggregate-only prompt tuning, a hard and practically important setting. This can generalize beyond prompt optimization to black-box optimization over discrete artifacts, influencing optimization, HCI, and LLM alignment/tooling. Paper 2 is timely and useful as an evaluation benchmark, but benchmarks tend to have narrower scientific reach and shorter half-life as models/harnesses change, unless they become a dominant standard. Paper 1’s algorithmic contribution is likelier to inspire follow-on research.