Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement
Jyotirmoy Nath, Neeraj Kumar, Brejesh Lall
Abstract
Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task's prompt as a monolithic, instance-blind string optimized through global edits, producing brittle updates and preventing the reuse of learned sub-behaviors. We propose Prompt Codebooks (PCO), a novel compositional prompt optimization framework that recasts APO as discrete learning over a finite vocabulary of natural-language instincts - atomic, reusable instruction units. PCO organizes prompt-construction knowledge in a discrete codebook and routes each input to a small subset of entries via an LLM-based encoder; a generator composes them into a prompt for the frozen target model; a critic emits a structured verdict that decomposes by attribution into per-variable textual gradients, jointly training the encoder, generator, and codebook under a language-valued min-max objective. The resulting routing is per-instance: different inputs in the same task receive different instinct compositions, a regime structurally inexpressible under instance-blind methods. Across six benchmarks on Qwen3-8B and LLaMA-3.1-8B, PCO improves over zero-shot by up to +30.36 points, surpasses the strongest prior baseline (GEPA) by +3.34 on HotpotQA and +1.11 in aggregate, and reduces deployed prompt length by up to 14.1x versus MIPROv2 and 3.0x versus GEPA using only K=16 instincts.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Prompt Codebooks (PCO)
1. Core Contribution
PCO reformulates automatic prompt optimization (APO) from editing a single monolithic prompt string into composing prompts from a discrete codebook of natural-language "instincts"—atomic, reusable instruction units. The key architectural innovation is a four-component system: an LLM-based encoder that routes each input to a subset of codebook entries via semantic routing, a generator that composes selected instincts into a prompt, a frozen target LLM, and a critic that provides structured natural-language feedback decomposed into per-variable textual gradients. This enables per-instance adaptive prompting—different inputs receive different instinct compositions—which is structurally impossible under instance-blind methods.
The conceptual bridge between VQ-VAE-style discrete latent representations and prompt optimization is genuinely novel. While DSPy structures prompts as compositional programs, its components remain instance-blind. PCO is, to the authors' claim, the first method that constructs prompts compositionally from a shared discrete latent vocabulary optimized end-to-end.
2. Methodological Rigor
Strengths in formalization: The language-valued min-max objective (Eq. 6) and its additive decomposition into generator faithfulness, codebook refinement, and routing consistency losses (Eq. 7) provide a clean theoretical framework. The analogy to GANs is functional rather than literal—the authors appropriately note they adopt the structure for attribution rather than distributional-distance interpretation.
Concerns about rigor:
Experimental coverage: Six benchmarks across reasoning, math, and instruction-following provide reasonable breadth. However, all experiments use 8B models only, and the authors acknowledge they do not evaluate on proprietary models. The comparison against GEPA uses official results for Qwen3-8B but reproduced results for LLaMA-3.1-8B, introducing potential inconsistency.
3. Potential Impact
Immediate applications: Per-instance prompt routing could benefit production LLM systems where heterogeneous inputs require different prompting strategies. The 14.1× prompt length reduction is practically significant for latency-sensitive and cost-sensitive deployments.
Broader implications: The discrete codebook abstraction opens several research directions: cross-task transfer of learned instincts, interpretability through codebook inspection, and integration into multi-agent pipelines where different agents could share instinct vocabularies. The emergent specialization phenomenon (Table 5) is particularly interesting—specialized, rarely-used instincts achieving high success rates suggests the framework discovers meaningful functional decompositions.
Limitations on impact: The computational overhead of training (approximately 24,000 LLM calls per benchmark) may limit adoption. The framework currently uses the same LLM for all roles (encoder, generator, critic, executor), which may not scale efficiently. The per-instance routing adds inference overhead (encoder + generator calls), though this appears modest.
4. Timeliness & Relevance
The paper addresses a genuine bottleneck in the rapidly growing APO field. As LLM-based agentic systems become more prevalent, the limitations of monolithic prompt optimization—brittleness, interference between updates, and inability to reuse learned behaviors—become increasingly problematic. The timing is appropriate given the concurrent maturation of both APO methods (GEPA, MIPROv2, TextGrad) and discrete representation learning.
The connection to VQ-VAE and codebook learning is intellectually stimulating but the analogy has limits: in VQ-VAE, codebook entries are vectors optimized via straight-through estimation with well-understood gradient dynamics, whereas PCO's "optimization" is LLM-based text rewriting with no formal convergence guarantees.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional observations: The paper is well-written with clear figures and thorough appendices. The qualitative prompt examples (Appendix A.7) effectively demonstrate the framework's behavior. However, the routing entropy metric, while informative about codebook utilization, doesn't directly measure prompt quality or diversity in a way that isolates the per-instance adaptation benefit.
Summary
PCO introduces an architecturally novel and conceptually well-motivated approach to prompt optimization. The discrete codebook abstraction, per-instance routing, and localized credit assignment represent genuine advances over monolithic APO. However, the empirical gains over the strongest baseline are modest and potentially within noise, the theoretical foundations rest on informal analogies rather than formal guarantees, and scalability remains undemonstrated. The prompt compression results are the most compelling practical contribution.
Generated May 28, 2026
Comparison History (16)
Paper 1 (Prompt Codebooks) offers a more clearly novel and broadly applicable formulation: reframing prompt optimization as discrete, compositional, per-instance routing over reusable “instinct” units, enabling transfer and modularity that instance-blind methods cannot express. It reports concrete, multi-benchmark gains on widely used open LLMs and adds efficiency benefits (prompt length reduction), strengthening real-world deployability. Paper 2 is timely for agent systems, but the abstract provides fewer methodological specifics and less concrete comparative evidence, making impact harder to assess and likely narrower to agent skill libraries.
Paper 1 introduces a highly novel compositional approach to LLM prompt optimization, addressing a critical bottleneck in agentic workflows. Its ability to create reusable, instance-specific instruction units offers broad applicability, performance improvements, and significant efficiency gains. Paper 2, while addressing important operational and auditability challenges in AI, presents a more specific framework that is likely to have a narrower impact compared to the fundamental advancements in LLM interaction proposed in Paper 1.
Paper 2 has higher likely impact due to broader applicability and timeliness: compositional, per-instance prompt optimization can benefit many LLM tasks and agentic systems beyond a single domain, enabling reusable instruction “instincts,” shorter prompts, and measurable gains on multiple benchmarks and models. Its framing (discrete codebooks + routing + critic with attributable textual gradients) is more generally innovative and could influence prompt/behavior engineering, efficient deployment, and modular alignment. Paper 1 is methodologically neat but targets a narrower subfield (linguistic steganalysis), limiting cross-field reach.
Paper 1 introduces a foundational shift in prompt optimization by moving from monolithic prompts to a compositional, instance-aware approach. This has broad applicability across nearly all LLM workflows, improving both performance and efficiency. Paper 2 is highly relevant for AI safety and mechanistic interpretability, but its immediate impact is narrower, primarily benefiting red-teaming and alignment research. Therefore, Paper 1 has a higher potential for widespread adoption and cross-disciplinary impact.
Paper 2 presents a novel, empirical breakthrough in LLM prompt optimization (Prompt Codebooks) that addresses current limitations of monolithic, instance-blind methods. Its strong quantitative results (surpassing baselines and drastically reducing prompt lengths) offer immediate, broad applicability across AI workflows. While Paper 1 addresses a highly important societal issue, it is primarily a literature synthesis and theoretical framework. Paper 2's methodological innovation and direct impact on the rapidly moving field of generative AI give it a higher potential for immediate scientific citation and application.
Paper 2 likely has higher impact: it introduces a novel, reusable compositional framework for prompt optimization (discrete “instinct” codebooks with per-instance routing) with clear methodological structure and strong benchmark gains across multiple LLMs, suggesting broad applicability to agentic systems and practical deployment (performance + prompt-length reductions). Paper 1 is valuable as a large-scale empirical audit of an A2A ecosystem, but its contribution is primarily diagnostic/characterization of one network and design pitfalls, with narrower generalizability and less direct algorithmic advancement.
Paper 1 likely has higher impact due to a more novel formulation: discrete, compositional prompt optimization with reusable “instinct” units and per-instance routing, addressing brittleness and reuse—core limitations in prompt optimization. It introduces an end-to-end framework (encoder/generator/critic with structured textual gradients) and shows strong gains plus substantial prompt-length reduction across multiple benchmarks/models, suggesting broad applicability to agentic workflows and efficient deployment. Paper 2 is timely and practical for LLM routing and adds a benchmark, but the historical matching framing is a more incremental extension of existing routing paradigms.
Paper 1 addresses a fundamental scientific question about the relationship between language model representations and human perceptual organization, revealing that perceptual geometry emerges transiently in intermediate layers despite no perceptual training. This offers deep insights into both AI and cognitive science, with broad interdisciplinary impact. Paper 2, while technically solid with strong empirical results, is primarily an engineering contribution to prompt optimization—an area with many competing methods and rapid obsolescence. Paper 1's findings about emergent perceptual structure have longer-lasting scientific significance and wider relevance across fields.
Paper 2 likely has higher scientific impact: it introduces a timely new benchmark (SHDF) targeting an emerging, underexplored failure mode (singing) in audio-visual deepfake detection, and proposes a cross-scenario framework (T-AVFD) with clear societal and security applications. The dataset plus robustness claims can catalyze broad follow-on work across multimedia forensics, security, and generative media evaluation. Paper 1 is novel for prompt optimization and could matter for LLM tooling, but its impact is narrower, more engineering-centric, and more sensitive to fast-moving baselines and model changes.
Paper 1 likely has higher scientific impact: it introduces an open-vocabulary, part-controllable 3D generation capability aligned with practical downstream requirements (animation/physics/scripts), plus a scalable pipeline for a new part-labeled 3D dataset—assets that can broadly enable research and industry workflows in graphics, simulation, robotics, and games. Methodologically, explicit part-structured generation at inference-time addresses a core limitation of current 3D generators. Paper 2 is timely and useful for LLM prompting, but prompt-optimization frameworks often have faster turnover and narrower durability than a new controllable 3D generation paradigm and dataset.
Paper 1 introduces a fundamentally new framework (Prompt Codebooks) for prompt optimization that addresses core limitations of existing methods—monolithic, instance-blind optimization—with a compositional, per-instance approach. The discrete codebook abstraction, language-valued min-max objective, and per-instance routing represent significant methodological innovation with broad applicability across LLM tasks. It demonstrates strong empirical gains across six benchmarks and two model families. Paper 2, while useful, presents an incremental engineering contribution (two-stage verification pipeline) with narrower scope, evaluated on a single benchmark, and the escalation idea is relatively straightforward.
Paper 2 is likely higher impact: it introduces a broadly applicable, reusable compositional framework for prompt optimization (discrete codebooks + per-instance routing) that can directly improve many LLM/agent workflows, showing sizable benchmark gains and prompt-length reductions across models/tasks. Its method is an actionable optimization paradigm with potential downstream adoption in production and research. Paper 1 is novel and valuable for safety/policy auditing, but is narrower in scope (diagnostic within prompt policies) and its impact depends more on adoption by policy-heavy deployments and on evaluation assumptions (judgeability, candidate selection).
Paper 1 introduces a fundamental, domain-agnostic methodological breakthrough in prompt optimization. By enabling instance-level, compositional prompt generation through discrete codebooks, it solves brittleness in existing APO methods and significantly reduces prompt length. Its broad applicability across LLM workflows gives it a wider potential impact across multiple fields compared to Paper 2, which presents a valuable but domain-specific data generation pipeline for medical LLMs.
Paper 1 likely has higher impact: it introduces a novel discrete, compositional prompt-optimization paradigm (codebooks of reusable “instincts” with per-instance routing) that addresses brittleness and reuse—core limitations in current APO. Applications are broad across LLM agent workflows, prompt engineering, and deployment efficiency (large prompt-length reductions) with strong benchmark gains on widely used open models, making it timely and immediately actionable. Paper 2 is methodologically solid and relevant to equilibrium computation in RL/game theory, but its scope is narrower (two-player zero-sum PSRO variants) and impact may be more specialized.
Paper 1 introduces a novel benchmark and alignment methodology for creative physical reasoning in LMMs—a largely untested but fundamental aspect of intelligence. It addresses a deeper capability gap (affordance-grounded creative problem-solving) with broad implications for embodied AI, robotics, and cognitive science. Paper 2 presents a clever engineering contribution to prompt optimization with solid empirical gains, but operates in a more incremental, narrower space. Paper 1's contribution to understanding and improving creative physical intelligence in multimodal models has broader cross-disciplinary impact and higher long-term significance.
Paper 1 introduces a novel, compositional framework for automatic prompt optimization that addresses fundamental limitations in current monolithic prompting methods. By offering a reusable, instance-specific codebook of prompts, it demonstrates significant performance gains and drastic reductions in prompt length (up to 14.1x). This presents massive practical applications for LLM inference efficiency and agentic workflows. While Paper 2 provides a valuable, rigorous statistical critique of a specific benchmark, Paper 1's introduction of a highly applicable, broadly impactful methodology gives it a significantly higher potential for widespread adoption and downstream scientific impact.