Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

Jyotirmoy Nath, Neeraj Kumar, Brejesh Lall

May 27, 2026

arXiv:2605.28360v1 PDF

cs.AI(primary)

#1176of 2682·Artificial Intelligence

#1176 of 2682 · Artificial Intelligence

Tournament Score

1423±48

10501800

75%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5

Novelty7

Clarity7

Tournament Score

1423±48

10501800

75%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task's prompt as a monolithic, instance-blind string optimized through global edits, producing brittle updates and preventing the reuse of learned sub-behaviors. We propose Prompt Codebooks (PCO), a novel compositional prompt optimization framework that recasts APO as discrete learning over a finite vocabulary of natural-language instincts - atomic, reusable instruction units. PCO organizes prompt-construction knowledge in a discrete codebook and routes each input to a small subset of entries via an LLM-based encoder; a generator composes them into a prompt for the frozen target model; a critic emits a structured verdict that decomposes by attribution into per-variable textual gradients, jointly training the encoder, generator, and codebook under a language-valued min-max objective. The resulting routing is per-instance: different inputs in the same task receive different instinct compositions, a regime structurally inexpressible under instance-blind methods. Across six benchmarks on Qwen3-8B and LLaMA-3.1-8B, PCO improves over zero-shot by up to +30.36 points, surpasses the strongest prior baseline (GEPA) by +3.34 on HotpotQA and +1.11 in aggregate, and reduces deployed prompt length by up to 14.1x versus MIPROv2 and 3.0x versus GEPA using only K=16 instincts.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Prompt Codebooks (PCO)

1. Core Contribution

PCO reformulates automatic prompt optimization (APO) from editing a single monolithic prompt string into composing prompts from a discrete codebook of natural-language "instincts"—atomic, reusable instruction units. The key architectural innovation is a four-component system: an LLM-based encoder that routes each input to a subset of codebook entries via semantic routing, a generator that composes selected instincts into a prompt, a frozen target LLM, and a critic that provides structured natural-language feedback decomposed into per-variable textual gradients. This enables per-instance adaptive prompting—different inputs receive different instinct compositions—which is structurally impossible under instance-blind methods.

The conceptual bridge between VQ-VAE-style discrete latent representations and prompt optimization is genuinely novel. While DSPy structures prompts as compositional programs, its components remain instance-blind. PCO is, to the authors' claim, the first method that constructs prompts compositionally from a shared discrete latent vocabulary optimized end-to-end.

2. Methodological Rigor

Strengths in formalization: The language-valued min-max objective (Eq. 6) and its additive decomposition into generator faithfulness, codebook refinement, and routing consistency losses (Eq. 7) provide a clean theoretical framework. The analogy to GANs is functional rather than literal—the authors appropriately note they adopt the structure for attribution rather than distributional-distance interpretation.

Concerns about rigor:

The "textual gradients" are LLM-generated critiques attributed to variables via another LLM call. The entire optimization loop depends on the quality and consistency of these LLM-based attribution operators, yet no analysis of attribution reliability or failure modes is provided.

The claim of a "min-max objective" is somewhat misleading since ψ is fixed throughout training—there is no adversarial training dynamic. The critic is a frozen prompted LLM, making this closer to a fixed reward model than a GAN-style adversary.

Statistical reporting is limited: variance is reported for PCO (3-5 seeds) but not for baselines. The aggregate improvement of +1.11 over GEPA is modest and may not be statistically significant given the reported standard deviations (±1.3 to ±3.7 across benchmarks).

The ablation study in Table 2 mixes full-dataset and subset evaluations, making cross-comparison difficult. Sensitivity analyses on K and S use only N=30 examples, which may not generalize.

Experimental coverage: Six benchmarks across reasoning, math, and instruction-following provide reasonable breadth. However, all experiments use 8B models only, and the authors acknowledge they do not evaluate on proprietary models. The comparison against GEPA uses official results for Qwen3-8B but reproduced results for LLaMA-3.1-8B, introducing potential inconsistency.

3. Potential Impact

Immediate applications: Per-instance prompt routing could benefit production LLM systems where heterogeneous inputs require different prompting strategies. The 14.1× prompt length reduction is practically significant for latency-sensitive and cost-sensitive deployments.

Broader implications: The discrete codebook abstraction opens several research directions: cross-task transfer of learned instincts, interpretability through codebook inspection, and integration into multi-agent pipelines where different agents could share instinct vocabularies. The emergent specialization phenomenon (Table 5) is particularly interesting—specialized, rarely-used instincts achieving high success rates suggests the framework discovers meaningful functional decompositions.

Limitations on impact: The computational overhead of training (approximately 24,000 LLM calls per benchmark) may limit adoption. The framework currently uses the same LLM for all roles (encoder, generator, critic, executor), which may not scale efficiently. The per-instance routing adds inference overhead (encoder + generator calls), though this appears modest.

4. Timeliness & Relevance

The paper addresses a genuine bottleneck in the rapidly growing APO field. As LLM-based agentic systems become more prevalent, the limitations of monolithic prompt optimization—brittleness, interference between updates, and inability to reuse learned behaviors—become increasingly problematic. The timing is appropriate given the concurrent maturation of both APO methods (GEPA, MIPROv2, TextGrad) and discrete representation learning.

The connection to VQ-VAE and codebook learning is intellectually stimulating but the analogy has limits: in VQ-VAE, codebook entries are vectors optimized via straight-through estimation with well-understood gradient dynamics, whereas PCO's "optimization" is LLM-based text rewriting with no formal convergence guarantees.

5. Strengths & Limitations

Key Strengths:

Novel architectural contribution with clear conceptual motivation

Per-instance routing addresses a genuine structural limitation of prior work

Strong prompt compression results (up to 14.1×) with competitive performance

Thoughtful exploration mechanisms (ε-greedy with success-weighted sampling) that prevent codebook collapse

Rich qualitative analysis showing emergent specialization and instinct evolution

Notable Weaknesses:

Marginal aggregate improvements (+1.11) over GEPA with unclear statistical significance

PCO underperforms GEPA+Merge on several individual benchmarks (HoVer for LLaMA, PUPA for Qwen3)

No formal convergence analysis or theoretical guarantees for the optimization procedure

Fixed K and S with limited guidance on selection beyond grid search on small subsets

The "min-max" framing overstates the adversarial nature of what is essentially a fixed-critic feedback loop

Missing computational cost comparison (wall-clock training time vs. baselines)

All evaluations limited to 8B models; scalability claims are aspirational

Additional observations: The paper is well-written with clear figures and thorough appendices. The qualitative prompt examples (Appendix A.7) effectively demonstrate the framework's behavior. However, the routing entropy metric, while informative about codebook utilization, doesn't directly measure prompt quality or diversity in a way that isolates the per-instance adaptation benefit.

Summary

PCO introduces an architecturally novel and conceptually well-motivated approach to prompt optimization. The discrete codebook abstraction, per-instance routing, and localized credit assignment represent genuine advances over monolithic APO. However, the empirical gains over the strongest baseline are modest and potentially within noise, the theoretical foundations rest on informal analogies rather than formal guarantees, and scalability remains undemonstrated. The prompt compression results are the most compelling practical contribution.

Rating:5.8/ 10

Significance 6Rigor 5Novelty 7Clarity 7

Generated May 28, 2026

Comparison History (16)

vs. You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

gpt-5.25/28/2026

Paper 1 (Prompt Codebooks) offers a more clearly novel and broadly applicable formulation: reframing prompt optimization as discrete, compositional, per-instance routing over reusable “instinct” units, enabling transfer and modularity that instance-blind methods cannot express. It reports concrete, multi-benchmark gains on widely used open LLMs and adds efficiency benefits (prompt length reduction), strengthening real-world deployability. Paper 2 is timely for agent systems, but the abstract provides fewer methodological specifics and less concrete comparative evidence, making impact harder to assess and likely narrower to agent skill libraries.

vs. Auditable Decision Models with Learned Abstention and Real-Time Steering

gemini-3.15/28/2026

Paper 1 introduces a highly novel compositional approach to LLM prompt optimization, addressing a critical bottleneck in agentic workflows. Its ability to create reusable, instance-specific instruction units offers broad applicability, performance improvements, and significant efficiency gains. Paper 2, while addressing important operational and auditability challenges in AI, presents a more specific framework that is likely to have a narrower impact compared to the fundamental advancements in LLM interaction proposed in Paper 1.

vs. REED: Post-Training Representation Editing for Cross-Domain Linguistic Steganalysis

gpt-5.25/28/2026

Paper 2 has higher likely impact due to broader applicability and timeliness: compositional, per-instance prompt optimization can benefit many LLM tasks and agentic systems beyond a single domain, enabling reusable instruction “instincts,” shorter prompts, and measurable gains on multiple benchmarks and models. Its framing (discrete codebooks + routing + critic with attributable textual gradients) is more generally innovative and could influence prompt/behavior engineering, efficient deployment, and modular alignment. Paper 1 is methodologically neat but targets a narrower subfield (linguistic steganalysis), limiting cross-field reach.

vs. Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

gemini-3.15/28/2026

Paper 1 introduces a foundational shift in prompt optimization by moving from monolithic prompts to a compositional, instance-aware approach. This has broad applicability across nearly all LLM workflows, improving both performance and efficiency. Paper 2 is highly relevant for AI safety and mechanistic interpretability, but its immediate impact is narrower, primarily benefiting red-teaming and alignment research. Therefore, Paper 1 has a higher potential for widespread adoption and cross-disciplinary impact.

vs. Cyberbullying Governance on Social Media: A Unified Framework from Content Identification to Intervention

gemini-3.15/28/2026

Paper 2 presents a novel, empirical breakthrough in LLM prompt optimization (Prompt Codebooks) that addresses current limitations of monolithic, instance-blind methods. Its strong quantitative results (surpassing baselines and drastically reducing prompt lengths) offer immediate, broad applicability across AI workflows. While Paper 1 addresses a highly important societal issue, it is primarily a literature synthesis and theoretical framework. Paper 2's methodological innovation and direct impact on the rapidly moving field of generative AI give it a higher potential for immediate scientific citation and application.

vs. Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

gpt-5.25/28/2026

Paper 2 likely has higher impact: it introduces a novel, reusable compositional framework for prompt optimization (discrete “instinct” codebooks with per-instance routing) with clear methodological structure and strong benchmark gains across multiple LLMs, suggesting broad applicability to agentic systems and practical deployment (performance + prompt-length reductions). Paper 1 is valuable as a large-scale empirical audit of an A2A ecosystem, but its contribution is primarily diagnostic/characterization of one network and design pitfalls, with narrower generalizability and less direct algorithmic advancement.

vs. Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching

gpt-5.25/28/2026

Paper 1 likely has higher impact due to a more novel formulation: discrete, compositional prompt optimization with reusable “instinct” units and per-instance routing, addressing brittleness and reuse—core limitations in prompt optimization. It introduces an end-to-end framework (encoder/generator/critic with structured textual gradients) and shows strong gains plus substantial prompt-length reduction across multiple benchmarks/models, suggesting broad applicability to agentic workflows and efficient deployment. Paper 2 is timely and practical for LLM routing and adds a benchmark, but the historical matching framing is a more incremental extension of existing routing paradigms.

vs. Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

claude-opus-4.65/28/2026

Paper 1 addresses a fundamental scientific question about the relationship between language model representations and human perceptual organization, revealing that perceptual geometry emerges transiently in intermediate layers despite no perceptual training. This offers deep insights into both AI and cognitive science, with broad interdisciplinary impact. Paper 2, while technically solid with strong empirical results, is primarily an engineering contribution to prompt optimization—an area with many competing methods and rapid obsolescence. Paper 1's findings about emergent perceptual structure have longer-lasting scientific significance and wider relevance across fields.

vs. From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact: it introduces a timely new benchmark (SHDF) targeting an emerging, underexplored failure mode (singing) in audio-visual deepfake detection, and proposes a cross-scenario framework (T-AVFD) with clear societal and security applications. The dataset plus robustness claims can catalyze broad follow-on work across multimedia forensics, security, and generative media evaluation. Paper 1 is novel for prompt optimization and could matter for LLM tooling, but its impact is narrower, more engineering-centric, and more sensitive to fast-moving baselines and model changes.

vs. CubePart: An Open-Vocabulary Part-Controllable 3D Generator

gpt-5.25/28/2026

Paper 1 likely has higher scientific impact: it introduces an open-vocabulary, part-controllable 3D generation capability aligned with practical downstream requirements (animation/physics/scripts), plus a scalable pipeline for a new part-labeled 3D dataset—assets that can broadly enable research and industry workflows in graphics, simulation, robotics, and games. Methodologically, explicit part-structured generation at inference-time addresses a core limitation of current 3D generators. Paper 2 is timely and useful for LLM prompting, but prompt-optimization frameworks often have faster turnover and narrower durability than a new controllable 3D generation paradigm and dataset.

vs. DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

claude-opus-4.65/28/2026

Paper 1 introduces a fundamentally new framework (Prompt Codebooks) for prompt optimization that addresses core limitations of existing methods—monolithic, instance-blind optimization—with a compositional, per-instance approach. The discrete codebook abstraction, language-valued min-max objective, and per-instance routing represent significant methodological innovation with broad applicability across LLM tasks. It demonstrates strong empirical gains across six benchmarks and two model families. Paper 2, while useful, presents an incremental engineering contribution (two-stage verification pipeline) with narrower scope, evaluated on a single benchmark, and the escalation idea is relatively straightforward.

vs. Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

gpt-5.25/28/2026

Paper 2 is likely higher impact: it introduces a broadly applicable, reusable compositional framework for prompt optimization (discrete codebooks + per-instance routing) that can directly improve many LLM/agent workflows, showing sizable benchmark gains and prompt-length reductions across models/tasks. Its method is an actionable optimization paradigm with potential downstream adoption in production and research. Paper 1 is novel and valuable for safety/policy auditing, but is narrower in scope (diagnostic within prompt policies) and its impact depends more on adoption by policy-heavy deployments and on evaluation assumptions (judgeability, candidate selection).

vs. MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

gemini-3.15/28/2026

Paper 1 introduces a fundamental, domain-agnostic methodological breakthrough in prompt optimization. By enabling instance-level, compositional prompt generation through discrete codebooks, it solves brittleness in existing APO methods and significantly reduces prompt length. Its broad applicability across LLM workflows gives it a wider potential impact across multiple fields compared to Paper 2, which presents a valuable but domain-specific data generation pipeline for medical LLMs.

vs. Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

gpt-5.25/28/2026

Paper 1 likely has higher impact: it introduces a novel discrete, compositional prompt-optimization paradigm (codebooks of reusable “instincts” with per-instance routing) that addresses brittleness and reuse—core limitations in current APO. Applications are broad across LLM agent workflows, prompt engineering, and deployment efficiency (large prompt-length reductions) with strong benchmark gains on widely used open models, making it timely and immediately actionable. Paper 2 is methodologically solid and relevant to equilibrium computation in RL/game theory, but its scope is narrower (two-player zero-sum PSRO variants) and impact may be more specialized.

vs. Advancing Creative Physical Intelligence in Large Multimodal Models

claude-opus-4.65/28/2026

Paper 1 introduces a novel benchmark and alignment methodology for creative physical reasoning in LMMs—a largely untested but fundamental aspect of intelligence. It addresses a deeper capability gap (affordance-grounded creative problem-solving) with broad implications for embodied AI, robotics, and cognitive science. Paper 2 presents a clever engineering contribution to prompt optimization with solid empirical gains, but operates in a more incremental, narrower space. Paper 1's contribution to understanding and improving creative physical intelligence in multimodal models has broader cross-disciplinary impact and higher long-term significance.

vs. The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

gemini-3.15/28/2026

Paper 1 introduces a novel, compositional framework for automatic prompt optimization that addresses fundamental limitations in current monolithic prompting methods. By offering a reusable, instance-specific codebook of prompts, it demonstrates significant performance gains and drastic reductions in prompt length (up to 14.1x). This presents massive practical applications for LLM inference efficiency and agentic workflows. While Paper 2 provides a valuable, rigorous statistical critique of a specific benchmark, Paper 1's introduction of a highly applicable, broadly impactful methodology gives it a significantly higher potential for widespread adoption and downstream scientific impact.