MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

Md Mehrab Tanjim, Jayakumar Subramanian, Xiang Chen, Branislav Kveton, Subhojyoti Mukherjee, Anlan Zhang, Sungchul Kim, Somdeb Sarkhel

May 19, 2026

arXiv:2605.19330v1 PDF

cs.AI(primary)cs.LGcs.SE

#1243of 2292·Artificial Intelligence

#1243 of 2292 · Artificial Intelligence

Tournament Score

1403±42

10501800

48%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance5.5

Rigor6.5

Novelty5

Clarity7.5

Tournament Score

1403±42

10501800

48%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MOCHA

1. Core Contribution

MOCHA addresses a genuine but narrowly scoped problem: optimizing LLM agent "skills" — structured, multi-field natural-language specifications — under competing objectives of task correctness and platform compliance (character limits on description and instruction body fields). The key insight is that existing prompt optimizers (TextGrad, ProTeGi, GEPA) use single-objective selection strategies that cannot escape seed skills when correctness improvements require violating compliance constraints. MOCHA introduces Chebyshev scalarization (guaranteeing coverage of non-convex Pareto regions) with an annealed transition from hypervolume-contribution-based exploration to Chebyshev-based exploitation.

The method is algorithmically straightforward: draw random weight vectors from Dirichlet(1), select parents via Chebyshev scalarization, accept candidates based on HVC early and Chebyshev improvement later, with exponential decay controlling the transition. The novelty lies in combining well-known MOO machinery (Chebyshev scalarization, hypervolume indicators) for the first time in the discrete, sample-expensive NL skill optimization setting.

2. Methodological Rigor

Strengths in experimental design: The paper makes a commendable effort at fair comparison. All methods share an identical mutation interface (SkillMdProposer), receive the same per-objective textual feedback, use the same backbone LLMs (Claude Haiku 4.5 for execution, Claude Opus 4.6 for mutation), and operate under matched 1000-rollout budgets. This isolates the selection strategy as the sole independent variable. Results are reported over 5 random seeds with standard deviations.

Concerns:

Scale of experiments: 100 train / 100 val / 100 test examples per benchmark is quite small. With only 1000 rollouts total, the optimization landscape is underexplored for all methods, which may partly explain why baselines get stuck.

The "baselines get stuck" finding is suspicious. On 4 of 6 tasks, all three baselines return the seed skill *unchanged*. This is a dramatic failure mode that warrants deeper investigation. The paper attributes this to single-objective selection rejecting candidates that trade compliance for correctness, but the degree of failure (literally zero accepted candidates in 1000 rollouts) suggests the baselines may be poorly configured for this particular problem formulation rather than fundamentally limited. The compliance constraints may be creating an artificially steep barrier that these methods weren't designed to handle.

Three-objective simplicity: Two of the three objectives (description compliance, body compliance) are trivially computable character-count thresholds. This is a very simple compliance landscape. The "non-convex Pareto front" argument, while theoretically valid, may overstate the complexity of the actual trade-off surface encountered.

HotpotQA exception: The one task where a baseline outperforms MOCHA (ProTeGi at .622 vs. MOCHA at .600) is dismissed as having "mild objective conflict," but this undermines the generality claim.

3. Potential Impact

Narrow but practical niche. The paper targets a real engineering constraint in agent platforms (Anthropic's SKILL.md specification with character limits). As LLM agent frameworks mature, structured skill optimization with platform constraints will become increasingly relevant. However, the specificity to one platform's constraints limits immediate generalizability.

Broader applicability is plausible but undemonstrated. The framework could potentially extend to other structured NL artifacts with competing constraints (API documentation, system prompts with multiple sections, multi-agent configurations). The paper mentions meta-harness optimization as a natural extension but does not explore it.

Limited downstream validation. The paper evaluates on academic benchmarks rather than deployed agent systems. Real-world impact would require demonstrating that Pareto-optimal skill variants discovered by MOCHA translate to meaningful user-facing improvements.

4. Timeliness & Relevance

The paper is well-timed. LLM agent frameworks (Claude Code, GPT Actions, etc.) are rapidly proliferating, and structured skill/plugin definitions are becoming standard. The observation that skill optimization is multi-objective — balancing performance against platform constraints — is genuinely useful framing that the community hasn't adequately addressed. The shift from monolithic prompt optimization to structured multi-field artifact optimization is a natural and timely evolution.

5. Strengths & Limitations

Key Strengths:

Clean problem formulation that identifies a genuine gap (multi-objective skill optimization)

Rigorous fair-comparison framework isolating selection strategy

Thorough ablation study revealing a clean exploration-exploitation spectrum

Excellent qualitative analysis (Figures 9-13) showing what MOCHA actually discovers

Complete reproduction materials (prompts, hyperparameters)

Key Limitations:

The compliance objectives are simplistic (character counts), making the "non-convex Pareto front" claim hard to verify empirically

The dramatic baseline failure (4/6 tasks stuck) may reflect problem-specific pathology rather than fundamental limitations of single-objective methods

Fixed annealing schedule with manually tuned hyperparameters (τ₀=0.1, λ=10) — the paper acknowledges this but doesn't address sensitivity

Small evaluation scale (100 examples per split, 6 tasks)

The 7.5% mean improvement, while consistent, is modest in absolute terms

Platform-specific compliance metrics (Anthropic's SKILL.md) limit generalizability

No statistical significance tests beyond reporting standard deviations

Additional Observations:

The paper's strongest contribution may be the *problem formulation* rather than the algorithm itself. Recognizing that agent skill optimization is multi-objective and requires Pareto front navigation is a useful conceptual contribution. The algorithmic components (Chebyshev scalarization, HVC, exponential annealing) are all well-established; the novelty is in their application to this domain. The qualitative examples showing MOCHA discovering structured reasoning protocols (FEVER rules, GPQA verification steps) while baselines return unchanged seeds are compelling evidence that the multi-objective framing unlocks meaningful optimization.

Rating:5.8/ 10

Significance 5.5Rigor 6.5Novelty 5Clarity 7.5

Generated May 20, 2026

Comparison History (21)

vs. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

gemini-3.15/22/2026

Paper 1 presents a novel algorithmic contribution (MOCHA) for multi-objective prompt optimization, addressing fundamental challenges in LLM agent design applicable across broad domains. Paper 2, while valuable for industry applications, offers a domain-specific benchmark (finance spreadsheets) which has a narrower scope. The fundamental methodological innovation and broader applicability of Paper 1 give it higher potential for widespread scientific impact.

vs. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

claude-opus-4.65/22/2026

MOCHA introduces a novel multi-objective optimization framework (Chebyshev scalarization with annealing) for LLM agent skill optimization, addressing a fundamental gap in prompt optimization. It offers methodological innovation applicable beyond its immediate domain, with rigorous experimental validation showing clear improvements. WorkstreamBench is a valuable benchmark contribution for spreadsheet-based financial tasks, but benchmarks typically have narrower impact than new optimization methods. MOCHA's technique could generalize broadly across multi-objective LLM optimization problems, giving it higher potential impact.

vs. OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

gemini-3.15/21/2026

Paper 1 offers a highly broadly applicable methodological advancement by addressing a fundamental bottleneck in LLM agent deployment: optimizing skills under strict platform constraints. By introducing a mathematically grounded multi-objective optimization technique (Chebyshev scalarization), it provides a robust solution that generalizes across any agentic framework. While Paper 2 shows impressive benchmark improvements in a specific cognitive niche (Theory of Mind), Paper 1's approach to prompt/skill optimization has greater potential to influence the foundational engineering and practical deployment of varied LLM agents across the entire field.

vs. COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

gemini-3.15/21/2026

Paper 2 addresses the highly timely and rapidly expanding field of LLM agent optimization. By introducing a multi-objective approach to optimize complex, multi-field agent skills under real-world platform constraints, it offers broader immediate applicability across AI research and industry than Paper 1's domain-specific focus on vehicle routing. Furthermore, its ability to succeed where existing prompt optimizers completely fail suggests a significant methodological breakthrough that could broadly influence how autonomous agents are developed and tuned.

vs. Generative Auto-Bidding with Unified Modeling and Exploration

gemini-3.15/20/2026

Paper 2 addresses the highly timely and broadly applicable problem of LLM agent skill optimization. Its use of multi-objective Chebyshev annealing offers a mathematically grounded approach to managing platform constraints, a critical bottleneck in agent deployment. While Paper 1 shows impressive large-scale industry results in ad bidding, Paper 2's methodology has greater potential for widespread adoption and multidisciplinary impact across the rapidly expanding field of autonomous AI agents.

vs. Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact due to strong timeliness (trustworthy VLMs for robotics), broad applicability across vision-language reasoning tasks, and clear real-world safety implications. Its modular pseudocode library plus difficulty-aware strategy selection (DFV) offers an interpretable framework that can transfer across models and domains, and the reported SOTA improvements on established benchmarks suggest practical effectiveness. Paper 1 is novel for multi-objective skill/prompt optimization, but its impact is narrower (agent “skills” under platform constraints) and more tooling-specific, with smaller, task-specific gains.

vs. Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

claude-opus-4.65/20/2026

MOCHA addresses a fundamental gap in LLM agent optimization by introducing a principled multi-objective framework (Chebyshev scalarization with annealing) that demonstrably outperforms existing methods across diverse tasks. Its contribution is methodologically novel, broadly applicable to the rapidly growing field of LLM agents and prompt optimization, and solves a well-defined technical problem with strong empirical results. Paper 2, while thorough and practically useful for survey methodology, is more application-specific to disaster preparedness surveys, with incremental improvements over existing imputation methods and narrower cross-field impact potential.

vs. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

gemini-3.15/20/2026

Paper 1 introduces a novel architectural paradigm (context maps) to solve a critical bottleneck in LLM agents: efficiently handling long, recurring contexts. By reducing costs up to 5.8x and significantly improving accuracy, it offers high real-world applicability for scaling agentic systems. Paper 2 presents a rigorous optimization method for agent skills, but Paper 1's contribution to agent memory and context management addresses a more fundamental and widely pressing challenge in the field, likely leading to broader adoption and architectural impact.

vs. Interference-Aware Multi-Task Unlearning

gpt-5.25/20/2026

Paper 1 likely has higher scientific impact due to stronger novelty and broader, timely relevance: it extends machine unlearning to realistic multi-task shared-backbone settings and formalizes/mitigates interference via principled gradient projection/orthogonalization, which can generalize beyond vision to many multi-task architectures. It also targets an important regulatory/trustworthy-ML need (data deletion), with clearer implications for compliance and deployment. Paper 2 is timely for LLM agents and offers a useful multi-objective optimizer, but the methodological advance is more incremental (scalarization + annealing) and its impact may be narrower and more system/prompt-engineering dependent.

vs. Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On

gpt-5.25/20/2026

Paper 1 presents a concrete, novel algorithm (Chebyshev scalarization + annealing) for multi-objective skill optimization under realistic platform constraints, with controlled experiments and measurable gains across multiple tasks, suggesting actionable real-world deployment and methodological rigor. Its multi-objective optimization framing is broadly relevant to agent prompt/skill engineering and could transfer to other constrained LLM optimization settings. Paper 2 is timely and potentially broad, but as a vision/conceptual framework without demonstrated methods or empirical validation, its near-term scientific impact and rigor are less certain.

vs. Generative Recursive Reasoning

gpt-5.25/20/2026

Paper 2 (GRAM) introduces a broadly applicable probabilistic framework for recursive latent-state reasoning with multi-trajectory computation, enabling hypothesis diversity, inference-time scaling, and both conditional reasoning and unconditional generation. This is novel at the modeling level, potentially impacting core sequence modeling, reasoning, and generative modeling across domains, and aligns with timely interest in test-time compute and structured reasoning. Paper 1 is a strong, practical contribution for LLM-agent prompt/skill optimization, but is more domain-specific and likely narrower in cross-field impact despite solid empirical gains.

vs. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

gemini-3.15/20/2026

Paper 2 offers a significant theoretical breakthrough by providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. By formalizing a novel framework (IC-SMDP) and bridging deep reinforcement learning theory with modern multi-agent LLM pipelines, it demonstrates exceptional methodological rigor. While Paper 1 presents a practical and effective algorithm for prompt optimization, Paper 2's foundational theoretical contributions are likely to have a broader, more enduring impact across reinforcement learning, multi-agent systems, and AI safety/coordination.

vs. Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

gemini-3.15/20/2026

Paper 1 addresses a critical bottleneck in the rapidly growing field of LLM agents: optimizing multi-field skills under real-world constraints. Its introduction of Chebyshev scalarization to navigate non-convex multi-objective trade-offs is methodologically rigorous and solves a distinct failure mode (stagnation) in existing optimizers. While Paper 2 offers strong improvements for evaluation and routing, Paper 1's potential to directly enhance autonomous agent performance gives it a broader and more timely scientific impact.

vs. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

gemini-3.15/20/2026

Paper 1 introduces a comprehensive benchmark for an emerging and critical area (agentic delegation), which often dictates the trajectory of future research. By exposing a significant 15-31% unrealized performance headroom, it sets a clear target for the community, likely driving high citation volume and broader structural impact compared to the specific algorithmic optimization proposed in Paper 2.

vs. GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

gpt-5.25/20/2026

GeoX is more novel and broadly impactful: it introduces a self-play RL framework with executable, verifiable rewards to learn geospatial reasoning from imagery, and releases a benchmark—advancing both method and data for an important, under-annotated domain. Its real-world applications (remote sensing, mapping, disaster response, urban planning) are immediate and high-value, and the verifiable-program setup suggests methodological rigor and transferability to other grounded reasoning tasks. MOCHA is timely and useful for LLM-agent prompt/skill optimization, but is narrower in scope and impact, and improvements are incremental within an established multi-objective optimization paradigm.

vs. Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

gemini-3.15/20/2026

Paper 1 introduces a novel algorithmic solution (MOCHA) to a complex, multi-objective problem in LLM agent optimization, demonstrating concrete performance improvements over baselines. In contrast, Paper 2 is primarily an empirical measurement study. While highly relevant for systems design, Paper 1's direct methodological innovation and immediate applicability to the rapidly growing field of autonomous agents give it a higher potential for broad, immediate scientific impact.

vs. Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

gemini-3.15/20/2026

Paper 1 introduces a foundational paradigm shift for LLM agents by replacing informal, prompt-based procedures with executable, stateful abstractions (Formal Skills). This structural innovation addresses critical bottlenecks in agent reliability and token efficiency, offering a broader architectural impact on next-generation agent frameworks compared to Paper 2's algorithmic improvements for optimizing existing natural language skills.

vs. OpenComputer: Verifiable Software Worlds for Computer-Use Agents

gpt-5.25/20/2026

Paper 2 is likely higher impact: it introduces a broadly usable, verifier-grounded evaluation and task framework for computer-use agents across 33 real applications and 1,000 tasks, addressing a central bottleneck (reliable, auditable measurement) with clear real-world relevance and potential to become a community benchmark/infrastructure. The methodology emphasizes verifiability, trajectory logging, and partial-credit rewards, improving rigor over LLM-as-judge. Paper 1 is novel for multi-objective skill/prompt optimization, but its scope is narrower and impact depends on adoption within agent-skill tuning workflows.

vs. TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

gpt-5.25/20/2026

Paper 2 is likely to have higher scientific impact due to its broader real-world applicability (household/robotic assistants under privacy and compute constraints), introduction of a new problem formalization (full-scene household reasoning), and release of a sizable human-validated benchmark (FullHome) that can drive follow-on research. The training-free, model-agnostic Ground-Infer-Execute framework is timely and deployable, and its gains across proprietary and open models suggest wide adoption. Paper 1 is novel in multi-objective prompt/skill optimization, but its impact is narrower to prompt/skill engineering and platform-specific constraints.

vs. Generative AI and the Productivity Divide: Human-AI Complementarities in Education

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about how GenAI affects productivity inequality—a topic with enormous breadth of impact across economics, education, management, and policy. Its RCT methodology is rigorous, and the novel construct of AI Interaction Competence (AIC) as a predictor of differential gains is highly citable and actionable. The finding that scaffolding reduces variance has immediate practical implications for firms and educators. Paper 1, while technically sound, addresses a narrower optimization problem within LLM agent engineering with more limited cross-disciplinary relevance.