MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization
Md Mehrab Tanjim, Jayakumar Subramanian, Xiang Chen, Branislav Kveton, Subhojyoti Mukherjee, Anlan Zhang, Sungchul Kim, Somdeb Sarkhel
Abstract
LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.
AI Impact Assessments
(1 models)Scientific Impact Assessment: MOCHA
1. Core Contribution
MOCHA addresses a genuine but narrowly scoped problem: optimizing LLM agent "skills" — structured, multi-field natural-language specifications — under competing objectives of task correctness and platform compliance (character limits on description and instruction body fields). The key insight is that existing prompt optimizers (TextGrad, ProTeGi, GEPA) use single-objective selection strategies that cannot escape seed skills when correctness improvements require violating compliance constraints. MOCHA introduces Chebyshev scalarization (guaranteeing coverage of non-convex Pareto regions) with an annealed transition from hypervolume-contribution-based exploration to Chebyshev-based exploitation.
The method is algorithmically straightforward: draw random weight vectors from Dirichlet(1), select parents via Chebyshev scalarization, accept candidates based on HVC early and Chebyshev improvement later, with exponential decay controlling the transition. The novelty lies in combining well-known MOO machinery (Chebyshev scalarization, hypervolume indicators) for the first time in the discrete, sample-expensive NL skill optimization setting.
2. Methodological Rigor
Strengths in experimental design: The paper makes a commendable effort at fair comparison. All methods share an identical mutation interface (SkillMdProposer), receive the same per-objective textual feedback, use the same backbone LLMs (Claude Haiku 4.5 for execution, Claude Opus 4.6 for mutation), and operate under matched 1000-rollout budgets. This isolates the selection strategy as the sole independent variable. Results are reported over 5 random seeds with standard deviations.
Concerns:
3. Potential Impact
Narrow but practical niche. The paper targets a real engineering constraint in agent platforms (Anthropic's SKILL.md specification with character limits). As LLM agent frameworks mature, structured skill optimization with platform constraints will become increasingly relevant. However, the specificity to one platform's constraints limits immediate generalizability.
Broader applicability is plausible but undemonstrated. The framework could potentially extend to other structured NL artifacts with competing constraints (API documentation, system prompts with multiple sections, multi-agent configurations). The paper mentions meta-harness optimization as a natural extension but does not explore it.
Limited downstream validation. The paper evaluates on academic benchmarks rather than deployed agent systems. Real-world impact would require demonstrating that Pareto-optimal skill variants discovered by MOCHA translate to meaningful user-facing improvements.
4. Timeliness & Relevance
The paper is well-timed. LLM agent frameworks (Claude Code, GPT Actions, etc.) are rapidly proliferating, and structured skill/plugin definitions are becoming standard. The observation that skill optimization is multi-objective — balancing performance against platform constraints — is genuinely useful framing that the community hasn't adequately addressed. The shift from monolithic prompt optimization to structured multi-field artifact optimization is a natural and timely evolution.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Additional Observations:
The paper's strongest contribution may be the *problem formulation* rather than the algorithm itself. Recognizing that agent skill optimization is multi-objective and requires Pareto front navigation is a useful conceptual contribution. The algorithmic components (Chebyshev scalarization, HVC, exponential annealing) are all well-established; the novelty is in their application to this domain. The qualitative examples showing MOCHA discovering structured reasoning protocols (FEVER rules, GPQA verification steps) while baselines return unchanged seeds are compelling evidence that the multi-objective framing unlocks meaningful optimization.
Generated May 20, 2026
Comparison History (21)
Paper 1 presents a novel algorithmic contribution (MOCHA) for multi-objective prompt optimization, addressing fundamental challenges in LLM agent design applicable across broad domains. Paper 2, while valuable for industry applications, offers a domain-specific benchmark (finance spreadsheets) which has a narrower scope. The fundamental methodological innovation and broader applicability of Paper 1 give it higher potential for widespread scientific impact.
MOCHA introduces a novel multi-objective optimization framework (Chebyshev scalarization with annealing) for LLM agent skill optimization, addressing a fundamental gap in prompt optimization. It offers methodological innovation applicable beyond its immediate domain, with rigorous experimental validation showing clear improvements. WorkstreamBench is a valuable benchmark contribution for spreadsheet-based financial tasks, but benchmarks typically have narrower impact than new optimization methods. MOCHA's technique could generalize broadly across multi-objective LLM optimization problems, giving it higher potential impact.
Paper 1 offers a highly broadly applicable methodological advancement by addressing a fundamental bottleneck in LLM agent deployment: optimizing skills under strict platform constraints. By introducing a mathematically grounded multi-objective optimization technique (Chebyshev scalarization), it provides a robust solution that generalizes across any agentic framework. While Paper 2 shows impressive benchmark improvements in a specific cognitive niche (Theory of Mind), Paper 1's approach to prompt/skill optimization has greater potential to influence the foundational engineering and practical deployment of varied LLM agents across the entire field.
Paper 2 addresses the highly timely and rapidly expanding field of LLM agent optimization. By introducing a multi-objective approach to optimize complex, multi-field agent skills under real-world platform constraints, it offers broader immediate applicability across AI research and industry than Paper 1's domain-specific focus on vehicle routing. Furthermore, its ability to succeed where existing prompt optimizers completely fail suggests a significant methodological breakthrough that could broadly influence how autonomous agents are developed and tuned.
Paper 2 addresses the highly timely and broadly applicable problem of LLM agent skill optimization. Its use of multi-objective Chebyshev annealing offers a mathematically grounded approach to managing platform constraints, a critical bottleneck in agent deployment. While Paper 1 shows impressive large-scale industry results in ad bidding, Paper 2's methodology has greater potential for widespread adoption and multidisciplinary impact across the rapidly expanding field of autonomous AI agents.
Paper 2 likely has higher scientific impact due to strong timeliness (trustworthy VLMs for robotics), broad applicability across vision-language reasoning tasks, and clear real-world safety implications. Its modular pseudocode library plus difficulty-aware strategy selection (DFV) offers an interpretable framework that can transfer across models and domains, and the reported SOTA improvements on established benchmarks suggest practical effectiveness. Paper 1 is novel for multi-objective skill/prompt optimization, but its impact is narrower (agent “skills” under platform constraints) and more tooling-specific, with smaller, task-specific gains.
MOCHA addresses a fundamental gap in LLM agent optimization by introducing a principled multi-objective framework (Chebyshev scalarization with annealing) that demonstrably outperforms existing methods across diverse tasks. Its contribution is methodologically novel, broadly applicable to the rapidly growing field of LLM agents and prompt optimization, and solves a well-defined technical problem with strong empirical results. Paper 2, while thorough and practically useful for survey methodology, is more application-specific to disaster preparedness surveys, with incremental improvements over existing imputation methods and narrower cross-field impact potential.
Paper 1 introduces a novel architectural paradigm (context maps) to solve a critical bottleneck in LLM agents: efficiently handling long, recurring contexts. By reducing costs up to 5.8x and significantly improving accuracy, it offers high real-world applicability for scaling agentic systems. Paper 2 presents a rigorous optimization method for agent skills, but Paper 1's contribution to agent memory and context management addresses a more fundamental and widely pressing challenge in the field, likely leading to broader adoption and architectural impact.
Paper 1 likely has higher scientific impact due to stronger novelty and broader, timely relevance: it extends machine unlearning to realistic multi-task shared-backbone settings and formalizes/mitigates interference via principled gradient projection/orthogonalization, which can generalize beyond vision to many multi-task architectures. It also targets an important regulatory/trustworthy-ML need (data deletion), with clearer implications for compliance and deployment. Paper 2 is timely for LLM agents and offers a useful multi-objective optimizer, but the methodological advance is more incremental (scalarization + annealing) and its impact may be narrower and more system/prompt-engineering dependent.
Paper 1 presents a concrete, novel algorithm (Chebyshev scalarization + annealing) for multi-objective skill optimization under realistic platform constraints, with controlled experiments and measurable gains across multiple tasks, suggesting actionable real-world deployment and methodological rigor. Its multi-objective optimization framing is broadly relevant to agent prompt/skill engineering and could transfer to other constrained LLM optimization settings. Paper 2 is timely and potentially broad, but as a vision/conceptual framework without demonstrated methods or empirical validation, its near-term scientific impact and rigor are less certain.
Paper 2 (GRAM) introduces a broadly applicable probabilistic framework for recursive latent-state reasoning with multi-trajectory computation, enabling hypothesis diversity, inference-time scaling, and both conditional reasoning and unconditional generation. This is novel at the modeling level, potentially impacting core sequence modeling, reasoning, and generative modeling across domains, and aligns with timely interest in test-time compute and structured reasoning. Paper 1 is a strong, practical contribution for LLM-agent prompt/skill optimization, but is more domain-specific and likely narrower in cross-field impact despite solid empirical gains.
Paper 2 offers a significant theoretical breakthrough by providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. By formalizing a novel framework (IC-SMDP) and bridging deep reinforcement learning theory with modern multi-agent LLM pipelines, it demonstrates exceptional methodological rigor. While Paper 1 presents a practical and effective algorithm for prompt optimization, Paper 2's foundational theoretical contributions are likely to have a broader, more enduring impact across reinforcement learning, multi-agent systems, and AI safety/coordination.
Paper 1 addresses a critical bottleneck in the rapidly growing field of LLM agents: optimizing multi-field skills under real-world constraints. Its introduction of Chebyshev scalarization to navigate non-convex multi-objective trade-offs is methodologically rigorous and solves a distinct failure mode (stagnation) in existing optimizers. While Paper 2 offers strong improvements for evaluation and routing, Paper 1's potential to directly enhance autonomous agent performance gives it a broader and more timely scientific impact.
Paper 1 introduces a comprehensive benchmark for an emerging and critical area (agentic delegation), which often dictates the trajectory of future research. By exposing a significant 15-31% unrealized performance headroom, it sets a clear target for the community, likely driving high citation volume and broader structural impact compared to the specific algorithmic optimization proposed in Paper 2.
GeoX is more novel and broadly impactful: it introduces a self-play RL framework with executable, verifiable rewards to learn geospatial reasoning from imagery, and releases a benchmark—advancing both method and data for an important, under-annotated domain. Its real-world applications (remote sensing, mapping, disaster response, urban planning) are immediate and high-value, and the verifiable-program setup suggests methodological rigor and transferability to other grounded reasoning tasks. MOCHA is timely and useful for LLM-agent prompt/skill optimization, but is narrower in scope and impact, and improvements are incremental within an established multi-objective optimization paradigm.
Paper 1 introduces a novel algorithmic solution (MOCHA) to a complex, multi-objective problem in LLM agent optimization, demonstrating concrete performance improvements over baselines. In contrast, Paper 2 is primarily an empirical measurement study. While highly relevant for systems design, Paper 1's direct methodological innovation and immediate applicability to the rapidly growing field of autonomous agents give it a higher potential for broad, immediate scientific impact.
Paper 1 introduces a foundational paradigm shift for LLM agents by replacing informal, prompt-based procedures with executable, stateful abstractions (Formal Skills). This structural innovation addresses critical bottlenecks in agent reliability and token efficiency, offering a broader architectural impact on next-generation agent frameworks compared to Paper 2's algorithmic improvements for optimizing existing natural language skills.
Paper 2 is likely higher impact: it introduces a broadly usable, verifier-grounded evaluation and task framework for computer-use agents across 33 real applications and 1,000 tasks, addressing a central bottleneck (reliable, auditable measurement) with clear real-world relevance and potential to become a community benchmark/infrastructure. The methodology emphasizes verifiability, trajectory logging, and partial-credit rewards, improving rigor over LLM-as-judge. Paper 1 is novel for multi-objective skill/prompt optimization, but its scope is narrower and impact depends on adoption within agent-skill tuning workflows.
Paper 2 is likely to have higher scientific impact due to its broader real-world applicability (household/robotic assistants under privacy and compute constraints), introduction of a new problem formalization (full-scene household reasoning), and release of a sizable human-validated benchmark (FullHome) that can drive follow-on research. The training-free, model-agnostic Ground-Infer-Execute framework is timely and deployable, and its gains across proprietary and open models suggest wide adoption. Paper 1 is novel in multi-objective prompt/skill optimization, but its impact is narrower to prompt/skill engineering and platform-specific constraints.
Paper 2 addresses a fundamental question about how GenAI affects productivity inequality—a topic with enormous breadth of impact across economics, education, management, and policy. Its RCT methodology is rigorous, and the novel construct of AI Interaction Competence (AIC) as a predictor of differential gains is highly citable and actionable. The finding that scaffolding reduces variance has immediate practical implications for firms and educators. Paper 1, while technically sound, addresses a narrower optimization problem within LLM agent engineering with more limited cross-disciplinary relevance.