SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao

May 22, 2026

arXiv:2605.23904v2 PDF

v1v2

cs.AI(primary)cs.CL

#91of 2525·Artificial Intelligence

#91 of 2525 · Artificial Intelligence

Tournament Score

1548±46

10501800

88%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance7.5

Rigor7.5

Novelty7

Clarity8

Tournament Score

1548±46

10501800

88%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SkillOpt

1. Core Contribution

SkillOpt introduces a systematic framework for optimizing natural-language "skill documents" that serve as procedural instructions for frozen LLM agents. The key conceptual move is treating the skill document as an optimizable external state analogous to model weights, with explicit parallels to deep learning training: rollout batches as data, bounded add/delete/replace edits as gradient steps, a textual learning-rate budget controlling step size, validation gating for acceptance, rejected-edit buffers as negative feedback, and epoch-wise slow/meta updates as momentum. The deployed output is a compact markdown file (300–2,000 tokens) that requires zero additional inference-time model calls.

The paper solves a genuine problem: current approaches to agent skill construction are either one-shot (no iterative improvement), hand-crafted (no scalability), or loosely controlled self-revision (no convergence guarantees). SkillOpt provides a disciplined middle ground where skills are iteratively refined with explicit stability controls.

2. Methodological Rigor

The experimental design is notably thorough. The evaluation spans 52 (model, benchmark, harness) cells across six diverse benchmarks (QA, spreadsheets, documents, math, embodied tasks), seven target models (GPT-5.x family and Qwen variants), and three execution harnesses (direct chat, Codex, Claude Code). The use of deterministic train/selection/test splits with the selection split used only for gating—and all reported numbers on held-out test—is methodologically sound.

The ablation study is comprehensive: it isolates training set size, minibatch size, batch size, learning rate, schedule, slow-update samples, and individual components (learning-rate form, rejected buffer, slow/meta update). The ablations reveal meaningful insights—for instance, removing both meta skill and slow update drops SpreadsheetBench by 22.5 points, while SearchQA (near ceiling) is stable across most settings.

However, there are methodological concerns. First, the optimizer model (GPT-5.5) is a frontier model, meaning the "no weight update" claim is somewhat misleading—the system uses a powerful frontier model to generate edits, just not to execute tasks. Table 5 partially addresses this by showing target-matched optimizers recover 56–74% of the gain, but the headline numbers all use the strongest available optimizer. Second, the paper evaluates on models released in 2026 (GPT-5.x, Qwen3.x), making independent verification difficult at the time of assessment. Third, while the claim of "best or tied on all 52 cells" is strong, some margins are small and statistical significance tests are absent.

3. Potential Impact

Practical impact: SkillOpt addresses a real deployment need. Organizations using frozen API-based models (which cannot be fine-tuned) need lightweight adaptation mechanisms. A reusable, inspectable, transferable text artifact that improves performance by 15–25 points on average is genuinely useful. The cross-harness transfer results (e.g., Codex→Claude Code with +59.7 on SpreadsheetBench) are particularly compelling for real-world deployment where execution environments change.

Broader influence: The conceptual framing—treating text artifacts as optimizable states with training-style controls—could influence how the community thinks about prompt engineering, agent configuration, and procedural memory. The analogy to deep learning optimization (learning rates, schedules, validation, momentum) provides a vocabulary and framework that could be adopted broadly.

Adjacent fields: The work connects to prompt optimization (TextGrad, DSPy), agent memory systems, and automated ML configuration. The skill-as-trainable-state paradigm could extend to other external knowledge artifacts (retrieval indices, tool descriptions, workflow specifications).

4. Timeliness & Relevance

This paper arrives at a moment when agent systems are becoming the dominant deployment paradigm for LLMs, and the gap between model capability and task-specific performance is increasingly addressed through external scaffolding rather than fine-tuning. The inability to fine-tune closed frontier models makes text-space adaptation directly relevant. The inclusion of Codex and Claude Code harnesses—real commercial agent platforms—demonstrates awareness of current deployment patterns.

The benchmark diversity (including SpreadsheetBench's multi-turn code generation and ALFWorld's embodied interaction) shows the approach addresses needs across the agent spectrum, not just simple prompt optimization for QA tasks.

5. Strengths & Limitations

Key Strengths:

Comprehensive evaluation: 52 cells, 7 models, 6 benchmarks, 3 harnesses is among the most thorough evaluations in the agent optimization literature.

Transfer experiments: Cross-model, cross-harness, and cross-benchmark transfer provide evidence that the artifacts encode genuinely useful procedural knowledge rather than overfitting.

Interpretability: The learned skills are compact, readable, and procedural (Figure 4), which is rare for automated optimization outputs.

Edit economy: Gains from only 1–4 accepted edits demonstrates the validation gate's effectiveness and the method's efficiency.

Complete system design: The paper provides full optimizer prompt contracts (Appendix C), enhancing reproducibility.

Notable Limitations:

Asymmetric compute: Using GPT-5.5 as an optimizer for smaller target models conflates "no weight update" with "no additional model." The headline gains partially reflect the optimizer model's knowledge being distilled into the skill.

Scorer dependency: The method requires reliable automatic scoring, limiting applicability to open-ended creative or subjective tasks (acknowledged in limitations).

No statistical significance: With no confidence intervals or significance tests across the 52 cells, some "best or tied" claims may not be robust.

Single skill limitation: One skill per domain may be insufficient for heterogeneous task distributions, though the paper acknowledges this.

Reproducibility concerns: Dependence on specific model versions (GPT-5.4, GPT-5.5) that may be deprecated complicates long-term reproducibility.

Baseline fairness: Some baselines (TextGrad, GEPA) were designed for prompt optimization, not skill optimization, potentially disadvantaging them in this framing.

Additional Observations

The cost analysis (Table 6) is valuable but reveals wide variance: 0.6M–46.4M tokens per test-point gain. The paper could better discuss when SkillOpt is cost-effective versus alternatives like few-shot prompting or fine-tuning.

The deep-learning analogy, while useful pedagogically, risks being overextended. Text-space optimization operates in a fundamentally different landscape than weight-space optimization—the "learning rate" is a discrete edit count, not a continuous scalar, and the optimization landscape has no gradient structure.

Overall, this is a well-executed systems paper that introduces a principled framework for an important practical problem, backed by unusually broad empirical evidence. The main scientific contribution is the formalization and demonstration that training-style discipline (bounded updates, validation gating, momentum) yields reliable improvements in text-space skill optimization.

Rating:7.5/ 10

Significance 7.5Rigor 7.5Novelty 7Clarity 8

Generated May 26, 2026

Comparison History (24)

vs. Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

claude-opus-4.65/26/2026

SkillOpt introduces a fundamentally new paradigm—treating agent skills as optimizable external state with disciplined text-space optimization—which has broader impact across the rapidly growing agent ecosystem. Its comprehensive evaluation across 52 cells (6 benchmarks, 7 models, 3 harnesses) with consistent improvements and demonstrated transferability suggests high practical adoption. Paper 2 offers a solid incremental improvement to test-time scaling via stochastic backtracking, but addresses a narrower problem (PRM-guided search efficiency) with less transformative potential. SkillOpt's cross-model, cross-harness generalizability and the skill-as-trainable-artifact framing opens more new research directions.

vs. Neuro-Inspired Inverse Learning for Planning and Control

gemini-3.15/26/2026

Paper 1 offers higher scientific impact by formalizing a fundamental new paradigm (Inverse Learning) that bridges Reinforcement Learning and Optimal Control. Its neuro-inspired architecture not only advances embodied AI but also demonstrates profound cross-disciplinary utility by accelerating quantum gate synthesis by 1000x. While Paper 2 provides a highly rigorous, timely framework for LLM skill optimization, Paper 1 introduces foundational theoretical concepts with physics-based and hardware-level applications, suggesting broader, longer-lasting implications across machine learning, robotics, and quantum computing.

vs. Inference Time Context Sparsity: Illusion or Opportunity?

claude-opus-4.65/26/2026

SkillOpt introduces a novel and systematic framework for optimizing agent skills as text-space parameters with optimizer discipline (learning rates, validation, epochs), achieving strong empirical results across 52 evaluation cells, 7 models, and 3 harnesses. It addresses a timely problem in agentic AI with a principled methodology and demonstrates transferability. Paper 2 provides valuable empirical observations about context sparsity robustness and practical speedups, but is more of a position/empirical study consolidating known intuitions rather than introducing a fundamentally new method. SkillOpt's broader applicability to the rapidly growing agent ecosystem gives it higher impact potential.

vs. GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

gemini-3.15/26/2026

While Paper 1 provides a valuable domain-specific benchmark and safety analysis for dental AI, Paper 2 introduces a fundamental, domain-agnostic methodology for optimizing AI agent skills. By framing skill evolution as a systematic, controllable text-space optimization process, Paper 2 offers a broad, scalable advancement that significantly improves agent performance across multiple models and environments. This fundamental methodological innovation gives it a much wider potential impact across the entire AI research and application landscape.

vs. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

claude-opus-4.65/26/2026

SkillOpt demonstrates broader immediate practical impact with consistent improvements across 52 evaluation cells, seven models, three execution harnesses, and six benchmarks. Its transferability results and deployment-ready design (zero inference-time overhead) make it highly applicable. However, Paper 2 offers deeper theoretical contributions (first finite-sample guarantee for neural Q-learning under decentralized partial observability). Paper 1 wins on breadth of impact, real-world applicability, and timeliness given the current explosion of LLM agent deployments, though Paper 2's theoretical rigor is arguably stronger.

vs. How Far Are We From True Auto-Research?

gpt-5.25/26/2026

Paper 2 likely has higher impact: it proposes a novel, optimizer-like, controllable method for training agent “skills” with validation-gated edits, addressing reproducibility and stability in text-space optimization. It demonstrates broad, rigorous empirical results (6 benchmarks, 7 models, 3 harnesses, 52 cells) with strong gains and transfer, plus released code—supporting real-world adoption in agent systems. Paper 1 is timely and valuable as an evaluation/diagnostic study of auto-research quality, but it is more observational and its contributions are narrower than a general, deployable optimization framework.

vs. Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

gemini-3.15/26/2026

Paper 1 introduces a foundational methodology for agent skill optimization, treating text-space skills with the rigor of deep-learning weight optimization. Its high methodological novelty, broad applicability across various models, and strong empirical results suggest widespread adoption across the entire AI agent ecosystem. While Paper 2 offers a highly valuable, socially impactful benchmark for disaster response, Paper 1 provides a fundamental algorithmic advancement that will likely improve agentic workflows across a much broader range of fields and applications.

vs. Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

claude-opus-4.65/26/2026

SkillOpt demonstrates higher scientific impact through comprehensive empirical validation across 52 evaluation cells, showing consistent improvements over multiple strong baselines. It introduces a principled, systematic framework for skill optimization that bridges text-space and weight-space optimization paradigms—a novel and broadly applicable contribution. The demonstrated transferability across models, environments, and tasks amplifies its practical impact. Paper 2, while addressing an important safety/governance concern with an interesting actuarial framing, is more niche, lacks the breadth of empirical validation, and its real-world adoption path is less clear. SkillOpt's immediate applicability to improving agent performance across diverse settings gives it broader impact.

vs. Credit Assignment with Resets in Language Model Reasoning

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental limitation in RLVR for LLM reasoning—uniform credit assignment—with a principled approach grounded in Conservative Policy Iteration theory. The credit assignment problem is central to RL and broadly applicable. The self-reset mechanism (SRPO) is novel, theoretically motivated, and requires no external supervision. Paper 2, while impressive empirically across many benchmarks, is more engineering-focused (text-space skill optimization) with narrower theoretical contributions. Paper 1's theoretical framework and insights into credit assignment have broader potential to influence future RL-for-reasoning research.

vs. When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

gpt-5.25/26/2026

Paper 2 (SkillOpt) has higher potential impact due to stronger novelty (a controlled, validation-gated optimizer for persistent agent “skill” artifacts), broad applicability across agentic LLM systems, and timely relevance to self-improving agents. It reports extensive empirical coverage (multiple benchmarks/models/harnesses) and practical deployment advantages (no extra inference-time calls, transferable skill documents). Paper 1 addresses an important methodological issue (temporal leakage) and could improve rigor in learning analytics, but its scope and cross-field reach are narrower and mainly corrective rather than enabling new capabilities.

vs. When Can Human-AI Teams Outperform Individuals? Tight Bounds with Impossibility Guarantees

gpt-5.25/26/2026

Paper 2 offers a general theoretical framework with tight bounds and impossibility guarantees for when human–AI complementarity is achievable, validated on multiple datasets and extended to multiclass settings. Its results are broadly applicable across HCI, ML evaluation, decision theory, and AI-assisted work, and provide durable design criteria (correlation threshold) that can shape future studies and systems. Paper 1 is highly practical and timely for agent skill optimization, but is more system/benchmark-specific and may be overtaken by shifting model/tooling paradigms, whereas Paper 2’s theory is likely to remain foundational.

vs. Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

claude-opus-4.65/26/2026

SkillOpt introduces a novel and systematic framework for optimizing agent skills as text-space parameters with optimizer-like discipline, demonstrating strong empirical results across 52 evaluation cells, 7 models, and 3 execution harnesses. Its broad applicability, transferability across models and environments, and practical deployment advantages (zero inference-time overhead) give it wider impact potential. While PVD offers a useful selective prediction protocol grounded in proof theory, its contributions are more incremental—combining existing ideas (prover-verifier games, selective prediction) with empirical characterization on limited benchmarks. SkillOpt's paradigm of treating skills as trainable artifacts is more foundational and broadly applicable.

vs. Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models

gpt-5.25/26/2026

Paper 2 likely has higher impact: it introduces a broadly applicable, optimizer-like framework for training reusable “skill” artifacts with validation-gated, bounded edits—analogous to disciplined weight-space optimization—showing strong, consistent gains across many models, benchmarks, and execution harnesses plus transferability. This suggests wide real-world utility for agent deployment and a methodology that could influence multiple subfields (agents, prompt/skill learning, evaluation). Paper 1 addresses an important but narrower finance-backtesting validity issue; while novel, its domain scope and downstream adoption surface are more limited.

vs. GRAIL: AI translation for scientists application workflow on satellite data

claude-opus-4.65/26/2026

SkillOpt introduces a novel and systematic framework for optimizing agent skills as text-space parameters, drawing a compelling analogy to deep learning optimization. It demonstrates broad impact across 52 evaluation cells, 7 models, 3 execution harnesses, and 6 benchmarks with significant accuracy improvements. Its transferability results and reproducible methodology suggest high generalizability. GRAIL, while useful, addresses a narrower problem (translating geospatial Python to Spark) with a more incremental engineering contribution. SkillOpt's foundational contribution to agentic AI skill optimization has far broader implications for the rapidly growing LLM agent field.

vs. DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

gemini-3.15/26/2026

Paper 2 introduces a highly novel paradigm by treating textual agent skills as optimizable states, akin to weight-space optimization in deep learning. Its rigorous methodology, extensive evaluation across multiple models and harnesses, and demonstration of strong transferability suggest a broader foundational impact on agent development than Paper 1's communication-reduction strategy, which is impactful but more narrowly focused on multi-agent efficiency.

vs. Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

gemini-3.15/26/2026

Paper 1 introduces a highly novel, generalizable text-space optimizer for AI agents, drawing a strong parallel to deep learning weight optimization. Its extensive methodological rigor, demonstrated across numerous benchmarks and state-of-the-art models, ensures broad applicability and impact across the entire field of autonomous agents. While Paper 2 offers valuable contributions to medical AI interpretability, Paper 1's foundational approach to agent self-evolution will likely influence a wider array of disciplines and future AI architectures.

vs. L2IR: Revealing Latent Intent in Graph Fraud Detection

gpt-5.25/26/2026

Paper 1 is likely higher impact due to a more novel, generalizable framework: a controllable, validation-gated “optimizer” for agent skill text with stability mechanisms and zero deployment overhead. It shows broad, consistent gains across many models, benchmarks, and execution harnesses, plus transfer across environments—suggesting wide applicability for LLM agents and tooling. Paper 2 targets an important but narrower domain (graph fraud) with moderate gains on two datasets; its LLM-intent idea is useful but less broadly transformative and may face deployment/labeling constraints. Paper 1 is also more timely given rapid agent adoption.

vs. Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP

gemini-3.15/26/2026

Paper 1 addresses the highly impactful and timely field of LLM agents, introducing a novel text-space optimizer inspired by deep learning principles. Its extensive evaluation across multiple leading models (e.g., GPT-5.5, Claude) and execution harnesses demonstrates significant performance gains and robust transferability. In contrast, Paper 2 focuses on a more niche problem (uncertainty in subjective NLP), making Paper 1 far more likely to have a broad and immediate impact on AI development and real-world applications.

vs. A Deep Dive into Axiomatic Design -- Part I: Problem Formulation

claude-opus-4.65/26/2026

SkillOpt presents a novel, systematic framework for optimizing agent skills as text-space parameters with rigorous optimizer discipline. It demonstrates strong empirical results across 52 evaluation cells, multiple models, and benchmarks, with significant accuracy improvements. The work addresses a timely problem in LLM agent development with broad applicability. Paper 2 is a pedagogical clarification of existing axiomatic design theory with limited novelty—it primarily revisits established concepts from Suh's books. SkillOpt's methodological innovation, comprehensive evaluation, and practical impact on the rapidly growing AI agents field give it substantially higher scientific impact potential.

vs. HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

gemini-3.15/26/2026

Paper 1 introduces a highly practical and systematic framework for optimizing agent skills in text-space, bridging the gap between deep-learning optimization rigor and LLM prompting. Its extensive empirical validation across multiple models (including advanced systems like GPT-5.5 and Claude Code) and massive performance gains demonstrate significant real-world applicability and broad impact. While Paper 2 offers an elegant theoretical approach to reasoning, Paper 1's immediate relevance to the rapidly growing field of autonomous agents and its strong transferability results give it a higher potential for widespread scientific and practical impact.