SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao

May 22, 2026

arXiv:2605.23904v1 PDF

cs.AI(primary)cs.CL

#94of 2320·Artificial Intelligence

#94 of 2320 · Artificial Intelligence

Tournament Score

1546±48

10501800

82%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8

Rigor7

Novelty7.5

Clarity8.5

Tournament Score

1546±48

10501800

82%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SkillOpt

1. Core Contribution

SkillOpt frames agent skill optimization as a controlled text-space training process, drawing an explicit operational analogy to deep-learning weight optimization. The key insight is that a natural-language "skill document" prepended to a frozen LLM agent can be treated as an optimizable external state, with a separate optimizer model proposing bounded add/delete/replace edits gated by held-out validation. The system introduces several mechanisms mapped from weight-space optimization: a textual learning-rate budget (edit count cap per step), cosine/linear schedules, minibatch reflection over rollout trajectories, a rejected-edit buffer (analogous to negative gradient memory), and an epoch-wise slow/meta update (analogous to momentum). The deployed artifact is a compact markdown file (~300–2,000 tokens) requiring zero additional inference-time model calls.

The problem being solved is genuine: current approaches to adapting frozen LLM agents—hand-crafted prompts, one-shot skill generation, or loosely controlled evolutionary self-revision—lack the stability guarantees and reproducibility of principled optimization. SkillOpt fills this gap by providing a disciplined, validation-gated editing loop that provably never degrades below the starting point (due to the strict acceptance criterion).

2. Methodological Rigor

The experimental design is impressively thorough. The evaluation spans 52 (model, benchmark, harness) cells across 6 benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), 7 target models (GPT-5.5 through Qwen3.5-4B), and 3 execution harnesses (direct chat, Codex, Claude Code). Deterministic train/selection/test splits with a fixed seed ensure reproducibility, and all reported scores are on held-out test data, not validation.

The baseline comparison is comprehensive: no-skill, human-written, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill. The ablation study (Tables 2–3) systematically isolates the contribution of each component—learning rate bounds, rejected-edit buffer, slow/meta update, batch sizes, and schedules. The finding that removing both meta skill and slow update drops SpreadsheetBench by 22.5 points is compelling evidence for these components' necessity.

However, several methodological concerns warrant attention:

The validation gate introduces a monotonicity guarantee by construction, meaning the "52/52 best or tied" claim is somewhat inflated—SkillOpt can never fall below its starting point, while baselines like TextGrad can. This is a design advantage, but it also means the comparison is not entirely apples-to-apples.

Optimizer model asymmetry: SkillOpt uses GPT-5.5 as the optimizer, which is the strongest model in the evaluation. While Table 5 shows target-matched optimizers still work, the main results benefit from this asymmetry. Baselines like TextGrad and GEPA may not have been given equivalent computational budgets.

Training cost: The method requires significant rollout computation (millions of training tokens). The cost-per-point analysis (Table 6) is helpful but doesn't directly compare with the computational overhead of baselines.

Statistical significance: No error bars or confidence intervals are reported across any of the 52 cells, which is a notable omission for a paper making such strong claims.

3. Potential Impact

The practical implications are substantial:

No weight updates required: This is critical for closed-source frontier models where fine-tuning is impossible. A compact text artifact becomes the sole adaptation mechanism.

Cross-harness and cross-model transfer: The demonstration that a Codex-trained skill transfers to Claude Code (+59.7 on SpreadsheetBench) is a strong deployment signal—optimize once, deploy across environments.

Interpretability: The learned skills are human-readable and auditable (Figure 4), which is rare for automated optimization methods.

Cost amortization: The skill is trained offline and adds zero inference-time cost, making it deployable at scale.

The broader influence could extend to: prompt engineering automation, domain adaptation for API-only models, agent configuration search, and potentially as a complementary technique to RLHF/DPO for behavioral steering through external text.

4. Timeliness & Relevance

This work is highly timely. The proliferation of agentic frameworks (Codex, Claude Code, SWE-Agent) and the increasing use of frozen frontier models through APIs create an urgent need for systematic adaptation methods that don't require weight access. The "skills as trainable objects" framing addresses a genuine gap between prompt engineering (too manual) and fine-tuning (often unavailable). The concurrent explosion of skill-related papers (Trace2Skill, EvoSkill, SkillsBench, SkillForge) from early 2026 confirms this is an active research frontier where SkillOpt appears to be setting the methodological standard.

5. Strengths & Limitations

Key Strengths:

Remarkably comprehensive evaluation (52 cells, 7 models, 3 harnesses, 6 baselines)

The optimization framework analogy is not merely decorative—each component has a functional role supported by ablations

Transfer experiments across models, harnesses, and benchmarks demonstrate genuine generalization

The deployed artifact is compact, interpretable, and zero-cost at inference

Edit economy (1–4 accepted edits for large gains) is striking evidence of the validation gate's effectiveness

Notable Limitations:

The strict validation gate guarantees monotonic improvement, making the "best on all 52 cells" claim less surprising than it appears

Reliance on automatic scorers limits applicability to open-ended or subjective tasks (acknowledged in limitations)

No statistical significance testing across any results

The deep-learning analogy, while useful, may overstate the theoretical grounding—there's no convergence theory or formal analysis of the optimization landscape

Single skill per domain is a clear scaling limitation for heterogeneous domains

The paper uses future model names (GPT-5.5, dated May 2026), which raises questions about reproducibility for the current research community

Overall Assessment: SkillOpt presents a well-engineered and thoroughly evaluated system that establishes a new paradigm for agent skill optimization. The breadth of evaluation is exceptional, the transfer results are practically valuable, and the framework is conceptually clean. The main weaknesses are the lack of statistical rigor in reporting and the inherent advantage of a monotonic-by-design system compared to baselines without such guarantees. Nevertheless, the magnitude of gains (especially on procedural benchmarks) and the consistency across settings make this a high-impact contribution.

Rating:7.8/ 10

Significance 8Rigor 7Novelty 7.5Clarity 8.5

Generated May 25, 2026

Comparison History (17)

vs. AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

gpt-5.25/25/2026

Paper 1 has higher impact potential: it proposes a concrete, novel, and rigorously controlled optimization framework (text-space “optimizer” with held-out validation gating, stability mechanisms, and zero inference-time overhead) and reports broad empirical wins across many models/benchmarks/harnesses with transfer evidence—suggesting actionable, reproducible improvements for agent reliability and deployment. Paper 2 is a valuable, timely survey that may shape terminology and evaluation norms, but it contributes less direct methodological innovation and lacks empirical validation of a new system, typically yielding lower immediate scientific/engineering impact than a strong, generalizable technique.

vs. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

gpt-5.25/25/2026

Paper 2 has higher potential impact due to a more clearly novel and generalizable contribution: a controllable, validation-gated “text-space optimizer” for agent skills that behaves like an optimizer with stability mechanisms (budgets, rejected-edit buffer, slow/meta updates). It demonstrates broad, rigorous evaluation (6 benchmarks, 7 models, 3 harnesses, 52 cells) plus transfer across models and environments, supporting robustness and real-world deployability with zero added inference cost. Paper 1 is timely and useful for efficiency, but is closer to re-factoring planning/self-regulation within CoT and may be less broadly transformative.

vs. EVE-Agent: Evidence-Verifiable Self-Evolving Agents

claude-opus-4.65/25/2026

SkillOpt demonstrates broader empirical impact across 52 evaluation cells spanning six benchmarks, seven models, and three execution harnesses, consistently outperforming all competitors. Its framing of text-space skill optimization as analogous to weight-space optimization introduces a novel paradigm with wide applicability to any LLM agent. The transferability results across models and environments further strengthen its practical impact. While EVE-Agent addresses an important trustworthiness problem in self-evolving search agents, its scope is narrower (search agents only), and its contribution is more incremental—adding an evidence verification step to existing proposer-solver frameworks.

vs. Rollout Cards: A Reproducibility Standard for Agent Research

gpt-5.25/25/2026

Paper 1 has higher potential impact because it targets a foundational, cross-cutting bottleneck—reproducibility and evaluation integrity in agent research—affecting most subfields using agent rollouts. The rollout-card standard plus reference implementation can become infrastructure adopted broadly, enabling reanalysis, audits, and fairer comparisons, with immediate relevance as agent benchmarks proliferate. Its empirical audit of repositories and demonstrated score/ranking sensitivity to reporting rules strengthens methodological rigor and urgency. Paper 2 is promising and practically useful, but its impact is narrower (skill optimization technique) and more contingent on specific agent setups.

vs. CIVeX: Causal Intervention Verification for Language Agents

gemini-3.15/25/2026

Paper 1 introduces causal inference principles to language agent tool use, addressing a fundamental limitation in how agents learn from observational data. Bridging causality and LLM agents opens a highly novel and rigorous research direction for agent safety and reliability. While Paper 2 offers excellent empirical results and a strong methodological approach to skill optimization, Paper 1's conceptual leap of treating tool calls as causal interventions provides a deeper theoretical contribution with broader long-term scientific impact across AI and decision-making.

vs. Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

gpt-5.25/25/2026

Paper 2 (SkillOpt) likely has higher impact due to broader applicability and stronger empirical evidence. It proposes a reproducible, optimizer-like framework for improving agent “skills” with controlled edits and strict validation, yielding large, consistent gains across many models, benchmarks, and execution harnesses, plus demonstrated transfer. This suggests immediate real-world deployment value for agent reliability and performance. Paper 1 is novel conceptually (environment synthesis for RLVR) but shows modest gains on a narrower setting and may be harder to generalize; its impact is promising but less immediately broad.

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

claude-opus-4.65/25/2026

Paper 2 demonstrates AI solving genuinely open mathematical problems (9 Erdős problems, 44 OEIS conjectures) with formal verification, representing a landmark achievement in AI-for-mathematics. This has transformative implications for mathematical research methodology and is already being deployed across multiple mathematical subfields. While Paper 1 presents a solid engineering contribution for optimizing agent skills with impressive benchmarks, it is ultimately an incremental advance in prompt/skill optimization. Paper 2's direct scientific discoveries and paradigm-shifting potential for mathematics research give it substantially higher impact.

vs. Process Matters more than Output for Distinguishing Humans from Machines

gpt-5.25/25/2026

Paper 2 likely has higher scientific impact due to a broadly applicable, optimizer-like framework for improving agent skills with strong empirical breadth (six benchmarks, seven models, three harnesses, 52 cells), clear methodological controls (held-out validation acceptance, stability mechanisms), and immediate real-world applicability to agent performance without added inference cost. Its “text-space optimization” paradigm can influence LLM/agent training, tool-use, and deployment workflows across domains. Paper 1 is novel and timely for human–machine discrimination, but its impact may be narrower (CAPTCHA-style evaluation and process-feature specification bottlenecks limit generalization).

vs. Foundation Protocol: A Coordination Layer for Agentic Society

claude-opus-4.65/25/2026

SkillOpt presents a concrete, empirically validated method with extensive benchmarking across 52 evaluation cells, demonstrating consistent improvements. It introduces a novel framing of skill optimization analogous to weight-space training, with rigorous methodology including validation-based acceptance criteria and ablations. Paper 2 (Foundation Protocol) proposes a coordination framework but is primarily architectural/conceptual without empirical validation of its claims. While Paper 2 addresses an important emerging problem, its lack of concrete experimental results and its broad, position-paper-like nature limits near-term scientific impact compared to SkillOpt's reproducible, measurable contributions.

vs. Design and Report Benchmarks for Knowledge Work

gpt-5.25/25/2026

Paper 2 likely has higher scientific impact: it proposes a concrete, optimizer-like framework (SkillOpt) for reproducible, controllable skill improvement with validation-gated edits, and reports extensive empirical gains across many models/benchmarks/harnesses plus transfer. This combination of methodological specificity, measurable performance improvements, and near-term deployability makes it broadly useful for agent training and tooling. Paper 1 offers important conceptual guidance for benchmark design and interpretation, but is less directly actionable as a new method and may yield slower, more diffuse downstream adoption compared to a strong, general-purpose optimization technique.

vs. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

gpt-5.25/25/2026

Paper 2 has higher estimated impact due to broader applicability and timeliness: a systematic, controllable “text-space optimizer” for agent skills can improve many existing and future agentic systems without changing base model weights, and transfers across models/harnesses. Its methodology (held-out validation gating, bounded edits, learning-rate budget, stability mechanisms) resembles reproducible optimization and is likely to influence practice across AI agents, software engineering, and RL-style evaluation. Paper 1 is a strong, rigorous architectural advance for linear attention/sequence models, but its impact is narrower to efficient long-context model design.

vs. GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

claude-opus-4.65/25/2026

SkillOpt introduces a novel and practically impactful framework—treating agent skills as trainable text-space parameters with optimizer discipline (learning rate, validation, edit buffers). It demonstrates broad empirical impact across 52 evaluation cells, multiple models, and execution harnesses, with large accuracy gains (+19-25 points). Its transferability across models and environments enhances real-world applicability. While GENSTRAT contributes a useful benchmarking methodology for strategic reasoning, SkillOpt addresses a more fundamental problem in AI agent development with a generalizable optimization paradigm that could influence how agent skills are developed across the field.

vs. Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

claude-opus-4.65/25/2026

SkillOpt demonstrates broader impact with its systematic text-space optimizer for agent skills, showing improvements across 52 evaluation cells spanning six benchmarks, seven models, and three execution harnesses. Its conceptual contribution—treating skills as optimizable external state with learning-rate budgets and validation-based acceptance—introduces a novel paradigm analogous to weight-space optimization. The transfer results across models, environments, and tasks further strengthen its generalizability. Co-ReAct contributes meaningful step-level rubric guidance for ReAct agents but addresses a narrower problem scope with fewer benchmarks and less demonstrated generalization.

vs. Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

gemini-3.15/25/2026

Paper 1 introduces a highly innovative framework that applies deep learning optimization principles to text-space agent skills. Its extensive evaluation across multiple benchmarks, models, and execution environments demonstrates remarkable robustness and significant performance gains. This systematic approach to skill optimization has higher potential for widespread adoption and broad methodological impact compared to Paper 2's narrower focus on library hygiene and lifecycle management.

vs. From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

gemini-3.15/25/2026

Paper 1 introduces a highly novel, text-space optimization algorithm for agent skills inspired by deep learning optimizers. Its systematic approach yields massive performance gains across diverse benchmarks and execution harnesses. By proving that text-based skills can be rigorously optimized and transferred across model scales, it fundamentally advances agent training methodology, promising broad and immediate real-world applications.

vs. Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks

claude-opus-4.65/25/2026

SkillOpt introduces a fundamentally novel paradigm—treating agent skills as optimizable external state analogous to weight-space optimization—with rigorous methodology (52 evaluation cells, 7 models, 3 harnesses, ablations, transfer experiments). Its broad applicability across any LLM agent system, strong empirical gains (+19-25 points), and transferability across models and environments suggest wide adoption potential. Paper 1, while impressive as an engineering contribution to visualization, is more domain-specific and represents incremental progress in applying agentic AI to a narrower application area rather than introducing a general-purpose methodology.

vs. Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

claude-opus-4.65/25/2026

SkillOpt introduces a fundamentally new paradigm—treating agent skills as trainable text-space parameters with optimizer-like discipline—which is highly novel and broadly applicable. Its comprehensive evaluation across 52 cells (6 benchmarks, 7 models, 3 harnesses) with consistent improvements and strong transfer results demonstrates exceptional methodological rigor and generalizability. The practical impact is substantial: zero inference-time overhead and compatibility with multiple execution environments. Paper 2 addresses an important but narrower problem (multimodal knowledge editing generalization) with incremental advances. SkillOpt's breadth of impact across the rapidly growing AI agents field gives it significantly higher potential impact.