SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao
Abstract
Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.
AI Impact Assessments
(1 models)Scientific Impact Assessment: SkillOpt
1. Core Contribution
SkillOpt frames agent skill optimization as a controlled text-space training process, drawing an explicit operational analogy to deep-learning weight optimization. The key insight is that a natural-language "skill document" prepended to a frozen LLM agent can be treated as an optimizable external state, with a separate optimizer model proposing bounded add/delete/replace edits gated by held-out validation. The system introduces several mechanisms mapped from weight-space optimization: a textual learning-rate budget (edit count cap per step), cosine/linear schedules, minibatch reflection over rollout trajectories, a rejected-edit buffer (analogous to negative gradient memory), and an epoch-wise slow/meta update (analogous to momentum). The deployed artifact is a compact markdown file (~300–2,000 tokens) requiring zero additional inference-time model calls.
The problem being solved is genuine: current approaches to adapting frozen LLM agents—hand-crafted prompts, one-shot skill generation, or loosely controlled evolutionary self-revision—lack the stability guarantees and reproducibility of principled optimization. SkillOpt fills this gap by providing a disciplined, validation-gated editing loop that provably never degrades below the starting point (due to the strict acceptance criterion).
2. Methodological Rigor
The experimental design is impressively thorough. The evaluation spans 52 (model, benchmark, harness) cells across 6 benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), 7 target models (GPT-5.5 through Qwen3.5-4B), and 3 execution harnesses (direct chat, Codex, Claude Code). Deterministic train/selection/test splits with a fixed seed ensure reproducibility, and all reported scores are on held-out test data, not validation.
The baseline comparison is comprehensive: no-skill, human-written, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill. The ablation study (Tables 2–3) systematically isolates the contribution of each component—learning rate bounds, rejected-edit buffer, slow/meta update, batch sizes, and schedules. The finding that removing both meta skill and slow update drops SpreadsheetBench by 22.5 points is compelling evidence for these components' necessity.
However, several methodological concerns warrant attention:
3. Potential Impact
The practical implications are substantial:
The broader influence could extend to: prompt engineering automation, domain adaptation for API-only models, agent configuration search, and potentially as a complementary technique to RLHF/DPO for behavioral steering through external text.
4. Timeliness & Relevance
This work is highly timely. The proliferation of agentic frameworks (Codex, Claude Code, SWE-Agent) and the increasing use of frozen frontier models through APIs create an urgent need for systematic adaptation methods that don't require weight access. The "skills as trainable objects" framing addresses a genuine gap between prompt engineering (too manual) and fine-tuning (often unavailable). The concurrent explosion of skill-related papers (Trace2Skill, EvoSkill, SkillsBench, SkillForge) from early 2026 confirms this is an active research frontier where SkillOpt appears to be setting the methodological standard.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment: SkillOpt presents a well-engineered and thoroughly evaluated system that establishes a new paradigm for agent skill optimization. The breadth of evaluation is exceptional, the transfer results are practically valuable, and the framework is conceptually clean. The main weaknesses are the lack of statistical rigor in reporting and the inherent advantage of a monotonic-by-design system compared to baselines without such guarantees. Nevertheless, the magnitude of gains (especially on procedural benchmarks) and the consistency across settings make this a high-impact contribution.
Generated May 25, 2026
Comparison History (17)
Paper 1 has higher impact potential: it proposes a concrete, novel, and rigorously controlled optimization framework (text-space “optimizer” with held-out validation gating, stability mechanisms, and zero inference-time overhead) and reports broad empirical wins across many models/benchmarks/harnesses with transfer evidence—suggesting actionable, reproducible improvements for agent reliability and deployment. Paper 2 is a valuable, timely survey that may shape terminology and evaluation norms, but it contributes less direct methodological innovation and lacks empirical validation of a new system, typically yielding lower immediate scientific/engineering impact than a strong, generalizable technique.
Paper 2 has higher potential impact due to a more clearly novel and generalizable contribution: a controllable, validation-gated “text-space optimizer” for agent skills that behaves like an optimizer with stability mechanisms (budgets, rejected-edit buffer, slow/meta updates). It demonstrates broad, rigorous evaluation (6 benchmarks, 7 models, 3 harnesses, 52 cells) plus transfer across models and environments, supporting robustness and real-world deployability with zero added inference cost. Paper 1 is timely and useful for efficiency, but is closer to re-factoring planning/self-regulation within CoT and may be less broadly transformative.
SkillOpt demonstrates broader empirical impact across 52 evaluation cells spanning six benchmarks, seven models, and three execution harnesses, consistently outperforming all competitors. Its framing of text-space skill optimization as analogous to weight-space optimization introduces a novel paradigm with wide applicability to any LLM agent. The transferability results across models and environments further strengthen its practical impact. While EVE-Agent addresses an important trustworthiness problem in self-evolving search agents, its scope is narrower (search agents only), and its contribution is more incremental—adding an evidence verification step to existing proposer-solver frameworks.
Paper 1 has higher potential impact because it targets a foundational, cross-cutting bottleneck—reproducibility and evaluation integrity in agent research—affecting most subfields using agent rollouts. The rollout-card standard plus reference implementation can become infrastructure adopted broadly, enabling reanalysis, audits, and fairer comparisons, with immediate relevance as agent benchmarks proliferate. Its empirical audit of repositories and demonstrated score/ranking sensitivity to reporting rules strengthens methodological rigor and urgency. Paper 2 is promising and practically useful, but its impact is narrower (skill optimization technique) and more contingent on specific agent setups.
Paper 1 introduces causal inference principles to language agent tool use, addressing a fundamental limitation in how agents learn from observational data. Bridging causality and LLM agents opens a highly novel and rigorous research direction for agent safety and reliability. While Paper 2 offers excellent empirical results and a strong methodological approach to skill optimization, Paper 1's conceptual leap of treating tool calls as causal interventions provides a deeper theoretical contribution with broader long-term scientific impact across AI and decision-making.
Paper 2 (SkillOpt) likely has higher impact due to broader applicability and stronger empirical evidence. It proposes a reproducible, optimizer-like framework for improving agent “skills” with controlled edits and strict validation, yielding large, consistent gains across many models, benchmarks, and execution harnesses, plus demonstrated transfer. This suggests immediate real-world deployment value for agent reliability and performance. Paper 1 is novel conceptually (environment synthesis for RLVR) but shows modest gains on a narrower setting and may be harder to generalize; its impact is promising but less immediately broad.
Paper 2 demonstrates AI solving genuinely open mathematical problems (9 Erdős problems, 44 OEIS conjectures) with formal verification, representing a landmark achievement in AI-for-mathematics. This has transformative implications for mathematical research methodology and is already being deployed across multiple mathematical subfields. While Paper 1 presents a solid engineering contribution for optimizing agent skills with impressive benchmarks, it is ultimately an incremental advance in prompt/skill optimization. Paper 2's direct scientific discoveries and paradigm-shifting potential for mathematics research give it substantially higher impact.
Paper 2 likely has higher scientific impact due to a broadly applicable, optimizer-like framework for improving agent skills with strong empirical breadth (six benchmarks, seven models, three harnesses, 52 cells), clear methodological controls (held-out validation acceptance, stability mechanisms), and immediate real-world applicability to agent performance without added inference cost. Its “text-space optimization” paradigm can influence LLM/agent training, tool-use, and deployment workflows across domains. Paper 1 is novel and timely for human–machine discrimination, but its impact may be narrower (CAPTCHA-style evaluation and process-feature specification bottlenecks limit generalization).
SkillOpt presents a concrete, empirically validated method with extensive benchmarking across 52 evaluation cells, demonstrating consistent improvements. It introduces a novel framing of skill optimization analogous to weight-space training, with rigorous methodology including validation-based acceptance criteria and ablations. Paper 2 (Foundation Protocol) proposes a coordination framework but is primarily architectural/conceptual without empirical validation of its claims. While Paper 2 addresses an important emerging problem, its lack of concrete experimental results and its broad, position-paper-like nature limits near-term scientific impact compared to SkillOpt's reproducible, measurable contributions.
Paper 2 likely has higher scientific impact: it proposes a concrete, optimizer-like framework (SkillOpt) for reproducible, controllable skill improvement with validation-gated edits, and reports extensive empirical gains across many models/benchmarks/harnesses plus transfer. This combination of methodological specificity, measurable performance improvements, and near-term deployability makes it broadly useful for agent training and tooling. Paper 1 offers important conceptual guidance for benchmark design and interpretation, but is less directly actionable as a new method and may yield slower, more diffuse downstream adoption compared to a strong, general-purpose optimization technique.
Paper 2 has higher estimated impact due to broader applicability and timeliness: a systematic, controllable “text-space optimizer” for agent skills can improve many existing and future agentic systems without changing base model weights, and transfers across models/harnesses. Its methodology (held-out validation gating, bounded edits, learning-rate budget, stability mechanisms) resembles reproducible optimization and is likely to influence practice across AI agents, software engineering, and RL-style evaluation. Paper 1 is a strong, rigorous architectural advance for linear attention/sequence models, but its impact is narrower to efficient long-context model design.
SkillOpt introduces a novel and practically impactful framework—treating agent skills as trainable text-space parameters with optimizer discipline (learning rate, validation, edit buffers). It demonstrates broad empirical impact across 52 evaluation cells, multiple models, and execution harnesses, with large accuracy gains (+19-25 points). Its transferability across models and environments enhances real-world applicability. While GENSTRAT contributes a useful benchmarking methodology for strategic reasoning, SkillOpt addresses a more fundamental problem in AI agent development with a generalizable optimization paradigm that could influence how agent skills are developed across the field.
SkillOpt demonstrates broader impact with its systematic text-space optimizer for agent skills, showing improvements across 52 evaluation cells spanning six benchmarks, seven models, and three execution harnesses. Its conceptual contribution—treating skills as optimizable external state with learning-rate budgets and validation-based acceptance—introduces a novel paradigm analogous to weight-space optimization. The transfer results across models, environments, and tasks further strengthen its generalizability. Co-ReAct contributes meaningful step-level rubric guidance for ReAct agents but addresses a narrower problem scope with fewer benchmarks and less demonstrated generalization.
Paper 1 introduces a highly innovative framework that applies deep learning optimization principles to text-space agent skills. Its extensive evaluation across multiple benchmarks, models, and execution environments demonstrates remarkable robustness and significant performance gains. This systematic approach to skill optimization has higher potential for widespread adoption and broad methodological impact compared to Paper 2's narrower focus on library hygiene and lifecycle management.
Paper 1 introduces a highly novel, text-space optimization algorithm for agent skills inspired by deep learning optimizers. Its systematic approach yields massive performance gains across diverse benchmarks and execution harnesses. By proving that text-based skills can be rigorously optimized and transferred across model scales, it fundamentally advances agent training methodology, promising broad and immediate real-world applications.
SkillOpt introduces a fundamentally novel paradigm—treating agent skills as optimizable external state analogous to weight-space optimization—with rigorous methodology (52 evaluation cells, 7 models, 3 harnesses, ablations, transfer experiments). Its broad applicability across any LLM agent system, strong empirical gains (+19-25 points), and transferability across models and environments suggest wide adoption potential. Paper 1, while impressive as an engineering contribution to visualization, is more domain-specific and represents incremental progress in applying agentic AI to a narrower application area rather than introducing a general-purpose methodology.
SkillOpt introduces a fundamentally new paradigm—treating agent skills as trainable text-space parameters with optimizer-like discipline—which is highly novel and broadly applicable. Its comprehensive evaluation across 52 cells (6 benchmarks, 7 models, 3 harnesses) with consistent improvements and strong transfer results demonstrates exceptional methodological rigor and generalizability. The practical impact is substantial: zero inference-time overhead and compatibility with multiple execution environments. Paper 2 addresses an important but narrower problem (multimodal knowledge editing generalization) with incremental advances. SkillOpt's breadth of impact across the rapidly growing AI agents field gives it significantly higher potential impact.