SkillGrad: Optimizing Agent Skills Like Gradient Descent
Hanyu Wang, Yifan Lan, Bochuan Cao, Lu Lin, Jinghui Chen
Abstract
Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often unreliable, incomplete, or outdated. Existing skill-evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient-descent-inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory-level loss evidence, automatic diagnoses then provide text-based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM-based patcher executes the parameter update by applying layer-aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training-based skill evolution baselines across two backbone LLMs, improving over the strongest training-based baseline by percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.
AI Impact Assessments
(1 models)Scientific Impact Assessment: SkillGrad
1. Core Contribution
SkillGrad proposes a gradient-descent-inspired framework for iteratively optimizing structured "agent skills" — persistent file packages that encode domain-specific procedural knowledge for LLM agents. The key conceptual move is mapping components of gradient descent (parameters, loss, gradients, momentum, updates) onto textual skill optimization: the skill package is the parameter, task execution outcomes serve as loss evidence, LLM-generated diagnoses act as gradient signals, a momentum agent accumulates recurring patterns across iterations, and a layer-aware patcher applies structured updates.
The paper addresses a genuine practical problem: automatically generated or third-party agent skills are often unreliable and can even degrade agent performance relative to no skill at all (demonstrated empirically with 7-8 percentage point drops). SkillGrad provides a principled iterative refinement loop rather than one-shot generation or ad hoc reflection.
2. Methodological Rigor
Strengths in experimental design:
Weaknesses:
3. Potential Impact
Practical applications: The framework addresses a real need in enterprise AI deployment where agents must be customized for domain-specific workflows. The skill-as-optimizable-artifact paradigm could reduce the cost of manual skill engineering.
Broader influence: The structured optimization loop (diagnose → accumulate patterns → patch) could generalize beyond spreadsheet skills to other agent domains. The layer-aware update concept (deciding what knowledge goes where in a hierarchical skill structure) is a useful design principle.
Limitations on impact: The framework is computationally expensive (full agent execution per training example), limiting scalability. The reliance on frontier LLMs for all sub-agents (diagnoser, momentum agent, patcher) makes the approach costly and dependent on API access. The domain specificity of the evaluation makes it hard to assess transferability.
4. Timeliness & Relevance
This paper is well-timed. The agent skills paradigm is gaining traction (SkillsBench, EvoSkill, Trace2Skill, SkillX, etc., all from 2026), and the need for systematic skill improvement is widely recognized. The paper positions itself clearly within this emerging ecosystem and provides a structured alternative to heuristic reflection approaches. The growing deployment of LLM agents in enterprise settings makes reliable skill optimization increasingly important.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations
The paper's extensive appendix (detailed prompts, qualitative examples, training dynamics) is commendable for transparency. The cost analysis showing USD 6.40 per run is practical information. However, the framework's complexity (four specialized agents with elaborate prompts totaling thousands of words) raises questions about whether simpler approaches could achieve comparable results. The 6.7 pp average improvement over baselines is meaningful but not transformative, particularly given the complexity overhead.
The contribution is best understood as a well-engineered system paper with a useful conceptual framework, rather than a fundamental methodological advance. It makes a solid contribution to the emerging agent skills literature but would need broader domain evaluation and stronger baselines to establish lasting impact.
Generated May 28, 2026
Comparison History (20)
SkillGrad introduces a more novel conceptual framework—treating agent skills as optimizable parameters with gradient-descent-inspired updates, momentum, and contrastive diagnosis. This metaphor bridges optimization theory and LLM agent adaptation in a creative way with broader applicability across domains. While RoRo addresses the important but narrower problem of stepwise model routing with process rewards, SkillGrad's framework for skill optimization is more generalizable, addresses a widely relevant problem (adapting LLM agents to new domains), and offers a paradigm that could influence future work on agent self-improvement more broadly.
Paper 1 likely has higher scientific impact due to a clear, concrete systems contribution with immediate real-world applicability: a 195M-parameter generalist image-editing diffusion model enabling fast (290ms) privacy-preserving on-device inference. This addresses timely deployment constraints (latency, memory, privacy) and can influence both mobile ML optimization and practical consumer products. Paper 2 is conceptually interesting but relies on LLM-based “text gradients,” which may be less methodologically grounded and more sensitive to prompt/LLM choices; its impact may be narrower and harder to generalize beyond specific benchmarks.
Paper 1 is more novel methodologically, reframing skill refinement for LLM agents as an explicit optimization loop with diagnosed “text gradients,” momentum-style memory, and structured patching—an approach likely reusable across many agent/skill frameworks and tasks. Its applications span tool-using agents, workflow automation, and domain adaptation, with demonstrated gains on established benchmarks and clear ablations supporting rigor. Paper 2 addresses an important, timely application (explainable AI-text detection) but builds on a narrower problem area with potentially faster shifting baselines and policy constraints; impact may be more domain-specific.
HRBench addresses a broader and more fundamental problem—benchmarking and understanding reasoning strategies across hybrid-reasoning LLMs—providing a unified evaluation framework spanning multiple strategy families, training regimes, models, and tasks. Its systematic organization of a rapidly growing design space, with 12+ reimplemented methods and reproducible infrastructure, is likely to serve as a widely-adopted community resource. SkillGrad proposes a clever optimization analogy for agent skills but addresses a narrower problem with evaluation on only two benchmarks. HRBench's breadth, timeliness given the explosion of reasoning LLMs, and infrastructure contribution give it higher potential impact.
Paper 1 addresses multiple critical challenges simultaneously—privacy preservation, bandwidth efficiency, and many-to-many multilingual translation—with a practical edge-cloud architecture achieving state-of-the-art results across 45 languages (1980 directions). Its broader real-world applicability to privacy-sensitive speech translation deployment, combined with substantial technical contributions (10× bandwidth reduction, voiceprint protection) and released code/models, gives it higher potential impact across NLP, systems, and privacy communities. Paper 2, while novel in its gradient-descent analogy for skill optimization, addresses a narrower problem with more incremental improvements on limited benchmarks.
Paper 1 offers a concrete, novel optimization framework for improving LLM agent skills with demonstrated empirical gains on established benchmarks, suggesting near-term applicability and methodological testability. Its gradient-descent-inspired formulation (diagnostic “text gradients,” momentum memory, structured patching) is an actionable contribution likely to be reused in agent/tooling research and practice. Paper 2 is timely and potentially broad, but as described it is more conceptual and depends heavily on adoption and the eventual quality/validity of proposed tasks and protocols; immediate rigor and measurable impact are less certain from the abstract alone.
Paper 1 presents a rigorous, well-evaluated framework (SkillGrad) for optimizing LLM agent skills with clear benchmarks, ablations, and quantitative improvements. It addresses a practical problem in agent development with a novel gradient-descent analogy. Paper 2, while provocative, relies on auto-ethnographic methodology with a single sustained interaction, co-authored by the AI itself, raising serious concerns about scientific rigor, reproducibility, and anthropomorphization. Its claims about AI phenomenology and 'training strata' lack the empirical grounding and methodological controls expected for high-impact work, limiting its influence in mainstream ML research.
Paper 2 (RULER) has higher impact potential due to its novel shift from output-level to representation-level verification for machine unlearning, addressing a critical safety/privacy gap where existing metrics can be gamed. It offers broadly applicable metrics (including an oracle-free option) across multiple modalities and domains (tabular, images, clinical text, face ID), increasing cross-field relevance and real-world applicability for compliance and trustworthy ML. The methodology is relatively rigorous (mixed-effects modeling, effect sizes, multi-method comparisons) and highly timely given regulatory and deployment pressure for provable data removal.
Paper 1 addresses a fundamental cognitive capability in LLMs (Theory of Mind) by shifting evaluation from outcome-based QA to explicit belief modeling. This provides deeper insights into model reasoning and representation bottlenecks. Its rigorous benchmark approach has broad implications for AI safety, alignment, and cognitive science, likely driving more fundamental research than the practical, albeit innovative, agent optimization framework presented in Paper 2.
Paper 2 translates the highly successful continuous optimization paradigm (gradient descent, momentum) into a structured, text-based framework for discrete agent skill evolution. This conceptual innovation provides a powerful and extensible foundation for self-improving LLM agents, a highly active research area. While Paper 1 offers a rigorous and novel approach to prompt debugging using SAT solvers, Paper 2's methodology has broader applicability and stronger potential to inspire follow-up research adapting traditional machine learning optimization techniques to agentic workflows.
Paper 1 addresses the highly active and broadly applicable field of LLM agents. Its novel gradient-descent-inspired approach for text-based skill optimization offers a creative and practical solution to agent adaptability. While Paper 2 presents a strong methodological improvement in multi-agent reinforcement learning and game theory, Paper 1 is likely to have a wider near-term impact due to the widespread integration of LLM agents in various real-world tasks and the high demand for reliable procedural knowledge execution.
Paper 1 introduces a novel and important security threat model ('Sleeper Attack') for LLM agents that reveals vulnerabilities persisting across interactions—a largely unexplored attack surface. Its comprehensive benchmark (1,896 instances, 7 LLMs, multiple attack strategies) and formalization of cross-interaction adversarial persistence address a critical and timely safety concern as LLM agents are widely deployed. Paper 2, while technically solid, offers an incremental optimization framework for agent skills with narrower scope. The security implications of Paper 1 have broader impact across the AI safety community and are more likely to influence future research and deployment practices.
Paper 2 addresses a fundamental limitation of current inference-only LLM deployments, proposing a paradigm shift toward continuous, weight-based personalization. Its substantial improvements in knowledge retention and its valuable methodological insight regarding cross-entropy metrics offer broader, more systemic implications for LLM architecture and deployment than Paper 1's agent-specific optimization framework.
SkillGrad introduces a novel and broadly applicable framework for optimizing LLM agent skills using a gradient-descent analogy, with clear empirical improvements over baselines. Its contributions are more generalizable across domains and LLM applications. Paper 1, while rigorous in its statistical analysis, addresses a narrower problem (Lean-as-judge for math reasoning) and ultimately shows limited practical utility—self-consistency alone achieves 91% accuracy, and the formal verification signal is sparse and often unfaithful, limiting real-world adoption. SkillGrad's broader applicability to the growing LLM agent ecosystem gives it higher impact potential.
Paper 2 presents a generalizable, gradient-descent-inspired optimization framework for improving LLM agent skills, which can be applied across various domains and tasks. In contrast, Paper 1, while addressing important challenges in autonomous agents, focuses primarily on a specific domain benchmark (travel planning). The broader applicability and novel methodological conceptualization of text-based gradients in Paper 2 suggest a higher potential for widespread adoption and foundational impact across the field of AI agents.
SkillGrad proposes a novel, principled optimization framework (gradient-descent-inspired skill evolution) for LLM agents with strong empirical results on established benchmarks, broad applicability across agent systems, and clear methodological contributions (momentum, contrastive diagnosis). Paper 2 provides interesting empirical observations about harness sensitivity but uses a synthetic benchmark, tests only single models per tier (limiting generalizability as the authors acknowledge), and offers more incremental practical guidelines rather than a new methodology. SkillGrad's framework has broader potential adoption and influence across the rapidly growing LLM agent ecosystem.
Paper 1 addresses a critical and timely gap in AI safety evaluation—privacy risks in multi-agent social systems—which has broad implications for deployed AI systems. The finding that social context amplifies privacy violations (from ~20% to ~45%) and that leakage is socially contagious reveals a fundamental blind spot in current safety benchmarks. This has immediate policy and deployment implications across the AI safety community. Paper 2, while technically solid, proposes an incremental optimization framework for agent skills with narrower scope and more limited cross-field impact.
Paper 2 addresses a critical, highly timely challenge: tracking the provenance and lineage of AI-generated content. By proposing 'steganographic inheritance,' it introduces a novel, biologically-inspired framework with broad implications for AI safety, copyright, and combating misinformation. While Paper 1 offers a solid technical optimization for LLM agents, Paper 2's foundational approach to synthetic information provenance offers much broader societal and cross-disciplinary impact.
Paper 2 addresses a fundamental architectural gap in the serving infrastructure for multi-agent LLM systems, proposing a new runtime layer between agent frameworks and serving engines. This has broader impact because: (1) it identifies a systematic architectural problem affecting all multi-agent deployments rather than a specific optimization task; (2) it provides a general abstraction (four primitives) that unifies nine distinct policies; (3) the practical improvements (13-37pp cache hit-rate, 12-29% lower latency) directly reduce serving costs at scale; (4) as multi-agent systems become dominant production workloads, infrastructure-level contributions have multiplicative impact. Paper 1, while solid, addresses the narrower problem of skill optimization with incremental improvements over baselines.
Paper 2 presents a novel, generalized framework (SkillGrad) for optimizing LLM agent skills using a gradient-descent analogy. This approach has broad applicability across various specialized domains in AI agent research. In contrast, Paper 1 offers a highly useful but more niche tool for generating scientific diagrams. The fundamental advancement in agent optimization methodologies in Paper 2 gives it a higher potential for wide-ranging scientific impact across multiple fields relying on autonomous agents.