From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

Zisu Huang, Jingwen Xu, Yifan Yang, Ziyang Gong, Qihao Yang, Muzhao Tian, Xiaohua Wang, Changze Lv

#433 of 2682 · Artificial Intelligence
Share
Tournament Score
1487±44
10501800
72%
Win Rate
18
Wins
7
Losses
25
Matches
Rating
7/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Language agents increasingly improve by reusing \emph{skills} -- structured procedural artifacts distilled from past experience. In particular, \emph{domain-level} and \emph{model-generated} skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- \textbf{experience generation}, \textbf{skill extraction}, and \textbf{skill consumption} -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete \emph{meta-skill} that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a genuine gap in the agent skills literature: while numerous methods exist for extracting reusable procedural skills from agent trajectories, no prior work has systematically studied the full skill lifecycle—from experience generation through extraction to consumption—across multiple domains, extractors, and target models. The authors build a utility-grounded evaluation framework spanning five diverse agentic domains (embodied planning, spreadsheet manipulation, software engineering, web search, tool calling), six target models, and five extractor models, producing a comprehensive 30-cell evaluation matrix per domain.

The key contributions are: (1) a lifecycle-spanning evaluation framework with two complementary metrics (Extraction Efficacy and Target Evolvability); (2) empirical findings that model-generated skills help on average but exhibit ~25% negative transfer, with skill utility decoupled from model scale and baseline task strength; (3) stage-by-stage analysis revealing that experience composition, not surface form, drives skill quality; and (4) a validated meta-skill rubric that, when injected into the extraction prompt, consistently improves skill quality across all nine evaluated cells.

Methodological Rigor

The experimental design is generally sound. The authors use held-out evaluation splits, three independent runs per condition, and systematically vary extractors and targets to disentangle extraction capability from consumption capability. The minimal extraction framework design—deliberately stripping away domain-specific heuristics—is a methodologically appropriate choice that ensures observed differences reflect model capability rather than pipeline engineering.

Several design choices strengthen credibility: the format normalization experiment (Section 5.2) uses Friedman tests to rule out surface-form effects; the pairwise evaluation protocol uses 151 high-gap pairs with 9 independent votes per pair; and the experience composition experiment (Section 5.1) controls for pool size while varying success ratios.

However, there are methodological limitations. The consolidation into a single domain-level skill per extraction is a significant simplification—real deployments may use skill libraries with retrieval. The authors acknowledge this but it limits ecological validity. Additionally, the meta-skill validation (Section 6) is tested on only 9 cells (3 domains × 3 targets), which is relatively thin for claiming consistent improvement. The rubric discovery pipeline uses GPT-5.4 to both generate and validate dimensions, introducing potential circularity. The three-run averaging may be insufficient to establish statistical significance for some of the smaller deltas reported.

Potential Impact

Practical impact: The finding that textual plausibility inversely correlates with skill utility at high-gap pairs (15.8% accuracy for δ≥5pp) is striking and practically important—it directly warns practitioners against using LLM-as-judge for skill selection. The meta-skill rubric provides a concrete, drop-in improvement applicable to any extraction pipeline.

Conceptual impact: The decoupling of extraction efficacy from task-solving capability (e.g., Gemini-3.1-Flash-Lite as strongest extractor on SpreadsheetBench despite not being the strongest executor) is a genuinely novel insight that reframes how the community should think about skill extraction—as a distinct capability requiring its own evaluation.

Field influence: This work could influence both the agent skills community (by establishing utility-grounded evaluation standards) and the broader agent systems community (by formalizing the lifecycle perspective). The three validated rubric dimensions (Failure Mechanism Encoding, Actionable Specificity, High-Risk Action Blacklist) provide concrete design guidance.

Timeliness & Relevance

This paper is exceptionally timely. Agent skills are becoming a standard component in commercial platforms (e.g., Claude Skills), and the proliferation of extraction methods without systematic understanding creates real deployment risks. The 25% negative transfer rate identified here is practically consequential—organizations deploying skill-augmented agents need exactly this kind of analysis to understand when skills help versus harm.

The paper also arrives at a moment when the community is transitioning from "can we extract skills?" to "should we trust extracted skills?"—making the utility-grounded perspective particularly valuable.

Strengths

1. Comprehensive scope: Five domains × six targets × five extractors is one of the most thorough cross-model evaluations in the agent skills literature.

2. Counter-intuitive findings: The plausibility-utility inversion and the decoupling of extraction from execution capability are genuinely surprising and well-supported.

3. Closed-loop validation: The progression from diagnosis (Sections 4-5) through rubric discovery to validated improvement (Section 6) demonstrates that analytical findings translate into practical gains.

4. Disentangled metrics: EE and TE provide a principled vocabulary for discussing skill utility that separates supply-side from demand-side effects.

5. Reproducibility: Code is released, extraction prompts are fully documented in appendices, and the evaluation protocol is clearly specified.

Limitations

1. Single-skill constraint: Consolidating all experience into one skill per domain is a significant simplification that may not reflect real deployment patterns with skill libraries.

2. Limited meta-skill validation scale: Nine cells across three domains is thin; the improvement magnitudes (+1.55pp average) are modest and close to noise margins.

3. Temporal snapshot: All models are from a narrow release window (2025-2026); findings about model-specific extraction/consumption capabilities may not generalize as models evolve.

4. No smaller/open-model extractors: Excluding Qwen3.5-9B as extractor due to protocol-following failures limits conclusions about scale effects.

5. Domain selection bias: All five domains use text-based interaction; embodied or multimodal domains with richer state spaces are underrepresented.

6. Experience pool heterogeneity: Pool sizes vary across domains, making cross-domain comparisons of absolute effect sizes difficult to interpret.

Overall Assessment

This is a well-executed empirical study that fills a clear gap in the literature. Its primary value lies in establishing robust empirical facts about model-generated skills—particularly the non-obvious findings about negative transfer, plausibility-utility misalignment, and extraction-execution decoupling—rather than in novel algorithmic contributions. The meta-skill intervention, while effective, is modest in scale. The paper's lasting impact will likely be in shifting community norms toward utility-grounded skill evaluation and in providing the EE/TE framework for future work.

Rating:7/ 10
Significance 7.5Rigor 7Novelty 6.5Clarity 8

Generated May 25, 2026

Comparison History (25)

vs. A governance horizon for ethical-use constraints in open-weight AI models
gemini-3.15/26/2026

Paper 2 addresses the critical and highly timely issue of AI governance and open-weight model proliferation. By conducting a massive empirical audit of over 2.1 million models, it introduces quantifiable metrics like the 'governance horizon' and provides actionable insights into policy design. Its interdisciplinary breadth, impacting AI regulation, platform engineering, and machine learning safety, gives it a higher potential for widespread scientific and real-world policy impact compared to the narrower, domain-specific agent evaluation study in Paper 1.

vs. Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs
claude-opus-4.65/26/2026

Paper 1 addresses a fundamental and broadly applicable problem in the rapidly growing field of language agents—understanding the full lifecycle of model-generated skills. Its comprehensive evaluation framework spanning five domains, systematic analysis of when and why skills succeed or fail, and actionable meta-skill contribution provide substantial methodological and empirical contributions. The findings about negative transfer and the independence of skill utility from model scale are novel insights with broad implications. Paper 2, while practically useful, addresses a narrower problem (selective safety relaxation) with a more incremental technical contribution (modular parameter merging for safety control), limiting its breadth of impact.

vs. HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models
gpt-5.25/26/2026

Paper 2 has higher likely impact: it provides a comprehensive, utility-grounded evaluation framework spanning the full lifecycle of agent skill reuse across five domains, identifies failure modes (negative transfer) and non-obvious factors (extractor/consumer mismatch, weak correlation with scale), and distills actionable guidance via a meta-skill that improves outcomes. This breadth, methodological rigor, and timeliness for agentic systems make it broadly useful to multiple subfields (LLM agents, RL, evaluation, tool/skill learning). Paper 1 is novel but more specialized and may generalize less beyond multi-step reasoning.

vs. Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications
gpt-5.25/26/2026

Paper 1 likely has higher impact due to a more novel, methodologically rigorous bridge between formal methods (FOL compilation, traceable specification-based testing) and LLM safety, yielding automated, coverage-driven, reproducible evaluations with direct real-world governance/compliance applications. It proposes a concrete framework with systematic guarantees and code release, addressing a timely bottleneck in safety assessment. Paper 2 is valuable and broad as a lifecycle evaluation study with practical insights, but is primarily diagnostic/empirical; its core contribution is less fundamentally new than a formalized testing paradigm for safety-critical policy adherence.

vs. Retrying vs Resampling in AI Control
gpt-5.25/26/2026

Paper 1 is more novel and timely for AI control: it directly analyzes a widely deployed safety mechanism (retrying) under adversarial assumptions, identifies an information-leak failure mode, and provides a concrete alternative (resampling) with actionable design guidance and empirical results that overturn prior conclusions. This has immediate real-world applicability to agent safety pipelines and broader implications for oversight, monitoring, and alignment. Paper 2 is rigorous and broadly useful as a systematic evaluation of skill lifecycles, but its primary contribution is integrative/diagnostic rather than a new safety-critical mechanism, likely yielding slower or more incremental downstream impact.

vs. SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver
gemini-3.15/26/2026

Paper 1 addresses a critical gap in the rapidly expanding field of language agents by providing a comprehensive, systematic study of model-generated skills. Its foundational insights into skill extraction, negative transfer, and the proposed evaluation framework have broad applicability across AI. While Paper 2 offers excellent methodological innovation for routing problems, Paper 1's focus on autonomous agent architectures is likely to have a wider and more immediate impact on current general AI research trajectories.

vs. When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems
gemini-3.15/25/2026

Paper 1 provides a comprehensive, systematic study of the entire skill lifecycle for language agents. By introducing a utility-grounded evaluation framework, diagnosing failure modes like negative transfer, and proposing a generalizable meta-skill solution, it establishes a foundational baseline for a rapidly growing subfield. Such broad empirical studies typically achieve higher scientific impact and citation counts than Paper 2, which focuses on solving a more specific, albeit important, planning failure mode in multi-agent systems.

vs. The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems
gemini-3.15/25/2026

Paper 1 offers foundational mathematical proofs establishing hard boundaries on transformer reasoning capabilities (a 'Deterministic Horizon'). By proving that reasoning depth is strictly bounded by architecture regardless of training scale, it directly challenges current AI scaling paradigms. Its broad scope spans information theory, circuit complexity, and multi-stage pipelines, offering computable limits prior to deployment. In contrast, Paper 2 is a valuable but narrower empirical study on language agent skill reuse. Paper 1's theoretical rigor and potential to fundamentally alter AI architecture design give it significantly higher potential scientific impact.

vs. Parallel Context Compaction for Long-Horizon LLM Agent Serving
claude-opus-4.65/25/2026

Paper 1 presents a comprehensive, first-of-its-kind evaluation framework spanning the full lifecycle of model-generated agent skills, revealing fundamental insights about skill transfer, negative transfer, and the disconnect between extraction and consumption abilities. Its findings (e.g., skill utility being independent of model scale) and the actionable meta-skill contribution have broad implications for the rapidly growing field of language agents. Paper 2 addresses a practical engineering problem (context compaction) with useful but more incremental contributions—parallel summarization with better throughput—that have narrower impact scope and less conceptual novelty.

vs. Latent-space Attacks for Refusal Evasion in Language Models
gpt-5.25/25/2026

Paper 2 has higher likely scientific impact: it introduces a comprehensive, utility-grounded framework covering the full lifecycle of agent skill generation, extraction, and consumption across five domains, yielding broadly applicable findings (negative transfer, extractor/consumer mismatch) and a practical meta-skill that improves outcomes. This is methodologically systematic and relevant to a growing area (agentic LMs), with clear implications for building robust, reusable agent skills across models and tasks. Paper 1 is novel and strong but is narrower (jailbreak/refusal evasion) and more security-specific.

vs. Agentic Proving for Program Verification
gemini-3.15/25/2026

Paper 1 provides a foundational, systematic framework for understanding the entire lifecycle of model-generated agent skills across diverse domains, culminating in a novel 'meta-skill' to improve extraction. Its broader theoretical insights and methodological rigor offer wider applicability to agent design. In contrast, Paper 2 primarily offers an empirical evaluation of a specific model (Claude Code) on a specific benchmark, which, while valuable for identifying benchmark saturation, has a narrower scope and less generalized theoretical impact.

vs. ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization
claude-opus-4.65/25/2026

Paper 1 addresses a fundamental and broadly applicable question about skill reuse in language agents across diverse domains, providing a comprehensive evaluation framework and actionable insights (meta-skills) that reduce negative transfer. Its breadth of impact spans the entire LLM agent community, which is rapidly growing. Paper 2, while technically rigorous and novel in its neurosymbolic approach to proof optimization, targets a narrower audience (formal mathematics/theorem proving). Paper 1's findings about skill lifecycle, transferability, and the disconnect between extraction and consumption abilities offer widely applicable design principles for the broader AI agent ecosystem.

vs. Foundation Protocol: A Coordination Layer for Agentic Society
gemini-3.15/25/2026

Paper 1 offers a rigorous, empirical study on a highly relevant technical problem (agent skill learning), providing actionable insights and concrete methodological improvements. Its strong empirical grounding and immediate applicability to current AI research give it a higher potential for direct scientific impact compared to the broader, more conceptual, and less empirically grounded protocol proposed in Paper 2.

vs. CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem
claude-opus-4.65/25/2026

Paper 2 addresses a broadly impactful topic—improving language agents through systematic skill reuse—which is highly timely given the rapid growth of LLM-based agents. It provides a comprehensive evaluation framework spanning the full skill lifecycle, offers actionable insights (meta-skills to reduce negative transfer), and spans five diverse domains, giving it broad applicability. Paper 1, while presenting a clean hybrid CP/DP integration, explicitly acknowledges it is not competitive with state-of-the-art solvers, and its contribution is primarily a proof-of-concept for a niche scheduling problem, limiting its broader impact.

vs. Design and Report Benchmarks for Knowledge Work
claude-opus-4.65/25/2026

Paper 1 provides a comprehensive empirical framework for understanding model-generated agent skills across the full lifecycle, with actionable findings (e.g., meta-skills reducing negative transfer) backed by experiments across five domains. It addresses a core challenge in the rapidly growing field of language agents with concrete, reproducible contributions. Paper 2 offers valuable conceptual guidance for benchmark design in knowledge work but is primarily a methodological/position paper with case analyses rather than novel empirical results, limiting its immediate technical impact despite its relevance.

vs. Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment
gemini-3.15/25/2026

Paper 2 presents a foundational, systematic study on language agent skill learning, a rapidly expanding field. By providing a comprehensive evaluation framework and uncovering insights like negative transfer across the entire skill lifecycle, it offers broader theoretical and practical impact for agentic AI than Paper 1's highly specialized, though rigorous, focus on multimodal knowledge editing.

vs. Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
claude-opus-4.65/25/2026

Paper 1 presents a comprehensive evaluation framework for the full lifecycle of model-generated agent skills, addressing a fundamental gap in understanding skill extraction and consumption. Its systematic study across five domains, identification of non-trivial negative transfer patterns, and actionable meta-skill contribution have broad applicability across the growing field of language agents. Paper 2, while achieving strong benchmark results with an elegant simplification (self-distillation via GRPO+OFSD), is more incremental—optimizing a specific training pipeline for search-augmented reasoning. Paper 1's breadth of analysis and generalizable insights give it higher potential to influence diverse research directions.

vs. Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems
claude-opus-4.65/25/2026

Paper 2 addresses a concrete, empirically grounded problem in AI agent development with systematic experiments, actionable findings (meta-skill framework), and reproducible methodology across five domains. Its contributions—identifying negative transfer, decoupling extractor/consumer roles, and providing practical improvements—have immediate applicability to the rapidly growing LLM agent community. Paper 1, while intellectually rich, is a theoretical/conceptual contribution to IS governance literature with narrower audience appeal and less empirical validation, relying on structured illustrations rather than rigorous experiments.

vs. One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents
claude-opus-4.65/25/2026

Paper 1 addresses a fundamental and broadly applicable problem in the rapidly growing field of language agents—understanding and improving skill reuse across the full lifecycle. Its systematic framework spanning five domains, identification of negative transfer patterns, and actionable meta-skill contribution have broad implications for the entire LLM agent community. Paper 2, while technically strong and novel in NPC persona conditioning, targets a narrower application domain (game NPCs). Paper 1's findings about skill extraction, consumption, and transferability are more likely to influence diverse downstream research areas.

vs. MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis
claude-opus-4.65/25/2026

MindLoom addresses a critical bottleneck in LLM development—generating high-quality frontier-level reasoning training data—with a novel compositional framework (thought modes) that is rigorously evaluated across 9 benchmarks, 5 STEM disciplines, and multiple model families. Its open-sourced implementation and direct applicability to improving reasoning capabilities of frontier models gives it broad, immediate impact. Paper 2 provides valuable systematic analysis of agent skills but is more of an empirical study with incremental contributions (meta-skill). MindLoom's novelty in decomposing reasoning difficulty and its practical utility for data synthesis positions it for higher impact.