Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents

Qianjun Pan, Yutao Yang, Junsong Li, Jie Zhou, Kai Chen, Xin Li, Qin Chen, Liang He

Jun 8, 2026arXiv:2606.09316v1

cs.AI

#1294of 3489·Artificial Intelligence

#1294 of 3489 · Artificial Intelligence

Tournament Score

1429±44

10501800

36%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor3.5

Novelty6

Clarity6.5

Abstract

Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented declarative evidence, leaving agents to repeatedly infer task procedures from passages, manuals, examples, logs, or trajectories. This raises a fundamental question: can skills extracted from external knowledge bases be installed into an agent, enabling it to rapidly approximate domain expertise? In this paper, we propose Anything2Skill, a taxonomy-guided framework that compiles heterogeneous external knowledge into reusable, retrievable, and executable skills for agents. Given a corpus of knowledge records, \textsc{Anything2Skill} first decomposes each record into evidence windows and performs plan-and-expand skill extraction under a skill-tree prior. The extracted candidates are then converted into structured skill contracts that specify invocation conditions, contraindications, action moves, workflow steps, constraints, output specifications, supporting evidence, and confidence scores. To construct a deployable procedural memory, Anything2Skill manages the extracted skills in a persistent SkillBank through taxonomy-aware compilation, registry-level reconciliation, lifecycle tracking, versioned updates, and visible skill-tree projection. At inference time, agents retrieve both task-specific passages from the original knowledge base and relevant procedural skills from the SkillBank, allowing RAG to provide declarative evidence while compiled skills provide reusable procedural guidance. Experiments on qsv and GitHub-CLI show that Anything2Skill combined with RAG achieves 98.85\% and 94.10\% success rates, respectively, substantially outperforming RAG-only agents. These results suggest that compiling latent procedural knowledge into explicit skills is an effective way to extend retrieval-augmented agents from knowledge access toward capability reuse.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Anything2Skill

1. Core Contribution

Anything2Skill proposes a framework that bridges two previously separated paradigms in LLM-based agents: retrieval-augmented generation (RAG) for declarative knowledge access and skill-based agents for procedural capability reuse. The central idea is that external knowledge sources (documents, manuals, logs, trajectories) contain latent procedural knowledge that can be "compiled" into structured, reusable skill contracts before inference time. This contrasts with standard RAG, which retrieves raw passages that the agent must interpret anew each time.

The framework introduces several interconnected mechanisms: (1) a skill taxonomy serving as a structural prior over extraction, (2) a plan-and-expand extraction strategy that decomposes records into evidence windows and extracts skills in a two-stage process, (3) structured skill contracts with rich metadata (invocation conditions, contraindications, workflow steps, constraints, etc.), and (4) a SkillBank management system with lifecycle tracking, versioning, reconciliation, and hierarchy projection.

The conceptual contribution—that RAG should be augmented with compiled procedural memory—is intuitive and well-motivated. The distinction between declarative evidence retrieval and procedural skill retrieval is a meaningful conceptual advance that reframes how we think about knowledge augmentation for agents.

2. Methodological Rigor

Strengths in formalization: The paper provides extensive formal notation for every component, from taxonomy structure to evidence windows, compilation objectives, reconciliation scoring, and lifecycle management. The skill contract schema (Eq. 2) is comprehensive, and the hybrid retrieval scoring function (Eq. 23) combining dense, sparse, and taxonomy-aware components is well-specified.

Weaknesses in evaluation: This is where the paper falls significantly short. The experimental evaluation is thin relative to the complexity of the proposed system:

Only two benchmarks (qsv and GitHub-CLI), both in the narrow domain of command-line tool usage. This severely limits generalizability claims about "heterogeneous external knowledge" including dialogues, logs, and trajectories.

No ablation studies. The framework has numerous components (taxonomy prior, plan-and-expand extraction, compilation, reconciliation, versioning, tree projection), but no experiment isolates their individual contributions.

No analysis of skill quality. There is no evaluation of whether extracted skills are correct, complete, or faithfully grounded in evidence. The paper claims evidence grounding but never validates it.

Single model (GPT-5.4). No experiments with other LLMs, making it impossible to assess model sensitivity.

No baselines beyond RAG. Relevant comparisons with other skill-based or procedural-memory agents (e.g., Voyager-style skill libraries, Reflexion-style memory) are absent.

No error analysis explaining what types of failures remain at 98.85% and 94.10%.

No cost analysis of the compilation process (LLM calls, latency, token usage).

Statistical significance is not reported—no confidence intervals, no multiple runs.

The gap between the system's complexity and the evaluation's simplicity is concerning. A framework claiming to handle "anything" (documents, manuals, dialogues, logs, trajectories) is tested only on CLI documentation.

3. Potential Impact

The core idea—pre-compiling procedural knowledge from knowledge bases into reusable skills—has genuine potential. If validated more broadly, it could influence:

Enterprise RAG systems where operational procedures, SOPs, and troubleshooting guides need to be actionable rather than merely retrievable.

Agent architectures by establishing procedural memory as a first-class component alongside episodic and semantic memory.

Knowledge management by providing a framework for converting tacit/implicit organizational knowledge into structured, executable procedures.

However, the current instantiation is narrowly scoped to CLI agents, and the practical applicability to truly heterogeneous knowledge (medical records, legal documents, engineering logs) remains undemonstrated.

4. Timeliness & Relevance

The paper addresses a genuinely timely problem. As LLM agents become more prevalent, the gap between knowledge retrieval and procedural execution is widely recognized. The framing of "declarative vs. procedural knowledge" in RAG systems is relevant to ongoing discussions in the community. The work positions itself well within the current wave of agent-augmented systems and procedural memory research.

However, some of the related work citations (Yang et al. 2026, the GPT-5.4 model) suggest this is a very recent or forward-looking paper, which makes independent verification difficult.

5. Strengths & Limitations

Key Strengths:

Clean conceptual framing: the declarative-procedural knowledge distinction is well-articulated and likely to resonate broadly.

Comprehensive system design with principled taxonomy, lifecycle management, and versioning.

The structured skill contract format is well-designed and practically useful.

Strong empirical results on the tested benchmarks (98.85% and 94.10% when combining with RAG).

The complementarity finding (skills + RAG > either alone) is practically valuable.

Notable Weaknesses:

Evaluation breadth: Two CLI-focused benchmarks cannot support claims about "heterogeneous external knowledge" or "anything" compilation.

No ablations: With such a complex pipeline, understanding which components matter is essential.

Over-engineering risk: The system introduces extensive machinery (7 lifecycle states, 7 management actions, multi-component retrieval scoring) without demonstrating that simpler approaches wouldn't suffice.

Reproducibility concerns: Heavy reliance on LLM-based compilation and management means results depend critically on prompt engineering and model capabilities, neither of which are fully specified.

No skill quality evaluation: The paper evaluates downstream task success but never validates the intermediate artifacts (skill contracts).

Scalability questions: How does the framework perform with thousands of documents? What is the compilation cost?

The "Base Agent" baseline at 81.6%/64.7% already performs reasonably, raising questions about whether the improvements come from the sophisticated skill framework or simply from providing more context.

Additional Observations

The paper reads more as a system description than an empirical research contribution. The methodology section is detailed and well-formalized, but the experiments occupy only ~1.5 pages and provide minimal analytical depth. The mismatch between the ambitious claims ("Anything2Skill") and the narrow evaluation (two CLI benchmarks) weakens the paper's credibility. A stronger version would include diverse knowledge types, ablation studies, skill quality analysis, and comparison with alternative approaches to procedural knowledge extraction.

Rating:4.5/ 10

Significance 5.5Rigor 3.5Novelty 6Clarity 6.5

Generated Jun 9, 2026

Comparison History (28)

Lostvs. Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

Paper 1 tackles a fundamental challenge in foundational LLM training: distinguishing genuine reasoning from memorization during reinforcement learning. By introducing a direction-aware exploration framework (DiRL) to steer RL updates, it addresses a core bottleneck in developing next-generation reasoning models. While Paper 2 offers a highly practical systems-level advancement for RAG-based agents, Paper 1 has higher potential scientific impact because its methodological improvements to LLM optimization can broadly influence the underlying training paradigms of future foundation models across the entire AI field.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

Paper 1 introduces a paradigm shift from standard declarative RAG to procedural RAG by compiling external knowledge into executable skills. This addresses a critical bottleneck in agent capabilities, enabling direct capability reuse rather than repeated inference. Its high potential for real-world applications in enterprise and coding agents, combined with impressive empirical results, suggests a broader and more immediate scientific impact compared to the memory retention optimizations in Paper 2.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Capacity, Not Format: Rethinking Structured Reasoning Failures

Paper 1 offers a more fundamental and broadly applicable scientific contribution. It rigorously disentangles the relationship between model capacity and structured output formatting, a question relevant to essentially all LLM applications. The controlled experimental design across multiple models and benchmarks, with clear mechanistic explanations (truncation vs. capacity competition), provides actionable insights for the entire field. Paper 2, while practically useful, presents a more incremental engineering framework for skill extraction in narrow agent domains (CLI tools), with less generalizable scientific insight and limited benchmark diversity.

claude-opus-4-6·Jun 9, 2026

Lostvs. Bilevel Autoresearch: Meta-Autoresearching Itself

Paper 1 introduces a framework for 'autoresearching itself,' touching on the foundational goal of recursive self-improvement in AI. The ability for an AI to optimize its own search mechanisms has profound implications for AGI and could fundamentally shift how automated research is conducted. While Paper 2 offers a highly practical and rigorous enhancement to RAG by adding procedural skills, Paper 1's conceptual leap toward recursive bootstrapping offers a higher ceiling for transformative scientific impact.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System

Paper 2 addresses a fundamental and broadly applicable challenge—credit assignment in multi-agent reinforcement learning with LLMs—using a principled game-theoretic approach (Shapley values). This has wider applicability across diverse multi-agent systems beyond specific tool domains. Paper 1, while showing strong empirical results, is more narrowly focused on compiling external knowledge into skills for specific CLI tools. Paper 2's theoretical grounding, significant performance improvements (23.66% and 14.05%), and relevance to the rapidly growing multi-agent LLM ecosystem give it broader potential impact across AI research.

claude-opus-4-6·Jun 9, 2026

Lostvs. SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance

SIFT addresses a fundamental computational bottleneck in RAG systems (prefill latency) with novel theoretical insights about attention invariance patterns. Its contributions—local-attention invariance and cross-attention consistency—are generalizable principles that could influence broader transformer optimization research. The 24,000x storage reduction and 1.71x TTFT improvement with minimal accuracy loss demonstrate strong practical impact. While Anything2Skill presents a useful engineering framework for skill compilation, its contributions are more incremental (combining existing ideas like skill extraction, taxonomy management, and RAG). SIFT's architectural-level insights have broader applicability across the rapidly growing RAG ecosystem.

claude-opus-4-6·Jun 9, 2026

Lostvs. Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Paper 1 proposes a fundamental paradigm shift by using images as a standalone reasoning medium instead of text, challenging traditional Chain-of-Thought methods. This high novelty could spark an entirely new direction in multimodal model architecture and reasoning efficiency. Paper 2, while highly practical and effective for agentic workflows, represents a more incremental architectural enhancement over existing RAG systems.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. PRISM: Recovering Instruction Sets from Language Model Activations

While Paper 1 offers a strong architectural improvement for RAG-based agents, Paper 2 tackles a fundamental challenge in LLM interpretability and AI safety. By recovering active instructions directly from model activations, PRISM addresses critical security concerns like hidden objectives and prompt injections. This mechanistic interpretability approach has a broader and more profound scientific impact, as it provides a pathway to safely monitor and audit autonomous agents, a top priority in current AI research.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Emergent alignment and the projectability of ethical personas

Paper 2 presents a highly practical and scalable framework for compiling unstructured knowledge into reusable procedural skills for agents, directly addressing a major bottleneck in current RAG systems. Its potential for real-world application in autonomous AI agents is immense, promising broad cross-domain impact. While Paper 1 offers valuable insights into AI alignment, Paper 2's methodological innovation in agentic workflows and strong empirical success rates suggest a more immediate and widespread influence on AI system design.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

Paper 2 addresses a fundamental bottleneck in LLM agents by introducing a framework to convert declarative knowledge into executable procedural skills. This approach significantly enhances Retrieval-Augmented Generation (RAG) and has broad applicability across various AI agent domains. While Paper 1 offers valuable methodological improvements for biomedical embeddings and hardware optimization, Paper 2's focus on autonomous agent skill acquisition aligns with highly active, high-impact trends in artificial intelligence, promising wider real-world applications.

gemini-3.1-pro-preview·Jun 9, 2026

#1294of 3489·Artificial Intelligence

#1294 of 3489 · Artificial Intelligence

Tournament Score

1429±44

10501800

36%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor3.5

Novelty6

Clarity6.5