Qianjun Pan, Yutao Yang, Junsong Li, Jie Zhou, Kai Chen, Xin Li, Qin Chen, Liang He
Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented declarative evidence, leaving agents to repeatedly infer task procedures from passages, manuals, examples, logs, or trajectories. This raises a fundamental question: can skills extracted from external knowledge bases be installed into an agent, enabling it to rapidly approximate domain expertise? In this paper, we propose Anything2Skill, a taxonomy-guided framework that compiles heterogeneous external knowledge into reusable, retrievable, and executable skills for agents. Given a corpus of knowledge records, \textsc{Anything2Skill} first decomposes each record into evidence windows and performs plan-and-expand skill extraction under a skill-tree prior. The extracted candidates are then converted into structured skill contracts that specify invocation conditions, contraindications, action moves, workflow steps, constraints, output specifications, supporting evidence, and confidence scores. To construct a deployable procedural memory, Anything2Skill manages the extracted skills in a persistent SkillBank through taxonomy-aware compilation, registry-level reconciliation, lifecycle tracking, versioned updates, and visible skill-tree projection. At inference time, agents retrieve both task-specific passages from the original knowledge base and relevant procedural skills from the SkillBank, allowing RAG to provide declarative evidence while compiled skills provide reusable procedural guidance. Experiments on qsv and GitHub-CLI show that Anything2Skill combined with RAG achieves 98.85\% and 94.10\% success rates, respectively, substantially outperforming RAG-only agents. These results suggest that compiling latent procedural knowledge into explicit skills is an effective way to extend retrieval-augmented agents from knowledge access toward capability reuse.
Anything2Skill proposes a framework that bridges two previously separated paradigms in LLM-based agents: retrieval-augmented generation (RAG) for declarative knowledge access and skill-based agents for procedural capability reuse. The central idea is that external knowledge sources (documents, manuals, logs, trajectories) contain latent procedural knowledge that can be "compiled" into structured, reusable skill contracts before inference time. This contrasts with standard RAG, which retrieves raw passages that the agent must interpret anew each time.
The framework introduces several interconnected mechanisms: (1) a skill taxonomy serving as a structural prior over extraction, (2) a plan-and-expand extraction strategy that decomposes records into evidence windows and extracts skills in a two-stage process, (3) structured skill contracts with rich metadata (invocation conditions, contraindications, workflow steps, constraints, etc.), and (4) a SkillBank management system with lifecycle tracking, versioning, reconciliation, and hierarchy projection.
The conceptual contribution—that RAG should be augmented with compiled procedural memory—is intuitive and well-motivated. The distinction between declarative evidence retrieval and procedural skill retrieval is a meaningful conceptual advance that reframes how we think about knowledge augmentation for agents.
Strengths in formalization: The paper provides extensive formal notation for every component, from taxonomy structure to evidence windows, compilation objectives, reconciliation scoring, and lifecycle management. The skill contract schema (Eq. 2) is comprehensive, and the hybrid retrieval scoring function (Eq. 23) combining dense, sparse, and taxonomy-aware components is well-specified.
Weaknesses in evaluation: This is where the paper falls significantly short. The experimental evaluation is thin relative to the complexity of the proposed system:
The gap between the system's complexity and the evaluation's simplicity is concerning. A framework claiming to handle "anything" (documents, manuals, dialogues, logs, trajectories) is tested only on CLI documentation.
The core idea—pre-compiling procedural knowledge from knowledge bases into reusable skills—has genuine potential. If validated more broadly, it could influence:
However, the current instantiation is narrowly scoped to CLI agents, and the practical applicability to truly heterogeneous knowledge (medical records, legal documents, engineering logs) remains undemonstrated.
The paper addresses a genuinely timely problem. As LLM agents become more prevalent, the gap between knowledge retrieval and procedural execution is widely recognized. The framing of "declarative vs. procedural knowledge" in RAG systems is relevant to ongoing discussions in the community. The work positions itself well within the current wave of agent-augmented systems and procedural memory research.
However, some of the related work citations (Yang et al. 2026, the GPT-5.4 model) suggest this is a very recent or forward-looking paper, which makes independent verification difficult.
The paper reads more as a system description than an empirical research contribution. The methodology section is detailed and well-formalized, but the experiments occupy only ~1.5 pages and provide minimal analytical depth. The mismatch between the ambitious claims ("Anything2Skill") and the narrow evaluation (two CLI benchmarks) weakens the paper's credibility. A stronger version would include diverse knowledge types, ablation studies, skill quality analysis, and comparison with alternative approaches to procedural knowledge extraction.
Generated Jun 9, 2026
Paper 1 tackles a fundamental challenge in foundational LLM training: distinguishing genuine reasoning from memorization during reinforcement learning. By introducing a direction-aware exploration framework (DiRL) to steer RL updates, it addresses a core bottleneck in developing next-generation reasoning models. While Paper 2 offers a highly practical systems-level advancement for RAG-based agents, Paper 1 has higher potential scientific impact because its methodological improvements to LLM optimization can broadly influence the underlying training paradigms of future foundation models across the entire AI field.
Paper 1 introduces a paradigm shift from standard declarative RAG to procedural RAG by compiling external knowledge into executable skills. This addresses a critical bottleneck in agent capabilities, enabling direct capability reuse rather than repeated inference. Its high potential for real-world applications in enterprise and coding agents, combined with impressive empirical results, suggests a broader and more immediate scientific impact compared to the memory retention optimizations in Paper 2.
Paper 1 offers a more fundamental and broadly applicable scientific contribution. It rigorously disentangles the relationship between model capacity and structured output formatting, a question relevant to essentially all LLM applications. The controlled experimental design across multiple models and benchmarks, with clear mechanistic explanations (truncation vs. capacity competition), provides actionable insights for the entire field. Paper 2, while practically useful, presents a more incremental engineering framework for skill extraction in narrow agent domains (CLI tools), with less generalizable scientific insight and limited benchmark diversity.
Paper 1 introduces a framework for 'autoresearching itself,' touching on the foundational goal of recursive self-improvement in AI. The ability for an AI to optimize its own search mechanisms has profound implications for AGI and could fundamentally shift how automated research is conducted. While Paper 2 offers a highly practical and rigorous enhancement to RAG by adding procedural skills, Paper 1's conceptual leap toward recursive bootstrapping offers a higher ceiling for transformative scientific impact.
Paper 2 addresses a fundamental and broadly applicable challenge—credit assignment in multi-agent reinforcement learning with LLMs—using a principled game-theoretic approach (Shapley values). This has wider applicability across diverse multi-agent systems beyond specific tool domains. Paper 1, while showing strong empirical results, is more narrowly focused on compiling external knowledge into skills for specific CLI tools. Paper 2's theoretical grounding, significant performance improvements (23.66% and 14.05%), and relevance to the rapidly growing multi-agent LLM ecosystem give it broader potential impact across AI research.
SIFT addresses a fundamental computational bottleneck in RAG systems (prefill latency) with novel theoretical insights about attention invariance patterns. Its contributions—local-attention invariance and cross-attention consistency—are generalizable principles that could influence broader transformer optimization research. The 24,000x storage reduction and 1.71x TTFT improvement with minimal accuracy loss demonstrate strong practical impact. While Anything2Skill presents a useful engineering framework for skill compilation, its contributions are more incremental (combining existing ideas like skill extraction, taxonomy management, and RAG). SIFT's architectural-level insights have broader applicability across the rapidly growing RAG ecosystem.
Paper 1 proposes a fundamental paradigm shift by using images as a standalone reasoning medium instead of text, challenging traditional Chain-of-Thought methods. This high novelty could spark an entirely new direction in multimodal model architecture and reasoning efficiency. Paper 2, while highly practical and effective for agentic workflows, represents a more incremental architectural enhancement over existing RAG systems.
While Paper 1 offers a strong architectural improvement for RAG-based agents, Paper 2 tackles a fundamental challenge in LLM interpretability and AI safety. By recovering active instructions directly from model activations, PRISM addresses critical security concerns like hidden objectives and prompt injections. This mechanistic interpretability approach has a broader and more profound scientific impact, as it provides a pathway to safely monitor and audit autonomous agents, a top priority in current AI research.
Paper 2 presents a highly practical and scalable framework for compiling unstructured knowledge into reusable procedural skills for agents, directly addressing a major bottleneck in current RAG systems. Its potential for real-world application in autonomous AI agents is immense, promising broad cross-domain impact. While Paper 1 offers valuable insights into AI alignment, Paper 2's methodological innovation in agentic workflows and strong empirical success rates suggest a more immediate and widespread influence on AI system design.
Paper 2 addresses a fundamental bottleneck in LLM agents by introducing a framework to convert declarative knowledge into executable procedural skills. This approach significantly enhances Retrieval-Augmented Generation (RAG) and has broad applicability across various AI agent domains. While Paper 1 offers valuable methodological improvements for biomedical embeddings and hardware optimization, Paper 2's focus on autonomous agent skill acquisition aligns with highly active, high-impact trends in artificial intelligence, promising wider real-world applications.