SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale
Tong Bai, Zhenglin Wan, Pengfei Zhou, Xingrui Yu, Wangbo Zhao, Yang You, Ivor W. Tsang
Abstract
As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity-matching one: skills depend on, conflict with, specialize, or duplicate one another, a structure invisible to both full enumeration and embedding similarity. We present SkillDAG, which models inter-skill relationships as a typed directed graph and exposes it to an LLM agent as an inference-time, agent-callable structural retrieval interface, queried and evolved during execution rather than baked into a fixed retrieval pipeline: each search returns vector matches, typed-edge neighbors, and conflict signals, and a propose-then-commit protocol lets the agent register execution-backed edges so the graph accumulates structure across episodes. On ALFWorld and SkillsBench with MiniMax-M2.7, SkillDAG reaches 67.1% success and 27.3% reward, exceeding the strongest reported Graph-of-Skills baseline by +12.8 and +8.6 points; the advantage ports to gpt-5.2-codex, and intrinsic SkillsBench Ret@K rises from 65.5 to 78.2 under matched queries. These gains trace to isolable mechanisms: candidate ranking that stays robust as the pool grows 10x where a fixed seeding-diffusion pipeline degrades, and set-monotone online edits that enlarge ground-truth recall without evicting prior hits.
AI Impact Assessments
(1 models)Scientific Impact Assessment: SkillDAG
Core Contribution
SkillDAG addresses the problem of skill selection in LLM agents when skill libraries grow large. The key insight is that relationships between skills (dependencies, conflicts, specializations, redundancies, compositions) are structurally important but invisible to both full-library prompting and flat embedding-based retrieval. The paper's central design inversion is exposing the typed skill graph as an agent-callable interface rather than using it as a substrate for a fixed retrieval algorithm. The LLM agent can query the graph via `search/show/propose-edge/edit-edge` calls, reason over three separate evidence channels (matches, typed neighbors, conflicts), and evolve the graph through execution-backed edits.
This is a meaningful architectural contribution. Prior work like Graph of Skills (GoS) builds a typed graph but collapses it into an opaque bundle via PageRank diffusion. SkillDAG makes the graph structure a first-class object in the agent's decision loop, preserving the semantic distinctions between edge types rather than flattening them into a single ranking score.
Methodological Rigor
The methodology has several well-designed components:
Graph formalization: Five edge types are operationally defined with clear semantics and counterfactual tests (Appendix C). The separation into a directed acyclic backbone (depends_on, specializes) and symmetric relations (composes_with, similar_to, conflicts_with) is principled.
Cold-start construction: The two-view embedding strategy (e_self for what a skill does, e_needs for what it requires) is a clever solution to the cross-functional gap problem where self-description similarity misses prerequisite relationships. This is well-motivated with concrete examples.
Online evolution protocol: The three structural invariants (acyclicity, non-contradiction, append-only reversibility) are minimal but sufficient guardrails. The propose-then-commit pattern with dry-run preview is sensible engineering.
Experimental concerns: However, several methodological limitations weaken confidence:
1. No seed-matched reruns of baselines: The Vanilla/Vector/GoS reward numbers are quoted from Liu et al. [2026] rather than reproduced. While the authors acknowledge this, it makes direct comparison less controlled—differences in API versions, sampling temperature, or minor implementation details could contribute to gaps.
2. Limited benchmark diversity: Only two benchmarks (ALFWorld with 37 skills and SkillsBench with up to 2000 skills) are evaluated. ALFWorld is a relatively simple embodied environment, and the skill libraries are still modest by real-world standards.
3. Single-observation commits: The paper acknowledges that gating graph edits on single episodes without statistical confidence thresholds is an open question. The 27 online edits on SkillsBench (Table 2) is a very small number, making it hard to assess whether the evolution mechanism is truly impactful or merely cosmetic.
4. Set-monotone recall claim: The property that online edits can only enlarge the retrieved set is presented as a strength, but this is also a limitation—the graph can only grow, potentially accumulating noise over many episodes. The paper acknowledges this is "a recall property, not a context-budget claim."
Potential Impact
The paper addresses a real and growing problem. As LLM agent frameworks (AutoGPT, CrewAI, etc.) mature, skill/tool libraries will inevitably grow, and structured retrieval will become critical. The idea of making relational structure an explicit part of the agent's reasoning context—rather than hiding it in a retrieval pipeline—is applicable broadly:
However, the current evaluation scope limits confidence in generalization. The 10x scaling experiment (200→2000 skills) shows promising robustness, but real-world tool libraries can be orders of magnitude larger.
Timeliness & Relevance
This work is highly timely. The "retrieval over growing capability surface as a failure mode" framing resonates with current challenges in agentic AI. The references include very recent work (2026 papers on SkillNet, CUA-Skill, XSkill), positioning this squarely at the frontier. The specific problem of skill selection degrading with library size is documented by Shi et al. [2025] and is a recognized bottleneck.
The choice to evaluate with both MiniMax-M2.7 and gpt-5.2-codex provides useful cross-model validation. The observation that SkillDAG's advantage is largest with the weaker backbone (where it can't reason past poor retrieval) is insightful and suggests the approach is most valuable precisely where it's most needed.
Strengths
1. Clean architectural principle: The inversion from "graph as retrieval substrate" to "graph as agent-callable interface" is simple, well-articulated, and likely durable.
2. Three-channel search decomposition: Keeping matches, neighbors, and conflicts separate preserves information that ranked-list approaches destroy.
3. Two-view cold-start: The e_needs embedding view addresses a specific, well-motivated failure mode of single-view similarity.
4. Scale robustness: The 10x expansion experiment with only 3.5-point Ret@5 degradation vs. GoS's 4.6-point drop is a meaningful practical advantage.
5. Concrete failure analysis (Appendix B) adds transparency.
Limitations
1. Modest scale: 37 skills (ALFWorld) and 2000 skills (SkillsBench max) are far from the "massive" libraries the introduction motivates.
2. Baseline fairness: Quoted rather than reproduced baselines weaken comparative claims.
3. Limited graph evolution evidence: 27 online edits across 65 tasks is insufficient to validate the self-evolution mechanism at scale.
4. No statistical significance reporting: No confidence intervals, standard deviations, or significance tests across runs.
5. LLM-dependent construction: Both cold-start and online edits depend on LLM quality; errors in pair classification propagate.
6. gpt-5.2-codex results are mixed: The advantage nearly vanishes on ALFWorld (tied at 93.6%) and is modest on SkillsBench (+2.4), raising questions about whether the contribution is primarily compensating for weaker models.
Overall Assessment
SkillDAG presents a well-motivated and cleanly designed approach to a real problem. The core architectural insight—exposing typed graph structure as an agent-callable interface—is sound and likely to influence future work on tool/skill management. However, the empirical evaluation is somewhat limited in scale and rigor, and the self-evolution mechanism remains undervalidated. The work makes a solid contribution but would benefit from larger-scale experiments, reproduced baselines, and deeper analysis of the graph evolution dynamics.
Generated Jun 3, 2026
Comparison History (25)
Paper 2 (SkillDAG) presents a constructive, novel framework with clear empirical improvements (+12.8 and +8.6 points over baselines) on established benchmarks, addressing the scalable skill selection problem for LLM agents—a growing area with broad applicability. Paper 1 provides valuable negative/diagnostic results showing that intervention timing is a low-reliability construct, which is important but primarily cautionary. While Paper 1's findings about human inter-rater disagreement and saturation traps are insightful, negative results typically have lower citation impact. Paper 2's actionable method with demonstrated gains is more likely to be adopted and built upon by the community.
Paper 2 has higher impact potential: it introduces a broadly applicable, deployable mechanism (typed, self-evolving skill graphs) that addresses a growing real-world bottleneck in agent systems—skill selection at scale—and demonstrates sizable, portable gains on established benchmarks plus scalability (10x pool) and ablations isolating mechanisms. Its approach could influence tool/skill retrieval, agent architectures, and continual learning across domains. Paper 1 is valuable for evaluation rigor and incentives, but benchmarks often have narrower downstream impact than methods that directly improve agent capability and can be integrated into production systems.
SkillDAG addresses a fundamental and increasingly important problem in LLM agent systems—scalable skill selection via structured graph representations—with strong novelty in its typed DAG approach, self-evolving propose-then-commit protocol, and robust scaling properties. The work has broad applicability across all LLM agent frameworks and demonstrates significant empirical gains (+12.8 points) over strong baselines. PATRA, while solid, addresses a narrower domain (time series QA) with more incremental contributions (pattern-aware alignment, balanced rewards). SkillDAG's architectural contribution is more likely to influence the rapidly growing LLM agent ecosystem.
Paper 2 addresses a fundamental bottleneck in LLM agent scalability: selecting skills from massive libraries where inter-skill relationships (dependencies, conflicts) matter more than simple semantic similarity. By introducing a self-evolving, typed DAG structure for skill retrieval, it offers a more scalable and structurally aware approach than Paper 1's state-grounded dynamic retrieval. This structural innovation is likely to have broader applicability across complex, large-scale agentic systems.
Paper 2 (FeynmanBench) is likely higher impact: it introduces a broadly useful, rigorously generated and verified benchmark exposing a clear capability gap in multimodal LLMs on topology-sensitive scientific diagram reasoning. Benchmarks often catalyze cross-field progress (ML, vision, scientific computing) by standardizing evaluation and guiding model design, and the reported near-zero performance on full derivations makes the result timely and actionable. Paper 1 is a strong systems contribution for agent skill selection, but its impact may be narrower to tool-augmented LLM agent architectures and depends more on ecosystem adoption of its graph interface.
SkillDAG presents a novel, concrete method (typed skill graphs with self-evolving structure) that demonstrates significant quantitative improvements over strong baselines on established benchmarks. It addresses a fundamental scaling problem in LLM agent skill selection with a principled graph-theoretic approach and shows portability across models. Paper 2 introduces a benchmark (EvoEnv) addressing important challenges in dynamic environments, but benchmarks typically have narrower methodological impact unless widely adopted. SkillDAG's actionable framework for structured skill retrieval and its demonstrated performance gains give it broader potential for real-world adoption and follow-on research.
Paper 2 has higher potential scientific impact because it resolves a longstanding open theoretical problem: establishing strongly polynomial-time complexity for policy iteration in (s,a)-rectangular L∞ robust MDPs with fixed discount. This is a fundamental algorithmic guarantee with broad implications across optimization, control, RL, and game theory, and is likely to be durable and widely citable. Paper 1 is timely and practically useful for LLM agents, but its impact may be more contingent on specific benchmarks, model ecosystems, and engineering choices, and is less foundational than a complexity-theoretic breakthrough.
While both papers address skill evolution in LLM agents, Paper 2 tackles the critical challenge of scale. By modeling complex inter-skill relationships (dependencies, conflicts) as a typed DAG and demonstrating robustness when the skill pool grows 10x, SkillDAG offers a more rigorous and scalable methodology for real-world applications with massive tool libraries.
Paper 1 addresses a fundamental bottleneck in LLM agent scalability (skill selection from large libraries) with a novel, self-evolving graph-based retrieval approach. This methodological innovation has broad applicability across various agentic workflows and demonstrates strong empirical gains. Paper 2, while addressing an important real-world application (safety analysis), relies on more established multi-agent dialogue techniques. Therefore, Paper 1 is likely to have a wider and more fundamental scientific impact in the fast-growing field of autonomous AI agents.
Paper 1 addresses a critical bottleneck in the rapidly growing field of LLM agents: scaling and managing large skill libraries. By introducing a self-evolving, typed skill graph that dynamically adapts during execution, it offers a highly scalable and broadly applicable solution. Paper 2's neurosymbolic approach to VQA is innovative and improves interpretability, but its focus on Answer-Set Programming is more niche. Paper 1 has a higher potential for broad adoption and impact across general AI agent architectures.
Paper 2 is more novel and broadly applicable: it introduces a self-evolving, typed skill-relationship graph and an inference-time structural retrieval interface for scalable tool/skill selection—an increasingly central bottleneck for LLM agents. Its mechanisms (typed edges, conflict signals, propose-then-commit online edits) generalize across domains and models, and it reports sizable gains on established agent benchmarks plus robustness as skill pools grow 10×. Paper 1 is valuable but mainly contributes a benchmark and incremental tool-call training improvements for math, with narrower cross-field impact.
ThoughtFold addresses a highly critical and timely bottleneck in Large Reasoning Models: inference inefficiency and over-thinking. Reducing token usage by 56% without sacrificing accuracy offers massive real-world cost savings and scalability benefits. While SkillDAG introduces an innovative structural approach for agent skill selection, improving general reasoning efficiency has a broader, more immediate impact across the entire LLM ecosystem.
Paper 1 addresses a fundamental and broadly relevant problem in LRMs—harmful overthinking—that affects all reasoning models across modalities. It introduces a novel evaluation protocol, demonstrates surprising findings (stopping early improves accuracy by up to 21%), and reveals that current efficiency strategies fail to address the core issue. The breadth of impact is high since it applies to both multimodal and language-only settings, affecting the entire LRM paradigm. Paper 2, while technically solid, addresses a more niche problem of skill selection in LLM agents with narrower applicability and incremental improvements over existing baselines.
Paper 2 (SkillDAG) has higher likely scientific impact: it targets a broadly shared bottleneck—skill selection in large tool/skill libraries—relevant to agents, software engineering, robotics, and enterprise automation. The typed, self-evolving graph with execution-backed online edits is a more general, reusable abstraction than math-specific context stratification, and it directly addresses scaling behavior (10× pool robustness). Reported gains across multiple benchmarks and transfer to another model suggest wider applicability and timeliness as agent ecosystems grow. Paper 1 is strong but narrower to math reasoning and prompt-management.
SkillDAG addresses a fundamental scaling problem in LLM agent skill selection with a novel typed directed graph approach that self-evolves during execution. It demonstrates strong empirical gains (+12.8 and +8.6 points over baselines) across multiple benchmarks and models, with clear ablation of mechanisms. Paper 2 offers interesting insights about visual graph scaffolds for reasoning but is more exploratory, revealing a modality gap rather than providing a definitive solution. SkillDAG's practical applicability to growing skill libraries and its self-evolving mechanism give it broader real-world impact potential.
Paper 1 provides a rigorous, generalizable theoretical framework (debate benefit condition) explaining when multi-agent debate helps vs. hurts, validated across 6,000+ task-condition pairs and 19 published comparisons with zero false positives. Its identification of critique-induced confusion and the adversarial separation principle offers broad methodological insights applicable across many multi-agent LLM systems beyond data cleaning. Paper 2, while technically solid, addresses a more niche problem (skill selection in LLM agents) with narrower generalizability and incremental advances over existing baselines.
Paper 1 addresses a fundamental, paradigm-level challenge in AI safety and alignment—arguing that the dominant solipsistic AI design paradigm is structurally incompatible with cooperation, and calling for a new research paradigm centered on interdependence. This has broad implications across AI safety, multi-agent systems, institutional design, and policy. Its timeliness is exceptional given rapid AI capability advances. Paper 2, while technically solid with strong empirical results on skill selection for LLM agents, addresses a narrower engineering problem with incremental improvements on specific benchmarks, limiting its broader scientific impact.
Paper 2 addresses a fundamental limitation in RLVR for visual reasoning—a rapidly growing area at the intersection of LLMs and multimodal AI. The insight that token-level entropy alone is insufficient for visual reasoning, and the principled multiplicative coupling of visual sensitivity with entropy, offers a broadly applicable contribution. It impacts the large community working on multimodal LLMs and RL-based training. Paper 1, while technically solid, addresses a more niche problem (skill library management for LLM agents) with narrower applicability. Paper 2's findings are more likely to influence future training paradigms for vision-language models.
Paper 2 (EvoDS) presents higher potential impact due to its combination of theoretical rigor (mathematical proofs on tool-selection error and information bottlenecks) and substantial empirical gains (28.9% improvement across four benchmarks). While Paper 1 offers a highly novel structural approach to skill retrieval, Paper 2 tackles two fundamental bottlenecks in long-horizon agentic systems simultaneously: dynamic skill acquisition and active context management. Additionally, EvoDS provides open-source code and data, which significantly accelerates community adoption, reproducibility, and follow-up research.
Paper 2 (Hedge-Bench) has higher likely scientific impact due to a broadly useful, high-signal evaluation artifact: a realistic, expert-grounded benchmark with deterministic grading, addressing a major gap in agent evaluation (open-ended reasoning without model-judged circularity). It is timely for measuring and driving progress in agentic reasoning and can influence both academia and industry across ML evaluation, NLP/agents, and finance/FinTech. Paper 1 is technically novel and strong, but its impact is more specialized to skill-library orchestration and depends on adoption of its graph interface and continual-update protocol.