CODESKILL: Learning Self-Evolving Skills for Coding Agents
Yanzhou Li, Yiran Zhang, Xiaoyu Zhang, Xiaoxia Liu, Yang Liu
Abstract
Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.
AI Impact Assessments
(1 models)Scientific Impact Assessment: CODESKILL — Learning Self-Evolving Skills for Coding Agents
1. Core Contribution
CODESKILL reformulates the problem of skill extraction and skill-bank maintenance for coding agents as a learnable management policy rather than relying on fixed prompts and heuristic rules. The framework performs three operations: (1) multi-granularity skill extraction (task-level and event-driven) from agent trajectories, (2) skill evolution using new evidence, and (3) skill-bank maintenance (add/merge/drop). The key novelty is training a small LLM (Qwen3.5-4B) via GRPO reinforcement learning with a hybrid reward combining dense rubric-based quality judgments and sparse verifiable execution feedback from a frozen downstream coding agent. This is a meaningful conceptual advance: rather than treating skill distillation as a static prompting problem, CODESKILL treats it as a policy optimization problem grounded in downstream utility.
2. Methodological Rigor
Strengths in methodology:
Concerns:
3. Potential Impact
The paper addresses a genuine practical need: as coding agents become more capable and widely deployed, the ability to systematically accumulate and maintain reusable procedural knowledge becomes increasingly valuable. The demonstrated improvements (9.69 points average pass rate over no-skill, 4.01 over strongest baseline) are substantial.
Real-world applications:
Broader influence:
4. Timeliness & Relevance
This paper is highly timely. The coding agent space is rapidly evolving (SWE-bench, Devin, etc.), and the question of how agents can learn from accumulated experience is a current bottleneck. The paper directly addresses the limitation of static prompt-based skill extraction methods that have emerged in recent months (SkillRL, AutoSkill, etc.). The use of RL for meta-level skill management rather than just task solving represents a forward-looking research direction.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations
The paper's framing of the "attribution problem" in execution rewards — where downstream success may not be caused by the injected skill — is an important insight that applies broadly to any system trying to learn from indirect feedback signals. The alignment reward as a solution is pragmatic but imperfect, as it still relies on LLM judgment. The three-stage curriculum is well-designed but the paper provides limited analysis of what happens without it (e.g., training all operations jointly from the start).
Generated May 26, 2026
Comparison History (18)
Paper 2 likely has higher scientific impact: it reframes AI safety around “controllability” (a timely, field-wide concern for agentic systems), proposes a clear definition, introduces a benchmark (ControlBench) to operationalize the concept, and offers architectural principles that can influence research across alignment, security, HCI, and systems. While Paper 1 is methodologically rigorous and practically useful for coding agents, its impact is narrower (software-engineering agents and skill/memory management) and more incremental relative to fast-moving agent-optimization work.
Paper 1 addresses a fundamental and underexplored problem in legal AI—calibrated sensitivity to legally relevant vs. irrelevant changes—introducing both a novel evaluation framework and a solver-grounded reasoning system (LexGuard). It combines formal methods (SMT solvers) with LLMs in a principled way, addressing trustworthiness concerns critical for high-stakes AI deployment. Paper 2, while solid, addresses a more incremental improvement in coding agent skill learning. Paper 1's broader implications for AI safety, fairness, and formal verification in legal domains give it higher cross-disciplinary impact and timeliness.
Paper 1 addresses a critical bottleneck in autonomous agent development (continuous self-evolution and memory management) with a novel RL-based framework. Its demonstrated success on highly competitive benchmarks like SWE-Bench suggests immediate and broad applicability in AI-driven software engineering, a rapidly accelerating field, giving it a higher potential for widespread scientific impact compared to the narrower focus of operations research in Paper 2.
CODESKILL introduces a novel learnable framework for skill extraction and maintenance in coding agents using reinforcement learning, demonstrating significant performance improvements across multiple benchmarks. Its contribution—reformulating skill management as a learnable policy rather than relying on heuristics—is more technically innovative and has broader practical impact for the rapidly growing field of AI coding agents. While MemFail provides a useful diagnostic benchmark for LLM memory systems, it is primarily an evaluation tool rather than a methodological advance, limiting its transformative potential compared to CODESKILL's self-evolving agent framework.
Paper 2 likely has higher impact due to a more novel, broadly applicable contribution: a learnable (RL-trained) policy for skill extraction and skill-bank maintenance from agent trajectories, validated on widely used, high-signal coding benchmarks with verifiable execution rewards. This targets a timely and fast-moving area (agentic LLMs, self-improvement, long-horizon software tasks) with clear real-world applicability. Paper 1 is valuable for trustworthy medical AI, but similar neuro-symbolic/LLM+logic directions are already crowded and clinical deployment faces higher barriers; reported gains are mainly interpretability at comparable accuracy.
Paper 2 (CODESKILL) likely has higher scientific impact due to broader applicability and timeliness: self-improving coding agents and reusable skill abstractions affect agentic AI, software engineering automation, and continual learning. Its learnable skill management policy trained with RL plus execution-verified feedback is a more general algorithmic contribution than a systems optimization for Tree-of-Thoughts KV caching. CODESKILL is evaluated on widely used, high-stakes benchmarks (SWE-Bench Verified, Terminal-Bench), suggesting immediate real-world relevance and cross-field uptake. ArborKV is valuable but narrower to ToT-style inference memory management.
Paper 2 is likely higher impact due to broader, more general significance: it introduces a principled bridge between classical optimal search (A*) and LLM post-training for reasoning, applicable beyond any single domain (e.g., theorem proving, planning, verification, NLI). The framing of proofs as search with A*-guided supervision/RL is novel and timely for improving reasoning efficiency and correctness, with strong evidence (small models surpassing much larger ones) suggesting practical value and scalability. Paper 1 is impactful for coding agents, but is more domain-specific.
Paper 2 likely has higher impact: it introduces a novel, learnable RL-based policy for extracting and maintaining reusable procedural skills for coding agents, with clear methodological contributions and strong benchmark gains on widely used, timely tasks (SWE-Bench Verified, etc.). It has direct real-world applicability to software engineering automation and can generalize to other agent domains (skill/memory management), broadening cross-field relevance. Paper 1 is innovative and rigorous in large-scale empirical analysis of human–AI cultural dynamics, but its applications are more interpretive and domain-specific, with less immediate translational leverage than improvements to coding agents.
Paper 1 is likely higher impact due to stronger novelty and broader applicability: it introduces a learnable, RL-trained policy for skill extraction and skill-bank maintenance from agent trajectories, validated on major coding benchmarks with measurable execution-verified gains. This advances general agent self-improvement and memory/skill management, relevant across LLM agents, software engineering automation, and continual learning. Paper 2 is timely and useful for LLM-assisted qualitative analysis, but its scope is narrower (QDA-specific) and evaluation via similarity to human codes may limit perceived rigor and generality compared to execution-based benchmarking.
Paper 2 addresses a highly timely and rapidly expanding field (LLM-based autonomous coding agents) and introduces a learnable, self-evolving skill framework trained via reinforcement learning. Its potential impact spans software engineering, AI agent design, and continual learning. Paper 1 offers a solid methodological improvement for traffic forecasting, but its scope is narrower and less likely to drive broad, cross-disciplinary breakthroughs compared to the advancements in autonomous reasoning and tool use presented in Paper 2.
Paper 1 (CODESKILL) likely has higher impact due to stronger real-world applicability and clearer empirical validation on widely used software-engineering benchmarks (SWE-Bench Verified, EnvBench, Terminal-Bench 2), showing sizable pass-rate gains and a practical mechanism for continual skill-bank maintenance. Its RL-based learnable policy for skill extraction/evolution addresses a concrete gap in agent self-improvement and could transfer broadly to other tool-using agents. Paper 2 is conceptually novel (hyperbolic guidance) but appears more specialized and may face adoption friction without demonstrated large-scale downstream integration.
Paper 2 has higher potential scientific impact because it challenges a fundamental and universally used metric (mean cross-entropy) in language model training. By demonstrating scenarios where median CE better correlates with task performance and explaining the underlying distributional shifts, its findings are broadly applicable across all LLM research, evaluation, and distillation tasks. In contrast, Paper 1 presents a valuable but more narrowly focused framework specific to procedural skill learning for coding agents.
Paper 2 is more novel methodologically: it frames skill extraction and skill-bank maintenance for coding agents as a learnable RL policy with hybrid dense/sparse, verifiable feedback, addressing a clear gap in how agent experience is selected and updated. Its applications (software engineering agents) are immediate and widely useful, and results are shown on established, practical benchmarks (SWE-Bench Verified, Terminal-Bench 2) with sustained gains and stable memory growth. Paper 1 is impactful but partly incremental (safety-oriented FM + orchestration/cost reductions) and harder to generalize beyond enterprise LLM deployment.
Paper 2 addresses a fundamental challenge in neuroscience and brain-computer interfaces by reframing anatomical variation as an inductive prior rather than noise. Its ability to drastically reduce training time (from 600 to 10 epochs) while achieving state-of-the-art performance on surface decoders suggests a profound methodological leap. While Paper 1 offers a strong engineering advancement for LLM coding agents, Paper 2 has a broader scientific impact by bridging deep learning efficiency with neuroanatomical fidelity, potentially accelerating discoveries in cognitive neuroscience and medical diagnostics.
Paper 2 likely has higher scientific impact: it proposes a learnable, reinforcement-learning-based skill extraction/maintenance framework for coding agents and demonstrates substantial gains on widely used benchmarks (SWE-Bench Verified, Terminal-Bench 2, EnvBench), making it timely and broadly relevant to agentic LLM research and software engineering. The methodology includes verifiable execution feedback and iterative skill-bank stabilization, which may generalize across agent settings. Paper 1 is rigorous and useful for compliance engineering, but its impact may be narrower and more domain-specific despite solid implementation and open-source release.
CODESKILL addresses the high-impact area of AI coding agents with a novel learnable skill management framework using RL, demonstrating significant improvements across multiple benchmarks. Its approach to self-evolving skills for coding agents has broad practical applications in software engineering and connects to the rapidly growing field of LLM-based agents. Paper 2 makes a solid methodological contribution to subjective NLP uncertainty estimation but addresses a narrower problem domain with more incremental advances (combining existing techniques like cSG-MCMC with soft labels). Paper 1's timeliness, broader applicability, and stronger practical implications give it higher impact potential.
Paper 2 addresses the highly active and impactful field of autonomous LLM coding agents. By proposing a learnable, RL-driven policy for skill extraction and maintenance, it overcomes limitations of heuristic-based memory systems. Its strong empirical results on premier benchmarks like SWE-Bench Verified demonstrate significant, immediate real-world applicability in software engineering automation. While Paper 1 presents a novel theoretical intersection of causality and argumentation for XAI, Paper 2 aligns better with current AI trends and offers broader, more scalable technological impact.
Paper 2 has higher likely scientific impact because it targets the evaluation foundations of knowledge-work agents across domains, offering a general framework (activity/setting/product) and an 18-activity taxonomy grounded in O*NET, plus concrete benchmark case analyses. This can reshape how the field designs, reports, and interprets benchmarks—affecting many subareas (coding, research, healthcare, office work) and improving external validity. Paper 1 is technically novel and practically useful for coding agents, but its impact is narrower (primarily software-engineering agents and skill-memory methods) and depends on adoption within that slice.