CODESKILL: Learning Self-Evolving Skills for Coding Agents

Yanzhou Li, Yiran Zhang, Xiaoyu Zhang, Xiaoxia Liu, Yang Liu

May 25, 2026

arXiv:2605.25430v1 PDF

cs.AI(primary)

#915of 2682·Artificial Intelligence

#915 of 2682 · Artificial Intelligence

Tournament Score

1444±42

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1444±42

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CODESKILL — Learning Self-Evolving Skills for Coding Agents

1. Core Contribution

CODESKILL reformulates the problem of skill extraction and skill-bank maintenance for coding agents as a learnable management policy rather than relying on fixed prompts and heuristic rules. The framework performs three operations: (1) multi-granularity skill extraction (task-level and event-driven) from agent trajectories, (2) skill evolution using new evidence, and (3) skill-bank maintenance (add/merge/drop). The key novelty is training a small LLM (Qwen3.5-4B) via GRPO reinforcement learning with a hybrid reward combining dense rubric-based quality judgments and sparse verifiable execution feedback from a frozen downstream coding agent. This is a meaningful conceptual advance: rather than treating skill distillation as a static prompting problem, CODESKILL treats it as a policy optimization problem grounded in downstream utility.

2. Methodological Rigor

Strengths in methodology:

The hybrid reward design is well-motivated. The authors recognize that execution reward alone suffers from attribution problems (the agent may succeed/fail for reasons unrelated to the skill), and introduce an alignment factor to improve credit assignment. This is a thoughtful design choice.

The three-stage curriculum (extraction → evolution → maintenance) naturally follows the skill lifecycle and enables data reuse across stages, addressing the challenge of sparse verifiable feedback.

The SFT warmup before RL is appropriate, ensuring the model learns the operation schema before optimization.

Concerns:

The evaluation relies heavily on GPT-5.4-mini as both the teacher model for SFT data and the LLM-as-judge for quality/alignment rewards. This creates a potential circularity — the quality of extracted skills is partially judged by the same family of models that generated the supervised targets.

The no-skill baselines are run only 4 times per instance for pre-caching baseline scores, which may introduce variance in the execution reward signal.

The reverse-retrieval mechanism for finding evaluation instances during RL is clever but introduces a selection bias — skills are tested on tasks they are likely to match, potentially inflating execution rewards during training.

The paper lacks statistical significance tests or confidence intervals on the main results, despite the inherent stochasticity of both skill generation and agent rollouts.

3. Potential Impact

The paper addresses a genuine practical need: as coding agents become more capable and widely deployed, the ability to systematically accumulate and maintain reusable procedural knowledge becomes increasingly valuable. The demonstrated improvements (9.69 points average pass rate over no-skill, 4.01 over strongest baseline) are substantial.

Real-world applications:

Enterprise deployments of coding agents could use CODESKILL to build organizational knowledge banks that improve over time.

The framework could be adapted to other long-horizon agent domains (DevOps, system administration, data engineering).

The skill-bank maintenance dynamics showing convergence to stable sizes is practically important for production deployments.

Broader influence:

The idea of treating memory/skill management as a learnable policy optimized with downstream feedback could influence the broader agent memory and self-improvement literature.

The multi-granularity skill design (task-level vs. event-driven) provides a useful taxonomy for procedural knowledge in agent systems.

4. Timeliness & Relevance

This paper is highly timely. The coding agent space is rapidly evolving (SWE-bench, Devin, etc.), and the question of how agents can learn from accumulated experience is a current bottleneck. The paper directly addresses the limitation of static prompt-based skill extraction methods that have emerged in recent months (SkillRL, AutoSkill, etc.). The use of RL for meta-level skill management rather than just task solving represents a forward-looking research direction.

5. Strengths & Limitations

Key Strengths:

Principled formulation: Casting skill management as a learnable policy with downstream-grounded optimization is a clean and powerful abstraction.

Comprehensive evaluation: Three benchmarks including an out-of-distribution test (Terminal-Bench 2), two different downstream policies (Qwen3.5-35B-A3B and GPT-5.4-mini), and thorough ablations.

Generalization evidence: Skills learned with one downstream policy transfer to another, and improvements hold on OOD benchmarks. This suggests the framework learns genuinely transferable skill management behavior.

Practical scalability: The skill bank converges to stable sizes (~676 skills), demonstrating that unbounded growth is avoided.

Efficiency gains: Beyond accuracy, CODESKILL reduces average reasoning steps from 44.12 to 35.15 on solved instances.

Notable Weaknesses:

Limited skill representations: Skills are restricted to natural-language instructions. The authors acknowledge this but it limits applicability to settings where executable scripts, tool definitions, or code templates would be more effective.

Single-operation limitation: Processing one candidate skill at a time prevents more complex maintenance decisions (splitting, batch reorganization).

Evaluation scale: SWE-Bench Verified evaluation uses only 150 of 500 instances, and EnvBench uses 150 of 994 repositories. While this is understandable given computational constraints, it limits statistical power.

Dependency on strong judge models: The quality and alignment rewards require GPT-5.4-mini calls, adding cost and introducing model-dependent biases into training.

Reproducibility concerns: The paper uses GPT-5.4-mini (a very recent model at time of writing) for multiple critical components, and some trajectory data comes from proprietary sources. The RL training requires ~210 hours on 4×H100 GPUs, which is non-trivial.

Ablation gap: The full lifecycle slightly reduces performance compared to extraction+evolution while halving the skill bank. Whether this tradeoff is worthwhile depends on deployment constraints, and the paper doesn't thoroughly analyze when maintenance helps versus hurts.

Additional Observations

The paper's framing of the "attribution problem" in execution rewards — where downstream success may not be caused by the injected skill — is an important insight that applies broadly to any system trying to learn from indirect feedback signals. The alignment reward as a solution is pragmatic but imperfect, as it still relies on LLM judgment. The three-stage curriculum is well-designed but the paper provides limited analysis of what happens without it (e.g., training all operations jointly from the start).

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 7Clarity 7.5

Generated May 26, 2026

Comparison History (18)

vs. Position: AI Safety Requires Effective Controllability

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact: it reframes AI safety around “controllability” (a timely, field-wide concern for agentic systems), proposes a clear definition, introduces a benchmark (ControlBench) to operationalize the concept, and offers architectural principles that can influence research across alignment, security, HCI, and systems. While Paper 1 is methodologically rigorous and practically useful for coding agents, its impact is narrower (software-engineering agents and skill/memory management) and more incremental relative to fast-moving agent-optimization work.

vs. Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental and underexplored problem in legal AI—calibrated sensitivity to legally relevant vs. irrelevant changes—introducing both a novel evaluation framework and a solver-grounded reasoning system (LexGuard). It combines formal methods (SMT solvers) with LLMs in a principled way, addressing trustworthiness concerns critical for high-stakes AI deployment. Paper 2, while solid, addresses a more incremental improvement in coding agent skill learning. Paper 1's broader implications for AI safety, fairness, and formal verification in legal domains give it higher cross-disciplinary impact and timeliness.

vs. Generating Robust Portfolios of Optimization Models using Large Language Models

gemini-3.15/27/2026

Paper 1 addresses a critical bottleneck in autonomous agent development (continuous self-evolution and memory management) with a novel RL-based framework. Its demonstrated success on highly competitive benchmarks like SWE-Bench suggests immediate and broad applicability in AI-driven software engineering, a rapidly accelerating field, giving it a higher potential for widespread scientific impact compared to the narrower focus of operations research in Paper 2.

vs. MemFail: Stress-Testing Failure Modes of LLM Memory Systems

claude-opus-4.65/27/2026

CODESKILL introduces a novel learnable framework for skill extraction and maintenance in coding agents using reinforcement learning, demonstrating significant performance improvements across multiple benchmarks. Its contribution—reformulating skill management as a learnable policy rather than relying on heuristics—is more technically innovative and has broader practical impact for the rapidly growing field of AI coding agents. While MemFail provides a useful diagnostic benchmark for LLM memory systems, it is primarily an evaluation tool rather than a methodological advance, limiting its transformative potential compared to CODESKILL's self-evolving agent framework.

vs. Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

gpt-5.25/26/2026

Paper 2 likely has higher impact due to a more novel, broadly applicable contribution: a learnable (RL-trained) policy for skill extraction and skill-bank maintenance from agent trajectories, validated on widely used, high-signal coding benchmarks with verifiable execution rewards. This targets a timely and fast-moving area (agentic LLMs, self-improvement, long-horizon software tasks) with clear real-world applicability. Paper 1 is valuable for trustworthy medical AI, but similar neuro-symbolic/LLM+logic directions are already crowded and clinical deployment faces higher barriers; reported gains are mainly interpretability at comparable accuracy.

vs. ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

gpt-5.25/26/2026

Paper 2 (CODESKILL) likely has higher scientific impact due to broader applicability and timeliness: self-improving coding agents and reusable skill abstractions affect agentic AI, software engineering automation, and continual learning. Its learnable skill management policy trained with RL plus execution-verified feedback is a more general algorithmic contribution than a systems optimization for Tree-of-Thoughts KV caching. CODESKILL is evaluated on widely used, high-stakes benchmarks (SWE-Bench Verified, Terminal-Bench), suggesting immediate real-world relevance and cross-field uptake. ArborKV is valuable but narrower to ToT-style inference memory management.

vs. Learning to Reason Efficiently with A* Post-Training

gpt-5.25/26/2026

Paper 2 is likely higher impact due to broader, more general significance: it introduces a principled bridge between classical optimal search (A*) and LLM post-training for reasoning, applicable beyond any single domain (e.g., theorem proving, planning, verification, NLI). The framing of proofs as search with A*-guided supervision/RL is novel and timely for improving reasoning efficiency and correctness, with strong evidence (small models surpassing much larger ones) suggesting practical value and scalability. Paper 1 is impactful for coding agents, but is more domain-specific.

vs. Dynamics of collective creativity in AI art competitions

gpt-5.25/26/2026

Paper 2 likely has higher impact: it introduces a novel, learnable RL-based policy for extracting and maintaining reusable procedural skills for coding agents, with clear methodological contributions and strong benchmark gains on widely used, timely tasks (SWE-Bench Verified, etc.). It has direct real-world applicability to software engineering automation and can generalize to other agent domains (skill/memory management), broadening cross-field relevance. Paper 1 is innovative and rigorous in large-scale empirical analysis of human–AI cultural dynamics, but its applications are more interpretive and domain-specific, with less immediate translational leverage than improvements to coding agents.

vs. Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis

gpt-5.25/26/2026

Paper 1 is likely higher impact due to stronger novelty and broader applicability: it introduces a learnable, RL-trained policy for skill extraction and skill-bank maintenance from agent trajectories, validated on major coding benchmarks with measurable execution-verified gains. This advances general agent self-improvement and memory/skill management, relevant across LLM agents, software engineering automation, and continual learning. Paper 2 is timely and useful for LLM-assisted qualitative analysis, but its scope is narrower (QDA-specific) and evaluation via similarity to human codes may limit perceived rigor and generality compared to execution-based benchmarking.

vs. PHGNet: Prototype-Guided Hypergraph Construction for Heterogeneous Spatiotemporal Forecasting

gemini-3.15/26/2026

Paper 2 addresses a highly timely and rapidly expanding field (LLM-based autonomous coding agents) and introduces a learnable, self-evolving skill framework trained via reinforcement learning. Its potential impact spans software engineering, AI agent design, and continual learning. Paper 1 offers a solid methodological improvement for traffic forecasting, but its scope is narrower and less likely to drive broad, cross-disciplinary breakthroughs compared to the advancements in autonomous reasoning and tool use presented in Paper 2.

vs. HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

gpt-5.25/26/2026

Paper 1 (CODESKILL) likely has higher impact due to stronger real-world applicability and clearer empirical validation on widely used software-engineering benchmarks (SWE-Bench Verified, EnvBench, Terminal-Bench 2), showing sizable pass-rate gains and a practical mechanism for continual skill-bank maintenance. Its RL-based learnable policy for skill extraction/evolution addresses a concrete gap in agent self-improvement and could transfer broadly to other tool-using agents. Paper 2 is conceptually novel (hyperbolic guidance) but appears more specialized and may face adoption friction without demonstrated large-scale downstream integration.

vs. When Mean CE Fails: Median CE Can Better Track Language Model Quality

gemini-3.15/26/2026

Paper 2 has higher potential scientific impact because it challenges a fundamental and universally used metric (mean cross-entropy) in language model training. By demonstrating scenarios where median CE better correlates with task performance and explaining the underlying distributional shifts, its findings are broadly applicable across all LLM research, evaluation, and distillation tasks. In contrast, Paper 1 presents a valuable but more narrowly focused framework specific to procedural skill learning for coding agents.

vs. JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data

gpt-5.25/26/2026

Paper 2 is more novel methodologically: it frames skill extraction and skill-bank maintenance for coding agents as a learnable RL policy with hybrid dense/sparse, verifiable feedback, addressing a clear gap in how agent experience is selected and updated. Its applications (software engineering agents) are immediate and widely useful, and results are shown on established, practical benchmarks (SWE-Bench Verified, Terminal-Bench 2) with sustained gains and stable memory growth. Paper 1 is impactful but partly incremental (safety-oriented FM + orchestration/cost reductions) and harder to generalize beyond enterprise LLM deployment.

vs. NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding

gemini-3.15/26/2026

Paper 2 addresses a fundamental challenge in neuroscience and brain-computer interfaces by reframing anatomical variation as an inductive prior rather than noise. Its ability to drastically reduce training time (from 600 to 10 epochs) while achieving state-of-the-art performance on surface decoders suggests a profound methodological leap. While Paper 1 offers a strong engineering advancement for LLM coding agents, Paper 2 has a broader scientific impact by bridging deep learning efficiency with neuroanatomical fidelity, potentially accelerating discoveries in cognitive neuroscience and medical diagnostics.

vs. Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact: it proposes a learnable, reinforcement-learning-based skill extraction/maintenance framework for coding agents and demonstrates substantial gains on widely used benchmarks (SWE-Bench Verified, Terminal-Bench 2, EnvBench), making it timely and broadly relevant to agentic LLM research and software engineering. The methodology includes verifiable execution feedback and iterative skill-bank stabilization, which may generalize across agent settings. Paper 1 is rigorous and useful for compliance engineering, but its impact may be narrower and more domain-specific despite solid implementation and open-source release.

vs. Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP

claude-opus-4.65/26/2026

CODESKILL addresses the high-impact area of AI coding agents with a novel learnable skill management framework using RL, demonstrating significant improvements across multiple benchmarks. Its approach to self-evolving skills for coding agents has broad practical applications in software engineering and connects to the rapidly growing field of LLM-based agents. Paper 2 makes a solid methodological contribution to subjective NLP uncertainty estimation but addresses a narrower problem domain with more incremental advances (combining existing techniques like cSG-MCMC with soft labels). Paper 1's timeliness, broader applicability, and stronger practical implications give it higher impact potential.

vs. A Causal Argumentation Method for Explainability of Machine Learning Models

gemini-3.15/26/2026

Paper 2 addresses the highly active and impactful field of autonomous LLM coding agents. By proposing a learnable, RL-driven policy for skill extraction and maintenance, it overcomes limitations of heuristic-based memory systems. Its strong empirical results on premier benchmarks like SWE-Bench Verified demonstrate significant, immediate real-world applicability in software engineering automation. While Paper 1 presents a novel theoretical intersection of causality and argumentation for XAI, Paper 2 aligns better with current AI trends and offers broader, more scalable technological impact.

vs. Design and Report Benchmarks for Knowledge Work

gpt-5.25/26/2026

Paper 2 has higher likely scientific impact because it targets the evaluation foundations of knowledge-work agents across domains, offering a general framework (activity/setting/product) and an 18-activity taxonomy grounded in O*NET, plus concrete benchmark case analyses. This can reshape how the field designs, reports, and interprets benchmarks—affecting many subareas (coding, research, healthcare, office work) and improving external validity. Paper 1 is technically novel and practically useful for coding agents, but its impact is narrower (primarily software-engineering agents and skill-memory methods) and depends on adoption within that slice.