Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

May 21, 2026

arXiv:2605.22148v1 PDF

cs.AI(primary)cs.CL

#655of 2292·Artificial Intelligence

#655 of 2292 · Artificial Intelligence

Tournament Score

1456±49

10501800

79%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor5.5

Novelty6.5

Clarity7.5

Tournament Score

1456±49

10501800

79%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+ 0.0$ pp over no-skill baselines while human-curated ones deliver $+ 16.2$ pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$ ) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$ ; the same recipe transfers to an agentic solver on SWE-bench Verified ( $+ 0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Ratchet — A Minimal Hygiene Recipe for Self-Evolving LLM Agents

1. Core Contribution

Ratchet reframes the problem of self-evolving LLM skill libraries: the bottleneck is not skill *authoring* but skill *lifecycle management* (the "librarian" problem). Building on the striking SkillsBench finding that LLM-authored skills deliver +0.0pp while human-curated ones deliver +16.2pp, the paper proposes a single-agent loop with four governance mechanisms: outcome-driven retirement, a bounded active-cap, a meta-skill authoring prior, and pattern canonicalization. The key insight — that managing the library matters more than generating its contents — is clean and actionable.

The system achieves a +0.328 rolling-mean gain on MBPP+ hard-100 (from 0.258 to 0.584 pass@1) with a frozen Claude Opus 4.7, and a +0.22 peak lift on SWE-bench Verified, representing what the authors claim is the first LLM-self-authored library that closes the gap to human-curated performance.

2. Methodological Rigor

Strengths in experimental design:

The 8 ablations (A1–A8) are well-structured, each modifying exactly one knob, enabling clean attribution of which components are load-bearing. The finding that retirement and the meta-skill prior are essential while explicit deduplication is subsumed is a genuinely useful design insight.

The hard-100 subset construction (filtering out tasks solved on all 5 probe seeds) is methodologically sound — it focuses evaluation on tasks where skill libraries could plausibly help.

Reporting rolling gain (late-window minus early-window) rather than just peak performance provides a more honest view of learning dynamics.

Concerns:

Only 3 seeds per condition on 40 eval tasks is statistically thin. With n=40 binary outcomes, standard deviations are inherently large (~0.07 per run at p≈0.25), and 3 seeds provide limited statistical power. Several comparisons (e.g., A5/A6 vs. Default) fall within noise, which the authors acknowledge but could have been addressed with more seeds.

The SWE-bench evaluation is preliminary: only 20 rounds, and the baseline of 0.65 on a "hard" subset is surprisingly high, raising questions about subset difficulty calibration. The authors explain this (tasks the agent solves "sometimes but not reliably"), but the filter design means the skill library is mainly converting unreliable successes into reliable ones rather than enabling genuinely new capabilities.

The non-divergence proposition (Prop. 1) is mathematically straightforward — essentially a union bound over Hoeffding concentration inequalities. The resulting bound (E[p0] - 0.35) is very loose and doesn't bind in practice. While it formalizes an intuition, calling it a "theoretical contribution" overstates its novelty; it's more of a sanity check.

Single model (Claude Opus 4.7) and provider limits generalizability claims. Cross-model evaluation would strengthen the contribution.

3. Potential Impact

Practical implications: The "librarian over author" thesis is immediately actionable for practitioners building agentic systems with skill libraries. The finding that a meta-skill authoring prior substitutes for explicit deduplication is a useful engineering insight that could save implementation complexity.

Broader influence: The paper contributes to the growing literature on inference-time scaling and external memory for LLMs. The framework of separating skill creation from lifecycle management provides a conceptual vocabulary that could influence how future systems are designed. The experience compression spectrum framing (from the authors' own prior survey) positions this work well within the field.

Limitations on impact: The system is an "amplifier, not discoverer" — it cannot teach genuinely new knowledge, only redirect existing capabilities. This fundamentally bounds its utility. The natural-language skill format, while flexible, may not scale to domains requiring precise procedural knowledge.

4. Timeliness & Relevance

The paper is highly timely. Self-evolving agents are a major research direction in 2025-2026, with numerous concurrent systems (CASCADE, AutoSkill, EvolveR, etc.). The SkillsBench null result is fresh and provocative, and Ratchet directly addresses it. The comparison table (Table 1) effectively positions the work against 14 systems across 8 design axes, showing that no prior system combines all of Ratchet's governance features.

The focus on frozen-LLM systems (no weight updates) is practically relevant: most deployed LLM agents cannot be fine-tuned, so inference-time adaptation through skill libraries addresses a real deployment constraint.

5. Strengths & Limitations

Key strengths:

Clean, falsifiable thesis with empirical support

Thorough ablation study yielding actionable design principles (retirement and meta-skill are load-bearing; dedup is subsumed)

Surprising results that challenge design assumptions (A4 hurting, A5/A6 slightly helping)

Comprehensive appendices enabling reproducibility (hyperparameters, prompts, per-seed data)

Transfers to a qualitatively different setting (agentic solver on SWE-bench)

Notable weaknesses:

Statistical power is limited (3 seeds, 40 eval tasks)

The "hard subset" methodology introduces selection bias that inflates apparent gains — reporting on the full benchmark would provide context

The non-divergence bound is trivial in practice and doesn't constitute meaningful theoretical insight

No comparison against any existing skill-library system under identical conditions — all comparisons are indirect via the design-axis table

Retirement measures correlation, not causation (acknowledged but not addressed)

The paper is from 2026 and references several 2026 preprints, some potentially from the same group, creating a somewhat self-referential citation ecosystem

Missing elements:

No analysis of what the learned skills actually look like or qualitative examples of successful/failed skill application

No measurement of pass@k to validate the "amplifier not discoverer" hypothesis

No cost-benefit analysis beyond wall-time (API costs, token counts)

Summary

Ratchet makes a clear and well-supported argument that lifecycle management is the bottleneck in self-evolving skill libraries, and provides a practical recipe with empirical validation. The ablation study is the strongest contribution, yielding non-obvious design insights. However, the statistical foundations are thin, the theoretical contribution is minor, and the evaluation scope (single model, limited benchmarks) constrains generalizability claims. The work is a solid empirical contribution to an active research area but falls short of being definitive.

Rating:6.5/ 10

Significance 7Rigor 5.5Novelty 6.5Clarity 7.5

Generated May 22, 2026

Comparison History (14)

vs. Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

gemini-3.15/22/2026

Paper 1 addresses a critical bottleneck in self-evolving LLM agents (skill lifecycle management) and demonstrates significant, rigorous performance gains on rigorous benchmarks like SWE-bench. Its technical contributions to agent architecture have broad applicability in advancing autonomous AI systems. Paper 2 offers valuable pedagogical insights and a new benchmark, but its impact is narrower, focusing primarily on AI education and evaluation rather than fundamental capability advancement.

vs. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

gpt-5.25/22/2026

Paper 2 has higher likely scientific impact: it targets a broadly relevant, timely bottleneck (skill lifecycle management) with a minimal, portable recipe; demonstrates substantial gains on widely used benchmarks (MBPP+, SWE-bench Verified) with multiple seeds, many rounds, and ablations; and adds a formal non-divergence proposition. This combination of generality, rigor, and reproducibility makes it more likely to influence practice across agent and LLM-tooling communities. Paper 1 is novel (source-level self-rewriting) and practically appealing, but evidence is narrower (single environment, single cycle) and the approach may face higher safety/engineering barriers to widespread adoption.

vs. AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

claude-opus-4.65/22/2026

Ratchet addresses a fundamental problem in LLM agent self-improvement—lifecycle management of skill libraries—with a minimal, principled recipe that shows dramatic gains (+32.8pp on MBPP+, transfers to SWE-bench). Its findings (retirement and meta-skill priors are load-bearing; deduplication is subsumed) provide broadly applicable insights for the rapidly growing field of autonomous LLM agents. AutoRubric-T2I is solid but more narrowly focused on T2I reward modeling. Ratchet's simplicity, transferability across benchmarks, and relevance to the agentic AI paradigm give it broader and more timely impact.

vs. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

gemini-3.15/22/2026

Paper 1 proposes a fundamental architectural decomposition for agentic reasoning (System I, II, III) that addresses the critical bottleneck of token-inefficient planning in LLMs. By matching the performance of much larger models while using up to 95% fewer reasoning tokens, it offers immense practical efficiency gains and broader theoretical implications for agent design compared to the specific skill-library management refinements in Paper 2.

vs. What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct

gpt-5.25/22/2026

Paper 2 likely has higher impact due to a concrete, novel algorithmic “hygiene” loop for self-evolving LLM agents, strong quantitative gains on established benchmarks (MBPP+, SWE-bench Verified), multi-seed runs, extensive ablations, and a supporting non-divergence proposition. It is timely for agentic systems and could transfer broadly to tool/skill-learning, continual learning without finetuning, and autonomous software engineering. Paper 1 provides valuable conceptual clarification (taxonomy + expert survey) for evaluation/governance, but is less likely to directly shift performance capabilities or spawn immediate downstream methods.

vs. Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to broader applicability and timeliness: scalable long-horizon memory is a central bottleneck for deploying LLM agents in real scientific workflows. It proposes an architecture with clear systems-level contributions (episodic/semantic separation, consolidation handling contradictions), evaluates at large scale (15k messages, 1,440 queries) across six models, and reports practical efficiency/latency gains beyond context limits. Paper 1 is strong and novel for self-evolving skill hygiene, but its impact is narrower (code/agent skill libraries) and more model/task-specific.

vs. Meta-Learning for Rapid Adaptation in Reference Tracking of Uncertain Nonlinear Systems

gpt-5.25/22/2026

Paper 2 likely has higher impact: it targets a timely, widely relevant bottleneck in LLM agents (skill lifecycle management), proposes a minimal, transferable “hygiene” recipe, and reports large gains on prominent benchmarks (MBPP+ and SWE-bench Verified) with multiple seeds and ablations plus a safety-style non-divergence guarantee. Its ideas can generalize across agent frameworks and domains, affecting software engineering, autonomous agents, and evaluation. Paper 1 is solid and application-relevant for adaptive control, but its impact is narrower to nonlinear control/meta-learning and likely less broadly adoptable.

vs. Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

gpt-5.25/22/2026

Paper 2 is more novel and timely, addressing a key bottleneck in self-evolving LLM agents (skill lifecycle management) with a minimal, general recipe and theoretical non-divergence guarantee. It shows large, well-quantified gains with multi-seed runs, ablations, and transfer across benchmarks (MBPP+ and SWE-bench), suggesting broad applicability to agentic software engineering and tool-using systems. Paper 1 is solid and practically relevant for manufacturing scheduling, but constrains action selection to dispatching rules and applies standard PPO, making it more incremental with narrower cross-field impact.

vs. Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

gpt-5.25/22/2026

Paper 1 has higher estimated impact due to a clearly novel, minimal “hygiene” recipe for self-evolving skill libraries with strong empirical gains on widely used, high-salience benchmarks (MBPP+ hard-100, SWE-bench Verified), extensive ablations, and a formal non-divergence guarantee. Its contributions are broadly applicable to many agent frameworks (coding, tool-use, autonomous loops) and timely given rapid adoption of skill/memory augmentation. Paper 2 is promising and adds a useful benchmark, but its causal-intervention memory selection may be harder to scale and its impact depends on broader adoption of the new dataset.

vs. Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

gemini-3.15/22/2026

Paper 1 addresses a critical bottleneck in self-evolving LLM agents—skill lifecycle management—demonstrating massive empirical gains on rigorous, real-world benchmarks like SWE-bench and MBPP+. Its combination of strong practical results, thorough ablation studies, and theoretical non-divergence bounds suggests broader immediate applicability and foundational impact compared to Paper 2's reliance on a synthetic benchmark and potentially compute-heavy causal memory selection.

vs. OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

gpt-5.25/22/2026

Paper 1 likely has higher impact: it targets a central bottleneck for LLM agents (skill lifecycle management), proposes a minimal, general “hygiene” recipe with clear ablations, provides sizable gains on widely relevant coding/agent benchmarks (MBPP+, SWE-bench Verified), and includes a stability/non-divergence argument—supporting methodological rigor and transferability. Its contributions apply broadly to autonomous agents, continual improvement without finetuning, and tool/skill libraries across domains. Paper 2 is novel for ToM benchmarking/data generation, but its impact is narrower to social reasoning tasks and relies more on synthetic data/RL specifics.

vs. OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

gpt-5.25/22/2026

Paper 2 likely has higher impact: it targets a central, timely problem in agentic LLMs—maintaining self-evolving skill libraries—and shows large, reproducible gains with ablations, multi-seed evaluation, and a stability proposition. The “minimal hygiene recipe” is broadly applicable across domains (coding, software engineering agents) and can be adopted without model finetuning, increasing real-world utility. Paper 1 is novel in ToM benchmark/data generation and shows striking benchmark gains, but its impact is narrower (ToM-specific) and more tied to synthetic benchmark design than to general agent performance.

vs. Interference-Aware Multi-Task Unlearning

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact: it tackles machine unlearning in realistic multi-task settings, a timely problem with clear safety, privacy, and compliance applications. The interference analysis (task- and instance-level coupling) and the proposed gradient-projection/orthogonalization framework are broadly applicable to many shared-backbone models beyond vision, potentially influencing unlearning, continual learning, and multi-objective optimization. Paper 1 shows strong empirical gains for LLM agent skill hygiene, but it is more domain-specific, relies on a particular agent loop, and may be more sensitive to model/tooling choices, limiting breadth relative to unlearning.

vs. ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

gemini-3.15/22/2026

Paper 1 demonstrates higher potential scientific impact due to its broad applicability and timely contribution to the rapidly expanding field of autonomous LLM agents. By solving the critical skill lifecycle management bottleneck, Ratchet achieves massive performance gains on rigorous benchmarks like SWE-bench and MBPP+. Its detailed ablations and theoretical non-divergence guarantees provide foundational insights for self-improving AI. While Paper 2 offers significant safety advancements for autonomous driving, Paper 1's methodology impacts a wider range of general-purpose AI domains, positioning it as a foundational recipe for next-generation self-evolving agents.