MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

Qianshu Cai, Yonggang Zhang, Xianzhang Jia, Wei Xue, Jun Song, Xinmei Tian, Yike Guo

May 21, 2026

arXiv:2605.22794v1 PDF

cs.AI(primary)cs.LG

#384of 2292·Artificial Intelligence

#384 of 2292 · Artificial Intelligence

Tournament Score

1487±41

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance6

Rigor3.5

Novelty5.5

Clarity7

Tournament Score

1487±41

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files, prompt configurations, memory schemas, workflow graphs -- and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MOSS – Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

1. Core Contribution

MOSS introduces source-level self-rewriting for production-grade agentic systems, extending the scope of self-evolution beyond text-mutable artifacts (prompts, skills, memory schemas) to the agent harness itself—routing logic, hook ordering, state management, and dispatch. The paper argues this is a fundamentally more general evolution medium along four axes: Turing-completeness, strict superset of text-mutable scope, deterministic effect, and resistance to long-context drift. The system is instantiated on OpenClaw, a production agentic substrate, and demonstrated on a four-task compliance-audit benchmark.

The key novelty is the claim and demonstration that modifying the harness code itself—not just the prompts and skills the harness orchestrates—is both necessary (some failure classes are unreachable from the text layer) and feasible (with appropriate engineering scaffolding). This fills a gap between academic source-level self-rewriting on minimal scaffolds (SICA, Darwin Gödel Machine) and production self-evolving agents that only touch text-mutable layers (Hermes Agent, SkillClaw, EvoAgentX).

2. Methodological Rigor

The system design is well-articulated: a five-component architecture (main container, CLI control surface, pluggable coding-agent, host-daemon, ephemeral trial workers) with clear separation of concerns. The seven-stage pipeline (Locate → Plan → Plan-Review → Implement → Code-Review → Task-Evaluate → Verdict) with multi-round review gates and plateau guards is a sensible engineering decomposition.

However, the empirical evaluation is thin. The case study uses only four tasks from a single category (compliance-audit) in a single evolution cycle. The baseline-to-candidate comparison measures the system on the same tasks used as the evolution batch, which is essentially measuring in-sample improvement rather than generalization. The authors acknowledge this ("Using a benchmark batch here trades realism for reproducibility"), but it severely limits what can be concluded. There is no measurement of whether the fix regresses performance on other tasks, no ablation study comparing source-level vs. text-level fixes for the same failures, no comparison against any baseline self-evolving system, and no statistical significance analysis beyond mean-of-3-trials.

The claim that source-level evolution is "fundamentally more general" is argued conceptually but not empirically validated in a comparative setting. The Turing-completeness argument is theoretically sound but practically trivial—the question is not whether code *can* express more than prompts, but whether an LLM-driven coding agent can reliably make the right code changes at scale.

3. Potential Impact

The core insight—that production agent failures often originate in harness code unreachable by prompt/skill editing—is valuable and likely correct as agentic systems grow more complex. If MOSS's approach proves robust at scale, it could shift how production agent systems are maintained, reducing the human engineering bottleneck for harness-level fixes.

The pluggable coding-agent architecture (supporting Claude Code, OpenAI Codex, DeepSeek-TUI, OpenCode) and the substrate-agnostic control surface design suggest reasonable engineering for adoption. The user-consent-gated deployment with health-probe rollback addresses real production safety concerns.

However, the practical impact is constrained by several factors: (a) the demonstration is on a single substrate with minimal task diversity; (b) the safety implications of autonomous source-level self-modification in production systems are not seriously addressed; (c) scalability to larger codebases and more complex failure patterns is untested.

4. Timeliness & Relevance

The paper addresses a genuine and timely gap. The proliferation of production agentic systems (OpenClaw, Hermes Agent, Claude assistants) creates a real need for autonomous improvement, and the observation that text-mutable evolution hits a ceiling as harness complexity grows is well-motivated. The paper positions itself clearly in the emerging landscape of self-evolving agents (Table 1 is useful).

The timing aligns with rapid advances in coding agents (Claude Code, Codex CLI) that make source-level modification by LLMs increasingly feasible. MOSS wisely delegates code editing to these external tools rather than building its own.

5. Strengths & Limitations

Strengths:

Clear and well-motivated problem framing: the text-mutable ceiling is a real limitation that will become more pressing as agentic systems mature.

Principled system architecture with clean separation between evolution orchestration (MOSS) and code editing (pluggable coding agent).

Production-oriented design: ephemeral trial workers, user-consent gates, health-probe rollback, user-state preservation across swaps.

The directed evolution approach (anchored to specific failure batches rather than benchmark optimization) is pragmatically sound.

Good exposition of the four-layer nesting (baseline → iteration loop → stage pipeline → stage-internal retry).

Limitations:

Extremely narrow evaluation: 4 tasks, 1 task category, 1 substrate, 1 evolution cycle, no generalization measurement, no regression testing on non-batch tasks.

No comparative baselines: No comparison against text-mutable-only evolution on the same failures, or against human-authored fixes, or against any prior self-evolving system.

In-sample evaluation: Measuring improvement on the same tasks used to drive evolution is circular; the system is designed to fix exactly these tasks.

Safety and alignment concerns barely addressed: Autonomous source-level self-modification raises significant safety questions (adversarial exploitation, unintended behavioral changes, cascading modifications) that receive no treatment beyond the rollback mechanism.

Scalability undemonstrated: A 177-line, 3-file change on one iteration is modest; behavior on larger, cross-cutting failures is unknown.

Reproducibility concerns: The system depends on proprietary coding agents (Claude Code), specific substrate configurations, and complex container orchestration that may be difficult to replicate.

The four theoretical arguments (Turing-complete, strict superset, deterministic, no drift) are asserted but the practical advantages over text-mutable evolution are not empirically isolated.

Additional Observations

The paper reads more as a systems/engineering contribution than a scientific one. The architecture is thoughtfully designed, but the absence of rigorous evaluation—ablations, comparisons, generalization tests, failure case analysis—significantly weakens the scientific claims. The 0.25→0.61 result, while notable, is on in-sample tasks and a single evolution cycle; it tells us the system works in one favorable setting but little about reliability or generality.

The paper would benefit enormously from: (1) running the evolved agent on held-out tasks to measure generalization/regression; (2) comparing against a text-mutable-only baseline fixing the same failures; (3) multiple evolution cycles showing iterative improvement; (4) failure analysis of cases where source-level evolution breaks or produces harmful changes.

Rating:4.8/ 10

Significance 6Rigor 3.5Novelty 5.5Clarity 7

Generated May 22, 2026

Comparison History (18)

vs. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

gpt-5.25/22/2026

Paper 1 is more novel and potentially higher-impact: it moves agent adaptation from prompt/text artifacts to source-level self-rewriting with verification, promotion, and rollback—addressing structural failures unreachable by existing self-improvement methods. This could materially change how production agents are maintained, enabling continuous, deterministic evolution with safety gates. While Paper 2 is timely and useful for scalable diagnostics, corpus-level trace analysis is a more incremental extension of existing observability/analysis paradigms and depends on humans/agents to apply insights, whereas Paper 1 directly closes the loop to autonomous, deployable fixes.

vs. Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

gpt-5.25/22/2026

Paper 1 is more novel by enabling autonomous agents to self-evolve via verified source-code rewriting, expanding adaptation beyond prompt/workflow layers to the actual harness (routing, invariants, hooks). It proposes a rigorous, safety-aware pipeline (evidence batching, deterministic stages, replay-based verification, gated rollout/rollback) with concrete performance gains, and has broad implications for reliable long-lived agent systems and software maintenance. Paper 2 is timely and practical, but “compiling workflows into weights” is a more established direction (several cited predecessors), making its incremental novelty and cross-field impact comparatively lower.

vs. Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

gpt-5.25/22/2026

Paper 1 offers a clearer, more novel scientific contribution: systematic runtime harness adaptation (without weight updates or environment changes) with strong, broad empirical validation across 7 benchmarks, 18 model backbones, and extensive transfer, suggesting generalizable environment-side structure. This methodological rigor and breadth make it likely to influence agent evaluation/training paradigms and interface design across many deterministic tool-use settings. Paper 2 is compelling and timely for production engineering, but evidence is narrower (single benchmark, few tasks) and the contribution leans more toward systems integration/DevOps than broadly validated scientific insight.

vs. Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

gpt-5.25/22/2026

Paper 2 has higher likely scientific impact: it targets a broadly relevant, timely bottleneck (skill lifecycle management) with a minimal, portable recipe; demonstrates substantial gains on widely used benchmarks (MBPP+, SWE-bench Verified) with multiple seeds, many rounds, and ablations; and adds a formal non-divergence proposition. This combination of generality, rigor, and reproducibility makes it more likely to influence practice across agent and LLM-tooling communities. Paper 1 is novel (source-level self-rewriting) and practically appealing, but evidence is narrower (single environment, single cycle) and the approach may face higher safety/engineering barriers to widespread adoption.

vs. Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

gemini-3.15/22/2026

Paper 1 proposes a paradigm-shifting approach by enabling autonomous agents to rewrite their own source code, achieving self-evolution at a Turing-complete level rather than just modifying text artifacts or prompts. This fundamentally expands the horizon of self-improving systems and addresses structural failures unreachable by previous methods. While Paper 2 offers a rigorous and highly transferable interface adaptation method, Paper 1's bold conceptual leap toward true self-modifying code presents a deeper theoretical and practical impact for the future of AGI.

vs. Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

gpt-5.25/22/2026

Paper 2 is more novel and broadly impactful: it advances autonomous agents from prompt/memory tweaking to verified source-level self-rewriting with a concrete, deployable pipeline (evidence batching, deterministic stages, replay-based verification, safe promotion/rollback). This could generalize across many agent systems and application domains, with clear real-world operational benefits. Paper 1 offers timely, rigorous evaluation for clinical LLMs and important safety insights, but it is primarily a benchmarking/diagnostic analysis contribution with narrower domain impact compared to a general mechanism for self-improving production agents.

vs. Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence

gpt-5.25/22/2026

Paper 2 is more likely to have broad scientific impact because it delivers a cross-domain benchmark with explicit baselines, ablations/null controls, scoring protocols, and stated limitations—assets that can be reused by many communities to evaluate coordinated-agent workflows. Its framing (when coordination helps vs. not) is timely and generalizable across scientific inference settings, and the reported strong results plus negative findings improve credibility and adoption. Paper 1 is novel and practically important for agent reliability, but is narrower (agent engineering), higher-risk (self-modifying code), and its evaluation appears less general and less rigorously benchmarked.

vs. PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

claude-opus-4.65/22/2026

MOSS introduces a fundamentally new paradigm for autonomous agent systems—source-level self-rewriting—that addresses a structural limitation (static deployment) affecting the entire AI agent ecosystem. Its Turing-complete self-evolution framework is broadly applicable across all agentic systems, not just a single domain. While PRISMat offers solid incremental improvements in materials discovery with a more efficient architecture, MOSS's contribution is more novel and potentially transformative, enabling agents to autonomously fix structural failures without human intervention. The breadth of impact across AI agent development gives MOSS higher potential scientific impact.

vs. TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

claude-opus-4.65/22/2026

TACT introduces a novel, mechanistically grounded approach to understanding and mitigating agent drift via activation steering—a technique with broad applicability across models and tasks. Its identification of linearly separable 'drift axes' in the residual stream is a fundamental insight into LLM agent behavior, bridging interpretability and practical performance. The method is lightweight, model-agnostic, and validated across multiple benchmarks and architectures. MOSS, while ambitious in proposing source-level self-rewriting, is evaluated on a single benchmark and raises significant safety/reliability concerns that may limit adoption. TACT's methodological rigor and broader applicability give it higher impact potential.

vs. TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

gemini-3.15/22/2026

Paper 2 proposes a highly ambitious, paradigm-shifting approach by allowing autonomous agents to rewrite their own source code. This Turing-complete, source-level self-evolution transcends the limitations of modifying text-based artifacts (like prompts or skills), addressing structural failures that current systems cannot reach. While Paper 1 offers a rigorous and practical technical solution (activation steering) for specific agent drift issues, Paper 2 introduces a fundamentally more general and transformative concept with broader implications for the future of general AI agents and autonomous self-improvement.

vs. Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

gemini-3.15/22/2026

Paper 2 introduces a foundational technical innovation (source-level self-rewriting for autonomous agents) that fundamentally advances the capabilities and architecture of AI systems. While Paper 1 addresses an important socio-technical issue in AI safety, Paper 2's approach to self-evolving agents offers broader methodological implications, pushing the boundaries of autonomous system design and potentially impacting a wider range of technical fields.

vs. Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

claude-opus-4.65/22/2026

Paper 1 addresses a critical and timely gap at the intersection of AI safety and conflict/humanitarian contexts, introducing the first evaluation framework for assessing LLM alignment in conflict-affected societies. Its breadth of impact spans AI ethics, policy, journalism, and humanitarian work, affecting real-world safety. Paper 2, while technically novel in proposing source-level self-rewriting for autonomous agents, addresses a narrower systems engineering problem with limited evaluation (one benchmark, one cycle). Paper 1's policy relevance and cross-disciplinary implications give it broader potential impact.

vs. Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

claude-opus-4.65/22/2026

MOSS introduces a fundamentally new paradigm—source-level self-rewriting of autonomous agent systems—that addresses a previously unrecognized limitation of all existing self-evolving agent approaches (confinement to text-mutable artifacts). This represents a more novel conceptual contribution with broader implications for autonomous systems, software engineering, and AI safety. While SkillWeave offers practical engineering value with modular LLM specialization, it is more incremental, building on well-established ideas (LoRA-style adapters, model compression). MOSS's Turing-complete self-evolution framework opens a new research direction with deeper theoretical and practical ramifications.

vs. Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

gemini-3.15/22/2026

Paper 1 introduces a paradigm-shifting concept of autonomous agents rewriting their own source code, enabling self-evolution beyond static text artifacts. This Turing-complete adaptation offers highly novel capabilities and broader potential impact for autonomous AI systems compared to Paper 2, which, while methodologically rigorous and important for AI safety, represents a more domain-specific evaluation metric improvement.

vs. TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

claude-opus-4.65/22/2026

MOSS introduces a fundamentally novel paradigm—source-level self-rewriting of autonomous agent systems—that addresses a previously unrecognized gap (structural failures unreachable from text-layer evolution). This is a more general and theoretically significant contribution with broader implications across all agentic AI systems. While TaskGround is a solid engineering contribution for household robotics with strong empirical results, its scope is narrower (household task grounding) and its techniques (structured prompting pipelines) are more incremental. MOSS's Turing-complete self-evolution framework opens new research directions in self-improving AI systems with wider cross-field impact.

vs. AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

gemini-3.15/22/2026

Paper 2 proposes a paradigm shift in autonomous agents by enabling source-level self-evolution, moving beyond the limitations of text-mutable artifacts (like prompts or schemas). This Turing-complete self-rewriting approach has profound implications for AGI and self-improving systems across multiple domains. Paper 1, while highly practical and effective for reducing costs in GUI automation, offers a more incremental optimization within a narrower scope.

vs. Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

gemini-3.15/22/2026

Paper 1 proposes a highly novel paradigm in autonomous agents: self-evolution through source-level rewriting, bypassing the limitations of text-based prompts or memory. This pushes the boundaries toward self-improving AI systems, offering immense potential impact and relevance in the rapidly growing field of LLM agents. Paper 2, while solid, applies standard deep reinforcement learning techniques to a well-known scheduling problem, representing an incremental improvement rather than a paradigm shift, leading to a narrower breadth of impact.

vs. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

gpt-5.25/22/2026

Paper 1 offers a concrete algorithmic advance in sequence modeling (decoupled erase/write gating for delta-rule linear attention) with derived efficient training/inference machinery and strong large-scale empirical validation (1.3B params, 100B tokens) across diverse benchmarks, especially long-context retrieval—an area of broad, timely interest. Its methodological rigor and potential to influence architectures across NLP and beyond are high. Paper 2 is innovative for autonomous agents and has clear practical relevance, but impact may be narrower (systems/engineering-specific), with less evidence of generalizable scientific principles or extensive evaluation.