Self-Programmed Execution for Language-Model Agents

Luke J. O'Connor

May 7, 2026

arXiv:2605.06898v1 PDF

cs.AI(primary)

#188of 2292·Artificial Intelligence

#188 of 2292 · Artificial Intelligence

Tournament Score

1522±47

10501800

89%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7

Rigor7.5

Novelty7.5

Clarity8

Tournament Score

1522±47

10501800

89%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

At the heart of existing language model agents is a fixed orchestrator program responsible for the state transition between consecutive turns. This paper introduces self-programmed execution (SPE), an agent architecture in which the model completion is itself the orchestrator program, and the harness evaluates this program but does not impose its own orchestration policy. I formalize this idea using agentic machines: an SPE state is one from which a model completion can load any state of an embedded copy of the machine, meaning that it is subject to no fixed turn-to-turn orchestration policy. Realizing SPE in practice is nontrivial because the same data is both model context and executable program. I therefore introduce Spell, a Lisp-based language in which programs can edit and re-evaluate themselves, and effectful expressions like model invocations are structured such that re-evaluating an edited program does not replay its side effects. Experiments with existing models, not trained for SPE or Spell, show that frontier models can operate in this regime and accomplish challenging agentic tasks. These results demonstrate how an LM can act as an agent without any fixed orchestration policy, and they raise the question of what self-orchestration strategies might be learned by a model trained for self-programmed execution. Code is available at https://github.com/lukejoconnor/spell .

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Self-Programmed Execution for Language-Model Agents

1. Core Contribution

This paper introduces Self-Programmed Execution (SPE), an agent architecture that eliminates fixed orchestration policy from the harness entirely, making the model completion itself the orchestrator program. The key insight is that the transition between consecutive agent turns—what context to carry forward, what tools to invoke, when to delegate—need not be prescribed by the harness but can be expressed by the model as executable code. The paper formalizes this via "agentic machines," proving that a seed program `let y = lm q in eval(y)` reaches an SPE state from which the model can access any successor state. To realize SPE practically, the author introduces Spell, a Lisp-based language addressing three challenges: context persistence as code, replay-safe effects, and turn-boundary interference.

The conceptual contribution is clean: SPE is the logical endpoint of a trend toward partial self-orchestration (from ReAct → RLMs → SPE), and the paper makes this precise through formal definitions of embeddings, completion-generation, and SPE states.

2. Methodological Rigor

Formal framework. The agentic machine abstraction is well-defined, and the proofs (Theorems A.4, A.17, Corollaries A.19, A.21) are rigorous. The embedding-based formalization is appropriate—it draws on verification theory (simulation/refinement) while being tailored to the LM agent setting. The universality corollary is honest: it establishes expressiveness but explicitly notes it "cuts both ways" and provides no performance guarantee.

Language design. Spell is carefully engineered. The trailing-expression pattern for replay-safe effects is elegant—it structurally prevents re-execution of old side effects without requiring stateful tracking. The quine form for self-reference and fresh local environments for turn-boundary isolation are well-motivated by the design principles.

Empirical evaluation. The experiments are thorough but have important caveats. The comparison against Codex CLI (a mature, post-training-aligned harness) using models not trained for Spell is a deliberately high bar, and the results are credibly presented. GPT-5.4 achieves zero fatal Spell errors, and accuracy is competitive on coding benchmarks (43/80 vs 43/80 on Terminal-Bench with coding prompt; 171/300 vs 172/300 on SWE-bench Lite at medium effort). However, the AppWorld gap (42.1% vs 63.2%) is substantial and not well explained. The orchestration games are modest but demonstrate capability beyond what models spontaneously choose. Statistical rigor is adequate—pairwise binomial tests are reported, and budget-cap sensitivity is analyzed.

3. Potential Impact

Architectural paradigm. SPE represents a conceptually important limiting case. If models can be trained natively for self-orchestration (as suggested by the RLM, FoldGRPO, and Conductor precedents), the implications could be significant: orchestration strategies could be learned end-to-end rather than hand-engineered. The paper correctly frames this as an open question rather than a demonstrated benefit.

Practical tooling. Spell's context management features (!peek, prune, persist) demonstrably reduce token usage (~4× fewer total tokens than Codex CLI) at competitive accuracy. This addresses a genuine cost bottleneck in production agent systems. The code is released.

Training signal. The most impactful downstream possibility is training models natively for Spell-like self-orchestration. The paper establishes the substrate but leaves this critical experiment to future work.

Limitations on immediate impact. Current models primarily use Spell for simple policies (context pruning, tool batching) rather than sophisticated multi-agent orchestration. No model successfully used multi-agent delegation in the coding benchmarks. This suggests that without Spell-native training, the architecture's expressive power is underutilized.

4. Timeliness & Relevance

The paper is highly timely. Context engineering has been explicitly identified as a bottleneck by practitioners (Anthropic, OpenAI). Commercial coding agents (Codex CLI, Claude Code) represent significant engineering investment in orchestration policy. The trend toward partial self-orchestration (MemGPT, RLMs, Conductor, FoldGRPO) makes SPE a natural next question. The formal treatment provides theoretical grounding that the field currently lacks.

5. Strengths & Limitations

Strengths:

Conceptual clarity: the progression from fixed orchestration to SPE is precisely defined and well-motivated

The formalization is genuinely useful—it gives the community shared vocabulary for reasoning about orchestration

Spell's design is principled and addresses real technical challenges; the trailing-expression pattern is a genuine contribution to language design for self-modifying programs

Empirical comparisons are fair and transparently reported, including failures (AppWorld, multi-agent non-use)

Extensive appendices provide full prompt surfaces, enabling reproducibility

Token efficiency gains are substantial and practically relevant

Limitations:

No Spell-native training: the strongest argument for SPE is that orchestration could be learned end-to-end, but this is entirely hypothetical

Models don't spontaneously use the full expressiveness of Spell—the gap between capability (orchestration games) and behavior (coding benchmarks) is acknowledged but unresolved

AppWorld underperformance (-21 percentage points) is significant and poorly understood

The universality result, while mathematically clean, is somewhat vacuous practically—any computable orchestration can be installed externally too

Single-author work with limited model diversity for the main comparisons (primarily GPT-5.4)

The prompting surface is substantial (~3k words system prompt + namespace docs); whether observed benefits stem from SPE principles or effective prompting is hard to disentangle

Overall assessment: This is a thoughtful, well-executed paper that formalizes an important conceptual endpoint in agent architecture design and provides a practical implementation. Its main limitation is that the strongest claims about SPE's value require training experiments that haven't been conducted. The current empirical results demonstrate feasibility and competitive performance but not clear superiority over existing approaches.

Rating:7/ 10

Significance 7Rigor 7.5Novelty 7.5Clarity 8

Generated May 11, 2026

Comparison History (19)

vs. State Contamination in Memory-Augmented LLM Agents

gemini-3.15/19/2026

Paper 1 introduces a fundamentally novel architecture that removes the need for fixed orchestrator programs in LM agents, potentially revolutionizing agent design. While Paper 2 addresses a critical safety vulnerability, Paper 1's creation of a new self-programming paradigm and execution language offers a broader methodological shift that could spawn entirely new directions in agentic AI research.

vs. Sequence Search: Automated Sequence Design using Neural Architecture Search

gpt-5.25/16/2026

Paper 2 is likely to have higher scientific impact due to broader cross-field relevance and timeliness: it proposes a general agent architecture (self-programmed execution) that challenges the dominant fixed-orchestrator paradigm, introduces a concrete self-editing language/runtime (Spell) addressing side-effect re-execution, and demonstrates feasibility on frontier models with available code, enabling rapid follow-on work. Paper 1 is novel and valuable for MRI, but its impact is more domain-specific and constrained by physics/simulator fidelity and deployment hurdles, whereas Paper 2’s ideas can influence many areas of AI, programming languages, and agent safety/tooling.

vs. Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

claude-opus-4.65/16/2026

Paper 1 introduces a fundamentally new agent architecture (SPE) where the LM itself serves as the orchestrator program, formalized through agentic machines and realized via a novel self-editing Lisp-based language (Spell). This represents a paradigm shift in how LM agents are designed, moving beyond fixed orchestration policies. The theoretical framework, novel programming language, and broad applicability across agentic tasks give it high novelty and breadth of impact. Paper 2 presents a useful but more incremental contribution—applying cycle-consistency (an established idea) to search agent training—with narrower scope limited to information retrieval.

vs. Adaptive auditing of AI systems with anytime-valid guarantees

gemini-3.15/16/2026

Paper 1 introduces a fundamentally novel architecture for LLM agents by eliminating fixed orchestrators and allowing models to self-program their execution. Given the explosive growth and interest in autonomous AI agents, this paradigm shift, along with the proposed Spell language, has massive potential to influence how future agentic systems are built. While Paper 2 offers rigorous statistical methods for AI auditing, Paper 1's architectural innovation is likely to spark a broader wave of follow-up research and practical applications in agent design.

vs. Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

claude-opus-4.65/16/2026

Paper 1 introduces OPT-BENCH, a comprehensive framework addressing a clear gap in LLM evaluation—optimality beyond correctness for NP-hard problems. It demonstrates strong empirical results with quality-aware RLVR, showing significant improvements over GPT-4o and positive transfer to diverse tasks. The practical impact is broad: optimization problems are ubiquitous in industry. Paper 2 presents an intellectually interesting but more speculative contribution (self-programmed execution with Spell), tested only on existing models not trained for the paradigm. Its impact depends on future training advances that remain undemonstrated, making Paper 1's contributions more immediately impactful and actionable.

vs. CODESTRUCT: Code Agents over Structured Action Spaces

gpt-5.25/16/2026

Paper 1 has higher estimated scientific impact due to its more fundamental architectural shift: removing the fixed orchestrator and making the model completion itself the executable control policy (SPE), formalized via agentic machines and instantiated with a self-editing/effect-safe language (Spell). This changes how agent autonomy and learning of orchestration strategies can be conceived, with implications across agent design, programming languages, and safety. Paper 2 is rigorous and highly practical for code agents, but is a narrower interface improvement within existing agent paradigms.

vs. How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

claude-opus-4.65/16/2026

Paper 2 introduces a fundamentally novel architectural paradigm (SPE) where language models self-program their own orchestration, eliminating fixed turn-to-turn policies. This represents a deeper conceptual contribution with broad implications for agent design, programming languages, and AI autonomy. The formalization via agentic machines and the Spell language are innovative contributions that could influence multiple research directions. Paper 1, while thorough in benchmarking LMMs for spatial navigation, is primarily an evaluation/benchmark contribution with more incremental insights. Paper 2's paradigm shift in agent architecture has greater potential to reshape how future agents are built.

vs. Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

gemini-3.15/16/2026

Paper 1 introduces a fundamental paradigm shift in LLM agent architecture by replacing fixed orchestrators with self-programmed execution. This foundational innovation has broad applicability across all agentic AI research. While Paper 2 presents a highly rigorous and effective system for the specific domain of forecasting, Paper 1's conceptual novelty and potential to redefine how autonomous agents are structured give it a higher ceiling for broad scientific impact.

vs. When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

gemini-3.15/16/2026

Paper 1 offers higher scientific impact by addressing a fundamental, ubiquitous limitation of LLMs (multi-turn context degradation) through rigorous mechanistic interpretability. By introducing the Goal Accessibility Ratio and uncovering how attention channels transition to residual streams, it provides actionable, causal insights into architectural flaws. This will directly influence foundational model design and evaluation. While Paper 2 presents a highly innovative agent architecture, it relies on a specific Lisp-based paradigm. Paper 1's generalizable diagnostics and mechanistic explanations of a critical failure mode ensure broader and more immediate relevance across the AI field.

vs. Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

gpt-5.25/16/2026

Paper 2 is more novel and broadly impactful: it reframes agent architectures by removing the fixed orchestrator and making the model output the executable control program, with a formalization (“agentic machines”) and a purpose-built language (Spell) addressing hard issues like self-modification and side-effect-safe re-evaluation. This is a general paradigm shift applicable across agentic systems, tooling, and training research. Paper 1 is rigorous and practical for multi-hop RAG, but is more domain-specific and incremental relative to prior tool/program-based RAG and code-as-reasoning trends. Paper 2’s breadth and conceptual novelty suggest higher scientific impact.

vs. SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility

gemini-3.15/11/2026

Paper 1 proposes a fundamental paradigm shift in agent architecture by removing fixed orchestrators in favor of self-programmed execution. This high novelty, combined with the introduction of a new language (Spell) to solve the context/program duality, opens an entirely new research direction for autonomous agents. Paper 2 offers a valuable but more incremental improvement to RL alignment via dynamic curriculum learning. The foundational nature and potential to rethink how LLM agents are designed give Paper 1 a higher potential scientific impact.

vs. SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

gpt-5.25/11/2026

Paper 1 is more novel: it proposes a new agent architecture (self-programmed execution) where the LM output is the orchestrator, plus a practical language/runtime (Spell) addressing tricky side-effect re-execution issues. This could reshape how agents are built and trained, with broad impact across agent design, programming languages, and alignment/safety. While Paper 2 is timely and useful as a benchmark for embodied social reasoning, benchmarks typically have narrower conceptual novelty and impact unless they become dominant standards; its main contribution is evaluation infrastructure rather than a new paradigm.

vs. Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

gemini-3.15/11/2026

Paper 1 proposes a fundamental architectural shift for LLM agents by eliminating fixed orchestrators in favor of self-programmed execution. This theoretical and methodological innovation introduces a novel computing paradigm with broad applicability across the entire AI agent field. In contrast, Paper 2 presents a highly effective but domain-specific framework for system operations. While Paper 2 shows impressive real-world industrial impact, Paper 1's foundational novelty is likely to inspire more widespread scientific follow-up and cross-disciplinary research.

vs. Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

gemini-3.15/11/2026

Paper 2 introduces a foundational paradigm shift in agent architecture by eliminating the fixed orchestrator and allowing models to self-program their execution loops. This conceptual innovation, backed by a novel self-modifying programming language (Spell), offers much broader theoretical and practical implications for all LLM agents. In contrast, Paper 1 presents a valuable but more incremental engineering approach focused specifically on optimizing lightweight GUI agents.

vs. Towards Security-Auditable LLM Agents: A Unified Graph Representation

gpt-5.25/11/2026

Paper 2 is more novel: it redefines agent execution by removing the fixed orchestrator and making the model output an executable self-editing program (SPE), introducing a concrete language/runtime (Spell) to make this feasible and safe w.r.t. side effects. This is a foundational architectural shift that could influence agent design, training objectives, and programming-language interfaces across many domains. Paper 1 is timely and practically useful for security auditing, but it is more of a unifying representation/engineering framework with narrower impact primarily in LLM security and governance.

vs. RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

claude-opus-4.65/11/2026

Paper 1 introduces a fundamentally new agent architecture (SPE) where LLMs self-program their own orchestration, supported by novel theoretical formalization (agentic machines) and a practical language (Spell). This challenges existing assumptions about fixed orchestration in LM agents and opens a new research direction with broad implications for AI agent design and self-modifying systems. Paper 2, while methodologically sound, is a benchmark contribution for a narrower domain (VL content moderation) with more incremental impact. SPE's novelty, generality, and potential to reshape agent architectures give it substantially higher impact potential.

vs. GAMMA-Net: Adaptive Long-Horizon Traffic Spatio-Temporal Forecasting Model based on Interleaved Graph Attention and Multi-Axis Mamba

claude-opus-4.65/11/2026

Paper 1 introduces a fundamentally novel agent architecture (SPE) where LLMs self-program their own orchestration, formalized with agentic machines and a novel self-editing Lisp language (Spell). This addresses a core limitation in LM agent design with broad implications for AI agent architectures, autonomous systems, and programming language design. Paper 2, while solid, is an incremental combination of existing components (GAT + Mamba) for traffic forecasting—a well-explored domain. Paper 1's conceptual novelty, cross-disciplinary impact (PL + AI agents), and potential to reshape agent design give it substantially higher impact potential.

vs. Leading Across the Spectrum of Human-AI Relationships: A Conceptual Framework for Increasingly Heterogeneous Teams

gpt-5.25/11/2026

Paper 1 is more likely to have higher scientific impact: it proposes a novel agent architecture (self-programmed execution) with a concrete implementation (Spell) and empirical evidence that existing frontier LMs can operate under this regime. This is methodologically stronger and more directly extensible by researchers, with immediate applications to agent design, tool use, and training objectives. Its ideas can influence multiple technical areas (program synthesis, agent orchestration, safety/interpretability). Paper 2 is timely and potentially influential in practice, but it is primarily conceptual with less empirical/technical rigor, making its scientific impact less certain.

vs. Finite-Time Analysis of MCTS in Continuous POMDP Planning

claude-opus-4.65/11/2026

Paper 1 introduces a fundamentally novel agent architecture (SPE) where language models self-orchestrate by writing and editing their own execution programs, eliminating fixed orchestration policies. This is highly innovative, timely given the explosion of LM agents, and has broad implications for AI agent design, autonomy, and future training paradigms. Paper 2 provides valuable theoretical contributions (finite-time MCTS guarantees for continuous POMDPs), but addresses a more incremental, narrower problem within established planning frameworks. Paper 1's paradigm-shifting potential and relevance to the rapidly growing LM agent field give it higher impact potential.