Self-Programmed Execution for Language-Model Agents
Luke J. O'Connor
Abstract
At the heart of existing language model agents is a fixed orchestrator program responsible for the state transition between consecutive turns. This paper introduces self-programmed execution (SPE), an agent architecture in which the model completion is itself the orchestrator program, and the harness evaluates this program but does not impose its own orchestration policy. I formalize this idea using agentic machines: an SPE state is one from which a model completion can load any state of an embedded copy of the machine, meaning that it is subject to no fixed turn-to-turn orchestration policy. Realizing SPE in practice is nontrivial because the same data is both model context and executable program. I therefore introduce Spell, a Lisp-based language in which programs can edit and re-evaluate themselves, and effectful expressions like model invocations are structured such that re-evaluating an edited program does not replay its side effects. Experiments with existing models, not trained for SPE or Spell, show that frontier models can operate in this regime and accomplish challenging agentic tasks. These results demonstrate how an LM can act as an agent without any fixed orchestration policy, and they raise the question of what self-orchestration strategies might be learned by a model trained for self-programmed execution. Code is available at https://github.com/lukejoconnor/spell .
AI Impact Assessments
(1 models)Scientific Impact Assessment: Self-Programmed Execution for Language-Model Agents
1. Core Contribution
This paper introduces Self-Programmed Execution (SPE), an agent architecture that eliminates fixed orchestration policy from the harness entirely, making the model completion itself the orchestrator program. The key insight is that the transition between consecutive agent turns—what context to carry forward, what tools to invoke, when to delegate—need not be prescribed by the harness but can be expressed by the model as executable code. The paper formalizes this via "agentic machines," proving that a seed program `let y = lm q in eval(y)` reaches an SPE state from which the model can access any successor state. To realize SPE practically, the author introduces Spell, a Lisp-based language addressing three challenges: context persistence as code, replay-safe effects, and turn-boundary interference.
The conceptual contribution is clean: SPE is the logical endpoint of a trend toward partial self-orchestration (from ReAct → RLMs → SPE), and the paper makes this precise through formal definitions of embeddings, completion-generation, and SPE states.
2. Methodological Rigor
Formal framework. The agentic machine abstraction is well-defined, and the proofs (Theorems A.4, A.17, Corollaries A.19, A.21) are rigorous. The embedding-based formalization is appropriate—it draws on verification theory (simulation/refinement) while being tailored to the LM agent setting. The universality corollary is honest: it establishes expressiveness but explicitly notes it "cuts both ways" and provides no performance guarantee.
Language design. Spell is carefully engineered. The trailing-expression pattern for replay-safe effects is elegant—it structurally prevents re-execution of old side effects without requiring stateful tracking. The quine form for self-reference and fresh local environments for turn-boundary isolation are well-motivated by the design principles.
Empirical evaluation. The experiments are thorough but have important caveats. The comparison against Codex CLI (a mature, post-training-aligned harness) using models not trained for Spell is a deliberately high bar, and the results are credibly presented. GPT-5.4 achieves zero fatal Spell errors, and accuracy is competitive on coding benchmarks (43/80 vs 43/80 on Terminal-Bench with coding prompt; 171/300 vs 172/300 on SWE-bench Lite at medium effort). However, the AppWorld gap (42.1% vs 63.2%) is substantial and not well explained. The orchestration games are modest but demonstrate capability beyond what models spontaneously choose. Statistical rigor is adequate—pairwise binomial tests are reported, and budget-cap sensitivity is analyzed.
3. Potential Impact
Architectural paradigm. SPE represents a conceptually important limiting case. If models can be trained natively for self-orchestration (as suggested by the RLM, FoldGRPO, and Conductor precedents), the implications could be significant: orchestration strategies could be learned end-to-end rather than hand-engineered. The paper correctly frames this as an open question rather than a demonstrated benefit.
Practical tooling. Spell's context management features (!peek, prune, persist) demonstrably reduce token usage (~4× fewer total tokens than Codex CLI) at competitive accuracy. This addresses a genuine cost bottleneck in production agent systems. The code is released.
Training signal. The most impactful downstream possibility is training models natively for Spell-like self-orchestration. The paper establishes the substrate but leaves this critical experiment to future work.
Limitations on immediate impact. Current models primarily use Spell for simple policies (context pruning, tool batching) rather than sophisticated multi-agent orchestration. No model successfully used multi-agent delegation in the coding benchmarks. This suggests that without Spell-native training, the architecture's expressive power is underutilized.
4. Timeliness & Relevance
The paper is highly timely. Context engineering has been explicitly identified as a bottleneck by practitioners (Anthropic, OpenAI). Commercial coding agents (Codex CLI, Claude Code) represent significant engineering investment in orchestration policy. The trend toward partial self-orchestration (MemGPT, RLMs, Conductor, FoldGRPO) makes SPE a natural next question. The formal treatment provides theoretical grounding that the field currently lacks.
5. Strengths & Limitations
Strengths:
Limitations:
Overall assessment: This is a thoughtful, well-executed paper that formalizes an important conceptual endpoint in agent architecture design and provides a practical implementation. Its main limitation is that the strongest claims about SPE's value require training experiments that haven't been conducted. The current empirical results demonstrate feasibility and competitive performance but not clear superiority over existing approaches.
Generated May 11, 2026
Comparison History (19)
Paper 1 introduces a fundamentally novel architecture that removes the need for fixed orchestrator programs in LM agents, potentially revolutionizing agent design. While Paper 2 addresses a critical safety vulnerability, Paper 1's creation of a new self-programming paradigm and execution language offers a broader methodological shift that could spawn entirely new directions in agentic AI research.
Paper 2 is likely to have higher scientific impact due to broader cross-field relevance and timeliness: it proposes a general agent architecture (self-programmed execution) that challenges the dominant fixed-orchestrator paradigm, introduces a concrete self-editing language/runtime (Spell) addressing side-effect re-execution, and demonstrates feasibility on frontier models with available code, enabling rapid follow-on work. Paper 1 is novel and valuable for MRI, but its impact is more domain-specific and constrained by physics/simulator fidelity and deployment hurdles, whereas Paper 2’s ideas can influence many areas of AI, programming languages, and agent safety/tooling.
Paper 1 introduces a fundamentally new agent architecture (SPE) where the LM itself serves as the orchestrator program, formalized through agentic machines and realized via a novel self-editing Lisp-based language (Spell). This represents a paradigm shift in how LM agents are designed, moving beyond fixed orchestration policies. The theoretical framework, novel programming language, and broad applicability across agentic tasks give it high novelty and breadth of impact. Paper 2 presents a useful but more incremental contribution—applying cycle-consistency (an established idea) to search agent training—with narrower scope limited to information retrieval.
Paper 1 introduces a fundamentally novel architecture for LLM agents by eliminating fixed orchestrators and allowing models to self-program their execution. Given the explosive growth and interest in autonomous AI agents, this paradigm shift, along with the proposed Spell language, has massive potential to influence how future agentic systems are built. While Paper 2 offers rigorous statistical methods for AI auditing, Paper 1's architectural innovation is likely to spark a broader wave of follow-up research and practical applications in agent design.
Paper 1 introduces OPT-BENCH, a comprehensive framework addressing a clear gap in LLM evaluation—optimality beyond correctness for NP-hard problems. It demonstrates strong empirical results with quality-aware RLVR, showing significant improvements over GPT-4o and positive transfer to diverse tasks. The practical impact is broad: optimization problems are ubiquitous in industry. Paper 2 presents an intellectually interesting but more speculative contribution (self-programmed execution with Spell), tested only on existing models not trained for the paradigm. Its impact depends on future training advances that remain undemonstrated, making Paper 1's contributions more immediately impactful and actionable.
Paper 1 has higher estimated scientific impact due to its more fundamental architectural shift: removing the fixed orchestrator and making the model completion itself the executable control policy (SPE), formalized via agentic machines and instantiated with a self-editing/effect-safe language (Spell). This changes how agent autonomy and learning of orchestration strategies can be conceived, with implications across agent design, programming languages, and safety. Paper 2 is rigorous and highly practical for code agents, but is a narrower interface improvement within existing agent paradigms.
Paper 2 introduces a fundamentally novel architectural paradigm (SPE) where language models self-program their own orchestration, eliminating fixed turn-to-turn policies. This represents a deeper conceptual contribution with broad implications for agent design, programming languages, and AI autonomy. The formalization via agentic machines and the Spell language are innovative contributions that could influence multiple research directions. Paper 1, while thorough in benchmarking LMMs for spatial navigation, is primarily an evaluation/benchmark contribution with more incremental insights. Paper 2's paradigm shift in agent architecture has greater potential to reshape how future agents are built.
Paper 1 introduces a fundamental paradigm shift in LLM agent architecture by replacing fixed orchestrators with self-programmed execution. This foundational innovation has broad applicability across all agentic AI research. While Paper 2 presents a highly rigorous and effective system for the specific domain of forecasting, Paper 1's conceptual novelty and potential to redefine how autonomous agents are structured give it a higher ceiling for broad scientific impact.
Paper 1 offers higher scientific impact by addressing a fundamental, ubiquitous limitation of LLMs (multi-turn context degradation) through rigorous mechanistic interpretability. By introducing the Goal Accessibility Ratio and uncovering how attention channels transition to residual streams, it provides actionable, causal insights into architectural flaws. This will directly influence foundational model design and evaluation. While Paper 2 presents a highly innovative agent architecture, it relies on a specific Lisp-based paradigm. Paper 1's generalizable diagnostics and mechanistic explanations of a critical failure mode ensure broader and more immediate relevance across the AI field.
Paper 2 is more novel and broadly impactful: it reframes agent architectures by removing the fixed orchestrator and making the model output the executable control program, with a formalization (“agentic machines”) and a purpose-built language (Spell) addressing hard issues like self-modification and side-effect-safe re-evaluation. This is a general paradigm shift applicable across agentic systems, tooling, and training research. Paper 1 is rigorous and practical for multi-hop RAG, but is more domain-specific and incremental relative to prior tool/program-based RAG and code-as-reasoning trends. Paper 2’s breadth and conceptual novelty suggest higher scientific impact.
Paper 1 proposes a fundamental paradigm shift in agent architecture by removing fixed orchestrators in favor of self-programmed execution. This high novelty, combined with the introduction of a new language (Spell) to solve the context/program duality, opens an entirely new research direction for autonomous agents. Paper 2 offers a valuable but more incremental improvement to RL alignment via dynamic curriculum learning. The foundational nature and potential to rethink how LLM agents are designed give Paper 1 a higher potential scientific impact.
Paper 1 is more novel: it proposes a new agent architecture (self-programmed execution) where the LM output is the orchestrator, plus a practical language/runtime (Spell) addressing tricky side-effect re-execution issues. This could reshape how agents are built and trained, with broad impact across agent design, programming languages, and alignment/safety. While Paper 2 is timely and useful as a benchmark for embodied social reasoning, benchmarks typically have narrower conceptual novelty and impact unless they become dominant standards; its main contribution is evaluation infrastructure rather than a new paradigm.
Paper 1 proposes a fundamental architectural shift for LLM agents by eliminating fixed orchestrators in favor of self-programmed execution. This theoretical and methodological innovation introduces a novel computing paradigm with broad applicability across the entire AI agent field. In contrast, Paper 2 presents a highly effective but domain-specific framework for system operations. While Paper 2 shows impressive real-world industrial impact, Paper 1's foundational novelty is likely to inspire more widespread scientific follow-up and cross-disciplinary research.
Paper 2 introduces a foundational paradigm shift in agent architecture by eliminating the fixed orchestrator and allowing models to self-program their execution loops. This conceptual innovation, backed by a novel self-modifying programming language (Spell), offers much broader theoretical and practical implications for all LLM agents. In contrast, Paper 1 presents a valuable but more incremental engineering approach focused specifically on optimizing lightweight GUI agents.
Paper 2 is more novel: it redefines agent execution by removing the fixed orchestrator and making the model output an executable self-editing program (SPE), introducing a concrete language/runtime (Spell) to make this feasible and safe w.r.t. side effects. This is a foundational architectural shift that could influence agent design, training objectives, and programming-language interfaces across many domains. Paper 1 is timely and practically useful for security auditing, but it is more of a unifying representation/engineering framework with narrower impact primarily in LLM security and governance.
Paper 1 introduces a fundamentally new agent architecture (SPE) where LLMs self-program their own orchestration, supported by novel theoretical formalization (agentic machines) and a practical language (Spell). This challenges existing assumptions about fixed orchestration in LM agents and opens a new research direction with broad implications for AI agent design and self-modifying systems. Paper 2, while methodologically sound, is a benchmark contribution for a narrower domain (VL content moderation) with more incremental impact. SPE's novelty, generality, and potential to reshape agent architectures give it substantially higher impact potential.
Paper 1 introduces a fundamentally novel agent architecture (SPE) where LLMs self-program their own orchestration, formalized with agentic machines and a novel self-editing Lisp language (Spell). This addresses a core limitation in LM agent design with broad implications for AI agent architectures, autonomous systems, and programming language design. Paper 2, while solid, is an incremental combination of existing components (GAT + Mamba) for traffic forecasting—a well-explored domain. Paper 1's conceptual novelty, cross-disciplinary impact (PL + AI agents), and potential to reshape agent design give it substantially higher impact potential.
Paper 1 is more likely to have higher scientific impact: it proposes a novel agent architecture (self-programmed execution) with a concrete implementation (Spell) and empirical evidence that existing frontier LMs can operate under this regime. This is methodologically stronger and more directly extensible by researchers, with immediate applications to agent design, tool use, and training objectives. Its ideas can influence multiple technical areas (program synthesis, agent orchestration, safety/interpretability). Paper 2 is timely and potentially influential in practice, but it is primarily conceptual with less empirical/technical rigor, making its scientific impact less certain.
Paper 1 introduces a fundamentally novel agent architecture (SPE) where language models self-orchestrate by writing and editing their own execution programs, eliminating fixed orchestration policies. This is highly innovative, timely given the explosion of LM agents, and has broad implications for AI agent design, autonomy, and future training paradigms. Paper 2 provides valuable theoretical contributions (finite-time MCTS guarantees for continuous POMDPs), but addresses a more incremental, narrower problem within established planning frameworks. Paper 1's paradigm-shifting potential and relevance to the rapidly growing LM agent field give it higher impact potential.