LACUNA: Safe Agents as Recursive Program Holes
Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham, Frank Zhengqing Wu, Martin Odersky
Abstract
LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and -bench. On BrowseComp-Plus, of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches accuracy. On -bench, LACUNA solves of tasks across four domains with a capable model, on par with the baseline agent.
AI Impact Assessments
(1 models)Scientific Impact Assessment: LACUNA: Safe Agents as Recursive Program Holes
1. Core Contribution
LACUNA introduces a programming model that unifies the agent runtime and the model-generated code through a single primitive: `agent[T](task)`. When execution reaches this call, the LLM generates Scala code that is compiled and type-checked against the expected return type `T` in the surrounding lexical scope before any execution occurs. This "typed hole" approach provides atomic accept/reject semantics—if the generated code fails type-checking, nothing executes, preventing partial state corruption.
The key insight is treating model-generated agent actions as typed program holes (borrowing from programming language theory) that are recursively composable. The generated code can itself contain `agent` calls, enabling sub-agents, parallel decomposition, and ReAct-style loops to emerge as ordinary control flow rather than framework-imposed patterns. This is a genuine conceptual advance over existing code-as-action frameworks where the runtime owns the control flow and the model merely fills in individual actions.
2. Methodological Rigor
The paper is methodologically sound in several respects. The formal safety argument rests on the host compiler's soundness rather than a bespoke verification layer—a principled design choice. The threat model is clearly articulated, distinguishing between honest-but-fallible models and adversarial ones (prompt injection), with capability tracking via Scala 3's capture checking addressing the latter.
The evaluation covers three complementary axes: ~400 unit tests verifying type-system guarantees, BrowseComp-Plus for complex tool-using tasks, and τ²-bench for multi-turn conversational agents. The AgentDojo prompt-injection evaluation (Table 3) demonstrates capability confinement under adversarial conditions, with near-zero successful attacks across domains.
However, there are notable gaps. The BrowseComp-Plus accuracy (27.1%) is modest, though the authors correctly note this reflects retriever/model quality rather than the framework. The τ²-bench comparison shows LACUNA is "on par" with baseline tool-calling agents but doesn't surpass them, making the expressiveness argument somewhat theoretical. The paper runs single trials on τ²-bench, limiting statistical confidence. The compilation overhead is acknowledged but not rigorously measured—latency and cost comparisons with standard tool-calling are absent.
3. Potential Impact
Programming languages × AI agents intersection: This paper sits at a genuinely productive intersection. By grounding agent safety in established PL concepts (types, capabilities, capture checking), it provides a more principled foundation than ad-hoc sandboxing or policy-based approaches. This could influence how future agent frameworks think about safety guarantees.
Practical limitations temper impact: The Scala 3 dependency is a significant adoption barrier. The authors acknowledge portability (Section 6.5) but the specific features needed—in-process recompilation with live context exposure and capture checking—are rare in mainstream languages. Python, where most agent development occurs, cannot provide these guarantees. This constrains near-term practical adoption.
Safety paradigm: The atomic rejection property (nothing runs if anything fails to type-check) is a genuinely useful safety primitive that other systems lack. The demonstration that 8.6% of generations on BrowseComp-Plus are caught pre-execution, with cheap retries, shows the overhead is manageable.
Capability-based information flow: The `Classified[T]` pattern for preventing sensitive data leakage to untrusted models (Section 4.3) is a compelling application, particularly relevant as agents handle increasingly sensitive data across trust boundaries.
4. Timeliness & Relevance
The paper addresses a genuine and growing concern. As LLM agents gain more autonomy and tool access (MCP, code execution, multi-step planning), the gap between expressiveness and safety widens. The timing is excellent—agent safety is a bottleneck, and most current approaches are probabilistic (output filtering, prompt hardening) rather than structural. LACUNA offers deterministic, pre-execution guarantees, which is a qualitative improvement.
The framing against recursive language models (RLM) is well-positioned, as RLM's lack of pre-execution checking is a real limitation that LACUNA directly addresses.
5. Strengths & Limitations
Strengths:
Limitations:
Overall Assessment
LACUNA is a well-conceived contribution at the intersection of programming languages and AI agents. Its core insight—treating agent actions as typed program holes with pre-execution checking—is both theoretically elegant and practically relevant. The safety guarantees are real and structurally superior to existing approaches. However, the Scala-specific realization limits near-term adoption, the empirical evaluation shows parity rather than improvement over baselines, and the gap between type safety and semantic correctness remains the elephant in the room. The paper's greatest impact may be conceptual: demonstrating that PL-theoretic tools can provide deterministic safety guarantees for LLM agents, potentially inspiring similar approaches in more widely adopted languages.
Generated May 28, 2026
Comparison History (24)
Paper 2 has higher impact potential due to a novel, general programming model that unifies agent runtimes with model-written code via typed “program holes,” offering a concrete safety mechanism (type-checking, atomic accept/reject, bounded tool/data flow) with broad applicability to real agent systems. It is methodologically stronger: formal interface + implementation + evaluations on multiple benchmarks. Its contributions can influence programming languages, agent architectures, and safety. Paper 1 is timely and important for multi-agent safety, but is primarily an empirical finding in specific games; it offers fewer generalizable mechanisms or deployable mitigations.
LACUNA addresses a fundamental challenge in LLM agent safety and expressiveness—bridging the runtime/model-code split with type-checked program holes. This sits at the intersection of programming languages and AI agents, two rapidly growing fields. Its novel programming model (agents as typed recursive holes) with formal safety guarantees has broad implications for how future AI agents are built. Paper 2, while valuable for PHM reproducibility, addresses a narrower domain with an infrastructure/benchmarking contribution rather than a conceptual advance. The timeliness and breadth of Paper 1's impact on the booming LLM agent ecosystem gives it higher potential.
Paper 2 proposes a novel, general-purpose programming model for LLM agents that addresses critical issues in agent safety, expressiveness, and control flow. Its approach of treating agent actions as typed, compiler-checked program holes has broad applicability across all AI agent domains. While Paper 1 offers a rigorous and valuable benchmark for financial LLMs, Paper 2's fundamental contribution to agent architecture and safety gives it a higher potential for widespread scientific impact across the broader AI and software engineering communities.
LACUNA introduces a fundamentally novel programming model that addresses a core architectural problem in LLM agents—unifying the runtime and model-written code while preserving safety through typed program holes. This has broader impact across the entire agent ecosystem, touching programming languages, safety, and agent design. AIBuildAI-2, while achieving strong benchmark results, is a more incremental contribution (knowledge-enhanced agent for AutoML) building on established retrieval-augmented generation patterns. LACUNA's theoretical framework and safety guarantees have wider applicability and deeper foundational significance.
Paper 1 introduces a novel programming model for LLM agents that elegantly merges runtime control with model-generated code while ensuring safety through type-checking. This addresses a highly relevant and pressing challenge in agentic AI. In contrast, Paper 2 proposes an incremental improvement to Masked Language Modeling, a mature area that currently receives less focus compared to generative agents. Thus, Paper 1 promises broader and more timely scientific impact.
LACUNA introduces a novel programming model that addresses a fundamental tension between expressiveness and safety in LLM agents—a broadly applicable architectural contribution. Its typed, compiler-verified approach to agent actions is conceptually innovative and generalizable across many agent frameworks. OpenComputer, while valuable as a benchmarking framework for computer-use agents with 33 applications, is more incremental—primarily an evaluation infrastructure contribution. LACUNA's ideas around treating agent actions as typed program holes with safety guarantees have broader theoretical and practical implications for the rapidly growing field of agentic AI safety.
Paper 1 investigates foundational post-training methodologies (distillation) for LLMs, uncovering core failure mechanisms and providing concrete fixes. Because distillation is central to aligning and improving state-of-the-art LLMs, these insights offer immediate, widespread impact on core machine learning research and model development. Paper 2 presents an innovative and practical framework for agent programming and safety, but its scope is more constrained to software engineering and agent system design rather than fundamental model capabilities.
Paper 2 offers higher scientific impact by addressing a fundamental structural flaw in Knowledge Editing—epistemic dissonance. While Paper 1 provides a valuable systems-level programming model for agent safety, Paper 2 proposes a paradigm shift in how foundational models internalize updates, moving from discrete fact overwriting to causal knowledge evolution. Its dramatic reduction of self-refutation rates (from 95.6% to 1.8%) demonstrates deep methodological rigor and solves a critical bottleneck in maintaining the factual accuracy and logical coherence of LLMs, ensuring broad applicability across all domains relying on dynamically updated models.
Paper 2 (LACUNA) introduces a novel, general programming model that unifies agent runtime and model-written code via typed “program holes” with pre-execution type-checking, offering a principled safety mechanism (reject/rollback) and clear extensibility to common agent patterns. Its potential real-world impact is broad (safer tool use, prompt-injection resilience, reliable automation) and spans PL, systems, and AI safety, making it timely. Paper 1 is valuable for XAI faithfulness and benchmarking, but its impact is more niche and depends on faithfulness tools/benchmarks; gains are moderate and domain-specific.
Paper 1 addresses a critical flaw in current machine unlearning evaluation by exposing that models retain forgotten data in intermediate representations despite passing output-level tests. Given the rising legal and ethical pressures around data privacy and copyright (e.g., GDPR), a rigorous, representation-level verification method has profound, immediate implications across the entire machine learning community. Paper 2 presents a novel and useful programming model for LLM agents, but Paper 1's fundamental challenge to the efficacy of existing unlearning techniques promises broader and more disruptive scientific impact.
LACUNA addresses a fundamental challenge in LLM agent safety and expressiveness—bridging the gap between agent runtime and model-written code through a novel typed programming model. This has broad impact across AI safety, software engineering, and agent frameworks. The approach is conceptually novel (agents as typed program holes with compiler-enforced safety), timely given the rapid growth of code-generating agents, and applicable across many domains. Paper 2, while rigorous, addresses a narrower problem (pedestrian-vehicle interaction modeling) with more incremental methodological contributions (combining existing RL and Mamba architectures).
Paper 1 introduces a foundational programming paradigm for LLM agents, addressing critical safety and expressivity bottlenecks in agentic workflows. Its approach to unifying runtime and model-generated code via type-checked recursive holes offers broad implications for AI agent design. While Paper 2 presents a strong, practical architecture for speech translation, Paper 1's methodological innovation in agent safety and control has a wider potential impact across the rapidly expanding field of autonomous AI systems.
DenoiseRL addresses a fundamental scalability bottleneck in RL-based reasoning for LLMs—dependence on stronger teacher models and curated data—with a novel recovery-oriented learning framework. This has broad applicability across all reasoning tasks and model scales. While LACUNA presents an interesting programming model for safe LLM agents with type-checked code generation, it addresses a more niche problem (agent safety via typed holes) with moderate empirical results. DenoiseRL's potential to change how reasoning models are trained at scale, combined with strong empirical results across multiple benchmarks, suggests broader and deeper scientific impact.
LACUNA addresses a fundamental challenge in LLM agent safety and expressiveness—bridging the gap between agent runtimes and model-written code through a novel programming model with type-checked safety guarantees. This has broad implications across the entire LLM agent ecosystem, affecting safety, reliability, and expressiveness of code-generating agents. MACReD, while solid, addresses a narrower domain (chemical reaction diagram parsing) with incremental improvements on a specific benchmark. LACUNA's conceptual contribution (agents as typed program holes) is more novel and has wider cross-field applicability.
Paper 1 introduces a foundational programming paradigm for LLM agents by bridging programming language concepts (type-checking, compiler diagnostics) with agentic execution. This system-level approach to agent safety and control flow has a broader potential impact on how future AI agent frameworks are built compared to Paper 2, which offers a valuable but narrower alignment technique for improving model abstention.
LACUNA addresses a fundamental challenge in LLM agent safety and expressiveness with a novel programming model (typed program holes) that has rigorous theoretical grounding and quantitative evaluation on established benchmarks. It bridges programming languages and AI safety—two high-impact fields—with a principled approach. Paper 2 presents an engineering architecture for analytics using existing components (Kafka, Flink, LLMs) with only use-case demonstrations rather than rigorous evaluation. LACUNA's contribution to safe agentic code generation is more novel, methodologically rigorous, and broadly applicable across the rapidly growing agent ecosystem.
LACUNA addresses a fundamental and timely problem in LLM agent safety and expressiveness by introducing a novel programming model that unifies agent runtime and model-written code through typed, compiler-checked program holes. This has broad implications for the rapidly growing field of AI agents, with practical safety guarantees and a principled design. Paper 2 proposes incremental improvements (conflict-aware penalty and statistical loss) for multimodal sentiment analysis, tested on a single dataset (CMU-MOSI), representing a narrower contribution with more limited generalizability and impact potential.
LACUNA addresses a fundamental architectural problem in LLM agent safety—bridging the gap between agent runtime and model-written code through a novel programming model with type-checked actions. This has broader impact across the rapidly growing field of LLM agents, touching safety, programming languages, and AI systems design. STAB, while solid and well-evaluated, addresses a narrower problem (algorithmic bottleneck testing) with incremental improvements. LACUNA's conceptual contribution—agents as recursive program holes with safety guarantees—is more likely to influence future agent architectures and spawn follow-up research across multiple communities.
Paper 1 introduces a novel programming model (typed, runtime-integrated “program holes” for agent actions) that directly addresses a core limitation in agent architectures—control-flow/runtime separation—while adding safety via compile-time type checks and transactional rejection/retry. This is broadly applicable across domains where LLM agents write/execute code, potentially influencing agent frameworks, programming languages, and safety tooling. Paper 2 is a rigorous and timely benchmark with clear value for spatial biology, but its impact is narrower (primarily evaluation within a specific scientific area) and less likely to reshape general agent design.
Paper 1 introduces a novel programming model that tightly integrates LLM code generation with a typed, safety-preserving runtime via compile-time rejection and retries—an approach likely to influence how agents are built, verified, and deployed across many domains. It has clear real-world applicability (safer tool use, controlled side effects, composable agent control flow) and broader cross-field impact (programming languages, software engineering, AI safety, agent systems). Paper 2 is a valuable benchmark/taxonomy for a narrower subarea (cinematic multi-talker A/V), impactful but more domain-specific and less methodologically transformative.