LACUNA: Safe Agents as Recursive Program Holes

Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham, Frank Zhengqing Wu, Martin Odersky

May 27, 2026

arXiv:2605.28617v1 PDF

cs.AI(primary)cs.PL

#297of 2682·Artificial Intelligence

#297 of 2682 · Artificial Intelligence

Tournament Score

1508±47

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty7.5

Clarity8

Tournament Score

1508±47

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call $\texttt{agent[T](task)}$ that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and $τ^{2}$ -bench. On BrowseComp-Plus, $8.6 %$ of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches $27.1\%$ accuracy. On $τ^{2}$ -bench, LACUNA solves $76.0\%$ of $392$ tasks across four domains with a capable model, on par with the baseline agent.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: LACUNA: Safe Agents as Recursive Program Holes

1. Core Contribution

LACUNA introduces a programming model that unifies the agent runtime and the model-generated code through a single primitive: `agent[T](task)`. When execution reaches this call, the LLM generates Scala code that is compiled and type-checked against the expected return type `T` in the surrounding lexical scope before any execution occurs. This "typed hole" approach provides atomic accept/reject semantics—if the generated code fails type-checking, nothing executes, preventing partial state corruption.

The key insight is treating model-generated agent actions as typed program holes (borrowing from programming language theory) that are recursively composable. The generated code can itself contain `agent` calls, enabling sub-agents, parallel decomposition, and ReAct-style loops to emerge as ordinary control flow rather than framework-imposed patterns. This is a genuine conceptual advance over existing code-as-action frameworks where the runtime owns the control flow and the model merely fills in individual actions.

2. Methodological Rigor

The paper is methodologically sound in several respects. The formal safety argument rests on the host compiler's soundness rather than a bespoke verification layer—a principled design choice. The threat model is clearly articulated, distinguishing between honest-but-fallible models and adversarial ones (prompt injection), with capability tracking via Scala 3's capture checking addressing the latter.

The evaluation covers three complementary axes: ~400 unit tests verifying type-system guarantees, BrowseComp-Plus for complex tool-using tasks, and τ²-bench for multi-turn conversational agents. The AgentDojo prompt-injection evaluation (Table 3) demonstrates capability confinement under adversarial conditions, with near-zero successful attacks across domains.

However, there are notable gaps. The BrowseComp-Plus accuracy (27.1%) is modest, though the authors correctly note this reflects retriever/model quality rather than the framework. The τ²-bench comparison shows LACUNA is "on par" with baseline tool-calling agents but doesn't surpass them, making the expressiveness argument somewhat theoretical. The paper runs single trials on τ²-bench, limiting statistical confidence. The compilation overhead is acknowledged but not rigorously measured—latency and cost comparisons with standard tool-calling are absent.

3. Potential Impact

Programming languages × AI agents intersection: This paper sits at a genuinely productive intersection. By grounding agent safety in established PL concepts (types, capabilities, capture checking), it provides a more principled foundation than ad-hoc sandboxing or policy-based approaches. This could influence how future agent frameworks think about safety guarantees.

Practical limitations temper impact: The Scala 3 dependency is a significant adoption barrier. The authors acknowledge portability (Section 6.5) but the specific features needed—in-process recompilation with live context exposure and capture checking—are rare in mainstream languages. Python, where most agent development occurs, cannot provide these guarantees. This constrains near-term practical adoption.

Safety paradigm: The atomic rejection property (nothing runs if anything fails to type-check) is a genuinely useful safety primitive that other systems lack. The demonstration that 8.6% of generations on BrowseComp-Plus are caught pre-execution, with cheap retries, shows the overhead is manageable.

Capability-based information flow: The `Classified[T]` pattern for preventing sensitive data leakage to untrusted models (Section 4.3) is a compelling application, particularly relevant as agents handle increasingly sensitive data across trust boundaries.

4. Timeliness & Relevance

The paper addresses a genuine and growing concern. As LLM agents gain more autonomy and tool access (MCP, code execution, multi-step planning), the gap between expressiveness and safety widens. The timing is excellent—agent safety is a bottleneck, and most current approaches are probabilistic (output filtering, prompt hardening) rather than structural. LACUNA offers deterministic, pre-execution guarantees, which is a qualitative improvement.

The framing against recursive language models (RLM) is well-positioned, as RLM's lack of pre-execution checking is a real limitation that LACUNA directly addresses.

5. Strengths & Limitations

Strengths:

*Elegant unification*: A single primitive (`agent[T]`) subsumes ReAct, sub-agents, skills, parallel decomposition, and multi-model planning. The compositional expressiveness is impressive.

*Safety by construction*: Rather than bolting safety onto an existing framework, safety emerges from the host language's type system—a fundamentally more robust approach.

*Atomic failure semantics*: The all-or-nothing execution property prevents the partial-state corruption that plagues Python `exec`-based approaches.

*Recursive composability*: Nested agent calls with progressively richer context is a natural and powerful abstraction.

*Honest limitations section*: The paper is unusually transparent about its constraints (well-typed ≠ correct, capability scope must be developer-supplied, model coding ability dependency).

Limitations:

*Scala dependency*: The strongest guarantees require Scala 3's experimental capture checking. The paper's safety story doesn't transfer to Python ecosystems where agents are actually built.

*Performance not demonstrated as superior*: On both benchmarks, LACUNA matches but doesn't exceed baselines, making the practical value proposition primarily about safety rather than capability.

*Semantic correctness gap*: The authors acknowledge that type-correct code can still be semantically wrong—the most common failure mode in practice. The refinement types direction is interesting but entirely future work.

*Model coding ability as bottleneck*: The 89% rejection rate for gemini-flash-lite on telecom tasks reveals that the approach's utility is heavily model-dependent, and requiring Scala competence raises the bar further.

*Limited scale of evaluation*: Single-trial τ²-bench runs, and no wall-clock or cost comparisons with conventional approaches.

*Escape hatches*: Reflection and raw process execution bypass all guarantees unless safe mode is enabled—an opt-in defense that deployments may neglect.

Overall Assessment

LACUNA is a well-conceived contribution at the intersection of programming languages and AI agents. Its core insight—treating agent actions as typed program holes with pre-execution checking—is both theoretically elegant and practically relevant. The safety guarantees are real and structurally superior to existing approaches. However, the Scala-specific realization limits near-term adoption, the empirical evaluation shows parity rather than improvement over baselines, and the gap between type safety and semantic correctness remains the elephant in the room. The paper's greatest impact may be conceptual: demonstrating that PL-theoretic tools can provide deterministic safety guarantees for LLM agents, potentially inspiring similar approaches in more widely adopted languages.

Rating:6.8/ 10

Significance 7.5Rigor 6.5Novelty 7.5Clarity 8

Generated May 28, 2026

Comparison History (24)

vs. Voluntary Collusion with Secret Tools in Competing LLM Agents

gpt-5.25/28/2026

Paper 2 has higher impact potential due to a novel, general programming model that unifies agent runtimes with model-written code via typed “program holes,” offering a concrete safety mechanism (type-checking, atomic accept/reject, bounded tool/data flow) with broad applicability to real agent systems. It is methodologically stronger: formal interface + implementation + evaluations on multiple benchmarks. Its contributions can influence programming languages, agent architectures, and safety. Paper 1 is timely and important for multi-agent safety, but is primarily an empirical finding in specific games; it offers fewer generalizable mechanisms or deployable mitigations.

vs. Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains

claude-opus-4.65/28/2026

LACUNA addresses a fundamental challenge in LLM agent safety and expressiveness—bridging the runtime/model-code split with type-checked program holes. This sits at the intersection of programming languages and AI agents, two rapidly growing fields. Its novel programming model (agents as typed recursive holes) with formal safety guarantees has broad implications for how future AI agents are built. Paper 2, while valuable for PHM reproducibility, addresses a narrower domain with an infrastructure/benchmarking contribution rather than a conceptual advance. The timeliness and breadth of Paper 1's impact on the booming LLM agent ecosystem gives it higher potential.

vs. From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

gemini-3.15/28/2026

Paper 2 proposes a novel, general-purpose programming model for LLM agents that addresses critical issues in agent safety, expressiveness, and control flow. Its approach of treating agent actions as typed, compiler-checked program holes has broad applicability across all AI agent domains. While Paper 1 offers a rigorous and valuable benchmark for financial LLMs, Paper 2's fundamental contribution to agent architecture and safety gives it a higher potential for widespread scientific impact across the broader AI and software engineering communities.

vs. AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

claude-opus-4.65/28/2026

LACUNA introduces a fundamentally novel programming model that addresses a core architectural problem in LLM agents—unifying the runtime and model-written code while preserving safety through typed program holes. This has broader impact across the entire agent ecosystem, touching programming languages, safety, and agent design. AIBuildAI-2, while achieving strong benchmark results, is a more incremental contribution (knowledge-enhanced agent for AutoML) building on established retrieval-augmented generation patterns. LACUNA's theoretical framework and safety guarantees have wider applicability and deeper foundational significance.

vs. Entropy-aware Masking for Masked Language Modeling

gemini-3.15/28/2026

Paper 1 introduces a novel programming model for LLM agents that elegantly merges runtime control with model-generated code while ensuring safety through type-checking. This addresses a highly relevant and pressing challenge in agentic AI. In contrast, Paper 2 proposes an incremental improvement to Masked Language Modeling, a mature area that currently receives less focus compared to generative agents. Thus, Paper 1 promises broader and more timely scientific impact.

vs. OpenComputer: Verifiable Software Worlds for Computer-Use Agents

claude-opus-4.65/28/2026

LACUNA introduces a novel programming model that addresses a fundamental tension between expressiveness and safety in LLM agents—a broadly applicable architectural contribution. Its typed, compiler-verified approach to agent actions is conceptually innovative and generalizable across many agent frameworks. OpenComputer, while valuable as a benchmarking framework for computer-use agents with 33 applications, is more incremental—primarily an evaluation infrastructure contribution. LACUNA's ideas around treating agent actions as typed program holes with safety guarantees have broader theoretical and practical implications for the rapidly growing field of agentic AI safety.

vs. The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

gemini-3.15/28/2026

Paper 1 investigates foundational post-training methodologies (distillation) for LLMs, uncovering core failure mechanisms and providing concrete fixes. Because distillation is central to aligning and improving state-of-the-art LLMs, these insights offer immediate, widespread impact on core machine learning research and model development. Paper 2 presents an innovative and practical framework for agent programming and safety, but its scope is more constrained to software engineering and agent system design rather than fundamental model capabilities.

vs. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

gemini-3.15/28/2026

Paper 2 offers higher scientific impact by addressing a fundamental structural flaw in Knowledge Editing—epistemic dissonance. While Paper 1 provides a valuable systems-level programming model for agent safety, Paper 2 proposes a paradigm shift in how foundational models internalize updates, moving from discrete fact overwriting to causal knowledge evolution. Its dramatic reduction of self-refutation rates (from 95.6% to 1.8%) demonstrates deep methodological rigor and solves a critical bottleneck in maintaining the factual accuracy and logical coherence of LLMs, ensuring broad applicability across all domains relying on dynamically updated models.

vs. Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

gpt-5.25/28/2026

Paper 2 (LACUNA) introduces a novel, general programming model that unifies agent runtime and model-written code via typed “program holes” with pre-execution type-checking, offering a principled safety mechanism (reject/rollback) and clear extensibility to common agent patterns. Its potential real-world impact is broad (safer tool use, prompt-injection resilience, reliable automation) and spans PL, systems, and AI safety, making it timely. Paper 1 is valuable for XAI faithfulness and benchmarking, but its impact is more niche and depends on faithfulness tools/benchmarks; gains are moderate and domain-specific.

vs. RULER: Representation-Level Verification of Machine Unlearning

gemini-3.15/28/2026

Paper 1 addresses a critical flaw in current machine unlearning evaluation by exposing that models retain forgotten data in intermediate representations despite passing output-level tests. Given the rising legal and ethical pressures around data privacy and copyright (e.g., GDPR), a rigorous, representation-level verification method has profound, immediate implications across the entire machine learning community. Paper 2 presents a novel and useful programming model for LLM agents, but Paper 1's fundamental challenge to the efficacy of existing unlearning techniques promises broader and more disruptive scientific impact.

vs. Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning

claude-opus-4.65/28/2026

LACUNA addresses a fundamental challenge in LLM agent safety and expressiveness—bridging the gap between agent runtime and model-written code through a novel typed programming model. This has broad impact across AI safety, software engineering, and agent frameworks. The approach is conceptually novel (agents as typed program holes with compiler-enforced safety), timely given the rapid growth of code-generating agents, and applicable across many domains. Paper 2, while rigorous, addresses a narrower problem (pedestrian-vehicle interaction modeling) with more incremental methodological contributions (combining existing RL and Mamba architectures).

vs. Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

gemini-3.15/28/2026

Paper 1 introduces a foundational programming paradigm for LLM agents, addressing critical safety and expressivity bottlenecks in agentic workflows. Its approach to unifying runtime and model-generated code via type-checked recursive holes offers broad implications for AI agent design. While Paper 2 presents a strong, practical architecture for speech translation, Paper 1's methodological innovation in agent safety and control has a wider potential impact across the rapidly expanding field of autonomous AI systems.

vs. DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

claude-opus-4.65/28/2026

DenoiseRL addresses a fundamental scalability bottleneck in RL-based reasoning for LLMs—dependence on stronger teacher models and curated data—with a novel recovery-oriented learning framework. This has broad applicability across all reasoning tasks and model scales. While LACUNA presents an interesting programming model for safe LLM agents with type-checked code generation, it addresses a more niche problem (agent safety via typed holes) with moderate empirical results. DenoiseRL's potential to change how reasoning models are trained at scale, combined with strong empirical results across multiple benchmarks, suggests broader and deeper scientific impact.

vs. MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing

claude-opus-4.65/28/2026

LACUNA addresses a fundamental challenge in LLM agent safety and expressiveness—bridging the gap between agent runtimes and model-written code through a novel programming model with type-checked safety guarantees. This has broad implications across the entire LLM agent ecosystem, affecting safety, reliability, and expressiveness of code-generating agents. MACReD, while solid, addresses a narrower domain (chemical reaction diagram parsing) with incremental improvements on a specific benchmark. LACUNA's conceptual contribution (agents as typed program holes) is more novel and has wider cross-field applicability.

vs. Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

gemini-3.15/28/2026

Paper 1 introduces a foundational programming paradigm for LLM agents by bridging programming language concepts (type-checking, compiler diagnostics) with agentic execution. This system-level approach to agent safety and control flow has a broader potential impact on how future AI agent frameworks are built compared to Paper 2, which offers a valuable but narrower alignment technique for improving model abstention.

vs. Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

claude-opus-4.65/28/2026

LACUNA addresses a fundamental challenge in LLM agent safety and expressiveness with a novel programming model (typed program holes) that has rigorous theoretical grounding and quantitative evaluation on established benchmarks. It bridges programming languages and AI safety—two high-impact fields—with a principled approach. Paper 2 presents an engineering architecture for analytics using existing components (Kafka, Flink, LLMs) with only use-case demonstrations rather than rigorous evaluation. LACUNA's contribution to safe agentic code generation is more novel, methodologically rigorous, and broadly applicable across the rapidly growing agent ecosystem.

vs. A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

claude-opus-4.65/28/2026

LACUNA addresses a fundamental and timely problem in LLM agent safety and expressiveness by introducing a novel programming model that unifies agent runtime and model-written code through typed, compiler-checked program holes. This has broad implications for the rapidly growing field of AI agents, with practical safety guarantees and a principled design. Paper 2 proposes incremental improvements (conflict-aware penalty and statistical loss) for multimodal sentiment analysis, tested on a single dataset (CMU-MOSI), representing a narrower contribution with more limited generalizability and impact potential.

vs. STAB: Specification-driven Testing for Algorithmic Bottlenecks

claude-opus-4.65/28/2026

LACUNA addresses a fundamental architectural problem in LLM agent safety—bridging the gap between agent runtime and model-written code through a novel programming model with type-checked actions. This has broader impact across the rapidly growing field of LLM agents, touching safety, programming languages, and AI systems design. STAB, while solid and well-evaluated, addresses a narrower problem (algorithmic bottleneck testing) with incremental improvements. LACUNA's conceptual contribution—agents as recursive program holes with safety guarantees—is more likely to influence future agent architectures and spawn follow-up research across multiple communities.

vs. Verifiable Benchmarking of Long-Horizon Spatial Biology

gpt-5.25/28/2026

Paper 1 introduces a novel programming model (typed, runtime-integrated “program holes” for agent actions) that directly addresses a core limitation in agent architectures—control-flow/runtime separation—while adding safety via compile-time type checks and transactional rejection/retry. This is broadly applicable across domains where LLM agents write/execute code, potentially influencing agent frameworks, programming languages, and safety tooling. Paper 2 is a rigorous and timely benchmark with clear value for spatial biology, but its impact is narrower (primarily evaluation within a specific scientific area) and less likely to reshape general agent design.

vs. MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

gpt-5.25/28/2026

Paper 1 introduces a novel programming model that tightly integrates LLM code generation with a typed, safety-preserving runtime via compile-time rejection and retries—an approach likely to influence how agents are built, verified, and deployed across many domains. It has clear real-world applicability (safer tool use, controlled side effects, composable agent control flow) and broader cross-field impact (programming languages, software engineering, AI safety, agent systems). Paper 2 is a valuable benchmark/taxonomy for a narrower subarea (cinematic multi-talker A/V), impactful but more domain-specific and less methodologically transformative.