Agentic Proving for Program Verification

Alessandro Sosso, Akhil Arora, Bas Spitters

cs.AI(primary)cs.LOcs.PLcs.SE
#1426 of 2682 · Artificial Intelligence
Share
Tournament Score
1402±38
10501800
58%
Win Rate
18
Wins
13
Losses
31
Matches
Rating
5.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper evaluates Claude Code (Opus 4.6) in an agentic, compiler-in-the-loop framework on CLEVER, a Lean 4 benchmark for end-to-end verifiable code generation. The main finding is near-saturation performance: 98.8% of problems receive arguably valid specifications, 87.5% of implementations are certified against correct ground-truth specs, and 98.1% of entries with self-consistent premises are solved end-to-end. The paper's secondary and arguably more lasting contribution is a systematic audit of the CLEVER benchmark itself—identifying bugs in ~50% of ground-truth specifications—and a methodological critique of isomorphism-based scoring for evaluating autoformalization.

Methodological Rigor

The experimental design is reasonably thorough, employing multiple prompting strategies (simple, guided, multi-specification) and tracking results across four pipeline stages. The use of status flags (OK, FAIL, MISMATCH, ISSUE) with manual verification of Claude's self-classifications adds credibility. The authors cross-checked agent-reported flags against the benchmark harness, finding only two discrepancies, both explainable.

However, several methodological concerns arise:

1. Single-model evaluation: Only Claude Opus 4.6 is tested in the main experiments. While preliminary experiments (Appendix A) compare multiple provers, the core paper lacks systematic ablations or comparisons with other frontier models on the same setup, making it difficult to disentangle the contribution of the agentic framework from the underlying model capability.

2. Benchmark emendation: The authors used a custom fork with fixes to formatting, typos, and test cases. While stated to not touch ground-truth specifications, this complicates reproducibility and direct comparison with prior results. The line between "fixing benchmark bugs" and "adapting the benchmark to the method" can be blurry.

3. Pass@1 with generous timeout: A 3600-second timeout per attempt is exceptionally generous (CLEVER's original protocol uses 600s). Combined with expensive API calls ($2-10 per proving attempt), this is a significantly different resource regime than prior work's pass@600s.

4. Self-evaluation circularity: The claim that Claude provides "high-quality feedback" is validated by manual review, but the manual review is conducted by the authors themselves without an independent annotation protocol, inter-rater reliability measures, or blinding.

5. The 98.8% specification validity claim is arguably the weakest metric: it counts any "defensible interpretation" of the NL description as valid, which is a subjective judgment. The harder metric—isomorphism with ground truth on correct entries—yields 81.3%, a more informative but less headline-worthy figure.

Potential Impact

The paper's practical impact operates on two levels:

Benchmark critique: The identification of pervasive bugs in CLEVER (80/161 ground-truth specifications flagged) is a genuinely valuable contribution. The detailed taxonomy of error types (Tables 1, 6-8) provides actionable guidance for future benchmark designers. The observation that isomorphism-based scoring breaks down when NL descriptions are ambiguous is important for the field. Independent corroboration exists: concurrent work [10, 16] also identifies spec-quality issues, though this paper's coverage is more comprehensive.

Agentic proving paradigm: The empirical evidence that compiler-in-the-loop agentic systems dramatically outperform non-agentic approaches for program verification is useful, though not entirely surprising given similar findings in mathematical theorem proving. The gap between Claude Code's performance and the original benchmark ceiling (from ~0.62% to ~98%) is striking, though it reflects both model improvements (Claude 3.7 → Opus 4.6) and the agentic scaffold.

Limitations of broader impact: The paper does not propose new methods—it applies an existing commercial tool (Claude Code) with existing open-source plugins to an existing benchmark. The actionable insight for practitioners is essentially "use Claude Code with lean4-skills and lean-lsp-mcp," which is more of an engineering recommendation than a scientific advance.

Timeliness & Relevance

The paper is highly timely, arriving as agentic AI systems are rapidly advancing in formal reasoning tasks. The finding that existing program verification benchmarks are nearly saturated by frontier agents is an important signal for the community, motivating harder benchmarks. The paper also connects to the broader autoformalization challenge and the "intent formalization" problem [15].

However, the speed of frontier model development means results may become dated quickly. The paper is essentially a snapshot of what Claude Opus 4.6 can do on a specific benchmark in May 2026.

Strengths

  • Comprehensive benchmark audit: The detailed error taxonomy across 161 problems is a significant empirical contribution that will benefit future benchmark design.
  • Multi-stage evaluation: Testing all four pipeline stages with multiple prompting strategies provides a nuanced picture.
  • Transparency: Detailed appendices with per-problem breakdowns, cost/runtime analysis, and the Problem 32 case study demonstrate thoroughness.
  • Practical methodology: The setup runs on a laptop, demonstrating that competitive program verification no longer requires specialized compute.
  • Limitations

  • No methodological novelty: The paper applies existing tools to an existing benchmark. The agentic framework is not a contribution of this work.
  • Conflation of model and method: It's unclear how much performance comes from Claude Opus 4.6's raw capability versus the agentic scaffold.
  • Benchmark-specific findings: Results on a single benchmark of 161 HumanEval-derived problems may not generalize to industrial-scale verification tasks.
  • Subjective evaluation: The "arguably valid" qualification in the 98.8% claim relies on the authors' judgment, and the manual review lacks formal protocols.
  • Missing baselines: No comparison with other agentic frameworks (e.g., using GPT-5 or Gemini in the same scaffold) in the main experiments.
  • Overall Assessment

    This paper is primarily an empirical benchmark evaluation with a valuable secondary contribution in benchmark auditing. It demonstrates impressive capabilities of a commercial AI system on program verification but offers limited methodological insight. The benchmark critique and error taxonomy are the most durable contributions. The paper would be strengthened by systematic ablations, comparison across models, and evaluation on more challenging verification targets.

    Rating:5.5/ 10
    Significance 5.5Rigor 5Novelty 3.5Clarity 7.5

    Generated May 25, 2026

    Comparison History (31)

    vs. Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents
    gemini-3.15/26/2026

    Paper 2 offers higher scientific impact due to its highly novel integration of actuarial science and AI safety. While Paper 1 provides valuable empirical results on AI program verification, it represents an incremental application of existing models to current benchmarks. Paper 2 introduces a fundamentally new, cross-disciplinary framework to quantify, price, and constrain AI agent risks at runtime. This addresses a massive bottleneck in enterprise AI deployment—managing side-effect liabilities—giving it broader multidisciplinary relevance and exceptional potential for immediate real-world application in autonomous systems governance.

    vs. Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network
    gpt-5.25/26/2026

    Paper 2 has higher likely impact: it provides the first large-scale empirical characterization of a real, deployed agent-to-agent ecosystem (1.5M assets, 128K agents), uncovering systemic incentive and evaluation failures with clear, broadly applicable design implications for decentralized AI collaboration platforms. Its findings generalize across multi-agent systems, marketplace/incentive design, security/adversarial robustness, and auditing/verification. Paper 1 is timely and strong but is primarily an evaluation on a specific benchmark/model setup; its core impact is narrower to program verification benchmarking and agentic theorem proving methodology.

    vs. MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning
    claude-opus-4.65/26/2026

    Paper 1 demonstrates that agentic AI systems can achieve near-complete success on program verification benchmarks, fundamentally challenging the field's evaluation methodology and establishing a new paradigm (compiler-in-the-loop agentic proving) for formal verification. This has broad implications for software engineering, formal methods, and AI safety. Paper 2, while technically solid, presents an incremental improvement in model compression for VLMs—a narrower contribution in a crowded space. Paper 1's finding that benchmarks are already saturated by current AI capabilities is a significant wake-up call with wider reverberations across multiple research communities.

    vs. LECTOR: Joint Optimization of Scientific Reasoning Graphs and Introduction Generation
    gpt-5.25/26/2026

    Paper 2 likely has higher impact: it targets program verification with agentic theorem-proving in Lean, a high-stakes, broadly relevant area (software correctness, compilers, security) and provides strong empirical evidence that current benchmarks are becoming insufficient, motivating new evaluation methodology. This combination of impressive performance, critique of benchmark validity, and implications for rigorous verification pipelines is timely and can influence both research directions and tooling. Paper 1 is novel for grounded introduction generation, but its applications are narrower and more incremental within AI-assisted scientific writing.

    vs. SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent
    gemini-3.15/26/2026

    Paper 2 proposes a novel, generalizable framework (SAM) for long-horizon reasoning in LLM agents, a highly active and broad area of AI research. By addressing fundamental memory management challenges, its methodology can be applied across various backbones and tasks. In contrast, Paper 1 is primarily an evaluation of a proprietary model on a specific program verification benchmark, which, while valuable for formal methods, offers narrower methodological innovation and broader impact compared to introducing a new architectural paradigm.

    vs. VeriTrace: Evolving Mental Models for Deep Research Agents
    gpt-5.25/26/2026

    Paper 2 likely has higher impact: it provides strong empirical evidence that agentic theorem-proving paradigms transfer to program verification, achieving very high end-to-end success on a formal Lean 4 benchmark while also uncovering benchmark weaknesses and proposing needed evaluation reforms. This is timely and broadly relevant to formal methods, PL, and AI safety/verification, with clear real-world applications (certified code, compilers-in-the-loop workflows). Paper 1 is novel in agent regulation but appears more incremental within LLM-agent frameworks and depends on specific benchmarks/backbones.

    vs. Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts
    gpt-5.25/26/2026

    Paper 2 likely has higher impact due to stronger novelty and broader applicability: it demonstrates an agentic, compiler-in-the-loop paradigm achieving very high end-to-end success on a real program-verification benchmark, and it exposes benchmark/evaluation weaknesses (e.g., isomorphism-based scoring), motivating new standards and methodologies. This can influence both formal methods and AI agent research with clear real-world implications for software correctness. Paper 1 is a solid mechanistic/safety analysis of MoE routing but is narrower (one model, subtle effects) and offers more incremental insights relative to fast-moving interpretability/safety work.

    vs. Adaptive Human-AI Coordination via Hierarchical Action Disentanglement
    claude-opus-4.65/26/2026

    Paper 1 demonstrates near-saturating performance (98.1%) on a program verification benchmark using agentic AI, revealing that existing benchmarks are insufficient for modern AI capabilities. This has significant implications for formal verification, software engineering, and AI-assisted programming—fields with massive real-world impact. It also identifies methodological issues with current evaluation approaches, potentially catalyzing new benchmark development. Paper 2 makes a solid contribution to human-AI coordination via hierarchical RL, but operates in a narrower domain (Overcooked-AI) with more incremental improvements over baselines. Paper 1's findings are more timely given the rapid advancement of LLM agents and more broadly impactful across software verification and formal methods.

    vs. Distilling Game Code World Model Generation into Lightweight Large Language Models
    gemini-3.15/26/2026

    Paper 1 addresses fundamental program verification and formal mathematics, domains with critical implications for software security and reliability. By demonstrating that current agentic provers outpace existing benchmarks and exposing flaws in current evaluation methodologies, it sets a new direction for the field of formal methods. Paper 2's focus on distilling game environment generation is valuable for RL agents, but Paper 1's contributions to verifiable code generation and compiler-in-the-loop paradigms have broader, more profound real-world applications and methodological impact across software engineering.

    vs. Retrying vs Resampling in AI Control
    gemini-3.15/26/2026

    Paper 1 offers fundamental methodological contributions to AI safety by formally comparing retrying and resampling in adversarial AI control. It challenges previous findings and provides actionable, empirical safety improvements. In contrast, Paper 2 is primarily an empirical evaluation showing that a specific model (Claude Code) saturates an existing benchmark, which, while useful for tracking progress, offers less novel methodological innovation.

    vs. MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
    claude-opus-4.65/26/2026

    Paper 1 demonstrates that agentic LLM systems can achieve near-perfect performance on program verification benchmarks, a result with profound implications for formal methods and software engineering. Its finding that existing benchmarks are now inadequate challenges the field to develop harder evaluations, and the tight compiler-in-the-loop paradigm it validates could transform how verified software is produced. Paper 2, while a solid engineering contribution for mobile GUI agent training with good sim-to-real transfer, addresses a narrower problem (mobile agent simulation) with more incremental impact compared to the fundamental shift Paper 1 signals in formal verification capabilities.

    vs. Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World
    gpt-5.25/26/2026

    Paper 2 likely has higher impact: it introduces a broadly applicable benchmark and scalable data-generation pipeline for always-on assistants spanning long-horizon context, multiple services, and GUI/CLI across devices—closely aligned with real-world deployment. Its results expose a substantial capability gap and provide infrastructure (2,000 environments, measurable training gains) that many groups can reuse, driving follow-on work across agent evaluation, HCI, security/privacy, and multimodal interaction. Paper 1 is strong and timely but is narrower (Lean/CLEVER) and partly highlights benchmark shortcomings rather than delivering a widely reusable new standard.

    vs. When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems
    gpt-5.25/25/2026

    Paper 2 likely has higher impact due to stronger methodological rigor (formal Lean 4 verification, benchmarked pipeline), clearer and broader real-world applicability (reliable program verification and code generation), and timely relevance as agentic theorem proving rapidly advances. It also identifies benchmark/evaluation weaknesses (isomorphism-based scoring, dataset bugs) and calls for improved methodologies, influencing future evaluation standards across program verification and AI-assisted software engineering. Paper 1 offers a useful calibration concept and moderate gains, but its contribution is more incremental and narrower to multi-agent LLM planning behaviors.

    vs. Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems
    gemini-3.15/25/2026

    Paper 1 addresses automated program verification using LLM agents, a highly impactful area with the potential to revolutionize software reliability. Its demonstration of near-saturation on current benchmarks provides critical direction for future AI and formal verification research. Paper 2's semantic web approach to AI compliance, while practically relevant, has a narrower scope and relies on more traditional rule-based validation, making its potential scientific impact less transformative than the foundational advances in automated reasoning presented in Paper 1.

    vs. Foundation Protocol: A Coordination Layer for Agentic Society
    claude-opus-4.65/25/2026

    Paper 2 presents concrete empirical results demonstrating near-saturating performance on program verification benchmarks using agentic proving, with specific quantitative findings (98.8% specification validity, 87.5% certification rate). This has immediate practical implications for formal verification and software engineering, identifies benchmark limitations, and contributes to the growing body of work on AI-assisted theorem proving. Paper 1 proposes a conceptual coordination protocol for multi-agent systems but lacks empirical validation, making it more speculative. Paper 2's methodological rigor, reproducible findings, and actionable insights give it higher near-term scientific impact.

    vs. From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills
    gemini-3.15/25/2026

    Paper 1 provides a foundational, systematic framework for understanding the entire lifecycle of model-generated agent skills across diverse domains, culminating in a novel 'meta-skill' to improve extraction. Its broader theoretical insights and methodological rigor offer wider applicability to agent design. In contrast, Paper 2 primarily offers an empirical evaluation of a specific model (Claude Code) on a specific benchmark, which, while valuable for identifying benchmark saturation, has a narrower scope and less generalized theoretical impact.

    vs. EVE-Agent: Evidence-Verifiable Self-Evolving Agents
    gemini-3.15/25/2026

    Paper 2 introduces a novel and scalable methodology for evidence-verifiable self-evolution in agents, addressing the critical issue of hallucination and opaque training signals in LLMs. This foundational contribution has broader applicability and potential for real-world impact across various domains compared to Paper 1, which primarily serves as an empirical evaluation of an existing model on a specific program verification benchmark.

    vs. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
    claude-opus-4.65/25/2026

    Paper 2 demonstrates a more significant scientific contribution by showing that agentic systems can achieve near-human-level performance on program verification (98.1% success rate), revealing that existing benchmarks are becoming insufficient. This has broader impact across formal verification, software engineering, and AI safety. It introduces a novel paradigm (compiler-in-the-loop agentic proving) with implications for reliable software development. Paper 1, while valuable for finance applications, is primarily a benchmark evaluation showing current limitations rather than advancing capabilities, and has a narrower domain focus.

    vs. CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem
    gemini-3.15/25/2026

    Paper 2 addresses a highly timely and impactful area—agentic AI for formal program verification—demonstrating state-of-the-art performance while critically evaluating existing benchmarks. Its implications for software engineering and AI-driven theorem proving give it a broader and more significant potential impact. In contrast, Paper 1 presents a niche hybrid approach for scheduling that admittedly does not outperform existing state-of-the-art solvers, limiting its immediate practical impact.

    vs. A Camera-Cooperative ISAC Framework for Multimodal Non-Cooperative UAVs Sensing
    gemini-3.15/25/2026

    Paper 1 addresses the rapidly advancing field of AI-driven formal verification, demonstrating the near-saturation of current benchmarks by agentic systems. This has profound implications for automated reasoning, AI safety, and software engineering, likely sparking broad interest and new evaluation paradigms across computer science. Paper 2 offers a solid engineering contribution to 6G and UAV tracking, but its impact is more narrowly confined to telecommunications and signal processing compared to the foundational AI advancements in Paper 1.