LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

Po-Nien Kung, Linfeng Song, Dawsen Hwang, Jinsung Yoon, Chun-Liang Li, Simone Severini, Mirek Olšák, Edward Lockhart

Jun 2, 2026

arXiv:2606.03303v1 PDF

cs.AI(primary)

#38of 3355·Artificial Intelligence

Silver · Week 23, 2026 Share

Tournament Score

1572±46

10501800

93%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8

Rigor7

Novelty7.5

Clarity8

Tournament Score

1572±46

10501800

93%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean-IMO-Bench, a benchmark of IMO-style problems formalized in Lean, with short statements yet highly non-routine and multi-step proofs across a wide range of difficulty levels. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models. On Lean-IMO-Bench, LEAP boosts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold-medal-caliber IMO system. Furthermore, we demonstrate LEAP's research-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: LEAP – Supercharging LLMs for Formal Mathematics with Agentic Frameworks

1. Core Contribution

LEAP introduces an agentic framework that enables general-purpose LLMs (specifically Gemini-3.1-Pro) to perform state-of-the-art formal theorem proving in Lean without any specialized fine-tuning. The key insight is that the bottleneck for general LLMs in formal mathematics is not language comprehension or mathematical reasoning per se, but rather the inability to produce long, correct formal proofs in a single pass. LEAP addresses this through three interlinked mechanisms: (1) an AND-OR DAG for hierarchical proof decomposition with memoization, (2) interleaved informal-formal planning where natural-language blueprints guide formal proof generation, and (3) verification-guided search using both compiler feedback and LLM-based decomposition review.

The paper also introduces Lean-IMO-Bench, a benchmark of 60 IMO-level problems formalized in Lean, designed to address saturation concerns with existing benchmarks like MiniF2F and PutnamBench.

2. Methodological Rigor

Strengths in experimental design: The paper evaluates against meaningful baselines—both direct formalization (Gemini-3.1-Pro, Goedel-Prover-V2-32B at Pass@128), agentic systems (Hilbert), and state-of-the-art specialized systems (Aristotle). The comparison is informative, though evaluation settings differ (Pass@128 vs. rollout=2), which somewhat complicates direct comparison. The ablation studies are well-designed: the DAG vs. tree comparison (Table 6), iterative vs. one-shot formalization (Table 5), and the LLM reviewer ablation (Section 5.3) each isolate specific architectural contributions.

Concerns: The evaluation uses only rollout=2 for LEAP, which is computationally intensive (up to 3,000 LLM calls for a single problem like Putnam A5). The paper does not provide total compute costs in a way that enables fair efficiency comparisons. Aristotle is closed-source, and the comparison relies on reproducing its results. The claim of solving "all 12 Putnam 2025 problems" is impressive but should be contextualized—this was achieved with a very powerful foundation model (Gemini-3.1-Pro), and it's unclear how much of the success is attributable to the framework versus the base model's capabilities. A key missing experiment would be testing LEAP with different backbone LLMs to isolate framework contribution from model strength.

The benchmark contribution (Lean-IMO-Bench) is valuable but modest in size (60 problems). The manual formalization by Lean experts ensures quality, though the paper acknowledges geometry problems remain nearly unsolvable across all methods, raising questions about whether the benchmark adequately tests the claimed generality.

3. Potential Impact

Immediate impact: LEAP demonstrates that expensive specialized training may not be strictly necessary for competitive formal theorem proving, potentially democratizing access to formal verification capabilities. This is a significant finding for the formal methods and AI-for-math communities.

Broader implications: The case studies on open combinatorial problems (Knuth's Hamiltonian decomposition, Erdős Problem 457) showcase genuine research utility—producing 5,000+ lines of verified Lean code for a non-trivial subproblem is notable. If reproducible, this suggests LEAP could accelerate the formalization of existing mathematical results and support active research programs.

Tool ecosystem: The blueprint-inspired DAG approach could influence the design of future human-AI collaborative proving tools. The interpretability afforded by the DAG structure—showing which subgoals remain open and where progress is blocked—is practically valuable.

4. Timeliness & Relevance

This paper arrives at a critical juncture. Formal theorem proving has seen rapid progress (AlphaProof, DeepSeek-Prover-V2, Seed Prover), but most advances require extensive specialized training. The community is actively debating whether specialized fine-tuning or agentic scaffolding is the more scalable path. LEAP provides strong evidence for the latter, addressing a genuine current bottleneck.

The introduction of Lean-IMO-Bench is also timely, as MiniF2F is increasingly saturated and the field needs harder, more discriminating benchmarks. The focus on IMO-level problems with "short statements but highly non-routine proofs" fills a specific evaluation gap.

5. Strengths & Limitations

Key Strengths:

Paradigm challenge: Convincingly demonstrates that general LLMs + agentic framework can match or exceed specialized prover models, which is a surprising and impactful finding.

Principled architecture: The AND-OR DAG with hierarchical memoization is well-motivated by human mathematical practice (Lean Blueprint tool) and provides clear structural advantages over flat or tree-based approaches.

Strong empirical results: 100% on Putnam 2025 and 70% overall on Lean-IMO-Bench are compelling, especially the large margin over baselines.

Research-level demonstration: The Knuth cycles formalization adds practical credibility beyond competition benchmarks.

Good ablation coverage: Each component's contribution is independently validated.

Notable Limitations:

Single backbone model: All results use Gemini-3.1-Pro; it remains unclear how much performance depends on this specific (extremely capable) model versus the framework design.

Compute costs: The system requires up to 3,000 LLM calls per problem. Scalability to larger problem sets or less capable models is unaddressed.

Geometry weakness: Near-zero solve rates in geometry across all methods reveals a fundamental limitation that the paper acknowledges but doesn't address.

Reproducibility concerns: While code is released, the dependence on Gemini-3.1-Pro (a proprietary model) limits full reproducibility.

Benchmark size: 60 problems is relatively small for drawing robust statistical conclusions about category-level performance.

Comparison fairness: Different evaluation protocols (Pass@128 vs. rollout=2) and the proprietary nature of some baselines (Aristotle, Axiom, Numina) complicate direct comparisons.

Additional Observations

The paper's framing of "zero to hero" for general LLMs in formal math is compelling but slightly overstated—the "zero" baseline uses Pass@128 without any agentic scaffolding, while the "hero" result uses an extremely powerful model with extensive computational resources. The LLM reviewer ablation (Section 5.3) revealing circular decomposition patterns is a genuinely insightful finding about the failure modes of purely verification-guided search, suggesting promising directions for LLM-guided proof search heuristics.

Rating:7.8/ 10

Significance 8Rigor 7Novelty 7.5Clarity 8

Generated Jun 3, 2026

Comparison History (27)

vs. Success Conditioning as Policy Improvement: The Optimization Problem Solved by Imitating Success

gemini-3.16/6/2026

Paper 1 demonstrates breakthrough capabilities in automated formal theorem proving, solving all 12 problems in the 2025 Putnam Competition and formalizing open combinatorial challenges. Its introduction of a new benchmark and state-of-the-art agentic framework for LLMs directly advances a highly visible and active area of AI research. While Paper 2 provides elegant theoretical grounding for a widely used reinforcement learning technique, Paper 1's empirical leaps in complex reasoning and real-world mathematical utility suggest a broader and more immediate scientific impact across both AI and mathematics.

vs. AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

gpt-5.26/3/2026

Paper 2 (LEAP) has higher estimated scientific impact due to broader cross-field relevance and stronger novelty: it advances agentic methods for mechanically verified formal reasoning, introduces a harder benchmark (Lean-IMO-Bench), and demonstrates top-tier results (solving Putnam 2025; large gains on IMO-style formal proofs) plus research-level utility on open problems. Its applications extend beyond math to verification, programming languages, and trustworthy AI. Paper 1 is rigorous and valuable but is more domain-specific (US-GAAP/XBRL auditing) with narrower generalization and impact breadth.

vs. AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

claude-opus-4.66/3/2026

LEAP demonstrates remarkable concrete results—solving all 12 Putnam 2025 problems, achieving 70% on IMO-style problems (surpassing specialized systems), and formalizing research-level proofs. It introduces both a novel agentic framework and a new benchmark (Lean-IMO-Bench). The work bridges informal and formal mathematical reasoning with immediate practical applications in automated theorem proving and mathematical research. AgentCL makes valuable contributions to continual learning evaluation methodology, but its impact is more incremental—primarily diagnostic rather than achieving breakthrough performance. LEAP's results are more transformative with broader cross-disciplinary implications.

vs. Beyond One-shot: AI Agents for Learning in Field Experiments

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact: it advances a core scientific capability (mechanically verifiable formal reasoning) with a broadly reusable agentic framework, introduces a timely new benchmark (Lean-IMO-Bench), and demonstrates strong, rigorous evaluations plus research-level contributions (verified result tied to an open combinatorics challenge). Its applications span mathematics, formal methods, software/hardware verification, and AI safety. Paper 1 is novel and highly applied with impressive-scale field evidence, but its impact is more domain-specific (healthcare messaging/experimentation) and may generalize less broadly across scientific fields.

vs. Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins

claude-opus-4.66/3/2026

LEAP demonstrates significantly higher scientific impact potential. It addresses a fundamental challenge in AI—bridging informal mathematical reasoning with formal verification—achieving state-of-the-art results (solving all 12 Putnam 2025 problems, 70% on IMO-bench). It introduces a novel agentic framework applicable to general-purpose LLMs, a new benchmark (Lean-IMO-Bench), and demonstrates research-level utility on open mathematical problems. Its breadth of impact spans AI, formal verification, and mathematics. Paper 1, while methodologically sound, offers incremental architectural comparison findings in hydrology with limited novelty.

vs. AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design

gpt-5.26/3/2026

Paper 1 likely has higher scientific impact due to stronger real-world applicability and cross-disciplinary reach: integrating PLMs with structural/energetic/docking oracles targets therapeutics and enzyme engineering with direct experimental and industrial pathways. The RAD+CAPO framework is a novel, generalizable agentic paradigm for biology that could influence protein design workflows broadly. Paper 2 is timely and rigorous, with major benchmark gains in formal math, but its near-term practical impact is narrower (primarily theorem proving/formalization) and may be more benchmark-driven. Overall, AgentPLM’s translational potential and breadth favor higher impact.

vs. Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

gpt-5.26/3/2026

Paper 2 (LEAP) likely has higher scientific impact due to a major capability jump in a high-value domain: mechanically verified formal mathematics. It introduces an agentic framework with strong empirical gains (e.g., large improvement on Lean-IMO-Bench and solving all 2025 Putnam problems) and demonstrates research-level utility on open combinatorics, suggesting real-world applicability to proof automation and verification. It is timely amid rapid progress in formal theorem proving and could influence both AI systems research and mathematics/verification. Paper 1 is novel and important for reliability, but its impact is narrower and more diagnostic than enabling.

vs. SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

gemini-3.16/3/2026

Paper 1 represents a significant breakthrough in formal mathematical reasoning, a major frontier for AI. By successfully solving the 2025 Putnam Competition and autonomously formalizing proofs for open mathematical challenges, it demonstrates immediate, high-impact real-world utility in automated theorem proving. Paper 2 provides valuable insights into agent ecosystems, but Paper 1's demonstrable state-of-the-art results on extremely difficult, objective benchmarks suggest a more profound and immediate impact on AI capabilities and mathematical research.

vs. ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

gpt-5.26/3/2026

Paper 1 likely has higher impact: it introduces an agentic framework that materially advances automated formal theorem proving, plus a new challenging benchmark (Lean-IMO-Bench) and demonstrates success on high-profile, time-relevant tasks (Putnam 2025, IMO-style problems) and research-grade formalization of open combinatorial challenges. This combines novelty, strong methodological grounding via compiler-verified proofs, broad cross-field implications (AI, formal methods, mathematics), and clear real-world applications in verification and mathematical discovery. Paper 2 is valuable but more incremental/optimization-focused on efficiency for existing RLVR CoT training.

vs. Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

gpt-5.26/3/2026

Paper 2 (LEAP) has higher potential impact due to stronger novelty and broader implications: it advances general-purpose LLMs to state-of-the-art formal theorem proving via an agentic, compiler-in-the-loop framework and introduces a timely new benchmark (Lean-IMO-Bench) addressing benchmark saturation. The reported gains (to 70% on hard IMO-style formal problems, solving all Putnam 2025) suggest substantial practical and research utility, including verified progress on open combinatorics problems. This can affect ML, formal methods, programming languages, and mathematics more broadly than Paper 1’s mobile-agent efficiency improvements.

vs. DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

gemini-3.16/3/2026

Paper 1 demonstrates a major breakthrough in automated formal theorem proving, achieving state-of-the-art results on extremely difficult mathematical benchmarks (IMO, Putnam 2025) and contributing to open research problems. While Paper 2 introduces a valuable benchmark for evaluating GUI agents, Paper 1's methodological advancements in mathematical reasoning and its immediate utility in advancing mathematical research give it a significantly higher potential for broad scientific impact and fundamental AI capability enhancement.

vs. The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

claude-opus-4.66/3/2026

LEAP demonstrates remarkable practical impact by solving all 12 Putnam 2025 problems and achieving 70% on IMO-style problems, surpassing specialized systems. It introduces a reusable agentic framework bridging informal and formal mathematics with clear real-world utility for mathematical verification and research-level formalization. While Paper 1 provides interesting theoretical analysis of reasoning limitations with the Deterministic Horizon concept, Paper 2's contributions are more immediately transformative—enabling general-purpose LLMs to produce mechanically verified proofs at competition and research levels, with broader implications for AI-assisted mathematics.

vs. Toward a Modular Architecture for Embedded AI Agent Systems at the Edge

gpt-5.26/3/2026

Paper 2 has higher likely impact: it introduces a concrete agentic method (LEAP) with strong empirical gains and a new challenging benchmark (Lean-IMO-Bench), demonstrating state-of-the-art formal theorem proving and novel research-level results (verified proof for a Knuth-related subproblem). Its methodological rigor and measurable progress on timely, high-profile problems in AI+formal methods suggest broad downstream influence (verification, math, programming languages, AI reliability). Paper 1 is a useful architectural proposal for edge agents but is less novel and lacks empirical validation, limiting near-term scientific impact.

vs. Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

gemini-3.16/3/2026

Paper 2 presents a breakthrough in automated theorem proving, solving highly complex benchmark problems (Putnam 2025, IMO-style problems) and addressing open combinatorial challenges. Its ability to significantly boost the formal solve rate of LLMs and contribute to actual mathematical research gives it a substantially higher scientific impact than Paper 1, which focuses primarily on the post-generation refactoring and readability of proofs.

vs. Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

gpt-5.26/3/2026

Paper 1 likely has higher scientific impact due to major technical novelty (agentic formal-proof generation with compiler-in-the-loop), strong methodological contribution (new IMO-style Lean benchmark), and broad downstream applicability to formal verification, mathematics, and trustworthy software/hardware. Its reported performance gains (sub-10% to 70%) and demonstrated utility on research-level formalization suggest a step-change in automated reasoning. Paper 2 is timely and important for AI safety/health equity, but is narrower in scope (single symptom template, limited models) and primarily diagnostic of bias rather than offering a generalizable technical solution.

vs. Forget Attention: Importance-Aware Attention Is All You Need

gpt-5.26/3/2026

Paper 2 likely has higher impact: it delivers a substantial capability jump (general-purpose LLMs reaching strong formal-proof performance) via an agentic framework applicable across models, introduces a timely new benchmark (Lean-IMO-Bench), and demonstrates broader real-world/research utility (autonomous formalization and a verified result on an open combinatorics challenge). Its applications span automated reasoning, software verification, and mathematics. Paper 1 is technically novel and efficient, but its impact is more specialized to LM architecture design and depends on broader adoption and scaling validation.

vs. Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

gemini-3.16/3/2026

Paper 2 introduces a novel agentic framework that significantly advances LLM reasoning in formal mathematics, a critical frontier for AI development. By solving Putnam problems and demonstrating utility in open combinatorial challenges, it showcases breakthrough capabilities. In contrast, Paper 1 is a systematic review that, while clinically relevant to dentistry, synthesizes existing knowledge rather than introducing a fundamental, state-of-the-art methodology with broad implications for artificial general intelligence.

vs. Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

claude-opus-4.66/3/2026

LEAP demonstrates breakthrough results in formal theorem proving—solving all 12 Putnam 2025 problems and achieving 70% on IMO-style problems, far surpassing specialized systems. It introduces a novel agentic framework bridging informal and formal mathematical reasoning, with demonstrated utility on research-level open problems. This has broad impact across AI, mathematics, and formal verification. Paper 2 addresses a narrower engineering optimization (token cost reduction for multilingual coding prompts) with incremental improvements, limited novelty (prompt rewriting/translation), and much smaller potential impact scope.

vs. An Exploration of Collision-based Enemy Morphology Generation

gpt-5.26/3/2026

Paper 1 likely has higher scientific impact due to major novelty and demonstrated performance gains in a rapidly advancing, high-interest area (LLMs + formal verification). It introduces an agentic framework with strong empirical results on challenging, newly proposed benchmarks and shows research-level utility (autonomous formalization and a verified result on an open combinatorics subproblem). This has broad implications for AI, theorem proving, software/hardware verification, and mathematics. Paper 2 is novel within a narrower PCG niche and appears less rigorous/transformative, with more limited cross-field reach.

vs. The Violation Situation Pattern: A Knowledge-Graph Pattern for Compliance Violations

gemini-3.16/3/2026

Paper 1 addresses a fundamental challenge in AI—automated formal mathematical reasoning—achieving breakthrough results on prestigious benchmarks like the Putnam Competition and IMO. Its advancements in agentic LLM frameworks and formal verification have broad, transformative implications for AI, mathematics, and software engineering. Paper 2, while offering a useful knowledge-graph pattern for compliance tracking, has a much narrower scope and represents an incremental data engineering improvement rather than a major scientific breakthrough.