Advancing Mathematics Research with AI-Driven Formal Proof Search
George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching, Gergely Bérczi, Francisco J. R. Ruiz, Arun Suggala
Abstract
Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve open problems. Our most capable agent autonomously resolved 9 of 353 open Erdős problems at the per-problem cost of a few hundred dollars, proved 44/492 OEIS conjectures, and is being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research. A basic agent alternating LLM-based generation with Lean-based verification replicated the Erdős successes but proved costlier on the hardest problems. These findings demonstrate the power of AI-aided formal proof search and shed light on the agent designs that enable it.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Advancing Mathematics Research with AI-Driven Formal Proof Search"
1. Core Contribution
This paper presents AlphaProof Nexus, a framework for LLM-aided formal proof generation in Lean, and demonstrates its ability to autonomously solve open research-level mathematical problems. The headline results are striking: 9 of 353 open Erdős problems resolved (including problems open for 56 years), 44/492 OEIS conjectures proved, a 15-year-old question in algebraic geometry settled, and an improved convergence bound in convex optimization discovered. The key conceptual contribution is shifting formal AI theorem proving from competition mathematics and human-aided formalization toward autonomous discovery on genuine research problems.
The system architecture spans four agent configurations: (A) a basic Ralph-loop agent with Lean compiler feedback, (B) adding AlphaProof as a subgoal solver, (C) adding evolutionary population-based search with Elo-rated sketches, and (D) the full-featured combination. The evolutionary component is notable for addressing the mismatch between binary proof verification and the graduated fitness landscapes evolutionary algorithms typically require, solved via LLM-based relative sketch ranking aggregated into Elo scores.
2. Methodological Rigor
The evaluation methodology has several strengths. The Erdős problem set was determined by what the community had formalized (353 problems), avoiding cherry-picking. Expert validation confirmed formalizations faithfully captured original conjectures. All Lean proofs are publicly released, providing reproducibility at the verification level.
The agent comparison study is well-designed: 100 independent runs for simpler agents allow statistical analysis via chunking, while cost-normalized comparisons across architectures provide meaningful head-to-head evaluation. The use of USD as a common cost currency is pragmatic and informative.
However, some limitations exist. The post-hoc nature of the architecture comparison is acknowledged but important — the basic agent's success was tested only on the 9 problems the full agent solved, not on the full 353. The OEIS evaluation involved Gemini-based problem selection and autoformalization, introducing potential selection bias toward problems amenable to the system. The paper does not report confidence intervals for Agent (D) results due to computational constraints. AlphaProof's cost (~$60/problem) is excluded from reported costs, which slightly complicates interpretation.
3. Potential Impact
Mathematics: This work demonstrates a viable workflow where AI generates formally verified proofs of open problems, with mathematicians providing problem formulations and validating that formalizations match informal statements. The detection of misformalizations (Erdős #125, #741) is a particularly valuable secondary capability. The proof sketches, even when incomplete, helped collaborators understand problems — suggesting utility beyond just solving.
AI for Science: The finding that the basic agent could replicate all 9 Erdős successes (though at higher cost on harder problems) is perhaps the most consequential architectural insight. It suggests that as frontier LLMs improve, simpler agentic loops may suffice, potentially democratizing this capability. This aligns with broader trends in AI where simple approaches with strong base models outperform complex engineered systems.
Formal Verification: The work validates the paradigm of using formal languages as a medium for mathematical discovery (not just verification), distinguishing it from natural-language proof systems that require expert review.
4. Timeliness & Relevance
This paper arrives at a critical juncture. Recent work (Aletheia, OpenAI's systems) has shown LLMs solving Erdős problems informally, but the reliability question remained. Formal verification addresses this directly. The paper explicitly positions itself against the concurrent informal approaches and makes a compelling case that formal proof search filters out hallucinations that plague natural-language systems — a point vividly illustrated by the failure analysis showing agents hallucinating "established results."
The timing also captures an interesting snapshot of a moving target: the basic agent's surprising competitiveness reflects rapid LLM improvement, and the paper honestly acknowledges that the full-featured agent's advantages may diminish over time.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations:
The deformalized proofs in the supplementary material are themselves interesting mathematical contributions. The Erdős #12 construction using CRT and 3-AP-avoiding sets, and the Hilbert function log-concavity proof, demonstrate genuine mathematical sophistication. The optimization result, where the agent simultaneously discovered a novel parameter schedule and its proof, points toward a particularly promising use case.
The paper represents a significant milestone in AI-assisted mathematics, though it is important to note that the problems solved, while genuinely open, are not among the hardest open problems in mathematics. The concentration in combinatorics and number theory reflects both Mathlib coverage and the nature of problems amenable to constructive proof strategies.
Generated May 22, 2026
Comparison History (21)
Paper 1 introduces a foundation model at an unprecedented scale (5 million participants) for wearable health data. Its impact on personalized medicine, predictive health, and direct societal well-being is massive. While Paper 2 demonstrates a profound breakthrough in AI reasoning for pure mathematics, Paper 1's real-world applications across numerous medical domains, combined with clinician-validated agents, offer a broader and more immediate translational scientific impact.
Paper 1 likely has higher scientific impact due to greater novelty and cross-field breadth: it demonstrates AI-driven formal proof search solving previously open mathematical problems, a qualitatively new capability with implications for mathematics, verification, AI alignment/robustness, and scientific discovery workflows. Its methodology (formal verification in Lean, large-scale evaluation on open-problem sets) supports rigor and reproducibility. Paper 2 is highly impactful industrially and timely for recommender systems, but the ideas (multimodal codes, ID-free ranking, warmup training) are more incremental within applied ML and are narrower in academic breadth despite strong scale and deployment results.
Paper 2 demonstrates a groundbreaking methodological leap by using AI to autonomously solve open mathematical problems, including 9 Erdős conjectures. Its use of formal verification (Lean) ensures rigorous correctness, solving a major challenge in AI reasoning. Furthermore, its immediate deployment across diverse fields like algebraic geometry and quantum optics indicates profound and broad scientific impact. In contrast, while Paper 1 offers timely qualitative insights into organizational behavior, its small sample size (24 interviews) and restricted scope limit its transformative scientific potential compared to Paper 2's concrete contributions to fundamental mathematics.
Paper 2 demonstrates a groundbreaking application of LLMs to solve open mathematical problems (including 9 Erdős problems), representing a concrete, verifiable advance in mathematical research. Its impact spans multiple mathematical subfields and establishes a new paradigm for AI-assisted theorem proving. While Paper 1 provides a useful taxonomy and survey of AI sycophancy—an important conceptual contribution—Paper 2's demonstrated ability to solve previously unsolved problems represents a more transformative scientific contribution with immediate, measurable real-world impact on mathematics research.
Paper 1 has higher potential impact due to stronger novelty (large-scale evaluation of LLM+formal verification on open math problems), clear methodological rigor via Lean-verified proofs, and broad cross-field implications (mathematics, AI, formal methods, and downstream scientific domains). Solving open Erdős problems and OEIS conjectures provides high-visibility milestones and strong timeliness amid rapid progress in LLMs. Paper 2 is applied and useful, but its approach (PPO over dispatching-rule selection) is incremental within established DRL scheduling literature and likely narrower in breadth and scientific reach.
Paper 1 likely has higher scientific impact due to a broadly applicable, engineered knowledge infrastructure that measurably improves agentic use of process-based simulators, with strong real-world relevance to climate risk and resource management. Its demonstrated generalization across 119 models and 14 Earth-science domains suggests wide cross-disciplinary utility and infrastructure-like adoption potential. Paper 2 is highly novel and timely for AI+math, but its direct applicability is narrower (specialist formalization workflows) and impact may be constrained by tooling and domain formalization overhead, despite impressive open-problem results.
Paper 1 has higher potential scientific impact due to its novelty (large-scale evaluation of LLM-driven formal proof search on open problems), methodological rigor (formal verification in Lean), and broad cross-field relevance (mathematics, CS/AI, theorem proving, and multiple applied domains). Demonstrating solutions to open Erdős problems and OEIS conjectures is a strong signal of advancing research capability, not just evaluation. Paper 2 is timely and practically valuable for enterprise agent evaluation, but its contribution is narrower (benchmark/taxonomy in finance spreadsheets) and less likely to drive foundational advances across disciplines.
Paper 1 likely has higher scientific impact: it demonstrates an AI+formal-verification system solving nontrivial open problems (Erdős, OEIS), a concrete and rare benchmark for advancing mathematical research. The approach is novel (large-scale evaluation on open problems), broadly applicable across mathematics and adjacent sciences, and has immediate real-world research utility. While Paper 2 is timely and methodologically strong for AI security, its impact is more specialized to multi-agent LLM safety. Paper 1’s potential to change mathematical workflow and accelerate discovery yields wider cross-field scientific leverage.
Paper 1 demonstrates a groundbreaking application of AI to solving open mathematical problems, including 9 previously unsolved Erdős problems and 44 OEIS conjectures. This represents a paradigm shift in mathematical research methodology with broad implications across multiple fields. The scale, rigor, and novelty far exceed Paper 2, which is a small-scale case study (single speech, 51 segments) comparing emotion analysis modalities with limited generalizability. Paper 1's results are already being deployed in active research across multiple mathematical disciplines, indicating immediate and substantial real-world impact.
Paper 2 likely has higher impact due to strong novelty (large-scale, real-world evaluation of LLM+Lean agents on open problems), broad cross-field relevance (mathematics, AI, formal methods, automated reasoning), and timely significance amid rapid progress in AI for science. It demonstrates concrete successes on recognized open-problem sets (Erdős, OEIS) with actionable cost/agent-design insights and ongoing deployments, suggesting immediate research acceleration potential. Paper 1 is valuable for safety assurance engineering, but its scope is narrower and the methodological choice to prioritize responsiveness over Bayesian correctness may limit perceived generality.
Paper 2 likely has higher scientific impact due to a clearer, immediate real-world application (AI-assisted discovery of new formal proofs) and broad cross-field relevance to mathematics, automated reasoning, and AI tooling. Solving open Erdős problems and OEIS conjectures provides concrete, high-salience milestones, and deployment across multiple research areas suggests scalable impact. Paper 1 is novel and methodologically strong, but its impact is more conceptual/theoretical and may diffuse across cognitive science/ML without comparable headline breakthroughs or direct utility.
Paper 1 has higher potential impact: it demonstrates AI-driven formal proof search solving genuine open problems (Erdős, OEIS) with verified correctness via Lean, a strong rigor signal and a notable novelty milestone (large-scale open-problem evaluation). Its real-world applicability to active mathematical research across multiple domains suggests broad cross-field influence and high timeliness as formal methods + LLMs converge. Paper 2 is a solid, timely efficiency method for RL post-training with useful applications, but its contributions are more incremental within model training and likely narrower in downstream scientific disruption than verified progress on open mathematics.
Paper 2 demonstrates a groundbreaking application of AI to solve open problems in mathematics, showcasing broad impact across multiple fields like combinatorics and algebraic geometry. Its timely integration of LLMs with formal verification addresses crucial reliability issues, offering an innovative and rigorous tool for mathematical research. In contrast, Paper 1 presents a valuable but much more niche methodological improvement in the specific domain of system assurance arguments.
Paper 2 likely has higher impact: it demonstrates a large-scale, first-of-its-kind evaluation on genuinely open math problems with concrete successes and active deployment in multiple research domains, indicating immediate real-world applicability and timeliness. Its methodology is rigorous due to formal verification in Lean and clear quantitative benchmarks (Erdős, OEIS). The breadth spans mathematics, computer science, and AI tooling for scientific discovery. Paper 1 is novel in probabilistic recursive reasoning and may influence ML architectures, but its current evidence is mainly on benchmark reasoning/constraint tasks, with less direct near-term transformative application than automated formal proof search solving open problems.
Paper 2 demonstrates significant real-world impact by autonomously solving previously open mathematical problems (Erdős problems and OEIS conjectures) and being actively deployed across multiple research fields. While Paper 1 introduces a novel and rigorous theoretical framework for latent reasoning, Paper 2's concrete achievements in advancing actual mathematics research showcase a much broader and more immediate scientific breakthrough.
Paper 2 demonstrates unprecedented real-world scientific impact by autonomously solving 9 open Erdős problems—longstanding challenges in mathematics—using AI-driven formal proof search. This represents a concrete, verifiable advance in mathematical research, not just benchmark improvements. The approach combines LLMs with formal verification (Lean), addressing the critical reliability problem. Its deployment across multiple mathematical subfields (combinatorics, optimization, algebraic geometry, quantum optics) shows broad applicability. Solving open problems is qualitatively different from workflow optimization, representing a paradigm shift in how AI contributes to fundamental research.
Paper 2 demonstrates a landmark achievement by using AI to autonomously solve open, historically significant mathematical problems (Erdős problems and OEIS conjectures). While Paper 1 presents a valuable framework for multi-agent workflows, Paper 2's direct contribution to new mathematical knowledge and its deployment across multiple theoretical disciplines signify a broader and more transformative scientific impact.
Paper 2 demonstrates a breakthrough in automated theorem proving by autonomously solving previously open Erdős problems and OEIS conjectures using AI. This represents a major milestone in AI for mathematics, offering immediate, broad applicability across multiple rigorous scientific disciplines like algebraic geometry and quantum optics. While Paper 1 provides valuable theoretical insights into the sim-to-real gap in reinforcement learning, Paper 2's concrete resolution of open mathematical problems gives it a significantly higher potential for immediate and transformative scientific impact.
Paper 2 demonstrates a breakthrough in AI-assisted mathematics research by solving 9 open Erdős problems and proving 44 OEIS conjectures—genuinely advancing mathematical knowledge. This has profound implications across multiple mathematical subfields and establishes AI as a tool for resolving longstanding open problems. Its impact spans mathematics, formal verification, and AI research. Paper 1, while practically useful for spreadsheet automation, represents incremental engineering improvements on a narrower application domain with less fundamental scientific significance.
Paper 1 is more likely to have higher scientific impact due to stronger novelty and broader cross-field implications: demonstrating AI-assisted formal proof search solving multiple open Erdős problems and many OEIS conjectures is a rare, high-signal advance with direct consequences for mathematics and adjacent scientific domains. Its results suggest a new research paradigm (LLM+formal verification) with immediate real-world application in producing trustworthy proofs. Paper 2 is timely and useful for agent evaluation, but benchmarks typically yield narrower, tooling-focused impact than a demonstrated capability to resolve open scientific problems.