Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

Moritz Firsching, Paul Lezeau, Salvatore Mercuri, Miklós Z. Horváth, Yaël Dillies, Calle Sönne, Eric Wieser, Fred Zhang

May 13, 2026

arXiv:2605.13171v1 PDF

cs.AI(primary)

#18of 2292·Artificial Intelligence

Gold · Week 20, 2026 Share

Tournament Score

1592±34

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty6.5

Clarity8

Tournament Score

1592±34

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof autoformalization. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures. We describe our approach to ensuring the correctness of these formalizations in a collaborative open-source project where contributions stem from an active community. In this framework, AI-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research-level mathematics.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Formal Conjectures

1. Core Contribution

Formal Conjectures addresses a genuine and pressing gap in AI evaluation: the lack of research-level, formally verified mathematical benchmarks that resist contamination and saturation. The core contribution is a living benchmark of 2615 Lean 4 statements, including 1029 open research conjectures (providing zero-contamination guarantees by definition, since no solutions exist) and 836 solved problems for auto-formalization evaluation. The key insight is elegant: open conjectures inherently resist data leakage because their solutions don't exist in any training corpus. This sidesteps the contamination-versus-secrecy dilemma that plagues benchmarks like FrontierMath and Humanity's Last Exam. Additionally, using Lean 4's kernel for verification provides objectively binary correctness criteria, avoiding the ambiguity of numerical answer checking or informal proof grading.

The paper also introduces the `answer(sorry)` mechanism, which cleanly separates value discovery from proof verification—a practical design choice that elegantly handles problems where the answer itself is unknown.

2. Methodological Rigor

The paper demonstrates strong methodological awareness. The misformalization taxonomy (Translation, Underspecified, Source, with six subtypes) is a genuinely useful contribution to the formalization community. The analysis of 291 fixed misformalizations provides concrete, quantified evidence of the challenges in formal benchmarking, with misrepresentation (48%) and semantic (35%) errors dominating—valuable empirical data for future formalization efforts.

The versioning scheme (bench-vN-lean4.X.Y) and frozen evaluation subsets (FC100OpenSet1, FC100SolvedSet1) show careful attention to reproducibility in a living benchmark. However, the experimental evaluation is thin: only three configurations of two proprietary systems (AlphaProof and a DeepMind prover agent) are tested, both from the authors' organization. No open-source baselines or third-party systems are evaluated, which limits the evaluation's immediate usefulness for the broader community. The cost of $100 per problem for the prover agent raises accessibility concerns.

The baseline results (45%→50%→66% on FC100SolvedSet1) do demonstrate "climbable signal," but with only two systems and three data points, the scaling analysis is preliminary.

3. Potential Impact

Immediate utility: The benchmark has already demonstrated real-world value—Boris Alexeev used it to prove Erdős Problem 124, and the DeepMind prover agent solved several open problems. This is a rare and compelling demonstration of a benchmark directly enabling mathematical discovery.

Community infrastructure: The repository functions as a "unified API" connecting mathematicians with AI solvers, which could be transformative. The bidirectional feedback loop—where AI proofs/disproofs audit formalizations, and formalization efforts clarify informal mathematics—creates a genuinely novel research dynamic. The upstream contributions to Mathlib (88 ForMathlib files) further amplify impact.

Field-shaping potential: As formal theorem proving matures, this benchmark could become a standard evaluation target, similar to how MiniF2F served the field previously. The open-source nature (Apache 2.0) and growing community (30+ contributors listed) support adoption.

Limitations on impact: The benchmark is Lean 4-specific, though the authors acknowledge this. The concentration in number theory and combinatorics (>50% of statements) limits breadth. Dependence on Mathlib's coverage creates a selection bias toward areas where formalization infrastructure exists.

4. Timeliness & Relevance

The timing is excellent. MiniF2F is effectively saturated (>99%), PutnamBench is being rapidly solved, and there is acute need for harder benchmarks. The paper correctly identifies this saturation crisis and provides a principled response. The focus on open conjectures as a perpetually unsaturated evaluation target is strategically sound—though individual problems may be solved, the benchmark naturally adjusts difficulty by adding new problems and reclassifying solved ones.

The paper also arrives at a moment when formal verification is becoming mainstream in AI reasoning evaluation, with AlphaProof's IMO results catalyzing broad interest.

5. Strengths & Limitations

Key Strengths:

Zero-contamination by construction for open problems—an elegant solution to a fundamental problem

Demonstrated real-world mathematical discoveries already enabled by the benchmark

Thoughtful misformalization taxonomy with empirical quantification

The iterative auditing loop where AI proofs improve formalization quality is a genuinely novel methodological contribution

Strong versioning and reproducibility infrastructure for a living benchmark

Open-source with active community

Notable Weaknesses:

Experimental evaluation relies exclusively on proprietary systems from the authors' organization; no community baselines

FC100OpenSet1 trivially has 0% baseline with 0% achieved—while this is by design, it means the "primary goal" currently provides no discriminative evaluation signal

The paper is more of a resource/infrastructure contribution than a technical or theoretical advance

Subject distribution is heavily skewed; many mathematical areas are underrepresented

The 17.3% formal proof coverage for solved problems is low, limiting the auto-formalization track's utility as a "climbable" benchmark

No systematic difficulty estimation or stratification beyond the open/solved binary

Additional Observations:

The `answer(sorry)` elaborator is a clean engineering contribution with broader applicability

The paper's framing as addressing four specific limitations (leakage, secrecy, oversimplified criteria, saturation) is effective and well-argued

The connection to informal sources (Erdős problems, Kourovka Notebook) and upstream feedback to those sources creates a valuable scholarly infrastructure

Overall Assessment

This is primarily a high-quality benchmark and infrastructure contribution rather than a methodological or theoretical advance. Its greatest strength is the zero-contamination design through open conjectures and the demonstrated real-world utility for mathematical discovery. The misformalization analysis provides genuine scientific insight. However, the thin experimental evaluation and reliance on proprietary systems somewhat weaken the immediate impact for the broader community. The paper's long-term significance will depend heavily on adoption and community growth.

Rating:7/ 10

Significance 7.5Rigor 6.5Novelty 6.5Clarity 8

Generated May 14, 2026

Comparison History (34)

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

gemini-3.15/16/2026

Paper 1 presents a massive-scale foundation model trained on 200 million patients, demonstrating highly impactful real-world applications in disease prediction, epidemiology, and healthcare expenditure forecasting. Its unprecedented scale and significant performance improvements offer broader societal and interdisciplinary impact compared to Paper 2, which, while valuable for AI development, primarily provides a benchmark for the more niche intersection of automated reasoning and formal mathematics.

vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

claude-opus-4.65/16/2026

IatroBench addresses a critical and timely problem—AI safety measures causing measurable harm through identity-contingent knowledge withholding—with rigorous pre-registered methodology across frontier models. Its findings have immediate, high-stakes real-world implications for AI deployment in healthcare, AI safety policy, and model evaluation practices. The discovery that safety filters systematically harm vulnerable users who have exhausted standard referrals challenges fundamental assumptions in AI alignment. While Formal Conjectures is a valuable benchmark for mathematical reasoning, IatroBench's cross-disciplinary impact (AI safety, healthcare, policy, evaluation methodology) and its challenge to prevailing safety paradigms give it broader and more urgent scientific significance.

vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

gemini-3.15/16/2026

Paper 2 proposes a fundamental theoretical unification across multiple major disciplines (Bayesian inference, game theory, and thermodynamics), offering a profound foundational framework for understanding collective intelligence. While Paper 1 provides a highly valuable and timely benchmark for AI in mathematics, Paper 2's potential to reshape foundational theories across physics, biology, and artificial intelligence gives it significantly higher potential for broad and lasting scientific impact.

vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures

gpt-5.25/16/2026

Paper 2 has higher likely scientific impact: it introduces a unified, physically grounded framework (GSS) that bridges diffusion generation and random structure search, yielding large practical gains (10× lower sampling cost) and demonstrated out-of-distribution effectiveness across molecules and crystals. This targets a core bottleneck in materials/pharma discovery with clear downstream applications and broad relevance to ML for science, computational chemistry, and materials engineering. Paper 1 is timely and valuable infrastructure for formal math/AI evaluation, but its impact is more domain-specific and indirect compared to a method that can immediately accelerate real-world discovery pipelines.

vs. Simulating clinical interventions with a generative multimodal model of human physiology

gpt-5.25/16/2026

Paper 2 likely has higher scientific impact due to its strong real-world applicability (risk prediction and intervention simulation), broad relevance across medicine, epidemiology, and ML, and timely alignment with “digital twin”/foundation-model trends. It reports substantial empirical validation: large multimodal longitudinal cohort, transfer to independent cohorts, improvements across many endpoints, and comparisons to clinical risk scores plus intervention-effect checks against RCT literature. Paper 1 is novel and valuable for automated theorem proving and benchmarking, with potential long-term impact, but its immediate cross-domain societal impact and user base are narrower than clinical modeling.

vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules

gemini-3.15/16/2026

While Paper 1 offers a valuable benchmark for AI in mathematics, Paper 2 presents a multimodal foundation model for biology with profound implications for drug discovery, genetics, and therapeutics. Its ability to unify representation learning, conditional prediction, and constrained biomolecular design addresses critical challenges in life sciences, offering wider real-world applications and transformative potential in molecular biology and medicine.

vs. AI scientists produce results without reasoning scientifically

claude-opus-4.65/16/2026

Paper 2 addresses a fundamental question about the epistemic reliability of LLM-based scientific agents—a topic of immense and growing importance as AI is increasingly deployed in research. Its findings (evidence ignored in 68% of traces, scaffold contributing only 1.5% of variance) provide concrete, actionable insights with broad implications across all fields using AI for science. Paper 1 is a valuable benchmark contribution, but Paper 2's critical evaluation of AI reasoning limitations has broader cross-disciplinary impact and is more timely given the rush to deploy autonomous AI scientists.

vs. Conditional Attribute Estimation with Autoregressive Sequence Models

gemini-3.15/16/2026

Paper 1 bridges AI and advanced mathematics by providing a benchmark of formalized open conjectures. Its demonstrated ability to facilitate actual mathematical discoveries and resolve open problems indicates profound, immediate scientific impact. While Paper 2 offers valuable methodological advancements for sequence models, Paper 1's role in advancing verifiable AI-driven scientific discovery and providing a zero-contamination evaluation framework gives it higher transformative potential across disciplines.

vs. Machine Collective Intelligence for Explainable Scientific Discovery

gpt-5.25/16/2026

Paper 2 has higher likely impact due to creating shared infrastructure: a large, evolving, zero-contamination Lean 4 benchmark with open research conjectures and standardized evaluations. Benchmarks often catalyze rapid, broad progress across automated reasoning, formal methods, and mathematics, and the community/open-source workflow increases adoption and longevity. It is timely given recent advances in theorem-proving LLMs and already reports enabling new mathematical discoveries. Paper 1 is novel and potentially powerful for scientific modeling, but impact depends more on generalization and uptake beyond demonstrated systems.

vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

gemini-3.15/16/2026

While Paper 1 presents a highly innovative benchmark for AI reasoning in mathematics, Paper 2 has a significantly broader potential impact across all scientific disciplines. By demonstrating that AI-generated peer reviews are technically sound and preferred by authors at a massive real-world scale (over 22,000 papers), Paper 2 addresses a critical and universal bottleneck in the scientific process. This could fundamentally transform how scientific research is evaluated globally.

vs. End-to-end autonomous scientific discovery on a real optical platform

claude-opus-4.65/16/2026

Paper 2 demonstrates end-to-end autonomous scientific discovery on a real physical system, including the first AI-driven identification and experimental validation of a previously unreported physical mechanism (optical bilinear interaction). This represents a paradigm shift in how science is conducted, bridging AI reasoning with real-world experimentation. While Paper 1 provides a valuable benchmark for mathematical reasoning, Paper 2's demonstration of fully autonomous discovery with tangible physical results—potentially enabling new optical computing hardware—has broader cross-disciplinary impact and represents a more transformative milestone for AI-driven science.

vs. Containment Verification: AI Safety Guarantees Independent of Alignment

gpt-5.25/16/2026

Paper 2 is likely to have higher scientific impact due to broader, immediate utility: a large, open, evolving Lean 4 benchmark with open conjectures enables standardized evaluation, drives community contributions, and directly supports verified mathematical discovery. Its applications span automated reasoning, theorem proving, benchmarking, and AI evaluation, with demonstrated real-world impact via resolved conjectures. Paper 1 is highly novel and rigorous (deductive verification for agentic frameworks under worst-case semantics) with important AI safety implications, but its impact may be narrower near-term (boundary-enforceable properties, specific framework assumptions) and adoption depends on integration into widely used agent stacks.

vs. DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

claude-opus-4.65/16/2026

Formal Conjectures provides a lasting, evolving benchmark for a fundamental challenge—automated mathematical reasoning and discovery—with demonstrated real-world utility (resolving open conjectures). Its impact spans mathematics, AI reasoning, and formal verification, creating infrastructure that can drive discoveries for years. While DTap addresses the important and timely problem of AI agent security with impressive scale, it is more narrowly focused on red-teaming evaluation. Formal Conjectures' open-source, community-driven nature and its potential to bridge AI and mathematical research give it broader and more enduring scientific impact.

vs. Data-driven Circuit Discovery for Interpretability of Language Models

claude-opus-4.65/16/2026

Formal Conjectures provides a large-scale, evolving benchmark for evaluating AI mathematical reasoning at the research frontier, with demonstrated utility in resolving open conjectures. It bridges mathematicians and AI systems in a structured way, addresses contamination concerns, and serves the rapidly growing automated theorem proving community. Its breadth of impact spans mathematics and AI, and its open-source, community-driven nature ensures lasting relevance. Paper 2 makes valuable contributions to mechanistic interpretability but addresses a narrower methodological issue within circuit discovery, with more incremental impact on the interpretability subfield.

vs. SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

gemini-3.15/16/2026

Paper 2 introduces a zero-contamination benchmark of formalized open mathematical conjectures, directly addressing the critical issue of LLM evaluation contamination. Its ability to facilitate actual mathematical discoveries demonstrates profound interdisciplinary impact. While Paper 1 offers valuable architectural improvements for agent memory, Paper 2 establishes a foundational resource for the rapidly advancing field of automated theorem proving, promising broader and longer-lasting scientific influence.

vs. Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI

claude-opus-4.65/16/2026

Formal Conjectures presents a novel, large-scale benchmark (2615 problems in Lean 4) for evaluating automated reasoning on research-level mathematics, with demonstrated utility in resolving open conjectures. It bridges AI and mathematics communities through an evolving open-source framework, addressing the critical need for contamination-free evaluation as LLM-based theorem provers advance. Paper 2, while relevant, is primarily a review/taxonomy of XAI methods for ATR systems without introducing new methods or empirical results. Paper 1's broader cross-disciplinary impact, concrete artifact, and timeliness in the rapidly growing automated reasoning field give it significantly higher potential impact.

vs. MDGYM: Benchmarking AI Agents on Molecular Simulations

gpt-5.25/16/2026

Paper 1 likely has higher impact: it introduces a large, evolving, zero-contamination benchmark of research-level conjectures in a formal proof assistant (Lean 4), directly enabling and measuring verified mathematical discovery. Its interface between mathematicians and AI, community-driven rigor, and demonstrated use in resolving open conjectures give strong novelty, methodological robustness (formal verification), and broad relevance across automated reasoning, formal methods, and mathematics. Paper 2 is timely and useful for scientific-agent evaluation, but is narrower in scope (169 MD tasks) and less foundational than a verified, research-grade mathematics benchmark.

vs. From Holo Pockets to Electron Density: GPT-style Drug Design with Density

gpt-5.25/16/2026

Paper 1 likely has higher scientific impact due to its high novelty (a large, zero-contamination benchmark of open conjectures in Lean 4), methodological rigor via formal verification, and broad, cross-field relevance to automated reasoning, formal methods, and AI evaluation. Its open, evolving infrastructure can become a community standard, amplifying long-term impact, and it has already enabled new mathematical discoveries. Paper 2 is timely and applied with clear drug-design potential, but its impact may be narrower to SBDD and dependent on experimental ED availability and downstream validation.

vs. Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

claude-opus-4.65/16/2026

Formal Conjectures has higher potential impact due to its broader scope across mathematics and AI, its demonstrated utility in making new mathematical discoveries (resolving open conjectures), and its role as an evolving benchmark for a rapidly advancing field (automated theorem proving). It addresses a fundamental bottleneck—research-level evaluation of reasoning systems—with immediate, verified results. While DORA is a well-constructed domain-specific benchmark for disaster response, Formal Conjectures serves a larger research community, connects to foundational questions in AI reasoning, and has already produced tangible scientific discoveries beyond benchmarking.

vs. Seirênes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

claude-opus-4.65/16/2026

Formal Conjectures provides a foundational, evolving benchmark of 2615 formalized research-level math problems in Lean 4, including open conjectures that have already led to new mathematical discoveries. Its impact spans both AI and mathematics communities, establishing lasting infrastructure for evaluating and advancing automated reasoning. While Seirênes presents a clever adversarial self-play method with solid empirical gains, it represents an incremental improvement in LLM robustness training. Formal Conjectures' broader utility as a community-driven, zero-contamination benchmark with real mathematical impact gives it higher long-term scientific significance.