TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets

Jun 8, 2026arXiv:2606.09450v1

cs.AI

#1598of 3489·Artificial Intelligence

#1598 of 3489 · Artificial Intelligence

Tournament Score

1410±44

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance5

Rigor4.5

Novelty5.5

Clarity6

Abstract

LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: TheoremBench

Core Contribution

TheoremBench introduces a Lean4 benchmark for evaluating theorem provers on classical mathematical theorems that go beyond competition-style problems. The key innovation is twofold: (1) a premised representation that converts supporting lemmas into explicit premise binders, creating self-contained proving instances with visible dependency structure; and (2) structured evaluation metrics (theorem-level coverage, token-efficiency) that go beyond pass@k to expose qualitative differences in proof behavior.

The benchmark contains 1,142 Lean4 instances derived from ~83 parent theorems inspired by Wiedijk's "100 theorems" list. It ships in two forms: a plain-main version (one theorem per instance) and a premised version (main theorem plus extracted subtheorems with explicit dependencies).

Methodological Rigor

Strengths in methodology:

The construction pipeline is well-defined: source file parsing → theorem group extraction → premise binder transformation → Lean4 verification → benchmark inclusion. Only compilable instances are retained, ensuring correctness.

The evaluation covers four models (DeepSeek-Prover-V2-7B, Goedel-Prover-V2-8B, Kimina-Prover-Distill-8B, Goedel-Prover-SFT) with 64 sampling attempts, providing reasonable statistical coverage.

The token-efficiency metric is thoughtfully designed—removing non-Lean prose before tokenization ensures fair comparison.

Weaknesses in methodology:

The experimental scope is narrow: only four models, all at the 7-8B parameter scale. No larger models (e.g., 70B+), no API-based models (GPT-4, Claude), and no retrieval-augmented or search-based systems are tested. This severely limits the generalizability of empirical conclusions.

The premise extraction relies on syntactic dependency analysis, which the authors acknowledge may not recover mathematically minimal dependency sets. No quantitative analysis of extraction quality is provided.

The difficulty classification uses "LLM-based classification and Lean4-level structural features" but provides no details on how this was validated or calibrated.

The plain-main pass@64 results are suspiciously uniform—three of four models achieve exactly 0.053, which seems coincidental and warrants explanation.

Ground-truth proofs are from existing formalizations, which may themselves be non-optimal, complicating the interpretation of token-efficiency metrics.

Potential Impact

The paper addresses a genuine gap: most formal theorem proving benchmarks focus on isolated competition problems rather than structured mathematical developments. The premised representation is a practically useful idea that could influence how future benchmarks are designed.

However, the impact is constrained by several factors:

Scale: 83 parent theorems / 1,142 instances is modest. FormalMATH offers 5,560 instances; Lean Workbook is much larger. The benchmark may be too small to serve as a reliable discriminator as models improve.

Scope of evaluation: Testing only small open-source models means we don't know how frontier models perform, limiting the benchmark's diagnostic utility for the broader community.

Novelty of findings: The main empirical conclusions—that models are better at easy problems, that explicit premises help capable models, and that generated proofs are verbose—are largely expected and don't provide deep new insights.

The most impactful element is likely the premised dataset design pattern itself, which could be adopted by other benchmark designers to create dependency-aware evaluations.

Timeliness & Relevance

The paper is timely: LLM-based theorem proving is advancing rapidly (DeepSeek-Prover-V2, Kimina-Prover, Goedel-Prover all appeared in early-mid 2025), and the community needs benchmarks that test more than isolated problem-solving. The shift toward evaluating structured mathematical reasoning—partial progress, proof chains, premise utilization—is well-motivated.

However, the paper arrives in a crowded benchmark landscape. MiniF2F, PutnamBench, ProverBench, CombiBench, FormalMATH, and ProofNet already exist. The paper's Table 1 positions TheoremBench alongside these, but the comparison is somewhat superficial—the 54.91% success rate is difficult to compare directly given different model configurations and sampling budgets.

Strengths

1. Well-motivated design: The premised representation is a genuine contribution to benchmark methodology, making proof dependencies explicit and enabling structured evaluation.

2. Novel metrics: Theorem-level coverage survival curves and token-efficiency ratios provide genuinely new evaluation dimensions beyond pass@k.

3. Clear empirical signal: The differential impact of premises across models (large for DeepSeek, none for Goedel-SFT) is a clean and informative result that reveals meaningful model differences.

4. Reproducibility: The pipeline is well-documented, all instances are Lean4-verified, and the benchmark structure enables future extensions.

Limitations

1. Narrow experimental scope: Four small models, no frontier models, no search-augmented systems. The benchmark's ability to discriminate among strong provers is untested.

2. Modest scale: 83 parent theorems may be insufficient for robust statistical comparisons, especially as models improve.

3. Limited analysis depth: The paper reports metrics but offers little mechanistic analysis of *why* models fail on specific theorems or *what* proof strategies they employ.

4. Premise extraction quality: No evaluation of whether the extracted premises are complete, minimal, or representative of what a human prover would need.

5. Missing baselines: No comparison with ATP tools (e.g., Sledgehammer-style approaches), which would contextualize LLM performance.

6. Writing quality: The paper is competent but somewhat repetitive—key findings are stated multiple times across sections without added insight.

Overall Assessment

TheoremBench makes a reasonable contribution to the formal theorem proving evaluation landscape by introducing dependency-aware benchmarking and structured metrics. The premised representation is a useful methodological idea. However, the paper's impact is limited by its modest scale, narrow experimental evaluation, and largely expected empirical findings. It represents a solid incremental contribution to benchmark design rather than a transformative advance in understanding or capability.

Rating:4.8/ 10

Significance 5Rigor 4.5Novelty 5.5Clarity 6

Generated Jun 9, 2026

Comparison History (20)

Lostvs. TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation

Paper 2 introduces a million-scale dataset and novel framework for tactile commonsense reasoning, addressing a critical bottleneck in embodied AI and robotics. The massive scale and multi-sensor approach provide a foundation for real-world physical interaction. In contrast, while Paper 1 offers a valuable benchmark for formal mathematics, its scope is more constrained to the specialized sub-field of automated theorem proving, making Paper 2's potential real-world applications and breadth of impact across robotics and multimodal AI significantly higher.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Paper 2 has significantly higher potential for real-world impact due to its direct application in healthcare. While Paper 1 provides a valuable benchmarking tool for formal mathematics, Paper 2 addresses a critical medical need (pulmonary diagnostic reasoning) by combining large language models with a novel knowledge graph (LungKG). The ability to improve electronic medical record-based diagnostics has profound implications for patient care, giving it broader relevance and higher immediate scientific and societal impact.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

AliyunConsoleAgent presents a novel, end-to-end framework addressing a concrete real-world problem (cloud documentation verification at scale) with a practical two-stage training paradigm combining distillation and RL in live environments. Its contributions span web agents, RL in real-world settings, and enterprise automation, with demonstrated cost savings (92% lower inference cost) and near-frontier performance. TheoremBench, while valuable as a benchmark contribution for formal theorem proving evaluation, is more incremental—extending existing benchmark methodology to classical theorems with structural decomposition. Paper 2's broader applicability, methodological innovations (dual-channel reward, Terraform-based rollout), and immediate practical impact give it higher potential.

claude-opus-4-6·Jun 9, 2026

Wonvs. From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

Paper 2 likely has higher scientific impact: it introduces a new benchmark (TheoremBench) and metrics that can become a shared evaluation standard for formal-math LLMs, affecting many future papers and enabling more rigorous, comparable progress. Its design (premised structure, theorem-level coverage, token-efficiency) broadens evaluation beyond contest-style tasks and is timely given rapid growth in Lean4-based provers. Paper 1 is useful and practical for long-context inference, but is more incremental within a crowded optimization space and may have narrower cross-field influence than a widely adopted benchmark.

gpt-5.2·Jun 9, 2026

Lostvs. Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Paper 1 introduces PRIME, a novel framework for detecting and understanding reward hacking before it manifests visibly—a critical AI safety problem. It offers mechanistic insights into how models internalize proxy rewards, provides early-warning signals for alignment failures, and demonstrates generalization across settings. This has broad implications for AI alignment and safety, an increasingly urgent field. Paper 2, while a solid benchmark contribution for formal theorem proving evaluation, is more incremental—extending existing evaluation methodology rather than revealing new phenomena. Paper 1's novelty, timeliness, and cross-field relevance to AI safety give it higher potential impact.

claude-opus-4-6·Jun 9, 2026

Wonvs. Agent Economics: An Entropy-Controlled Pluralistic Alignment Framework for Preventing Artificial Hivemind in Autonomous Agents

Paper 1 introduces a concrete, completed benchmark with novel metrics for evaluating LLMs on formal theorem proving (Lean4), a highly active and timely area. Benchmarks typically drive significant subsequent research, leading to high citation rates. In contrast, Paper 2 reads as a conceptual proposal, as its simulation environment and results are described in the future tense ('will be developed', 'anticipated results'). Paper 1's completed empirical evaluation and immediate utility to the AI reasoning community grant it a higher potential for concrete scientific impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Frequency-based Constrained Sampling for Interval Patterns

Paper 2 addresses a highly timely and rapidly expanding intersection of Large Language Models and formal mathematics. By introducing a structured benchmark (TheoremBench) for complex, dependency-rich proofs rather than isolated contest problems, it solves a major evaluation bottleneck in AI reasoning. While Paper 1 provides a rigorous and useful algorithmic advancement in data mining, Paper 2 has broader cross-disciplinary appeal (AI, formal verification, mathematics) and aligns with intense current research efforts in automated theorem proving, leading to higher potential scientific impact.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

Paper 1 addresses a critical bottleneck in the deployment of real-world AI agents by benchmarking multi-turn interleaved image-text dialogues. Its focus on unified multimodal models and exposure bias offers broader immediate applications and cross-disciplinary impact. While Paper 2 provides rigorous methodological advancements in formal theorem proving, its impact is concentrated within a more specialized subfield of mathematical reasoning.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

Paper 2 is likely to have higher scientific impact because it introduces a new benchmark and evaluation methodology for formal theorem proving that can become a community standard, influencing many subsequent model papers. Its contributions (premised task families, coverage and token-efficiency metrics) are broadly applicable across LLM reasoning, formal methods, and benchmark design, and are timely given rapid progress in Lean4 provers. Paper 1 is strong but more engineering-focused and narrower (RAG serving optimization in specific stacks), with impact concentrated in systems/serving rather than across multiple fields.

gpt-5.2·Jun 9, 2026

Wonvs. Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs

Paper 2 likely has higher impact because it introduces a broadly useful, durable evaluation benchmark (TheoremBench) for Lean4 that addresses a clear gap: long, dependency-rich formal developments beyond contest-style problems. Benchmarks and metrics (coverage, token-efficiency) can standardize progress, enable fair comparison across models, and influence both ML and formal methods communities. Paper 1 proposes a method (CAHL) for planner–executor alignment in tool-augmented LLMs, but it is closer to incremental optimization within an active area and its demonstrated scope is narrower and more benchmark-dependent.

gpt-5.2·Jun 9, 2026

#1598of 3489·Artificial Intelligence

#1598 of 3489 · Artificial Intelligence

Tournament Score

1410±44

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance5

Rigor4.5

Novelty5.5

Clarity6