QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets
LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.
TheoremBench introduces a Lean4 benchmark for evaluating theorem provers on classical mathematical theorems that go beyond competition-style problems. The key innovation is twofold: (1) a premised representation that converts supporting lemmas into explicit premise binders, creating self-contained proving instances with visible dependency structure; and (2) structured evaluation metrics (theorem-level coverage, token-efficiency) that go beyond pass@k to expose qualitative differences in proof behavior.
The benchmark contains 1,142 Lean4 instances derived from ~83 parent theorems inspired by Wiedijk's "100 theorems" list. It ships in two forms: a plain-main version (one theorem per instance) and a premised version (main theorem plus extracted subtheorems with explicit dependencies).
The paper addresses a genuine gap: most formal theorem proving benchmarks focus on isolated competition problems rather than structured mathematical developments. The premised representation is a practically useful idea that could influence how future benchmarks are designed.
However, the impact is constrained by several factors:
The most impactful element is likely the premised dataset design pattern itself, which could be adopted by other benchmark designers to create dependency-aware evaluations.
The paper is timely: LLM-based theorem proving is advancing rapidly (DeepSeek-Prover-V2, Kimina-Prover, Goedel-Prover all appeared in early-mid 2025), and the community needs benchmarks that test more than isolated problem-solving. The shift toward evaluating structured mathematical reasoning—partial progress, proof chains, premise utilization—is well-motivated.
However, the paper arrives in a crowded benchmark landscape. MiniF2F, PutnamBench, ProverBench, CombiBench, FormalMATH, and ProofNet already exist. The paper's Table 1 positions TheoremBench alongside these, but the comparison is somewhat superficial—the 54.91% success rate is difficult to compare directly given different model configurations and sampling budgets.
1. Well-motivated design: The premised representation is a genuine contribution to benchmark methodology, making proof dependencies explicit and enabling structured evaluation.
2. Novel metrics: Theorem-level coverage survival curves and token-efficiency ratios provide genuinely new evaluation dimensions beyond pass@k.
3. Clear empirical signal: The differential impact of premises across models (large for DeepSeek, none for Goedel-SFT) is a clean and informative result that reveals meaningful model differences.
4. Reproducibility: The pipeline is well-documented, all instances are Lean4-verified, and the benchmark structure enables future extensions.
1. Narrow experimental scope: Four small models, no frontier models, no search-augmented systems. The benchmark's ability to discriminate among strong provers is untested.
2. Modest scale: 83 parent theorems may be insufficient for robust statistical comparisons, especially as models improve.
3. Limited analysis depth: The paper reports metrics but offers little mechanistic analysis of *why* models fail on specific theorems or *what* proof strategies they employ.
4. Premise extraction quality: No evaluation of whether the extracted premises are complete, minimal, or representative of what a human prover would need.
5. Missing baselines: No comparison with ATP tools (e.g., Sledgehammer-style approaches), which would contextualize LLM performance.
6. Writing quality: The paper is competent but somewhat repetitive—key findings are stated multiple times across sections without added insight.
TheoremBench makes a reasonable contribution to the formal theorem proving evaluation landscape by introducing dependency-aware benchmarking and structured metrics. The premised representation is a useful methodological idea. However, the paper's impact is limited by its modest scale, narrow experimental evaluation, and largely expected empirical findings. It represents a solid incremental contribution to benchmark design rather than a transformative advance in understanding or capability.
Generated Jun 9, 2026
Paper 2 introduces a million-scale dataset and novel framework for tactile commonsense reasoning, addressing a critical bottleneck in embodied AI and robotics. The massive scale and multi-sensor approach provide a foundation for real-world physical interaction. In contrast, while Paper 1 offers a valuable benchmark for formal mathematics, its scope is more constrained to the specialized sub-field of automated theorem proving, making Paper 2's potential real-world applications and breadth of impact across robotics and multimodal AI significantly higher.
Paper 2 has significantly higher potential for real-world impact due to its direct application in healthcare. While Paper 1 provides a valuable benchmarking tool for formal mathematics, Paper 2 addresses a critical medical need (pulmonary diagnostic reasoning) by combining large language models with a novel knowledge graph (LungKG). The ability to improve electronic medical record-based diagnostics has profound implications for patient care, giving it broader relevance and higher immediate scientific and societal impact.
AliyunConsoleAgent presents a novel, end-to-end framework addressing a concrete real-world problem (cloud documentation verification at scale) with a practical two-stage training paradigm combining distillation and RL in live environments. Its contributions span web agents, RL in real-world settings, and enterprise automation, with demonstrated cost savings (92% lower inference cost) and near-frontier performance. TheoremBench, while valuable as a benchmark contribution for formal theorem proving evaluation, is more incremental—extending existing benchmark methodology to classical theorems with structural decomposition. Paper 2's broader applicability, methodological innovations (dual-channel reward, Terraform-based rollout), and immediate practical impact give it higher potential.
Paper 2 likely has higher scientific impact: it introduces a new benchmark (TheoremBench) and metrics that can become a shared evaluation standard for formal-math LLMs, affecting many future papers and enabling more rigorous, comparable progress. Its design (premised structure, theorem-level coverage, token-efficiency) broadens evaluation beyond contest-style tasks and is timely given rapid growth in Lean4-based provers. Paper 1 is useful and practical for long-context inference, but is more incremental within a crowded optimization space and may have narrower cross-field influence than a widely adopted benchmark.
Paper 1 introduces PRIME, a novel framework for detecting and understanding reward hacking before it manifests visibly—a critical AI safety problem. It offers mechanistic insights into how models internalize proxy rewards, provides early-warning signals for alignment failures, and demonstrates generalization across settings. This has broad implications for AI alignment and safety, an increasingly urgent field. Paper 2, while a solid benchmark contribution for formal theorem proving evaluation, is more incremental—extending existing evaluation methodology rather than revealing new phenomena. Paper 1's novelty, timeliness, and cross-field relevance to AI safety give it higher potential impact.
Paper 1 introduces a concrete, completed benchmark with novel metrics for evaluating LLMs on formal theorem proving (Lean4), a highly active and timely area. Benchmarks typically drive significant subsequent research, leading to high citation rates. In contrast, Paper 2 reads as a conceptual proposal, as its simulation environment and results are described in the future tense ('will be developed', 'anticipated results'). Paper 1's completed empirical evaluation and immediate utility to the AI reasoning community grant it a higher potential for concrete scientific impact.
Paper 2 addresses a highly timely and rapidly expanding intersection of Large Language Models and formal mathematics. By introducing a structured benchmark (TheoremBench) for complex, dependency-rich proofs rather than isolated contest problems, it solves a major evaluation bottleneck in AI reasoning. While Paper 1 provides a rigorous and useful algorithmic advancement in data mining, Paper 2 has broader cross-disciplinary appeal (AI, formal verification, mathematics) and aligns with intense current research efforts in automated theorem proving, leading to higher potential scientific impact.
Paper 1 addresses a critical bottleneck in the deployment of real-world AI agents by benchmarking multi-turn interleaved image-text dialogues. Its focus on unified multimodal models and exposure bias offers broader immediate applications and cross-disciplinary impact. While Paper 2 provides rigorous methodological advancements in formal theorem proving, its impact is concentrated within a more specialized subfield of mathematical reasoning.
Paper 2 is likely to have higher scientific impact because it introduces a new benchmark and evaluation methodology for formal theorem proving that can become a community standard, influencing many subsequent model papers. Its contributions (premised task families, coverage and token-efficiency metrics) are broadly applicable across LLM reasoning, formal methods, and benchmark design, and are timely given rapid progress in Lean4 provers. Paper 1 is strong but more engineering-focused and narrower (RAG serving optimization in specific stacks), with impact concentrated in systems/serving rather than across multiple fields.
Paper 2 likely has higher impact because it introduces a broadly useful, durable evaluation benchmark (TheoremBench) for Lean4 that addresses a clear gap: long, dependency-rich formal developments beyond contest-style problems. Benchmarks and metrics (coverage, token-efficiency) can standardize progress, enable fair comparison across models, and influence both ML and formal methods communities. Paper 1 proposes a method (CAHL) for planner–executor alignment in tool-augmented LLMs, but it is closer to incremental optimization within an active area and its demonstrated scope is narrower and more benchmark-dependent.