CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

Wentao Long, Yunfei Zhang, Chenyi Li, Li Zhou, Chumin Sun, Zaiwen Wen

May 17, 2026

arXiv:2605.17255v1 PDF

cs.AI(primary)math.OC

#1349of 2292·Artificial Intelligence

#1349 of 2292 · Artificial Intelligence

Tournament Score

1391±40

10501800

57%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6

Novelty6.5

Clarity6

Tournament Score

1391±40

10501800

57%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Formal theorem-proving benchmarks enable mechanically verifiable evaluation of mathematical reasoning in large language models. However, existing benchmarks mainly focus on Olympiad-style problems and algebraic domains, leaving computational and applied mathematics underrepresented. We introduce CAM-Bench, a Lean 4 theorem-proving benchmark of 1,000 Lean proof targets in computational and applied mathematics, with coverage spanning optimization, numerical linear algebra, and numerical analysis. These problems are adapted from textbook exercises and often depend on locally introduced definitions, notation, algorithms, and elementary results. To construct CAM-Bench, we develop a dependency-recovery pipeline that reconstructs the local textbook context needed to state each problem faithfully. It then normalizes each problem into a standalone informal theorem and translates it into a Lean target. We validate the resulting formal problems through Lean compilation and semantic review, checking both formal correctness and semantic alignment with the original exercises. For each problem, we release the raw exercise, recovered context, normalized informal theorem, and final Lean target. CAM-Bench complements existing formal mathematics benchmarks by targeting applied mathematics problems that rely on textbook concepts and elementary theorems, many of which are not directly available as standard Mathlib4 lemmas. We evaluate widely used large language models and formalization agents on CAM-Bench, and analyze common failure modes in tracking local assumptions, applying elementary results, decomposing proofs, and maintaining long-horizon control in Lean.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CAM-Bench

1. Core Contribution

CAM-Bench addresses a genuine gap in formal mathematics benchmarking: the underrepresentation of computational and applied mathematics. While existing benchmarks (MiniF2F, PutnamBench, ProofNet, FormalMATH) focus primarily on Olympiad-style problems or algebraic domains, CAM-Bench targets textbook exercises in optimization, numerical linear algebra, and numerical analysis. The key novelty is twofold: (a) the benchmark itself—1,000 Lean 4 proof targets derived from 778 textbook exercises with full provenance chains, and (b) a dependency-recovery pipeline that reconstructs the local textbook context (definitions, notation, algorithms, assumptions) needed to faithfully state each problem as a self-contained Lean target.

The dependency-recovery pipeline is the more technically interesting contribution. Textbook exercises in applied mathematics are rarely self-contained—they reference earlier definitions, algorithms, and conventions. The recursive dependency expansion (Algorithm 1) with source-grounded enrichment, followed by modularization into semantic blocks and frozen-context incremental formalization, represents a systematic approach to a genuinely difficult problem in benchmark construction.

2. Methodological Rigor

The construction pipeline is carefully designed with multiple validation layers: source-level semantic review (accept/revise/hold classification), Lean compilation checking, and Lean-level semantic validation. The conservative funnel approach—only 54.3% of candidates survive source-level review, and 92.5% of Lean declarations pass semantic validation—suggests quality control was taken seriously.

However, several concerns arise. The semantic review relies heavily on "LLM-as-a-judge," which introduces potential systematic biases that are difficult to characterize. While the authors acknowledge this as a "diagnostic signal" rather than a "sole correctness criterion," the pipeline's integrity still depends substantially on LLM judgment at multiple stages. The paper does not report any human expert validation of a random sample of final Lean targets, which would significantly strengthen confidence in benchmark quality.

The experimental evaluation is reasonable but not exhaustive. The pass@32 protocol for non-agentic baselines and the 8-hour budget for agentic systems provide meaningful comparisons. The external agent evaluation (M2F and Aristotle) on 200 sampled instances is a useful addition, though the sample size limits statistical power for fine-grained analysis.

3. Potential Impact

Benchmark impact: CAM-Bench fills a real niche. Applied mathematics formalization is fundamentally different from competition mathematics—it requires tracking inherited assumptions, working with domain-specific infrastructure (matrices, spectra, convexity, convergence), and managing longer proof chains with more auxiliary definitions. The benchmark should motivate development of better Mathlib infrastructure for applied mathematics and better handling of context-dependent formalization.

Pipeline contribution: The dependency-recovery pipeline could be adapted beyond this specific benchmark to other textbook-to-formal translation tasks. The semantic block decomposition and frozen-context construction approach are general techniques.

Diagnostic value: The error analysis identifying three failure categories (Library/API grounding, formal representation/type discipline, proof decomposition/long-horizon control) provides actionable insights. The finding that 68-69% of agentic system failures involve proof decomposition and long-horizon control is particularly informative.

Limitations on impact: The benchmark is heavily optimization-centric despite claiming to cover "computational and applied mathematics" broadly. PDEs, control theory, scientific computing, and other major applied mathematics areas are absent. The practical barrier to adoption may be high since many problems require infrastructure not yet in Mathlib4.

4. Timeliness & Relevance

The paper is well-timed. Formal theorem proving with LLMs is advancing rapidly (DeepSeek-Prover, SEED-Prover, various agentic systems), and there is increasing recognition that Olympiad-focused benchmarks do not capture the full spectrum of mathematical reasoning capabilities. The informal-formal gap documented here (70-84% informal accuracy vs. 2-20% verified Lean accuracy for non-agentic baselines) powerfully demonstrates that current models can reason about applied mathematics but cannot ground those arguments in formal infrastructure—a critical finding for the field.

The evaluation includes very recent models (GPT-5.4, Claude-sonnet-4.6, Deepseek-v4-pro, Gemini-3.1-pro-preview, Kimi-k2.6) and state-of-the-art agentic systems (M2F, Aristotle), making the results immediately relevant.

5. Strengths & Limitations

Strengths:

Fills a genuine gap in formal mathematics benchmarking

Comprehensive construction pipeline with multiple validation stages

Full provenance chain (raw exercise → context → informal theorem → Lean target) enables auditing

Strong experimental evaluation with diverse models and meaningful error analysis

The 71.5% solve rate for the best agentic system (M2F) shows the benchmark is challenging but not intractable—a good calibration point

Limitations:

Despite the "computational and applied mathematics" framing, coverage is narrowly optimization-centric

Heavy reliance on LLM-as-judge for semantic validation without reported human expert verification of a statistically meaningful sample

The pipeline is complex and potentially difficult to reproduce, despite code release

Some circuitousness: 778 exercises yielding 1,000 targets through decomposition means the effective benchmark size for unique mathematical content is smaller than it appears

The paper is very long (44+ pages with appendices) but the core novelty could be communicated more concisely

No analysis of how well the benchmark discriminates between models at different capability levels (beyond raw accuracy numbers)

The construction pipeline's dependence on frontier LLMs raises questions about cost and accessibility for benchmark extension

Additional Observations

The dataset release with full intermediate artifacts (raw exercise, recovered context, normalized theorem, final Lean target) is commendable and enables community auditing. The strict validation protocol for agentic systems (checking for sorry, admit, axiom, etc.) is well-designed. The observation that agentic repair increases GPT-5.4 from 11.7% to 44.0% at ~3x token cost provides useful efficiency analysis.

Rating:6.5/ 10

Significance 7Rigor 6Novelty 6.5Clarity 6

Generated May 19, 2026

Comparison History (23)

vs. Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to its durable, broadly reusable infrastructure contribution: a 1,000-problem Lean 4 benchmark filling a clear gap (computational/applied math) and enabling mechanically verifiable evaluation of LLM mathematical reasoning. Its dependency-recovery pipeline and artifact release (exercise→context→informal theorem→Lean target) support reproducibility and future research across formal methods, AI evaluation, and mathematical libraries. Paper 2 is timely and application-relevant, but dataset/task + prompting framework for ToM persuasion is more domain-specific, potentially more sensitive to annotation subjectivity, and may age faster with model advances.

vs. Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

gemini-3.15/22/2026

Paper 1 introduces a novel benchmark for formal theorem proving in Lean 4, addressing a critical bottleneck in AI reasoning by expanding beyond pure mathematics into applied domains. High-quality formal benchmarks are highly foundational, widely adopted by the AI community to evaluate reasoning limits, and typically garner broad scientific impact. Paper 2 offers valuable contributions to Theory of Mind and conversational agents, but targets a narrower subfield of persuasive dialogue compared to the broad utility of evaluating and advancing general logical reasoning.

vs. The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems

gemini-3.15/22/2026

Paper 1 proposes a fundamental paradigm shift in the architecture of AI agents, moving from LLM-centric loops to event-sourced reactive graphs. This addresses critical challenges in agent observability, determinism, and state management, offering broad applicability across AI safety, self-improving systems, and enterprise software. While Paper 2 provides a valuable formal mathematics benchmark, it represents an incremental contribution to a specific sub-field. Paper 1's architectural innovation has a much higher ceiling for widespread adoption and transformative impact across the broader AI and software engineering communities.

vs. The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems

claude-opus-4.65/22/2026

CAM-Bench addresses a clear gap in formal theorem-proving benchmarks by targeting computational and applied mathematics, an underrepresented domain. It provides a rigorous, large-scale benchmark (1,000 problems) with a validated pipeline, enabling reproducible evaluation of LLMs on applied math reasoning. This fills a concrete need in the AI-for-mathematics community and will likely drive research progress. Paper 2 describes an architectural proposal (ActiveGraph) for agent systems with interesting properties (replay, forking, lineage), but it is more of a position/design paper without demonstrated empirical results, limiting its immediate scientific impact.

vs. Evidential Information Fusion on Possibilistic Structure

gemini-3.15/19/2026

Paper 2 addresses a highly timely and rapidly expanding field: evaluating Large Language Models' mathematical reasoning using formal verification (Lean 4). By providing a novel benchmark for computational and applied mathematics, it serves a critical need in AI development and is likely to be widely adopted and cited. Paper 1, while methodologically sound, operates in a much narrower and established subfield of information fusion and belief functions, limiting its broader scientific impact compared to the AI/LLM benchmark.

vs. Enhancing Metacognitive AI: Knowledge-Graph Population with Graph-Theoretic LLM Enrichment

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact: it contributes a large, reusable Lean 4 benchmark (1,000 targets) addressing a clear gap (computational/applied math) with rigorous validation (compilation + semantic review) and rich artifact release, enabling standardized, reproducible evaluation across many future models and agents. Its breadth spans formal methods, ML, and applied mathematics, and it is timely given rapid growth in LLM theorem proving. Paper 1 is innovative but more system-integration/proof-of-concept and depends on specific tools/models, which may limit reproducibility and long-term generality.

vs. SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

gpt-5.25/19/2026

CAM-Bench likely has higher scientific impact because it provides a broadly reusable community resource (1,000 Lean 4 targets) that enables standardized, mechanically verifiable evaluation of LLM mathematical reasoning in undercovered applied/computational domains. Its dependency-recovery/normalization pipeline and released artifacts can catalyze follow-on work in formalization, benchmarking, curriculum learning, and tool development across ML, PL, and mathematics. SAPO is a solid methodological improvement for RL credit assignment in generative recommendation, with clear practical value, but its impact is narrower to a specific task/setup and may be more incremental.

vs. DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

gemini-3.15/19/2026

DARE-EEG addresses a critical challenge in Brain-Computer Interfaces by developing a highly transferable foundation model for EEG data. Its potential real-world applications in healthcare, neurology, and cognitive science offer broader scientific impact than CAM-Bench, which, while valuable for AI-driven formal theorem proving, serves a more niche community. Foundation models for physiological signals represent a highly impactful and rapidly growing cross-disciplinary field.

vs. A Conflict-aware Evidential Framework for Reliable Sleep Stage Classification

claude-opus-4.65/19/2026

CAM-Bench addresses a significant gap in formal theorem-proving benchmarks by targeting computational and applied mathematics, an underrepresented area. It provides a reusable community resource (1,000 Lean proof targets) that can drive progress in AI mathematical reasoning, a rapidly growing field. Its breadth of impact spans formal verification, AI/LLM evaluation, and applied mathematics. Paper 1, while solid, addresses a more niche problem (sleep stage classification) with incremental methodological contributions. CAM-Bench's timeliness—given the surge in LLM reasoning research—and its potential to become a widely-used benchmark give it higher impact potential.

vs. Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

gemini-3.15/19/2026

Paper 1 introduces a mechanically verifiable benchmark for LLMs in an underrepresented domain (applied and computational mathematics) using Lean 4. Its reliance on formal theorem proving provides a rigorous, hallucination-free evaluation metric, which is critical for the reliable advancement of AI in mathematics. While Paper 2 offers a useful methodology for improving logicality in physics QA, Paper 1's contribution to formal verification addresses a more fundamental bottleneck in AI scientific reasoning, likely yielding broader, longer-lasting impact across AI and applied mathematics.

vs. RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

gpt-5.25/19/2026

Paper 1 is more likely to have higher scientific impact due to its clear novelty (a 1,000-problem Lean 4 benchmark focused on underrepresented computational/applied math), strong methodological rigor (dependency-recovery pipeline, compilation + semantic review, multi-stage released artifacts), and broad relevance to formal methods, ML evaluation, and mathematical reasoning. Benchmarks often become durable community infrastructure and enable standardized, reproducible progress tracking. Paper 2 is timely and useful, but appears more incremental (agentic KG+RAG design) with only preliminary experiments on a subset of QASPER, limiting evidential strength and likely long-term impact.

vs. OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

claude-opus-4.65/19/2026

OCCAM addresses the widely important problem of explainability in black-box vision models with a novel framework combining open-set concept discovery, causal interventions, and ontology induction. This has broad applicability across AI safety, trustworthy AI, and any domain using deep vision classifiers. While CAM-Bench fills a gap in formal theorem-proving benchmarks for applied mathematics, benchmarks tend to have more incremental impact unless they catalyze entirely new research directions. OCCAM's methodological contributions—causal concept explanations and structured ontology induction—offer richer potential for follow-on research and real-world deployment in high-stakes applications.

vs. Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

gpt-5.25/19/2026

Paper 1 proposes concrete algorithmic advances (holistic lookahead encoding + Abstracted IW(1)) that improve scalability and achieve new SOTA on IPC 2023, with clear real-world relevance to planning/robotics and strong cross-field links to GNNs and search. Its methodological contribution is substantive and likely to influence both learning-for-planning and classical planning communities. Paper 2 is valuable and timely infrastructure for LLM/formalization evaluation, but benchmarks tend to have narrower direct scientific impact unless they become a dominant standard; its innovation is more in dataset/pipeline than in new mathematical or algorithmic methods.

vs. ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

gemini-3.15/19/2026

Paper 2 addresses a critical gap in predictive healthcare by integrating physiological ODE priors with latent diffusion models to simulate clinical interventions safely. Its direct real-world applications in clinical decision support, pharmacology, and patient safety offer profound societal and scientific impact, outweighing Paper 1's valuable but narrower contribution to benchmarking LLM mathematical reasoning.

vs. $π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

gemini-3.15/19/2026

Paper 1 addresses a critical bottleneck in the broadly applicable field of LLM agents: proactive assistance in long-horizon interactions. Its focus on hidden intents and cross-session continuity has wider real-world applications across various HCI and AI domains compared to Paper 2, which, while highly rigorous, targets the more specialized niche of formal theorem proving in Lean.

vs. Evaluating Cognitive Age Alignment in Interactive AI Agents

gemini-3.15/19/2026

While Paper 1 offers a novel cognitive science perspective on AI evaluation, Paper 2 provides a highly rigorous, actionable benchmark for formal theorem proving in Lean. By expanding automated reasoning beyond pure mathematics into computational and applied mathematics, Paper 2 directly addresses a critical bottleneck in AI-driven scientific discovery and verifiable computation, offering higher immediate utility and methodological rigor for advancing complex AI reasoning.

vs. From Prompts to Protocols: An AI Agent for Laboratory Automation

claude-opus-4.65/19/2026

Paper 2 addresses the critical bottleneck of laboratory automation accessibility by enabling natural-language interaction with complex lab systems. Its 97% success rate across chemistry, biology, and materials science demonstrates immediate practical utility and broad cross-disciplinary impact. While Paper 1 makes a valuable contribution to formal verification benchmarks for applied mathematics, its impact is more niche—primarily serving the intersection of formal methods and LLM evaluation communities. Paper 2's potential to democratize autonomous experimentation and accelerate scientific discovery across multiple fields gives it higher estimated real-world impact.

vs. DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

claude-opus-4.65/19/2026

CAM-Bench addresses a significant gap in formal theorem-proving benchmarks by targeting computational and applied mathematics, an underrepresented area. It provides a substantial resource (1,000 Lean proof targets) with a novel dependency-recovery pipeline, enabling systematic evaluation of LLMs on applied math reasoning. Its breadth of impact spans formal verification, AI for mathematics, and applied math education. DRS-GUI, while showing solid improvements in GUI grounding, represents a more incremental contribution to a narrower subfield. CAM-Bench's benchmark nature gives it lasting community-wide utility.

vs. PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

claude-opus-4.65/19/2026

CAM-Bench introduces a novel, reusable benchmark resource addressing a clear gap in formal mathematics evaluation—computational and applied mathematics are underrepresented in existing Lean benchmarks. It has broader impact across AI reasoning, formal verification, and applied mathematics communities, and will likely serve as a standard evaluation tool. PRISM, while practically valuable for enterprise AI deployment, is more narrowly focused on a specific commercial platform (Yellow.ai) and addresses an engineering problem (prompt reliability) with less generalizable scientific contribution.

vs. Petri Net Induced Heuristic Search for Resource Constrained Scheduling

gpt-5.25/19/2026

Paper 2 likely has higher impact: it introduces a sizable new benchmark and dataset plus a context-recovery/normalization pipeline that can be reused broadly, directly serving a fast-moving, highly relevant area (LLMs + formal theorem proving). Its applications span evaluation, training, and diagnosis of formalization agents across math/CS/AI, enabling standardized comparisons and accelerating progress. Paper 1 is methodologically solid and valuable for RCPSP, but its impact is narrower to scheduling/OR and may be incremental relative to existing exact/heuristic methods. Benchmark contributions tend to catalyze wider follow-on work.