MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu

Jun 11, 2026arXiv:2606.13473v1

cs.LGcs.AIcs.CL

#957of 5669·cs.LG

#957 of 5669 · cs.LG

Tournament Score

1475±45

10501750

69%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor6.5

Novelty6.8

Clarity8

Abstract

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MaxProof

1. Core Contribution

MaxProof presents an integrated framework for competition-level mathematical proof generation that operates at two levels: (a) a training pipeline that builds three complementary capabilities—proof generation, verification, and repair—into a single model using RL with a defense-in-depth generative verifier, and (b) a population-level test-time scaling framework that orchestrates these capabilities through evolutionary search and tournament selection.

The central novelty lies in the *compositional architecture*: rather than treating proof generation as a monolithic task, the system decomposes it into generate-verify-fix loops that share a consistent notion of correctness through a common pessimistic verifier. The defense-in-depth verifier design (four-layer pipeline with pessimistic min-aggregation) is well-motivated by concrete reward-hacking failure modes documented from the M2 cycle.

2. Methodological Rigor

Strengths in rigor:

The reward-hacking case study (Section 2.5, Appendix C) is exceptionally well-documented, with four concrete failure patterns (length bias, format hacking, semantic shortcut, judge-specific preference) illustrated by actual rollouts with both training-verifier and independent expert-judge scores. This is rare transparency.

The per-problem analysis (Table 3, Figure 8) provides granular diagnostics, including selection loss analysis and per-round oracle trajectories, which are more informative than aggregate scores alone.

The USAMO P2 failure (self-pick 2/7 vs oracle 6/7) is honestly reported and analyzed.

Concerns:

The evaluation protocol relies on MathArena-style 0–7 scoring by frontier LLMs (GPT-5.4 as judge), which introduces its own biases. Human expert verification is mentioned but only briefly.

The standalone benchmark comparison (Table 1) shows M3 significantly behind GPT-5.5 and Gemini 3.1 Pro on IMOProofBench (67.40 vs 90.85/75.71), making it unclear how much of the final contest performance derives from the base model versus MaxProof scaling.

The one-shot baseline (27/42 on IMO 2025) is not broken down by problem, making it harder to isolate where MaxProof's refinement vs. simple sampling contributes.

No formal ablation isolates the contribution of each verifier layer or the relative importance of PATCH vs. REWRITE refinement.

3. Potential Impact

Direct impact: The framework demonstrates that a model substantially behind frontier closed-source systems on static benchmarks can close the gap through structured test-time computation. This is a significant engineering insight—system design partially substitutes for model scale.

Broader implications:

The defense-in-depth verifier paradigm (pessimistic aggregation, multi-judge scoring) addresses a fundamental challenge in RL for reasoning: reward hacking in domains without executable ground truth. This applies beyond mathematics to any domain where verification is soft.

The population-level search framework is model-agnostic and could be applied to other long-form reasoning tasks (formal verification, legal analysis, scientific reasoning).

The detailed reward-hacking taxonomy (Appendix C) serves as a practical reference for anyone training reasoning models with generative verifiers.

Limitations to impact:

Compute cost is substantial: 32 initial candidates × 4 verifier samples × 10 refinement rounds × dual PATCH/REWRITE = thousands of inference calls per problem. This limits practical applicability.

The framework's effectiveness depends heavily on the base model having non-trivial best@K capability, limiting its value for weaker models.

4. Timeliness & Relevance

This work arrives at a critical juncture. Competition-level mathematical proof is the current frontier for reasoning models, with IMO 2025 and USAMO 2026 serving as flagship benchmarks. The paper directly addresses two bottlenecks: (a) how to train with noisy generative verifiers without reward hacking, and (b) how to convert best@K potential into reliable pass@1 performance. Both problems are acutely felt across the reasoning-model community.

The evolutionary/population-based framing of test-time scaling is timely, as the field moves from simple majority voting to more structured inference-time computation.

5. Strengths & Limitations

Key Strengths:

*Transparency*: The M2 failure analysis and reward-hacking case studies are unusually candid and educational.

*Complete pipeline*: End-to-end coverage from data curation through training to test-time scaling, with all components sharing a consistent verifier.

*Concrete results*: 35/42 on IMO 2025 and 36/42 on USAMO 2026 are state-of-the-art for open-weight models.

*The appendix provides full model outputs* for all 12 problems, enabling independent verification.

*Self-aware limitations*: The paper explicitly identifies where the model fails (IMO P6, conservative argumentation style, dependence on multi-round search).

Notable Weaknesses:

*Limited ablation*: No systematic study of how performance scales with population size, number of refinement rounds, or verifier configuration.

*Reproducibility concerns*: The base M3 model training details are sparse, and the verifier pipeline depends on "strong external" models whose identity is not specified.

*Selection loss*: The USAMO P2 failure (4-point selection loss) reveals a fundamental weakness in tournament-based selection that is acknowledged but not resolved.

*No comparison with simpler baselines*: How does MaxProof compare to simple best-of-N with the same compute budget? The claimed "far less than 10 points" for sampling baselines is not empirically verified.

*The gap to frontier systems remains large on static benchmarks* (67.40 vs 90.85 on IMOProofBench), suggesting the test-time scaling is compensating for base model weakness rather than demonstrating a generalizable advantage.

6. Additional Observations

The paper's honest positioning—"we are still followers chasing the frontier"—is refreshing. The contribution is best understood as an engineering systems paper that demonstrates how to orchestrate multiple capabilities for a difficult reasoning task, rather than a fundamental algorithmic advance. The reward-hacking documentation alone has independent pedagogical value.

Rating:7.2/ 10

Significance 7.5Rigor 6.5Novelty 6.8Clarity 8

Generated Jun 12, 2026

Comparison History (16)

Wonvs. The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

MaxProof demonstrates a breakthrough in automated mathematical theorem proving, achieving gold-medal level performance on IMO 2025 and USAMO 2026 — surpassing human gold medalists. This represents a landmark result in AI for mathematics with enormous implications for automated reasoning, formal verification, and mathematical discovery. While Paper 1 offers elegant geometric insights into diffusion model dynamics with practical diagnostics, its impact is more incremental and confined to understanding generative models. Paper 2's concrete, record-setting results on prestigious competitions will attract far broader attention and inspire significant follow-up work across AI and mathematics.

claude-opus-4-6·Jun 12, 2026

Lostvs. Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

Paper 1 provides fundamental mechanistic insights into Chain-of-Thought reasoning, discovering the 'commitment boundary' and showing that many CoT steps are epiphenomenal. This challenges existing assumptions about LLM reasoning and offers a broadly applicable method to reduce inference compute by up to 55% across diverse tasks. Paper 2, while demonstrating impressive state-of-the-art results on math benchmarks via test-time scaling, is primarily an engineering achievement in a specific domain, making Paper 1's foundational discoveries more broadly impactful across AI research.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

DYSCO addresses a fundamental problem in scientific discovery—extracting governing equations from noisy high-dimensional data—with theoretical identifiability guarantees and broad applicability across scientific domains (neuroscience, physics, biology). Its contributions span representation learning, system identification, and symbolic regression, offering lasting methodological impact. Paper 2, while impressive in achieving gold-medal-level math competition performance, is more narrowly focused on benchmark achievement through engineering (test-time scaling, tournament selection) rather than introducing fundamentally new scientific principles. Paper 1's theoretical framework and cross-disciplinary relevance give it broader long-term impact.

claude-opus-4-6·Jun 12, 2026

Lostvs. A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding

Paper 2 (A2D2) likely has higher scientific impact because it contributes a broadly applicable, theoretically grounded framework for reward-guided fine-tuning and decoding in any-length discrete diffusion models, including a Radon–Nikodym derivation and convergence guarantees. This can influence multiple areas of sequence generation (NLP, code, biological sequences) and provides reusable principles/losses (AJD) beyond a single benchmark. Paper 1 is impressive and timely for automated theorem proving, but appears more system/engineering- and benchmark-driven with narrower cross-field methodological generality.

gpt-5.2·Jun 12, 2026

Lostvs. Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

Paper 2 likely has higher scientific impact: it advances a core, widely reusable component (optimization) for large-scale pretraining, with clear methodological insight (spike-and-flat spectrum), a principled hybrid of Kronecker preconditioning and orthogonalization, and demonstrated gains in loss, memory, and wallclock across multiple model scales. This can broadly affect many LLM trainings and potentially other deep learning domains. Paper 1 is impressive and timely, but is more benchmark/task-specific and may depend heavily on engineering choices, limiting breadth and reproducibility relative to an optimizer contribution.

gpt-5.2·Jun 12, 2026

Wonvs. Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

Paper 2 demonstrates a major breakthrough in AI mathematical reasoning, achieving gold-medal thresholds on prestigious competitions like IMO. Its focus on test-time scaling and generative-verifier RL aligns with cutting-edge trends in reasoning, offering profound implications for AGI and automated theorem proving. Paper 1 is methodologically rigorous but its impact is narrower, primarily concerning transformer optimization configurations.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

MaxProof demonstrates a breakthrough in automated mathematical theorem proving, achieving super-human (gold-medal level) performance on IMO 2025 and USAMO 2026 — a landmark result in AI. This represents a fundamental milestone comparable to AlphaGo or AlphaFold, with enormous implications for mathematics, formal verification, and AI reasoning research. Its novelty in combining generative-verifier RL with population-level test-time scaling at competition level is highly impactful. While PolyFlow offers a solid contribution to constrained generative modeling with clear practical value, its incremental nature and narrower scope limit its comparative impact.

claude-opus-4-6·Jun 12, 2026

Wonvs. Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

MaxProof demonstrates a breakthrough in automated mathematical theorem proving, achieving gold-medal-level performance on IMO 2025 and USAMO 2026 — competitions at the frontier of human mathematical ability. This represents a landmark capability milestone with broad implications for AI reasoning, formal verification, and mathematics itself. While AuthorityBench addresses an important and timely question about LLM epistemic vulnerabilities with rigorous experimental design, its findings (citation presence increases hallucination) are somewhat expected and incremental. MaxProof's achievement is more transformative, likely to attract far greater attention and inspire substantial follow-up research.

claude-opus-4-6·Jun 12, 2026

Wonvs. Loss-Shift Transfer via Bayes Quotients

Paper 1 represents a major breakthrough in AI reasoning, achieving gold-medal performance on high-profile benchmarks like the IMO and USAMO. The introduction of population-level test-time scaling addresses a critical bottleneck in LLM reasoning capabilities. While Paper 2 offers a solid theoretical contribution to transfer learning, Paper 1 solves a highly visible grand challenge in artificial intelligence, virtually guaranteeing broader immediate attention, extensive follow-up research, and significant real-world impact in automated theorem proving and advanced reasoning systems.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Beyond representational alignment with brain-guided language models for robust reasoning

MaxProof demonstrates unprecedented performance on elite mathematical competitions (IMO 2025, USAMO 2026), exceeding human gold-medal thresholds. This represents a landmark achievement in AI mathematical reasoning with immediate, verifiable real-world impact. The framework combining generative-verifier RL with population-level test-time scaling introduces practical innovations with broad applicability. While Paper 1 presents an interesting neuroscience-AI bridge concept, its gains (up to 13% accuracy) are more incremental, and the brain-guided approach faces scalability limitations. Paper 2's results are more transformative for the field and will likely attract significantly more attention and follow-up work.

claude-opus-4-6·Jun 12, 2026

#957of 5669·cs.LG

#957 of 5669 · cs.LG

Tournament Score

1475±45

10501750

69%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor6.5

Novelty6.8

Clarity8