Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu
We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.
MaxProof presents an integrated framework for competition-level mathematical proof generation that operates at two levels: (a) a training pipeline that builds three complementary capabilities—proof generation, verification, and repair—into a single model using RL with a defense-in-depth generative verifier, and (b) a population-level test-time scaling framework that orchestrates these capabilities through evolutionary search and tournament selection.
The central novelty lies in the *compositional architecture*: rather than treating proof generation as a monolithic task, the system decomposes it into generate-verify-fix loops that share a consistent notion of correctness through a common pessimistic verifier. The defense-in-depth verifier design (four-layer pipeline with pessimistic min-aggregation) is well-motivated by concrete reward-hacking failure modes documented from the M2 cycle.
Direct impact: The framework demonstrates that a model substantially behind frontier closed-source systems on static benchmarks can close the gap through structured test-time computation. This is a significant engineering insight—system design partially substitutes for model scale.
This work arrives at a critical juncture. Competition-level mathematical proof is the current frontier for reasoning models, with IMO 2025 and USAMO 2026 serving as flagship benchmarks. The paper directly addresses two bottlenecks: (a) how to train with noisy generative verifiers without reward hacking, and (b) how to convert best@K potential into reliable pass@1 performance. Both problems are acutely felt across the reasoning-model community.
The evolutionary/population-based framing of test-time scaling is timely, as the field moves from simple majority voting to more structured inference-time computation.
The paper's honest positioning—"we are still followers chasing the frontier"—is refreshing. The contribution is best understood as an engineering systems paper that demonstrates how to orchestrate multiple capabilities for a difficult reasoning task, rather than a fundamental algorithmic advance. The reward-hacking documentation alone has independent pedagogical value.
Generated Jun 12, 2026
MaxProof demonstrates a breakthrough in automated mathematical theorem proving, achieving gold-medal level performance on IMO 2025 and USAMO 2026 — surpassing human gold medalists. This represents a landmark result in AI for mathematics with enormous implications for automated reasoning, formal verification, and mathematical discovery. While Paper 1 offers elegant geometric insights into diffusion model dynamics with practical diagnostics, its impact is more incremental and confined to understanding generative models. Paper 2's concrete, record-setting results on prestigious competitions will attract far broader attention and inspire significant follow-up work across AI and mathematics.
Paper 1 provides fundamental mechanistic insights into Chain-of-Thought reasoning, discovering the 'commitment boundary' and showing that many CoT steps are epiphenomenal. This challenges existing assumptions about LLM reasoning and offers a broadly applicable method to reduce inference compute by up to 55% across diverse tasks. Paper 2, while demonstrating impressive state-of-the-art results on math benchmarks via test-time scaling, is primarily an engineering achievement in a specific domain, making Paper 1's foundational discoveries more broadly impactful across AI research.
DYSCO addresses a fundamental problem in scientific discovery—extracting governing equations from noisy high-dimensional data—with theoretical identifiability guarantees and broad applicability across scientific domains (neuroscience, physics, biology). Its contributions span representation learning, system identification, and symbolic regression, offering lasting methodological impact. Paper 2, while impressive in achieving gold-medal-level math competition performance, is more narrowly focused on benchmark achievement through engineering (test-time scaling, tournament selection) rather than introducing fundamentally new scientific principles. Paper 1's theoretical framework and cross-disciplinary relevance give it broader long-term impact.
Paper 2 (A2D2) likely has higher scientific impact because it contributes a broadly applicable, theoretically grounded framework for reward-guided fine-tuning and decoding in any-length discrete diffusion models, including a Radon–Nikodym derivation and convergence guarantees. This can influence multiple areas of sequence generation (NLP, code, biological sequences) and provides reusable principles/losses (AJD) beyond a single benchmark. Paper 1 is impressive and timely for automated theorem proving, but appears more system/engineering- and benchmark-driven with narrower cross-field methodological generality.
Paper 2 likely has higher scientific impact: it advances a core, widely reusable component (optimization) for large-scale pretraining, with clear methodological insight (spike-and-flat spectrum), a principled hybrid of Kronecker preconditioning and orthogonalization, and demonstrated gains in loss, memory, and wallclock across multiple model scales. This can broadly affect many LLM trainings and potentially other deep learning domains. Paper 1 is impressive and timely, but is more benchmark/task-specific and may depend heavily on engineering choices, limiting breadth and reproducibility relative to an optimizer contribution.
Paper 2 demonstrates a major breakthrough in AI mathematical reasoning, achieving gold-medal thresholds on prestigious competitions like IMO. Its focus on test-time scaling and generative-verifier RL aligns with cutting-edge trends in reasoning, offering profound implications for AGI and automated theorem proving. Paper 1 is methodologically rigorous but its impact is narrower, primarily concerning transformer optimization configurations.
MaxProof demonstrates a breakthrough in automated mathematical theorem proving, achieving super-human (gold-medal level) performance on IMO 2025 and USAMO 2026 — a landmark result in AI. This represents a fundamental milestone comparable to AlphaGo or AlphaFold, with enormous implications for mathematics, formal verification, and AI reasoning research. Its novelty in combining generative-verifier RL with population-level test-time scaling at competition level is highly impactful. While PolyFlow offers a solid contribution to constrained generative modeling with clear practical value, its incremental nature and narrower scope limit its comparative impact.
MaxProof demonstrates a breakthrough in automated mathematical theorem proving, achieving gold-medal-level performance on IMO 2025 and USAMO 2026 — competitions at the frontier of human mathematical ability. This represents a landmark capability milestone with broad implications for AI reasoning, formal verification, and mathematics itself. While AuthorityBench addresses an important and timely question about LLM epistemic vulnerabilities with rigorous experimental design, its findings (citation presence increases hallucination) are somewhat expected and incremental. MaxProof's achievement is more transformative, likely to attract far greater attention and inspire substantial follow-up research.
Paper 1 represents a major breakthrough in AI reasoning, achieving gold-medal performance on high-profile benchmarks like the IMO and USAMO. The introduction of population-level test-time scaling addresses a critical bottleneck in LLM reasoning capabilities. While Paper 2 offers a solid theoretical contribution to transfer learning, Paper 1 solves a highly visible grand challenge in artificial intelligence, virtually guaranteeing broader immediate attention, extensive follow-up research, and significant real-world impact in automated theorem proving and advanced reasoning systems.
MaxProof demonstrates unprecedented performance on elite mathematical competitions (IMO 2025, USAMO 2026), exceeding human gold-medal thresholds. This represents a landmark achievement in AI mathematical reasoning with immediate, verifiable real-world impact. The framework combining generative-verifier RL with population-level test-time scaling introduces practical innovations with broad applicability. While Paper 1 presents an interesting neuroscience-AI bridge concept, its gains (up to 13% accuracy) are more incremental, and the brain-guided approach faces scalability limitations. Paper 2's results are more transformative for the field and will likely attract significantly more attention and follow-up work.