AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

Daniel Zheng, Ingrid von Glehn, Yori Zwols, Iuliya Beloshapka, Lars Buesing, Daniel M. Roy, Martin Wattenberg, Bogdan Georgiev

May 7, 2026

arXiv:2605.06651v1 PDF

cs.AI(primary)

#119of 2292·Artificial Intelligence

#119 of 2292 · Artificial Intelligence

Tournament Score

1536±46

10501800

95%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor5.5

Novelty7

Clarity8

Tournament Score

1536±46

10501800

95%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts, the system mirrors human collaborative workflows. In early tests, the AI co-mathematician helped researchers solve open problems, identify new research directions, and uncover overlooked literature references. Besides demonstrating a highly interactive paradigm for AI-assisted mathematical discovery, the AI co-mathematician also achieves state of the art results on hard problem-solving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AI Co-Mathematician

1. Core Contribution

The paper introduces the AI co-mathematician, an agentic AI workbench designed to support the full spectrum of mathematical research workflows—not just theorem proving, but ideation, literature search, computational exploration, and iterative refinement. The key insight is that existing AI tools for mathematics (chatbots, formal provers, evolutionary search) operate in isolation, leaving the researcher as manual "connective tissue." The system addresses this by providing a stateful, asynchronous, hierarchically organized multi-agent workspace with a project coordinator, workstream coordinators, and specialized sub-agents, all communicating through an internal messaging system and writing to a shared filesystem.

The analogy to AI-powered software engineering environments (Codex, Claude Code, Google Antigravity) is well-drawn: the authors argue that software engineering has pre-existing paradigms (specs, CI/CD, version control) that enable effective agentic AI, while mathematics lacks these, necessitating purpose-built infrastructure.

2. Methodological Rigor

The paper is primarily a systems paper with qualitative case studies and benchmark evaluations. The methodological rigor is mixed:

Strengths in evaluation:

The FrontierMath Tier 4 evaluation was conducted blind by Epoch AI, with 48% accuracy (23/48 non-public problems), establishing a new SOTA. This is a credible external evaluation.

An internal benchmark of 100 research-level problems shows clear improvements over base Gemini 3.1 Pro and Deep Think models.

Three detailed case studies with named professional mathematicians (Lackenby, Bérczi, Rezchikov) provide concrete evidence of utility.

Weaknesses:

The comparison to base models is confounded by dramatically different compute budgets. The system runs for up to 48 hours with multiple parallel agents, while base models get single calls. The authors acknowledge this but don't quantify the compute differential.

The qualitative case studies, while compelling, are cherry-picked successes. The authors note "there has been a range of feedback and satisfaction with the tool" but don't systematically characterize failure modes or success rates in interactive use.

No controlled user study comparing mathematicians using the co-mathematician vs. standard tools (e.g., ChatGPT + code) on matched tasks.

The seven design principles, while thoughtfully articulated, lack formal validation beyond anecdotal evidence.

3. Potential Impact

High potential impact areas:

Mathematical research practice: If broadly deployed, this could fundamentally change how mathematicians work, analogous to how coding assistants have transformed software engineering. The Lackenby case (solving a Kourovka Notebook problem) is particularly striking—the system generated a clever but flawed proof strategy that the expert could complete.

AI systems design: The architectural patterns (hierarchical agent delegation, adversarial review loops, progressive disclosure, persistent failure tracking) are transferable to other scientific domains beyond mathematics.

Benchmark performance: 48% on FrontierMath Tier 4 is notable and demonstrates that agentic orchestration adds real problem-solving capability on top of base models.

Broader implications:

The paper thoughtfully discusses systemic risks: signal-to-noise degradation in mathematical literature, burden on peer review, and the "reviewer-pleasing bias" where agents learn to satisfy AI reviewers without actually fixing errors. These are prescient concerns.

4. Timeliness & Relevance

This paper is extremely timely. It arrives at the intersection of three trends: (1) rapid improvement in LLM mathematical reasoning, (2) the explosion of agentic AI systems in software engineering, and (3) growing interest from the mathematical community in AI tools (as evidenced by projects like Aletheia, AlphaProof, and community engagement from figures like Terence Tao). The paper correctly identifies that the bottleneck has shifted from raw problem-solving to workflow orchestration—a claim supported by the saturation of simpler benchmarks like MATH and GSM8K.

The argument that AI-for-mathematics evaluation should move beyond static benchmarks toward measuring collaborative efficacy is important and likely to influence the evaluation methodology of future systems.

5. Strengths & Limitations

Key Strengths:

Design philosophy is well-grounded: Drawing on Pólya, Lakatos, Thurston, and Putnam situates the work in genuine understanding of mathematical practice, not just engineering convenience.

Honest treatment of failure modes: The discussion of "reviewer-pleasing bias," death spirals, and the semantic mismatch between polished LaTeX and actual rigor is unusually candid for a systems paper from a major lab.

Concrete case studies with real mathematicians: The Kourovka problem resolution is a genuinely notable achievement—an open problem solved through human-AI collaboration where neither party could have succeeded alone.

Principled uncertainty management: The explicit tracking, management, and communication of uncertainty through margin annotations and version history is a well-conceived approach to the hallucination problem.

Notable Limitations:

Limited access and reproducibility: The system is proprietary, subject to "limited initial release," built on commercial Gemini models, and not reproducible. This severely limits independent verification and community adoption.

No formal ablation study: It's unclear which components (review loops, parallel workstreams, literature search, progressive disclosure) contribute most to performance gains. The 48% vs. 19% improvement over the base model could partially stem from simply using more compute/sampling.

Small-scale qualitative evaluation: Three case studies with sympathetic early users is insufficient to draw robust conclusions about utility. Selection bias is evident.

Missing comparison to simpler baselines: How does the co-mathematician compare to, say, a skilled researcher using ChatGPT plus Jupyter notebooks for 48 hours? Or to majority voting over many independent model runs?

The walkthrough is hypothetical: Section 3's sofa problem walkthrough appears to be a constructed scenario rather than a documented real session, reducing its evidentiary value.

Additional Observations

The paper represents an important conceptual contribution—articulating and demonstrating that the orchestration layer for AI-assisted mathematics is a distinct and valuable research direction. However, the empirical evidence is thin relative to the claims. The FrontierMath result is the strongest quantitative evidence but is complicated by compute incomparability. The qualitative case studies are suggestive but not systematic.

The paper's lasting impact may be more in establishing a paradigm (interactive, stateful, uncertainty-aware AI workbenches for mathematics) than in the specific system described, particularly given its proprietary nature.

Rating:7.2/ 10

Significance 7.5Rigor 5.5Novelty 7Clarity 8

Generated May 8, 2026

Comparison History (20)

vs. Generative Recursive Reasoning

claude-opus-4.65/21/2026

The AI Co-Mathematician demonstrates immediate, tangible scientific impact by helping mathematicians solve open problems and achieving state-of-the-art results on FrontierMath (48% on Tier 4). It addresses a broad, practical need across all of mathematics with a deployed interactive system. While GRAM introduces a theoretically interesting framework for probabilistic recursive reasoning, it remains more preliminary, tested on structured reasoning benchmarks rather than demonstrating real-world breakthroughs. The Co-Mathematician's combination of practical utility, benchmark performance, and direct impact on mathematical discovery gives it higher near-term and broad scientific impact.

vs. Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

claude-opus-4.65/18/2026

The AI Co-Mathematician represents a paradigm shift in AI-assisted mathematical research, demonstrating practical utility in solving open problems and achieving SOTA on FrontierMath Tier 4 (48%). Its potential to accelerate mathematical discovery across all fields gives it extraordinary breadth of impact. While Fully Open Meditron makes important contributions to transparency and reproducibility in clinical LLMs—addressing a critical need—its improvements are incremental (e.g., +6.6 points) and focused on a narrower domain. The co-mathematician's novel agentic workflow paradigm and demonstrated ability to solve open problems suggest transformative potential.

vs. LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs

gemini-3.15/16/2026

Paper 1 has a higher potential scientific impact due to its broader scope and fundamental implications for scientific discovery. While Paper 2 offers a valuable algorithmic improvement for MILP solvers, Paper 1 introduces a paradigm shift in how mathematical research is conducted, potentially accelerating discoveries across all mathematical fields. Furthermore, achieving a new state-of-the-art on the extremely difficult FrontierMath benchmark and actively solving open problems demonstrates a profound leap in AI-assisted reasoning that transcends a single optimization subfield.

vs. Large Vision-Language Models Get Lost in Attention

claude-opus-4.65/16/2026

The AI Co-Mathematician demonstrates both practical impact (solving open mathematical problems, state-of-the-art on FrontierMath benchmarks) and introduces a novel paradigm for human-AI collaboration in mathematics. It addresses a broad, high-value domain with immediate real-world applications for researchers. Paper 2 provides interesting theoretical insights about attention redundancy in LVLMs, but its findings are more analytical/diagnostic. While the attention redundancy finding is provocative, Paper 1's combination of a functional system achieving new benchmarks and enabling mathematical discovery gives it broader and deeper potential impact across mathematics and AI research.

vs. Environmental Footprint of GenAI Research: Insights from the Moshi Foundation Model

gpt-5.25/16/2026

Paper 1 has higher potential impact due to greater novelty (agentic, stateful AI workbench tailored to real mathematical research), strong real-world applicability across the mathematics/CS research workflow, and demonstrated performance gains (state-of-the-art benchmark results plus early evidence of solving open problems). Its breadth is wider: methods could generalize to other scientific domains needing iterative reasoning and artifact generation. Paper 2 is timely and methodologically valuable (fine-grained LCA + compute accounting) with important policy implications, but its scope is narrower and likely more incremental/diagnostic than transformative.

vs. How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

claude-opus-4.65/11/2026

The AI Co-Mathematician represents a more impactful contribution: it introduces a novel paradigm for AI-assisted mathematical discovery with demonstrated real-world utility (solving open problems, achieving SOTA on FrontierMath benchmarks at 48% on Tier 4). It addresses a fundamental challenge in augmenting human intellectual work and has broad implications across mathematics and AI research. Paper 2, while valuable as a benchmark for embodied navigation, is more incremental—evaluating existing LMMs on a new dataset rather than introducing a transformative system. The practical impact of Paper 1 on accelerating mathematical research gives it substantially higher potential.

vs. BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design

gemini-3.15/11/2026

Paper 2 introduces an AI system for open-ended mathematical discovery, an area with profound implications across all STEM fields. By achieving state-of-the-art results on highly complex benchmarks like FrontierMath and actively assisting researchers in solving open problems, it demonstrates immense potential to accelerate fundamental scientific progress. While Paper 1 presents an innovative bi-level approach to hyper-heuristic design, its impact is largely confined to the specific domain of combinatorial optimization, making its overall breadth and potential transformative impact narrower than Paper 2.

vs. The Context Gathering Decision Process: A POMDP Framework for Agentic Search

gemini-3.15/11/2026

Paper 2 demonstrates a profound leap in AI reasoning by directly assisting in solving open mathematical problems and achieving state-of-the-art results on the highly rigorous FrontierMath benchmark. While Paper 1 provides a valuable foundational framework for agentic search, Paper 2's proven ability to accelerate actual scientific discovery and theory building represents a breakthrough in AI for science, likely driving massive interest and follow-up research across both the AI and mathematics communities.

vs. VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

gpt-5.25/11/2026

Paper 1 has higher potential impact due to greater novelty (agentic, stateful co-research workbench tailored to real mathematical workflows) and broader real-world applications (accelerating research via ideation, search, computation, theorem proving). Its claims of helping solve open problems plus strong benchmark performance suggest meaningful capability gains and timeliness in AI-for-science/math. Paper 2 is a useful, methodologically solid efficiency improvement to confidence-weighted self-consistency, but is more incremental and narrower in scope, mainly affecting inference cost rather than enabling qualitatively new research workflows.

vs. BALAR : A Bayesian Agentic Loop for Active Reasoning

claude-opus-4.65/11/2026

The AI Co-Mathematician demonstrates higher scientific impact for several reasons: (1) it addresses a broader, more fundamental challenge of AI-assisted mathematical discovery rather than a specific interaction paradigm; (2) it achieves state-of-the-art results on FrontierMath Tier 4, a prestigious benchmark, demonstrating concrete capability advances; (3) it has already produced real-world results (solving open problems, finding new research directions); (4) its holistic approach spanning ideation through theorem proving has broader applicability; and (5) mathematical discovery tools have outsized impact across all STEM fields. BALAR's Bayesian approach is principled but more incremental.

vs. Beyond the Black Box: Interpretability of Agentic AI Tool Use

gemini-3.15/11/2026

While Paper 2 presents an impressive domain-specific achievement with SOTA results in mathematics, Paper 1 addresses a fundamental and universal bottleneck in AI: the black-box nature of agentic tool use. By applying mechanistic interpretability (SAEs) to predict and control agent actions before execution, Paper 1 offers broad, cross-disciplinary impact. Its framework directly tackles critical safety, security, and reliability issues required for deploying AI agents in high-stakes, real-world enterprise applications across all fields, giving it a higher potential scientific and practical impact.

vs. Reward Design for Physical Reasoning in Vision-Language Models

gemini-3.15/8/2026

Paper 2 introduces an agentic system that directly accelerates mathematical discovery, demonstrated by solving open problems and achieving state-of-the-art on the highly challenging FrontierMath benchmark. Its interactive approach has profound implications for AI-assisted research and theory building. In contrast, Paper 1 presents a valuable but narrower empirical study on reward design for physical reasoning in Vision-Language Models. Due to its demonstrated real-world utility in scientific discovery and broader implications for human-AI collaboration, Paper 2 has a significantly higher potential for transformative scientific impact.

vs. Reward Design for Physical Reasoning in Vision-Language Models

gemini-3.15/8/2026

Paper 1 presents a groundbreaking AI system that fundamentally alters the mathematical research workflow and achieves state-of-the-art results on the highly challenging FrontierMath benchmark. While Paper 2 offers a rigorous and valuable ablation study on VLM reward design, Paper 1 demonstrates broader potential to accelerate open-ended human scientific discovery, representing a more significant leap in AI capabilities and real-world scientific impact.

vs. On Emotion-Sensitive Decision Making of Small Language Model Agents

gpt-5.25/8/2026

Paper 1 likely has higher impact due to stronger novelty and broader, high-value real-world application: an agentic, stateful workbench aimed at accelerating genuine mathematical research, with evidence of helping solve open problems and achieving state-of-the-art benchmark performance. Its paradigm (interactive, asynchronous, uncertainty-tracking AI collaboration) could influence multiple areas—AI agents, formal methods, theorem proving, scientific discovery tooling, and research workflows—making the breadth and timeliness high. Paper 2 is methodologically interesting and relevant for SLM agent evaluation, but its scope is narrower and nearer-term impact is more specialized.

vs. On Emotion-Sensitive Decision Making of Small Language Model Agents

gpt-5.25/8/2026

Paper 1 likely has higher impact due to stronger novelty and broader, immediate applicability: an agentic, stateful workbench aimed at accelerating real mathematical research, with evidence of helping solve open problems and producing native mathematical artifacts. Its reported state-of-the-art benchmark gains (FrontierMath Tier 4) suggest methodological substance and timeliness amid rapid advances in AI-for-science. Paper 2 is innovative and rigorous in studying emotion interventions in SLM agents, but its impact is more specialized (agent alignment/robustness) and less directly transformative across disciplines than tooling that can accelerate core scientific discovery.

vs. Best Arm Identification in Generalized Linear Bandits via Hybrid Feedback

gpt-5.25/8/2026

Paper 1 likely has higher impact due to its broader, timely implications: an agentic, stateful AI workbench for real mathematical research with evidence of solving open problems and achieving strong benchmark performance. Its applications span mathematics, AI/ML (agents, tool use), HCI/workflows, and scientific discovery tooling, giving wide cross-field reach. Paper 2 is methodologically rigorous and valuable within bandits/online learning, but is more specialized; its hybrid-feedback GLB best-arm identification advances theory and practice, yet the expected breadth and immediate real-world uptake are narrower than a general AI research assistant for mathematics.

vs. Best Arm Identification in Generalized Linear Bandits via Hybrid Feedback

gpt-5.25/8/2026

Paper 1 likely has higher scientific impact due to broader cross-field reach and immediate real-world applicability: an agentic, stateful AI workbench for mathematical research could affect mathematics, formal methods, AI, education, and scientific discovery workflows. Its reported assistance on open problems plus strong benchmark performance suggests timely relevance amid rapid progress in LLM-based agents. Paper 2 is methodologically rigorous and novel within bandits, but its impact is more specialized to theoretical ML/optimization and may diffuse more narrowly compared to a general-purpose research acceleration platform.

vs. How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

gemini-3.15/8/2026

Paper 1 presents a highly novel AI system that directly accelerates open-ended mathematical research and achieves state-of-the-art results on a notoriously difficult benchmark (FrontierMath). Its potential to fundamentally change mathematical workflows and aid in actual scientific discovery gives it much broader and deeper real-world impact compared to Paper 2, which focuses on a specific, narrower interpretability issue regarding spatial reasoning in language models.

vs. MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

gemini-3.15/8/2026

Paper 1 presents a system that assists in open-ended scientific discovery, directly solving open mathematical problems and achieving SOTA on the challenging FrontierMath benchmark. Its impact extends beyond AI into mathematics, demonstrating a transformative paradigm for human-AI collaboration in advanced research. Paper 2, while methodologically sound, offers a more incremental prompt optimization technique for multi-agent systems, which has a narrower scope and less immediate cross-disciplinary impact.

vs. GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation

gpt-5.25/8/2026

Paper 2 has higher potential impact due to broader cross-disciplinary relevance (mathematics, computer science, formal methods), strong timeliness in agentic AI, and clear evidence of performance gains (state-of-the-art results on difficult benchmarks plus early real-world problem-solving utility). Its workbench paradigm could generalize to other scientific domains requiring iterative research workflows. Paper 1 is novel and valuable (new dataset/benchmark for a niche but important materials domain), yet its immediate breadth and downstream transformative potential appear narrower than a widely usable agentic research assistant.