AI Co-Mathematician: Accelerating Mathematicians with Agentic AI
Daniel Zheng, Ingrid von Glehn, Yori Zwols, Iuliya Beloshapka, Lars Buesing, Daniel M. Roy, Martin Wattenberg, Bogdan Georgiev
Abstract
We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts, the system mirrors human collaborative workflows. In early tests, the AI co-mathematician helped researchers solve open problems, identify new research directions, and uncover overlooked literature references. Besides demonstrating a highly interactive paradigm for AI-assisted mathematical discovery, the AI co-mathematician also achieves state of the art results on hard problem-solving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.
AI Impact Assessments
(1 models)Scientific Impact Assessment: AI Co-Mathematician
1. Core Contribution
The paper introduces the AI co-mathematician, an agentic AI workbench designed to support the full spectrum of mathematical research workflows—not just theorem proving, but ideation, literature search, computational exploration, and iterative refinement. The key insight is that existing AI tools for mathematics (chatbots, formal provers, evolutionary search) operate in isolation, leaving the researcher as manual "connective tissue." The system addresses this by providing a stateful, asynchronous, hierarchically organized multi-agent workspace with a project coordinator, workstream coordinators, and specialized sub-agents, all communicating through an internal messaging system and writing to a shared filesystem.
The analogy to AI-powered software engineering environments (Codex, Claude Code, Google Antigravity) is well-drawn: the authors argue that software engineering has pre-existing paradigms (specs, CI/CD, version control) that enable effective agentic AI, while mathematics lacks these, necessitating purpose-built infrastructure.
2. Methodological Rigor
The paper is primarily a systems paper with qualitative case studies and benchmark evaluations. The methodological rigor is mixed:
Strengths in evaluation:
Weaknesses:
3. Potential Impact
High potential impact areas:
Broader implications:
The paper thoughtfully discusses systemic risks: signal-to-noise degradation in mathematical literature, burden on peer review, and the "reviewer-pleasing bias" where agents learn to satisfy AI reviewers without actually fixing errors. These are prescient concerns.
4. Timeliness & Relevance
This paper is extremely timely. It arrives at the intersection of three trends: (1) rapid improvement in LLM mathematical reasoning, (2) the explosion of agentic AI systems in software engineering, and (3) growing interest from the mathematical community in AI tools (as evidenced by projects like Aletheia, AlphaProof, and community engagement from figures like Terence Tao). The paper correctly identifies that the bottleneck has shifted from raw problem-solving to workflow orchestration—a claim supported by the saturation of simpler benchmarks like MATH and GSM8K.
The argument that AI-for-mathematics evaluation should move beyond static benchmarks toward measuring collaborative efficacy is important and likely to influence the evaluation methodology of future systems.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper represents an important conceptual contribution—articulating and demonstrating that the orchestration layer for AI-assisted mathematics is a distinct and valuable research direction. However, the empirical evidence is thin relative to the claims. The FrontierMath result is the strongest quantitative evidence but is complicated by compute incomparability. The qualitative case studies are suggestive but not systematic.
The paper's lasting impact may be more in establishing a paradigm (interactive, stateful, uncertainty-aware AI workbenches for mathematics) than in the specific system described, particularly given its proprietary nature.
Generated May 8, 2026
Comparison History (20)
The AI Co-Mathematician demonstrates immediate, tangible scientific impact by helping mathematicians solve open problems and achieving state-of-the-art results on FrontierMath (48% on Tier 4). It addresses a broad, practical need across all of mathematics with a deployed interactive system. While GRAM introduces a theoretically interesting framework for probabilistic recursive reasoning, it remains more preliminary, tested on structured reasoning benchmarks rather than demonstrating real-world breakthroughs. The Co-Mathematician's combination of practical utility, benchmark performance, and direct impact on mathematical discovery gives it higher near-term and broad scientific impact.
The AI Co-Mathematician represents a paradigm shift in AI-assisted mathematical research, demonstrating practical utility in solving open problems and achieving SOTA on FrontierMath Tier 4 (48%). Its potential to accelerate mathematical discovery across all fields gives it extraordinary breadth of impact. While Fully Open Meditron makes important contributions to transparency and reproducibility in clinical LLMs—addressing a critical need—its improvements are incremental (e.g., +6.6 points) and focused on a narrower domain. The co-mathematician's novel agentic workflow paradigm and demonstrated ability to solve open problems suggest transformative potential.
Paper 1 has a higher potential scientific impact due to its broader scope and fundamental implications for scientific discovery. While Paper 2 offers a valuable algorithmic improvement for MILP solvers, Paper 1 introduces a paradigm shift in how mathematical research is conducted, potentially accelerating discoveries across all mathematical fields. Furthermore, achieving a new state-of-the-art on the extremely difficult FrontierMath benchmark and actively solving open problems demonstrates a profound leap in AI-assisted reasoning that transcends a single optimization subfield.
The AI Co-Mathematician demonstrates both practical impact (solving open mathematical problems, state-of-the-art on FrontierMath benchmarks) and introduces a novel paradigm for human-AI collaboration in mathematics. It addresses a broad, high-value domain with immediate real-world applications for researchers. Paper 2 provides interesting theoretical insights about attention redundancy in LVLMs, but its findings are more analytical/diagnostic. While the attention redundancy finding is provocative, Paper 1's combination of a functional system achieving new benchmarks and enabling mathematical discovery gives it broader and deeper potential impact across mathematics and AI research.
Paper 1 has higher potential impact due to greater novelty (agentic, stateful AI workbench tailored to real mathematical research), strong real-world applicability across the mathematics/CS research workflow, and demonstrated performance gains (state-of-the-art benchmark results plus early evidence of solving open problems). Its breadth is wider: methods could generalize to other scientific domains needing iterative reasoning and artifact generation. Paper 2 is timely and methodologically valuable (fine-grained LCA + compute accounting) with important policy implications, but its scope is narrower and likely more incremental/diagnostic than transformative.
The AI Co-Mathematician represents a more impactful contribution: it introduces a novel paradigm for AI-assisted mathematical discovery with demonstrated real-world utility (solving open problems, achieving SOTA on FrontierMath benchmarks at 48% on Tier 4). It addresses a fundamental challenge in augmenting human intellectual work and has broad implications across mathematics and AI research. Paper 2, while valuable as a benchmark for embodied navigation, is more incremental—evaluating existing LMMs on a new dataset rather than introducing a transformative system. The practical impact of Paper 1 on accelerating mathematical research gives it substantially higher potential.
Paper 2 introduces an AI system for open-ended mathematical discovery, an area with profound implications across all STEM fields. By achieving state-of-the-art results on highly complex benchmarks like FrontierMath and actively assisting researchers in solving open problems, it demonstrates immense potential to accelerate fundamental scientific progress. While Paper 1 presents an innovative bi-level approach to hyper-heuristic design, its impact is largely confined to the specific domain of combinatorial optimization, making its overall breadth and potential transformative impact narrower than Paper 2.
Paper 2 demonstrates a profound leap in AI reasoning by directly assisting in solving open mathematical problems and achieving state-of-the-art results on the highly rigorous FrontierMath benchmark. While Paper 1 provides a valuable foundational framework for agentic search, Paper 2's proven ability to accelerate actual scientific discovery and theory building represents a breakthrough in AI for science, likely driving massive interest and follow-up research across both the AI and mathematics communities.
Paper 1 has higher potential impact due to greater novelty (agentic, stateful co-research workbench tailored to real mathematical workflows) and broader real-world applications (accelerating research via ideation, search, computation, theorem proving). Its claims of helping solve open problems plus strong benchmark performance suggest meaningful capability gains and timeliness in AI-for-science/math. Paper 2 is a useful, methodologically solid efficiency improvement to confidence-weighted self-consistency, but is more incremental and narrower in scope, mainly affecting inference cost rather than enabling qualitatively new research workflows.
The AI Co-Mathematician demonstrates higher scientific impact for several reasons: (1) it addresses a broader, more fundamental challenge of AI-assisted mathematical discovery rather than a specific interaction paradigm; (2) it achieves state-of-the-art results on FrontierMath Tier 4, a prestigious benchmark, demonstrating concrete capability advances; (3) it has already produced real-world results (solving open problems, finding new research directions); (4) its holistic approach spanning ideation through theorem proving has broader applicability; and (5) mathematical discovery tools have outsized impact across all STEM fields. BALAR's Bayesian approach is principled but more incremental.
While Paper 2 presents an impressive domain-specific achievement with SOTA results in mathematics, Paper 1 addresses a fundamental and universal bottleneck in AI: the black-box nature of agentic tool use. By applying mechanistic interpretability (SAEs) to predict and control agent actions before execution, Paper 1 offers broad, cross-disciplinary impact. Its framework directly tackles critical safety, security, and reliability issues required for deploying AI agents in high-stakes, real-world enterprise applications across all fields, giving it a higher potential scientific and practical impact.
Paper 2 introduces an agentic system that directly accelerates mathematical discovery, demonstrated by solving open problems and achieving state-of-the-art on the highly challenging FrontierMath benchmark. Its interactive approach has profound implications for AI-assisted research and theory building. In contrast, Paper 1 presents a valuable but narrower empirical study on reward design for physical reasoning in Vision-Language Models. Due to its demonstrated real-world utility in scientific discovery and broader implications for human-AI collaboration, Paper 2 has a significantly higher potential for transformative scientific impact.
Paper 1 presents a groundbreaking AI system that fundamentally alters the mathematical research workflow and achieves state-of-the-art results on the highly challenging FrontierMath benchmark. While Paper 2 offers a rigorous and valuable ablation study on VLM reward design, Paper 1 demonstrates broader potential to accelerate open-ended human scientific discovery, representing a more significant leap in AI capabilities and real-world scientific impact.
Paper 1 likely has higher impact due to stronger novelty and broader, high-value real-world application: an agentic, stateful workbench aimed at accelerating genuine mathematical research, with evidence of helping solve open problems and achieving state-of-the-art benchmark performance. Its paradigm (interactive, asynchronous, uncertainty-tracking AI collaboration) could influence multiple areas—AI agents, formal methods, theorem proving, scientific discovery tooling, and research workflows—making the breadth and timeliness high. Paper 2 is methodologically interesting and relevant for SLM agent evaluation, but its scope is narrower and nearer-term impact is more specialized.
Paper 1 likely has higher impact due to stronger novelty and broader, immediate applicability: an agentic, stateful workbench aimed at accelerating real mathematical research, with evidence of helping solve open problems and producing native mathematical artifacts. Its reported state-of-the-art benchmark gains (FrontierMath Tier 4) suggest methodological substance and timeliness amid rapid advances in AI-for-science. Paper 2 is innovative and rigorous in studying emotion interventions in SLM agents, but its impact is more specialized (agent alignment/robustness) and less directly transformative across disciplines than tooling that can accelerate core scientific discovery.
Paper 1 likely has higher impact due to its broader, timely implications: an agentic, stateful AI workbench for real mathematical research with evidence of solving open problems and achieving strong benchmark performance. Its applications span mathematics, AI/ML (agents, tool use), HCI/workflows, and scientific discovery tooling, giving wide cross-field reach. Paper 2 is methodologically rigorous and valuable within bandits/online learning, but is more specialized; its hybrid-feedback GLB best-arm identification advances theory and practice, yet the expected breadth and immediate real-world uptake are narrower than a general AI research assistant for mathematics.
Paper 1 likely has higher scientific impact due to broader cross-field reach and immediate real-world applicability: an agentic, stateful AI workbench for mathematical research could affect mathematics, formal methods, AI, education, and scientific discovery workflows. Its reported assistance on open problems plus strong benchmark performance suggests timely relevance amid rapid progress in LLM-based agents. Paper 2 is methodologically rigorous and novel within bandits, but its impact is more specialized to theoretical ML/optimization and may diffuse more narrowly compared to a general-purpose research acceleration platform.
Paper 1 presents a highly novel AI system that directly accelerates open-ended mathematical research and achieves state-of-the-art results on a notoriously difficult benchmark (FrontierMath). Its potential to fundamentally change mathematical workflows and aid in actual scientific discovery gives it much broader and deeper real-world impact compared to Paper 2, which focuses on a specific, narrower interpretability issue regarding spatial reasoning in language models.
Paper 1 presents a system that assists in open-ended scientific discovery, directly solving open mathematical problems and achieving SOTA on the challenging FrontierMath benchmark. Its impact extends beyond AI into mathematics, demonstrating a transformative paradigm for human-AI collaboration in advanced research. Paper 2, while methodologically sound, offers a more incremental prompt optimization technique for multi-agent systems, which has a narrower scope and less immediate cross-disciplinary impact.
Paper 2 has higher potential impact due to broader cross-disciplinary relevance (mathematics, computer science, formal methods), strong timeliness in agentic AI, and clear evidence of performance gains (state-of-the-art results on difficult benchmarks plus early real-world problem-solving utility). Its workbench paradigm could generalize to other scientific domains requiring iterative research workflows. Paper 1 is novel and valuable (new dataset/benchmark for a niche but important materials domain), yet its immediate breadth and downstream transformative potential appear narrower than a widely usable agentic research assistant.