QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang
Abstract
We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proofs remains an outstanding challenge for LLMs. Through systematic experiments with frontier LLMs on research-level proof tasks, we identify seven failure modes that prevent reliable proof generation, including context contamination, citation hallucination, hand-waving on key steps and misallocation of proof effort, unstable proof plans, unfocused verification, problem modification and single-model bottleneck. We argue that the gap between benchmark success and research-level proving is primarily one of system design, due to those failure modes. We present QED, an open-source multi-agent proof system in which each architectural decision directly addresses a specific failure mode. Evaluated on five open problems in applied analysis and PDEs contributed by domain experts, QED produces correct proofs for three problems, each verified by the contributing experts as original and nontrivial. QED is released as open-source software at https://github.com/proofQED/QED.
AI Impact Assessments
(3 models)Scientific Impact Assessment: QED — An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
1. Core Contribution
QED addresses a fundamental question in AI for mathematics: whether LLM-based systems can produce original, nontrivial proofs for genuinely open research problems. The paper makes two intertwined contributions. First, it provides an empirical taxonomy of seven failure modes that prevent frontier LLMs from reliably generating research-level proofs (context contamination, citation hallucination, hand-waving, unstable proof plans, unfocused verification, problem modification, and single-model bottleneck). Second, it presents QED, an open-source multi-agent proof system whose architecture maps each failure mode to a specific design countermeasure. The system is evaluated on five open problems in applied analysis and PDEs, successfully producing expert-verified original proofs for three.
The central thesis — that the gap between benchmark performance and research-level proving is primarily one of *system design* rather than raw model capability — is both provocative and, if validated broadly, important for the field.
2. Methodological Rigor
The paper's methodological approach has both notable strengths and significant limitations:
Strengths: The failure mode analysis is grounded in practical experimentation with multiple frontier models (Claude Code, Codex, Gemini CLI). The seven failure modes are concrete, well-articulated, and individually plausible. The architectural mapping from failure modes to design decisions is unusually well-motivated compared to many multi-agent system papers. The verification pipeline (six phases, two stages) is thoughtfully structured, with specific mechanisms like word-by-word problem-statement comparison and `` tagging representing genuine engineering insights.
Limitations: The evaluation is fundamentally limited. Five problems is an extremely small sample, and three successes cannot establish statistical reliability. The authors acknowledge this but frame it as a feature (case studies rather than benchmarks). More critically, no ablation studies are performed. The authors justify this by noting the expert burden, which is understandable, but it means we cannot assess which components of QED are essential versus decorative. Without ablations, the claim that each architectural decision addresses a specific failure mode remains a design rationale rather than an empirical finding.
The problems are all from applied analysis and PDEs — a narrow mathematical domain. The two failed problems are informative: Problem 1 (Bessel function analysis) and Problem 2 (Allen-Cahn-Navier-Stokes) reveal that QED's verifier correctly rejects flawed proofs, but also that the system cannot handle problems requiring close adaptation of existing paper methodologies or complex special-function analysis.
The expert verification process, while essential, introduces subjectivity. The experts' characterizations of the proofs as "relatively straightforward but nontrivial" for P4 and "contradicting the expert's prior expectation" for P3 are encouraging but anecdotal. The fact that the domain expert for P2 noted the proof would need supplementation for publication suggests the mathematical significance is modest.
3. Potential Impact
Near-term impact: QED's open-source release could serve as a useful baseline and toolkit for researchers exploring AI-assisted theorem proving. The failure mode taxonomy is likely the most immediately reusable contribution — it provides a practical checklist for anyone building LLM-based proof systems.
Medium-term impact: If the system design thesis holds broadly, QED could catalyze a shift in how the community approaches AI for mathematics: away from single-model scaling and toward careful multi-agent orchestration. The verification pipeline could be adapted for other domains requiring rigorous logical reasoning.
Caveats on impact: The three solved problems, while genuinely open, appear to be of moderate mathematical difficulty. The expert for P4 called it "relatively straightforward but nontrivial," and P3 was solved in Round 1 with a "straightforward proof." The most significant result (P4, Batchelor scale equivalence) is genuinely interesting — revealing that a known sufficient condition is actually an equivalence — but represents an insight that, once suggested, is relatively easy to verify. The system's inability to solve P2 (described as "standard for graduate students") suggests that QED's successes may be concentrated on problems where direct analytical reasoning suffices, rather than problems requiring careful adaptation of existing techniques.
4. Timeliness & Relevance
The paper is highly timely. It appears during a period of intense activity around AI for mathematics (AlphaEvolve, Aletheia, FunSearch, DeepSeek-Prover) and directly addresses skepticism about whether LLMs can move beyond benchmark performance to genuine research contribution. The positioning against Aletheia — noting that only 13 of 212 flagged solutions were meaningfully valid, with 9 being rediscoveries — effectively contextualizes QED's contribution. The open-source release differentiates it from closed-source systems.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional observations: The paper's positioning between engineering contribution and scientific claim creates some tension. As an engineering paper, QED is a well-designed system with practical value. As a scientific paper claiming to demonstrate that LLMs can produce original research mathematics, the evidence is suggestive but not conclusive given the small sample and absence of controlled experiments.
Generated Apr 28, 2026
Comparison History (54)
Paper 1 proposes a foundational theoretical framework that unifies Bayesian inference, game theory, and thermodynamics. Such cross-disciplinary unification has massive potential to reshape theoretical approaches in physics, cognitive science, economics, and AI. While Paper 2 presents a highly impressive and practical engineering achievement in AI for mathematics, it is primarily a system design solution rather than a paradigm-shifting theoretical discovery. Paper 1's breadth of impact across multiple scientific domains and its establishment of a new mathematical bridge give it higher potential for deep, long-term scientific impact.
QED addresses a fundamental open question in AI for mathematics—whether AI can generate original proofs for open research problems—and demonstrates success by solving 3 out of 5 open problems with expert-verified novel proofs. This represents a landmark capability milestone with broad implications across mathematics and AI. The systematic identification of failure modes and open-source release amplify impact. While Paper 2 makes a solid contribution to computational materials science with practical efficiency gains, Paper 1's demonstration of genuine AI-driven mathematical discovery is more transformative and likely to attract broader attention across multiple fields.
QED addresses a fundamental open question—whether AI can generate original, nontrivial proofs for unsolved mathematical problems—and demonstrates success on three open problems verified by domain experts. This represents a qualitative leap in AI-for-mathematics, with broad implications across all mathematical sciences. The systematic identification of seven failure modes and an open-source multi-agent architecture that addresses them provides both theoretical insight and practical infrastructure. While Paper 2 makes a solid contribution to structure search by unifying generative models with RSS, it represents more incremental progress in a well-explored area. QED's demonstrated ability to solve open problems is transformative.
While Paper 2 offers exceptional methodological rigor with formal guarantees for cyber defense, Paper 1 demonstrates a landmark achievement in artificial intelligence: generating original, expert-verified proofs for open mathematical problems. By identifying specific LLM failure modes and systematically overcoming them to solve previously unsolved PDEs, Paper 1 represents a fundamental leap in AI's capability to contribute directly to scientific discovery. Its open-source nature and success in advancing human knowledge across mathematics give it a broader and more profound fundamental scientific impact.
Paper 1 has higher potential impact due to greater novelty (a multi-agent system that produces expert-verified original proofs for open PDE/analysis problems), strong real-world scientific relevance (advancing automated research mathematics), and broader cross-field implications for LLM system design beyond benchmarks. It offers concrete methodological contributions via identified failure modes mapped to architectural choices and releases open-source software. Paper 2 is timely and rigorous as a large-scale benchmark for workspace agents with realistic dependencies, but benchmarks typically yield more incremental impact than demonstrated progress on open research problems.
QED demonstrates a breakthrough capability: an AI system producing correct, expert-verified proofs for open mathematical problems. This represents a qualitative leap beyond benchmark performance, addressing a fundamental challenge in AI for mathematics. The identification of seven specific failure modes and systematic architectural solutions provides actionable insights. The concrete achievement of solving 3 out of 5 open problems, verified by domain experts, has immediate real-world impact on mathematical research. Paper 2 presents an incremental improvement to meta-reasoning frameworks with benchmark gains, but lacks the transformative potential and novelty of actually solving open problems.
Paper 1 represents a significant milestone in AI by demonstrating the ability to solve genuinely open research problems in mathematics, a long-standing grand challenge. While Paper 2 offers a valuable framework for improving general LLM reasoning, Paper 1's concrete success in producing expert-verified original proofs, combined with its open-source release and systematic analysis of failure modes, promises a more immediate and profound impact on both AI and mathematics.
QED demonstrates AI systems generating verified, original proofs for open mathematical problems—a landmark achievement bridging AI and mathematics research. It identifies specific failure modes and provides an open-source multi-agent architecture that directly addresses them, with expert-verified results on real open problems. While Paper 2 provides useful empirical insights about iterative finetuning stability, its findings (traits mostly decay or stay constant) are somewhat expected and incremental. QED's impact spans AI, mathematics, and scientific discovery, with immediate practical applications and the potential to fundamentally change how mathematical research is conducted.
Paper 2 likely has higher impact: it reports an open-source multi-agent system that produces expert-verified original proofs for multiple open problems, a notable leap beyond benchmarks with clear, high-value real-world application in mathematical research. It also contributes a useful taxonomy of failure modes and a system architecture mapped to them, enabling follow-on work across AI-for-math, multi-agent LLM systems, and verification workflows. Paper 1 is timely and relevant to alignment/synthetic training, but its main finding (idempotence/decay) is more incremental and narrower in downstream applications.
Paper 1 presents a broader paradigm (machine collective intelligence) for autonomous scientific discovery across multiple domains, demonstrating dramatic improvements over deep neural networks (six orders of magnitude error reduction) while achieving interpretability. Its applicability spans deterministic, stochastic, and uncharacterized systems, suggesting wide cross-disciplinary impact. Paper 2 makes an important contribution to AI-assisted theorem proving with verified proofs on open problems, but its scope is narrower (mathematical proofs in applied analysis/PDEs) and the system solved 3 of 5 problems. Paper 1's potential to transform scientific discovery methodology gives it broader and deeper impact.
QED addresses a fundamental challenge in AI for mathematics—generating original, nontrivial proofs for open research problems—and demonstrates success by solving 3 out of 5 open problems verified by domain experts. This represents a significant milestone: moving beyond benchmark performance to genuine mathematical discovery. The systematic identification of seven failure modes and principled multi-agent architecture provides a framework for future research. Paper 2 offers useful engineering contributions for efficient computer-use agents but addresses an optimization problem (compute allocation) rather than pushing a fundamental capability frontier. QED's impact spans AI and mathematics, with broader implications for AI-assisted scientific discovery.
QED demonstrates a concrete, verified capability of AI systems producing original proofs for open mathematical problems—a landmark achievement. Successfully solving 3 out of 5 open problems with expert-verified proofs represents a tangible advance in automated mathematical reasoning. While Intern-Atlas offers valuable research infrastructure for tracking methodological evolution, QED's direct demonstration of AI solving previously unsolved problems has broader transformative implications across mathematics and AI, and its open-source release enables immediate community adoption and extension. The systematic identification of failure modes also provides actionable insights for the field.
Paper 2 has higher potential impact: it claims original, expert-verified proofs for multiple open problems—an immediate, high-value real-world scientific outcome with broad implications for automated reasoning and mathematical discovery. It also contributes an open-source multi-agent system and a concrete taxonomy of failure modes tied to design choices, aiding reproducibility and follow-on work. While Paper 1 is methodologically rigorous and timely for agent reliability, its contribution is primarily diagnostic/measurement; Paper 2, if the claims hold, represents a more novel and field-shifting capability with wider cross-disciplinary resonance.
Paper 2 demonstrates a profound breakthrough by enabling AI to solve open, research-level mathematical problems and generate verified, novel proofs. While Paper 1 offers a valuable efficiency optimization for computer-use agents, Paper 2 tackles a holy grail in AI and mathematics (automated scientific discovery and theorem proving), promising a much broader and paradigm-shifting scientific impact across multiple disciplines.
While Paper 1 presents a highly innovative infrastructure for automated scientific discovery, Paper 2 demonstrates a concrete, groundbreaking achievement: using AI to generate original, verified proofs for unsolved, open mathematical problems. Solving actual open problems represents a major milestone in AI and mathematics, offering immediate, undeniable proof of utility and pushing the boundaries of what AI systems can achieve in rigorous scientific reasoning.
QED demonstrates AI systems producing verified, original proofs for open mathematical problems—a landmark achievement bridging AI and mathematics research. It identifies specific failure modes and provides architectural solutions, with concrete results (3/5 open problems solved) verified by domain experts. This has transformative potential for mathematical research. Paper 2 contributes a useful but more incremental analysis of contamination in multi-agent workflows, offering taxonomies and measurement frameworks but with narrower implications. QED's demonstrated capability to generate novel mathematical proofs represents a qualitatively higher level of impact.
While Paper 1 provides a valuable benchmark for GUI agents, Paper 2 demonstrates AI systems solving actual open research problems in mathematics. The ability to generate novel, expert-verified mathematical proofs represents a paradigm shift from benchmark-solving to genuine scientific discovery, offering significantly higher theoretical and practical impact across AI and mathematics.
Paper 1 likely has higher impact due to stronger novelty and higher upside: it reports original, expert-verified proofs for open problems—an outcome that, if robust, advances the frontier of AI for mathematics and could influence ML, formal methods, and mathematical practice. Its system-level diagnosis of failure modes plus an open-source multi-agent architecture is broadly reusable. Paper 2 is timely and useful as a benchmark, with clear real-world relevance, but benchmarking tends to be more incremental and field-specific unless it becomes a dominant standard.
Paper 2 likely has higher scientific impact: producing expert-verified original proofs for open problems is a major milestone with broad implications for mathematics, formal methods, and AI reliability. Its identification of concrete failure modes and corresponding multi-agent mitigations is methodologically strong and generalizable, and the open-source release enables rapid adoption and follow-on work. Paper 1 is innovative for engineering design via symbolic abstraction plus optimization, but its impact may be narrower (mechanical linkages) and more incremental relative to fast-moving LLM-for-design trends. Paper 2 is more transformative and timely.
Paper 1 has a higher scientific impact because it tackles a fundamental frontier in AI: complex reasoning and generating novel mathematical proofs for open research problems. Successfully solving previously open problems in mathematics is a major scientific milestone that pushes the boundaries of artificial general intelligence. While Paper 2 offers significant commercial and practical utility for supply chain automation, it represents an industrial application of existing agentic paradigms rather than a foundational scientific breakthrough. Additionally, Paper 1's open-source release will highly catalyze further academic research.