QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang
Abstract
We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proofs remains an outstanding challenge for LLMs. Through systematic experiments with frontier LLMs on research-level proof tasks, we identify seven failure modes that prevent reliable proof generation, including context contamination, citation hallucination, hand-waving on key steps and misallocation of proof effort, unstable proof plans, unfocused verification, problem modification and single-model bottleneck. We argue that the gap between benchmark success and research-level proving is primarily one of system design, due to those failure modes. We present QED, an open-source multi-agent proof system in which each architectural decision directly addresses a specific failure mode. Evaluated on five open problems in applied analysis and PDEs contributed by domain experts, QED produces correct proofs for three problems, each verified by the contributing experts as original and nontrivial. QED is released as open-source software at https://github.com/proofQED/QED.
AI Impact Assessments
(1 models)Scientific Impact Assessment: QED — An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
1. Core Contribution
QED addresses whether AI systems can produce original, nontrivial proofs for genuinely open mathematical problems — a qualitatively different challenge from solving benchmark problems with known solutions. The paper's primary contributions are: (a) an empirical taxonomy of seven failure modes observed when frontier LLMs attempt research-level proofs, and (b) a multi-agent system architecture where each design decision maps directly to a specific failure mode. The system is evaluated on five open problems in applied analysis and PDEs, producing expert-verified original proofs for three of them.
The most notable result is Problem 4 (Batchelor Scale Equivalence), where QED revealed that a known sufficient condition is actually an equivalence — a finding the domain experts had not anticipated. This constitutes a genuine mathematical discovery, albeit modest in scope.
2. Methodological Rigor
Strengths in evaluation design: The paper uses genuinely open problems contributed by domain experts, with independent expert verification of correctness, originality, and nontriviality. This is the gold standard for evaluating AI proof generation, far superior to benchmark-based evaluation.
Significant weaknesses:
3. Potential Impact
Positive impact directions:
Limitations on impact:
4. Timeliness & Relevance
The paper is highly timely, arriving amid intense interest in AI for mathematics following AlphaProof, AlphaEvolve, and Aletheia. The positioning against Aletheia's results on Erdős problems (212 flagged, only 13 valid, 9 rediscoveries) effectively frames the gap QED aims to address. The focus on open-source tooling and natural-language proofs (rather than formal verification) fills a genuine niche, though this choice also means proofs lack machine-checkable guarantees.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
6. Additional Observations
The paper is well-written and honest about limitations. The failure mode taxonomy, while informal, provides useful vocabulary for the field. However, the paper's strongest claim — that the benchmark-to-research gap is "primarily one of system design" — remains inadequately supported by the evidence presented. The results are better described as a promising proof of concept than as validation of a general methodology.
Generated May 5, 2026
Comparison History (40)
QED demonstrates a breakthrough in AI-assisted mathematical research by producing verified, original proofs for open problems—a long-standing grand challenge. Its systematic identification of seven failure modes and architectural solutions represent fundamental contributions to AI reasoning. The real-world impact of AI solving open mathematical problems is transformative and cross-disciplinary. While OmniDiagram advances diagram generation with novel RL rewards and a large dataset, it addresses a more incremental, narrower problem. QED's open-source release and expert-verified results on genuine open problems position it for significantly broader scientific impact.
QED demonstrates a breakthrough in AI-assisted mathematical reasoning by producing verified, original proofs for open research problems—a milestone that has been a grand challenge in AI. Its systematic identification of seven failure modes and targeted multi-agent architecture provides foundational insights for the field. The verified proofs on genuine open problems represent tangible scientific contributions beyond the AI system itself. While OmniDiagram is solid incremental work on diagram generation with a novel reward mechanism, QED's impact spans AI, mathematics, and scientific discovery more broadly, with higher potential to reshape how research is conducted.
Paper 1 introduces foundational infrastructure (565 verifiable environments across four disciplines) for training and evaluating AI scientists, addressing a critical bottleneck in data-driven discovery. While Paper 2 achieves a significant milestone in mathematics by solving open problems, Paper 1 has broader applicability across multiple scientific fields and provides a reusable benchmark that will likely catalyze widespread follow-up research and model development.
Paper 1 demonstrates an AI system solving genuinely open, expert-contributed mathematical problems, representing a major milestone in AI-driven scientific discovery. While Paper 2 provides a valuable dataset and benchmark for training scientific agents, Paper 1's achievement of producing original, verified proofs for unsolved problems has profound and immediate implications for the future of mathematical research and AI capabilities.
While Paper 1 presents a significant breakthrough in AI for mathematics by solving open problems, Paper 2 addresses a broader challenge across all empirical sciences: deriving explainable governing equations from data. By reducing extrapolation errors by orders of magnitude and creating interpretable models, Paper 2's approach can fundamentally accelerate scientific discovery across physics, biology, and chemistry, giving it a wider and potentially more transformative cross-disciplinary impact.
Paper 2 likely has higher impact due to broad, timely applicability in materials and molecular discovery, with clear real-world pathways (drug discovery, catalysts, batteries) and a general methodological contribution (unifying diffusion generation with physics-based search). It targets a widely shared bottleneck—efficient exploration of energy landscapes—and claims strong efficiency gains and out-of-distribution effectiveness across molecules and crystals, suggesting robustness and cross-field uptake. Paper 1 is novel and valuable for AI+math, but its demonstrated scope (few open PDE/analysis problems) and narrower immediate application base likely limit near-term breadth compared to GSS.
Paper 2 likely has higher scientific impact due to broad real-world applicability in materials and molecular discovery, a timely area with direct links to catalysis, batteries, pharmaceuticals, and crystal engineering. Its unified framework bridging diffusion models with physics-based random structure search addresses a widely recognized limitation (training-data bias) and promises order-of-magnitude efficiency gains, enabling adoption across computational chemistry and materials science workflows. Paper 1 is highly novel and exciting but its demonstrated impact is narrower (a few PDE/open problems) and may be harder to generalize and rigorously validate at scale.
Paper 2 presents a broader paradigm (machine collective intelligence) applicable across diverse scientific domains, combining symbolism and metaheuristics for autonomous equation discovery. Its demonstrated results—recovering governing equations across deterministic, stochastic, and uncharacterized systems with up to six orders of magnitude improvement in extrapolation—suggest wider scientific impact. While Paper 1 (QED) is impressive in solving open math problems, its impact is more narrowly focused on mathematical proof generation. Paper 2's potential to transform AI-driven scientific discovery across physics, biology, and engineering gives it broader cross-disciplinary relevance.
Paper 1 has higher likely scientific impact: it introduces a novel, pre-registered benchmark quantifying identity-contingent safety-induced omission harm in clinical advice, with substantial real-world and policy relevance for deploying LLMs in healthcare. The methodology is comparatively rigorous (pre-registration, structured scoring, physician validation, multiple models, statistical testing) and targets a timely, high-stakes failure mode (withholding/filters) with broad implications for AI safety, evaluation, and regulation. Paper 2 is innovative and useful, but impact may be narrower to AI4Math and depends on sustained expert verification and generalization beyond five problems.
While Paper 1 makes significant strides in automated mathematical proving, Paper 2 achieves a groundbreaking milestone by demonstrating end-to-end autonomous scientific discovery on a physical experimental platform. The AI's discovery and experimental validation of a novel physical mechanism (optical bilinear interaction) not only proves the system's efficacy in the real world but also offers tangible technological applications in developing high-speed, energy-efficient optical hardware for AI computations.
IatroBench addresses a critical, timely problem—AI safety measures causing iatrogenic harm—with rigorous pre-registered methodology across 3,600 responses and 6 frontier models. It reveals a systematic, identity-contingent failure mode with real clinical consequences, challenging fundamental assumptions in AI safety alignment. Its implications span AI policy, healthcare, and model development. While QED's automated theorem proving for open problems is impressive, IatroBench's findings are more broadly impactful: they affect billions of potential users, expose a paradox in current safety practices, and will likely influence regulatory and deployment decisions across the AI industry.
Paper 2 demonstrates end-to-end autonomous scientific discovery in a real physical system, autonomously identifying and experimentally validating a previously unreported physical mechanism (optical bilinear interaction). This represents a broader milestone—bridging AI reasoning with physical experimentation—with transformative implications across experimental sciences. While Paper 1's multi-agent mathematical proof system is impressive (solving 3 of 5 open problems), Paper 2's integration of hypothesis generation, experimental design, execution, and validation on real hardware represents a more fundamental advance with wider cross-disciplinary impact and practical applications in optical computing.
Paper 2 (MIMIC) likely has higher scientific impact due to broader real-world applications (drug discovery, variant interpretation, RNA splicing, protein design), wider cross-field reach (genomics, structural biology, ML, medicine), and strong timeliness as multimodal foundation models are central to current bio-AI. The work claims a new aligned multimodal dataset, flexible conditional generation, state-of-the-art downstream results, and design demonstrations. Paper 1 is novel and valuable for AI+math, but its domain is narrower and impact depends on reproducibility and sustained success on more open problems.
MIMIC represents a more broadly impactful contribution: a multimodal foundation model unifying sequence, structure, regulatory, evolutionary, and contextual modalities for biomolecular understanding and design. Its applications span RNA splicing prediction, protein design (PD-L1/hACE2 binders), and therapeutic mutation correction—areas with direct biomedical impact. The new LORE dataset and the generative multimodal framework address a fundamental gap in computational biology. While QED is impressive in generating novel mathematical proofs via multi-agent systems, MIMIC's breadth of applications, methodological innovation in multimodal biological modeling, and potential for drug discovery and therapeutic design give it broader and more immediate real-world impact.
Paper 1 introduces a broadly applicable generative “health world model” trained on large, deeply phenotyped longitudinal data and demonstrates strong transfer, improved clinical endpoint prediction, and intervention-conditioned simulation aligned with RCT evidence—high novelty plus immediate translational value for risk stratification and digital twins. Its potential impact spans medicine, public health, wearable sensing, causal inference, and ML. Paper 2 is innovative and timely for AI-for-math and open-sources a useful system, but its direct real-world applications and cross-field breadth are narrower, and results are shown on only five problems with expert verification but limited scalability evidence.
Paper 2 addresses a universal bottleneck in the scientific ecosystem—peer review scalability. Its massive field deployment (23,000+ papers) and empirical evidence that AI reviews are preferred over human ones offer immediate, transformative implications for how research is evaluated across all disciplines. While Paper 1 makes impressive strides in AI-driven mathematical discovery, Paper 2's broad applicability, timeliness, and successfully executed systemic integration give it a higher potential for widespread, structural scientific impact.
HealthFormer represents a broader and more transformative scientific contribution. It introduces a generative model of human physiology trained on deeply phenotyped data across 7 domains, enabling clinical digital twins, personalized intervention simulation, and disease risk prediction—all from a single model. It validates across 4 independent cohorts and 41 randomized trial comparisons, demonstrating remarkable generalization. While QED is impressive in generating novel mathematical proofs via multi-agent LLM systems, HealthFormer's potential to revolutionize personalized medicine, clinical trial design, and preventive healthcare gives it substantially broader real-world impact across medicine and public health.
Paper 2 has higher potential impact: it reports original, expert-verified proofs for open mathematical problems and releases an open-source multi-agent system, advancing both AI-for-math capability and mathematical knowledge. This is highly novel, methodologically concrete (failure-mode analysis, targeted architecture, expert verification), and broadly influential for automated reasoning, formal methods, and scientific discovery. Paper 1 is timely with clear real-world application and strong scale, but its core contribution is a large deployment/benchmarking of existing frontier-model tooling rather than a breakthrough capability, and its applicability depends on policy/ethics constraints.
Paper 1 has higher potential scientific impact due to greater novelty (a multi-agent architecture explicitly targeting identified failure modes) and a stronger scientific claim: producing expert-verified, original proofs for multiple open PDE/applied analysis problems. If robust, this advances AI-assisted mathematical discovery with broad downstream implications for mathematics, theoretical CS, and scientific modeling. Paper 2 is timely and valuable for benchmarking deployed endpoints (speed/cost/energy/quality), with clear real-world utility, but it is primarily an empirical evaluation framework; its impact is likely more incremental and concentrated in ML ops/benchmarking than research-level scientific discovery.
Paper 1 is more novel scientifically (AI system design yielding original, expert-verified proofs for open problems) and has broader cross-field impact (AI, formal methods, mathematics), with a clear step-change claim beyond benchmarks. Its real-world applications include accelerating mathematical discovery and improving trustworthy reasoning systems. While Paper 2 is timely and highly practical for deployment evaluation, it is primarily a benchmarking/methodology contribution with narrower scientific novelty and less fundamental advancement than demonstrated novel proof generation on open research problems.