QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang

#89 of 2292 · Artificial Intelligence
Share
Tournament Score
1547±35
10501800
57%
Win Rate
23
Wins
17
Losses
40
Matches
Rating
5.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proofs remains an outstanding challenge for LLMs. Through systematic experiments with frontier LLMs on research-level proof tasks, we identify seven failure modes that prevent reliable proof generation, including context contamination, citation hallucination, hand-waving on key steps and misallocation of proof effort, unstable proof plans, unfocused verification, problem modification and single-model bottleneck. We argue that the gap between benchmark success and research-level proving is primarily one of system design, due to those failure modes. We present QED, an open-source multi-agent proof system in which each architectural decision directly addresses a specific failure mode. Evaluated on five open problems in applied analysis and PDEs contributed by domain experts, QED produces correct proofs for three problems, each verified by the contributing experts as original and nontrivial. QED is released as open-source software at https://github.com/proofQED/QED.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: QED — An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

1. Core Contribution

QED addresses whether AI systems can produce original, nontrivial proofs for genuinely open mathematical problems — a qualitatively different challenge from solving benchmark problems with known solutions. The paper's primary contributions are: (a) an empirical taxonomy of seven failure modes observed when frontier LLMs attempt research-level proofs, and (b) a multi-agent system architecture where each design decision maps directly to a specific failure mode. The system is evaluated on five open problems in applied analysis and PDEs, producing expert-verified original proofs for three of them.

The most notable result is Problem 4 (Batchelor Scale Equivalence), where QED revealed that a known sufficient condition is actually an equivalence — a finding the domain experts had not anticipated. This constitutes a genuine mathematical discovery, albeit modest in scope.

2. Methodological Rigor

Strengths in evaluation design: The paper uses genuinely open problems contributed by domain experts, with independent expert verification of correctness, originality, and nontriviality. This is the gold standard for evaluating AI proof generation, far superior to benchmark-based evaluation.

Significant weaknesses:

  • No ablation studies. The authors acknowledge this limitation, citing the burden on expert reviewers, but it fundamentally undermines the paper's central claim that the gap is "primarily one of system design." Without ablations, we cannot determine which components are essential versus cosmetic. Would separated prover/verifier alone achieve similar results? Is citation verification actually necessary for these problems? The failure-mode taxonomy is compelling narratively but unvalidated empirically.
  • Extremely small evaluation set. Five problems, all in a narrow subdomain (applied analysis/PDEs), with three successes. This is a case study, not a systematic evaluation. The authors are transparent about this, but it limits generalizability claims.
  • Selection bias concerns. The problems were "contributed by domain experts" who also verified the proofs. While the experts appear independent of the system developers, the problem selection process is not described in sufficient detail to rule out implicit filtering toward AI-amenable problems.
  • Expert assessment quotes suggest modest difficulty. The expert described P3's proof as "straightforward" (contradicting expectations but not technically deep) and P4 as "relatively straightforward but nontrivial." P2 (described as "standard for graduate students") was failed. This raises questions about the difficulty ceiling.
  • The negative cases are informative but underanalyzed. Problem 2, described as a "standard problem for graduate students," was not solved — yet three ostensibly harder open problems were. The paper doesn't adequately explain this discrepancy. The expert notes QED doesn't follow referenced paper methodologies, suggesting a fundamental limitation in incorporating existing proof techniques.
  • 3. Potential Impact

    Positive impact directions:

  • The failure mode taxonomy (Section 3) is practically useful and could influence how future AI proof systems are designed. These observations, while not individually surprising, are well-organized and actionable.
  • The open-source release enables community building and reproducibility.
  • Demonstrating that multi-agent architectures can produce original proofs, even modest ones, establishes a proof of concept that could accelerate the field.
  • Limitations on impact:

  • The results are concentrated in one mathematical subdomain. The system's applicability to algebra, combinatorics, number theory, or other areas is unknown.
  • The reliance on frontier commercial APIs (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) means the system's capabilities are tightly coupled to models that may change or become unavailable.
  • The proved results, while original, appear to be of limited mathematical significance by the experts' own assessment ("would need to be supplemented with additional similar results to constitute a publishable paper").
  • 4. Timeliness & Relevance

    The paper is highly timely, arriving amid intense interest in AI for mathematics following AlphaProof, AlphaEvolve, and Aletheia. The positioning against Aletheia's results on Erdős problems (212 flagged, only 13 valid, 9 rediscoveries) effectively frames the gap QED aims to address. The focus on open-source tooling and natural-language proofs (rather than formal verification) fills a genuine niche, though this choice also means proofs lack machine-checkable guarantees.

    5. Strengths & Limitations

    Key Strengths:

  • Clear, well-motivated system design with explicit mapping from failure modes to architectural decisions
  • Genuine open problems with expert verification — the evaluation methodology is commendable
  • Problem 4 represents a bona fide (if modest) mathematical discovery
  • The negative results are honestly reported, and the system's self-rejection of incorrect proofs demonstrates the verification pipeline's value
  • Open-source release
  • Notable Weaknesses:

  • No ablation studies undermine the central design thesis
  • Very small evaluation set in a narrow domain
  • The "proofs" are natural language, not formally verified — correctness depends entirely on expert judgment
  • Problem 5's evaluation involved human intervention (problem reformulation after initial solution), complicating the "fully automatic" claim
  • The paper conflates system design contributions with the observation that frontier models are now capable enough to prove these specific problems — it's unclear how much credit goes to the architecture versus the underlying model capabilities
  • Missing comparison with simpler baselines (e.g., just running Claude Code with a good prompt for 9 rounds)
  • 6. Additional Observations

    The paper is well-written and honest about limitations. The failure mode taxonomy, while informal, provides useful vocabulary for the field. However, the paper's strongest claim — that the benchmark-to-research gap is "primarily one of system design" — remains inadequately supported by the evidence presented. The results are better described as a promising proof of concept than as validation of a general methodology.

    Rating:5.8/ 10
    Significance 6.5Rigor 4.5Novelty 6Clarity 7.5

    Generated May 5, 2026

    Comparison History (40)

    vs. OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward
    claude-opus-4.65/5/2026

    QED demonstrates a breakthrough in AI-assisted mathematical research by producing verified, original proofs for open problems—a long-standing grand challenge. Its systematic identification of seven failure modes and architectural solutions represent fundamental contributions to AI reasoning. The real-world impact of AI solving open mathematical problems is transformative and cross-disciplinary. While OmniDiagram advances diagram generation with novel RL rewards and a large dataset, it addresses a more incremental, narrower problem. QED's open-source release and expert-verified results on genuine open problems position it for significantly broader scientific impact.

    vs. OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward
    claude-opus-4.65/5/2026

    QED demonstrates a breakthrough in AI-assisted mathematical reasoning by producing verified, original proofs for open research problems—a milestone that has been a grand challenge in AI. Its systematic identification of seven failure modes and targeted multi-agent architecture provides foundational insights for the field. The verified proofs on genuine open problems represent tangible scientific contributions beyond the AI system itself. While OmniDiagram is solid incremental work on diagram generation with a novel reward mechanism, QED's impact spans AI, mathematics, and scientific discovery more broadly, with higher potential to reshape how research is conducted.

    vs. D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery
    gemini-35/5/2026

    Paper 1 introduces foundational infrastructure (565 verifiable environments across four disciplines) for training and evaluating AI scientists, addressing a critical bottleneck in data-driven discovery. While Paper 2 achieves a significant milestone in mathematics by solving open problems, Paper 1 has broader applicability across multiple scientific fields and provides a reusable benchmark that will likely catalyze widespread follow-up research and model development.

    vs. D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery
    gemini-35/5/2026

    Paper 1 demonstrates an AI system solving genuinely open, expert-contributed mathematical problems, representing a major milestone in AI-driven scientific discovery. While Paper 2 provides a valuable dataset and benchmark for training scientific agents, Paper 1's achievement of producing original, verified proofs for unsolved problems has profound and immediate implications for the future of mathematical research and AI capabilities.

    vs. Machine Collective Intelligence for Explainable Scientific Discovery
    gemini-35/5/2026

    While Paper 1 presents a significant breakthrough in AI for mathematics by solving open problems, Paper 2 addresses a broader challenge across all empirical sciences: deriving explainable governing equations from data. By reducing extrapolation errors by orders of magnitude and creating interpretable models, Paper 2's approach can fundamentally accelerate scientific discovery across physics, biology, and chemistry, giving it a wider and potentially more transformative cross-disciplinary impact.

    vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures
    gpt-5.25/5/2026

    Paper 2 likely has higher impact due to broad, timely applicability in materials and molecular discovery, with clear real-world pathways (drug discovery, catalysts, batteries) and a general methodological contribution (unifying diffusion generation with physics-based search). It targets a widely shared bottleneck—efficient exploration of energy landscapes—and claims strong efficiency gains and out-of-distribution effectiveness across molecules and crystals, suggesting robustness and cross-field uptake. Paper 1 is novel and valuable for AI+math, but its demonstrated scope (few open PDE/analysis problems) and narrower immediate application base likely limit near-term breadth compared to GSS.

    vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures
    gpt-5.25/5/2026

    Paper 2 likely has higher scientific impact due to broad real-world applicability in materials and molecular discovery, a timely area with direct links to catalysis, batteries, pharmaceuticals, and crystal engineering. Its unified framework bridging diffusion models with physics-based random structure search addresses a widely recognized limitation (training-data bias) and promises order-of-magnitude efficiency gains, enabling adoption across computational chemistry and materials science workflows. Paper 1 is highly novel and exciting but its demonstrated impact is narrower (a few PDE/open problems) and may be harder to generalize and rigorously validate at scale.

    vs. Machine Collective Intelligence for Explainable Scientific Discovery
    claude-opus-4.65/5/2026

    Paper 2 presents a broader paradigm (machine collective intelligence) applicable across diverse scientific domains, combining symbolism and metaheuristics for autonomous equation discovery. Its demonstrated results—recovering governing equations across deterministic, stochastic, and uncharacterized systems with up to six orders of magnitude improvement in extrapolation—suggest wider scientific impact. While Paper 1 (QED) is impressive in solving open math problems, its impact is more narrowly focused on mathematical proof generation. Paper 2's potential to transform AI-driven scientific discovery across physics, biology, and engineering gives it broader cross-disciplinary relevance.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    gpt-5.25/5/2026

    Paper 1 has higher likely scientific impact: it introduces a novel, pre-registered benchmark quantifying identity-contingent safety-induced omission harm in clinical advice, with substantial real-world and policy relevance for deploying LLMs in healthcare. The methodology is comparatively rigorous (pre-registration, structured scoring, physician validation, multiple models, statistical testing) and targets a timely, high-stakes failure mode (withholding/filters) with broad implications for AI safety, evaluation, and regulation. Paper 2 is innovative and useful, but impact may be narrower to AI4Math and depends on sustained expert verification and generalization beyond five problems.

    vs. End-to-end autonomous scientific discovery on a real optical platform
    gemini-35/5/2026

    While Paper 1 makes significant strides in automated mathematical proving, Paper 2 achieves a groundbreaking milestone by demonstrating end-to-end autonomous scientific discovery on a physical experimental platform. The AI's discovery and experimental validation of a novel physical mechanism (optical bilinear interaction) not only proves the system's efficacy in the real world but also offers tangible technological applications in developing high-speed, energy-efficient optical hardware for AI computations.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    claude-opus-4.65/5/2026

    IatroBench addresses a critical, timely problem—AI safety measures causing iatrogenic harm—with rigorous pre-registered methodology across 3,600 responses and 6 frontier models. It reveals a systematic, identity-contingent failure mode with real clinical consequences, challenging fundamental assumptions in AI safety alignment. Its implications span AI policy, healthcare, and model development. While QED's automated theorem proving for open problems is impressive, IatroBench's findings are more broadly impactful: they affect billions of potential users, expose a paradox in current safety practices, and will likely influence regulatory and deployment decisions across the AI industry.

    vs. End-to-end autonomous scientific discovery on a real optical platform
    claude-opus-4.65/5/2026

    Paper 2 demonstrates end-to-end autonomous scientific discovery in a real physical system, autonomously identifying and experimentally validating a previously unreported physical mechanism (optical bilinear interaction). This represents a broader milestone—bridging AI reasoning with physical experimentation—with transformative implications across experimental sciences. While Paper 1's multi-agent mathematical proof system is impressive (solving 3 of 5 open problems), Paper 2's integration of hypothesis generation, experimental design, execution, and validation on real hardware represents a more fundamental advance with wider cross-disciplinary impact and practical applications in optical computing.

    vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules
    gpt-5.25/5/2026

    Paper 2 (MIMIC) likely has higher scientific impact due to broader real-world applications (drug discovery, variant interpretation, RNA splicing, protein design), wider cross-field reach (genomics, structural biology, ML, medicine), and strong timeliness as multimodal foundation models are central to current bio-AI. The work claims a new aligned multimodal dataset, flexible conditional generation, state-of-the-art downstream results, and design demonstrations. Paper 1 is novel and valuable for AI+math, but its domain is narrower and impact depends on reproducibility and sustained success on more open problems.

    vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules
    claude-opus-4.65/5/2026

    MIMIC represents a more broadly impactful contribution: a multimodal foundation model unifying sequence, structure, regulatory, evolutionary, and contextual modalities for biomolecular understanding and design. Its applications span RNA splicing prediction, protein design (PD-L1/hACE2 binders), and therapeutic mutation correction—areas with direct biomedical impact. The new LORE dataset and the generative multimodal framework address a fundamental gap in computational biology. While QED is impressive in generating novel mathematical proofs via multi-agent systems, MIMIC's breadth of applications, methodological innovation in multimodal biological modeling, and potential for drug discovery and therapeutic design give it broader and more immediate real-world impact.

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    gpt-5.25/5/2026

    Paper 1 introduces a broadly applicable generative “health world model” trained on large, deeply phenotyped longitudinal data and demonstrates strong transfer, improved clinical endpoint prediction, and intervention-conditioned simulation aligned with RCT evidence—high novelty plus immediate translational value for risk stratification and digital twins. Its potential impact spans medicine, public health, wearable sensing, causal inference, and ML. Paper 2 is innovative and timely for AI-for-math and open-sources a useful system, but its direct real-world applications and cross-field breadth are narrower, and results are shown on only five problems with expert verification but limited scalability evidence.

    vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
    gemini-35/5/2026

    Paper 2 addresses a universal bottleneck in the scientific ecosystem—peer review scalability. Its massive field deployment (23,000+ papers) and empirical evidence that AI reviews are preferred over human ones offer immediate, transformative implications for how research is evaluated across all disciplines. While Paper 1 makes impressive strides in AI-driven mathematical discovery, Paper 2's broad applicability, timeliness, and successfully executed systemic integration give it a higher potential for widespread, structural scientific impact.

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    claude-opus-4.65/5/2026

    HealthFormer represents a broader and more transformative scientific contribution. It introduces a generative model of human physiology trained on deeply phenotyped data across 7 domains, enabling clinical digital twins, personalized intervention simulation, and disease risk prediction—all from a single model. It validates across 4 independent cohorts and 41 randomized trial comparisons, demonstrating remarkable generalization. While QED is impressive in generating novel mathematical proofs via multi-agent LLM systems, HealthFormer's potential to revolutionize personalized medicine, clinical trial design, and preventive healthcare gives it substantially broader real-world impact across medicine and public health.

    vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
    gpt-5.25/5/2026

    Paper 2 has higher potential impact: it reports original, expert-verified proofs for open mathematical problems and releases an open-source multi-agent system, advancing both AI-for-math capability and mathematical knowledge. This is highly novel, methodologically concrete (failure-mode analysis, targeted architecture, expert verification), and broadly influential for automated reasoning, formal methods, and scientific discovery. Paper 1 is timely with clear real-world application and strong scale, but its core contribution is a large deployment/benchmarking of existing frontier-model tooling rather than a breakthrough capability, and its applicability depends on policy/ethics constraints.

    vs. Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
    gpt-5.25/5/2026

    Paper 1 has higher potential scientific impact due to greater novelty (a multi-agent architecture explicitly targeting identified failure modes) and a stronger scientific claim: producing expert-verified, original proofs for multiple open PDE/applied analysis problems. If robust, this advances AI-assisted mathematical discovery with broad downstream implications for mathematics, theoretical CS, and scientific modeling. Paper 2 is timely and valuable for benchmarking deployed endpoints (speed/cost/energy/quality), with clear real-world utility, but it is primarily an empirical evaluation framework; its impact is likely more incremental and concentrated in ML ops/benchmarking than research-level scientific discovery.

    vs. Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
    gpt-5.25/5/2026

    Paper 1 is more novel scientifically (AI system design yielding original, expert-verified proofs for open problems) and has broader cross-field impact (AI, formal methods, mathematics), with a clear step-change claim beyond benchmarks. Its real-world applications include accelerating mathematical discovery and improving trustworthy reasoning systems. While Paper 2 is timely and highly practical for deployment evaluation, it is primarily a benchmarking/methodology contribution with narrower scientific novelty and less fundamental advancement than demonstrated novel proof generation on open research problems.