AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

Joydeep Biswas, Sheila Schoepp, Gautham Vasan, Anthony Opipari, Arthur Zhang, Zichao Hu, Sebastian Joseph, Matthew Lease

Gold · Week 16, 2026 Share
Tournament Score
1622±19
10501800
84%
Win Rate
202
Wins
38
Losses
240
Matches
Rating
7.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness. Recent advances in AI have led the community to consider its use in peer review, yet a key unresolved question is whether AI can generate technically sound reviews at real-world conference scale. Here we report the first large-scale field deployment of AI-assisted peer review: every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system. The system combined frontier models, tool use, and safeguards in a multi-stage process to generate reviews for all 22,977 full-review papers in less than a day. A large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. We also introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses. Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

1. Core Contribution

This paper reports the first full-scale, live deployment of AI-generated peer reviews at a major AI conference. Every main-track submission at AAAI-26 (22,977 papers) received one clearly labeled AI review generated by a custom multi-stage, multi-tool LLM pipeline. The system combines frontier models (GPT-5), tool use (code interpreter, web search), and structured multi-stage reasoning across five review dimensions (story, presentation, evaluations, correctness, significance), followed by self-critique and revision. The paper contributes three things: (1) an operational system that generated reviews for ~23K papers in under 24 hours at less than $1 per paper, (2) a large-scale survey (5,834 responses) comparing AI and human review quality, and (3) the SPECS benchmark for evaluating AI review systems via synthetic perturbations to accepted papers.

2. Methodological Rigor

System design: The multi-stage pipeline is well-motivated by prior work showing that monolithic LLM prompts produce inferior reviews. The decomposition into five criterion-specific stages, each with targeted tools (code interpreter for correctness/evaluations, web search for significance), reflects thoughtful engineering. The self-critique loop adds a validation layer. However, the exact prompts are withheld, limiting reproducibility.

Survey design: The study obtained IRB/REB approval from three institutions. The survey employed Likert-scale items across four dimensions with 5,834 responses. Statistical analysis uses Mann-Whitney U tests with all comparisons showing significance at α=0.01. However, there are important methodological concerns:

  • Self-selection bias is acknowledged but not adequately addressed. Respondents who found AI reviews noteworthy (positively or negatively) may be overrepresented.
  • Non-blinding confound: AI reviews were clearly labeled, so respondents knew which reviews were AI-generated. This introduces potential novelty effects, confirmation bias, or anchoring effects that could inflate or deflate ratings in unpredictable ways.
  • The comparison is between AI reviews (which had no scores/recommendations) and human reviews (which did), creating an asymmetry that could affect perceptions of thoroughness and technical focus.
  • Author preference for AI reviews could partially reflect that AI reviews are more detailed/longer, which authors may appreciate even if the additional content has diminishing value for decision-making.
  • SPECS benchmark: The synthetic perturbation approach is methodologically sound in principle—generating controlled errors in accepted papers and measuring detection rates. The human oversight (Table 5) reveals that only 22/35 perturbations were unanimously deemed valid scientific errors, with presentation and significance perturbations being particularly problematic (3/7 valid each). This suggests the benchmark's reliability varies substantially by criterion. The sample size for human validation (35 perturbations) is relatively small. The LLM-as-judge approach introduces potential circularity when evaluating an LLM-based review system.

    3. Potential Impact

    This work has substantial practical and policy implications:

  • Operational precedent: Demonstrating feasibility at the scale of a top-tier conference removes a key barrier to adoption. Other venues will likely follow.
  • Community norms: The finding that participants preferred AI reviews on several dimensions could accelerate integration of AI into review processes, potentially reshaping how conferences organize peer review.
  • Complementarity argument: The data showing AI reviews catch errors humans miss (and vice versa) provides evidence for human-AI teaming rather than replacement—an important framing for community acceptance.
  • Cost efficiency: At <$1/paper, the economic argument is compelling, especially for conferences struggling with reviewer recruitment.
  • However, there are significant risks the paper acknowledges but doesn't deeply explore: gaming/optimization of papers for AI reviewers, deskilling of human reviewers over time, and concentration of reviewing infrastructure around a single commercial provider (OpenAI).

    4. Timeliness & Relevance

    This paper addresses perhaps the most pressing operational challenge in AI/ML research: the peer review crisis driven by exponential submission growth. With AAAI submissions doubling year-over-year and over half of researchers already using AI for review (often covertly), the question isn't whether AI will be used in peer review, but how. This paper provides the first rigorous large-scale evidence base for that integration, making it extremely timely.

    5. Strengths & Limitations

    Key Strengths:

  • Unprecedented scale: 22,977 papers reviewed in a live conference setting
  • Comprehensive evaluation combining quantitative survey, qualitative analysis, and benchmark evaluation
  • Transparent about limitations (overemphasis on minor issues, verbosity, difficulty with significance judgments)
  • Multi-institutional collaboration with extensive ethical oversight
  • The SPECS benchmark fills a genuine gap in evaluation methodology
  • Key Limitations:

  • Labeled reviews: The non-blinded comparison between AI and human reviews is the most significant methodological weakness. A randomized controlled design where some AI reviews are unlabeled would provide much stronger evidence.
  • No impact on outcomes: The paper does not analyze whether AI reviews influenced acceptance decisions, review scores, or discussion quality—arguably the most important measures of utility.
  • Dependence on proprietary models: The system relies entirely on GPT-5 under a special agreement with OpenAI (who was also a conference sponsor), raising concerns about reproducibility, vendor lock-in, and potential conflicts of interest.
  • SPECS validity: The moderate human agreement on perturbation validity (63% consensus) and small validation sample weaken the benchmark's authority.
  • Missing analysis: No analysis of AI review quality variation across subfields, paper quality levels, or paper types (theoretical vs. empirical). The qualitative finding about "shallow contextual and domain understanding" suggests significant heterogeneity that isn't quantified.
  • Review length asymmetry: AI reviews were acknowledged as verbose, which could inflate thoroughness ratings while simultaneously reducing practical utility.
  • Overall Assessment

    This is a landmark paper in terms of scale and operational significance—the first real deployment of AI reviews at a major conference. The survey results are compelling despite methodological limitations. However, the lack of blinding, the absence of outcome analysis, and the reliance on proprietary infrastructure temper the scientific conclusions. The SPECS benchmark is a useful but imperfect contribution. The paper's greatest impact will likely be as a precedent-setting case study that catalyzes adoption across venues, rather than as a definitive scientific evaluation of AI review quality.

    Rating:7.8/ 10
    Significance 8.5Rigor 6.5Novelty 7.5Clarity 8

    Generated Apr 16, 2026

    Comparison History (240)

    vs. Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
    gemini-3.15/16/2026

    While Paper 1 presents a highly innovative benchmark for AI reasoning in mathematics, Paper 2 has a significantly broader potential impact across all scientific disciplines. By demonstrating that AI-generated peer reviews are technically sound and preferred by authors at a massive real-world scale (over 22,000 papers), Paper 2 addresses a critical and universal bottleneck in the scientific process. This could fundamentally transform how scientific research is evaluated globally.

    vs. Containment Verification: AI Safety Guarantees Independent of Alignment
    gpt-5.25/16/2026

    Paper 2 is more novel and foundational: it shifts safety guarantees from opaque model behavior to formally verified agentic frameworks with universal (havoc-oracle) semantics, yielding capability-invariant guarantees. The methodological rigor is higher via mechanized proofs in Dafny and a concrete verified framework (PocketFlow), and its ideas can generalize across many AI systems that use tool/action interfaces, impacting formal methods, security, and AI safety. Paper 1 has strong real-world applicability and timeliness, but its impact is more operational and contingent on ecosystem adoption and evaluation design, with fewer transferable scientific guarantees.

    vs. SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment
    claude-opus-4.65/16/2026

    The AI-assisted peer review paper addresses a fundamental bottleneck in scientific publishing affecting all fields, deployed at massive scale (22,977 papers at AAAI-26). Its impact is broader because it could transform how science itself is evaluated across all disciplines. While SymptomAI is impressive in medical AI with strong methodology and real-world deployment, the peer review paper's potential to reshape scientific infrastructure gives it wider cross-disciplinary impact. Both are rigorous large-scale deployments, but reforming peer review has cascading effects on all of science.

    vs. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
    gemini-3.15/16/2026

    Paper 2 addresses a critical, universal bottleneck in science—peer review—with a massive, real-world deployment at a top conference (AAAI-26). Its success in generating preferred, technically sound reviews at scale suggests a paradigm shift in how research is evaluated, offering broader interdisciplinary impact and immediate practical utility compared to Paper 1's narrower focus on LLM serving systems.

    vs. Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces
    claude-opus-4.65/15/2026

    Paper 1 reports the first large-scale field deployment of AI-assisted peer review at a major conference (AAAI-26, 22,977 papers), addressing a critical and timely problem affecting all of science. The finding that AI reviews were preferred over human reviews on key dimensions is potentially transformative for the entire scientific publishing ecosystem. Its real-world scale, immediate practical applicability, and broad cross-field relevance give it exceptional impact potential. Paper 2, while technically interesting in analyzing reasoning trace redundancy, addresses a more niche problem with narrower immediate applications.

    vs. GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design
    gemini-3.15/15/2026

    Paper 2 addresses a critical bottleneck in the scientific process itself—peer review. Its real-world deployment at scale and findings that AI reviews are preferred over human ones have profound implications for how science is evaluated across all disciplines. While Paper 1 presents a strong domain-specific advancement in synthetic biology, Paper 2's potential to fundamentally transform the scientific publication ecosystem gives it broader, more immediate scientific impact.

    vs. AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
    claude-opus-4.65/7/2026

    The AAAI-26 AI-assisted peer review paper represents a landmark first large-scale field deployment (22,977 papers) of AI in scientific peer review, addressing a fundamental challenge across all of science. Its breadth of impact is enormous—it affects how research itself is evaluated. The finding that AI reviews were preferred over human reviews on key dimensions is paradigm-shifting. Paper 2, while practically useful for AI agent safety, is more incremental engineering work in a narrower domain. Paper 1's implications for the scientific enterprise as a whole give it substantially greater potential impact.

    vs. Geometric Routing Enables Causal Expert Control in Mixture of Experts
    gpt-5.25/5/2026

    Paper 2 has higher likely impact due to a real-world, conference-scale deployment (22,977 papers) addressing an urgent bottleneck in science, with immediate applicability across fields and strong timeliness. Its contributions (end-to-end system, safeguards, benchmark, and large survey evidence) could rapidly influence peer-review policy, tooling, and research evaluation practices. Paper 1 is novel and methodologically interesting for MoE interpretability and controllability, but its direct downstream impact is narrower and more contingent on adoption in specific model architectures and research communities.

    vs. Geometric Routing Enables Causal Expert Control in Mixture of Experts
    gpt-5.25/5/2026

    Paper 2 likely has higher impact due to its unprecedented real-world, conference-scale deployment (22,977 papers) with direct operational relevance and immediate applicability to the scientific ecosystem. It addresses a timely, high-stakes bottleneck (peer review), provides empirical evidence via field data and surveys, and introduces a benchmark, making it broadly influential across disciplines and research governance. Paper 1 is novel and rigorous for MoE interpretability/control, but its impact is more specialized to ML architecture/interpretability, whereas Paper 2 could reshape review workflows and policy across fields.

    vs. SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
    claude-opus-4.65/5/2026

    The AAAI-26 AI Review Pilot represents a landmark first large-scale field deployment of AI-assisted peer review across ~23,000 papers at a major conference, directly addressing a critical bottleneck in the scientific enterprise. Its real-world validation with surveys showing AI reviews preferred over human reviews on key dimensions has immediate, broad implications for how science is evaluated globally. Paper 2, while technically strong with a novel data construction framework for scientific reasoning agents, represents more incremental progress in the competitive LLM benchmark landscape. Paper 1's institutional-scale impact and potential to reshape peer review gives it substantially higher scientific impact.

    vs. SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
    claude-opus-4.65/5/2026

    Paper 2 reports the first large-scale real-world deployment of AI-assisted peer review across all 22,977 AAAI-26 submissions, with empirical evidence that AI reviews were preferred over human reviews on key dimensions. This has immediate, broad impact across all scientific fields that rely on peer review, addressing a universal and urgent problem. Paper 1, while strong in its domain (scientific reasoning agents), represents an incremental advance in a crowded space of AI agent benchmarks. Paper 2's practical demonstration at unprecedented scale, combined with its potential to reshape how science is evaluated globally, gives it substantially higher impact potential.

    vs. Towards Understanding Specification Gaming in Reasoning Models
    claude-opus-4.65/5/2026

    Paper 1 reports the first large-scale real-world deployment of AI-assisted peer review across all 22,977 AAAI-26 submissions, demonstrating AI reviews were preferred over human reviews on key dimensions. This has enormous practical implications for the entire scientific enterprise, affecting how research is evaluated globally. While Paper 2 makes valuable contributions to understanding specification gaming in reasoning models, Paper 1's unprecedented scale, immediate real-world application, and potential to reshape a fundamental scientific process give it broader and more transformative impact across all of science.

    vs. Towards Understanding Specification Gaming in Reasoning Models
    claude-opus-4.65/5/2026

    Paper 1 reports the first large-scale deployment of AI-assisted peer review at a major conference (22,977 papers at AAAI-26), demonstrating that AI reviews were preferred over human reviews on key dimensions. This has enormous practical implications for the scientific enterprise itself—potentially transforming how all research is evaluated. While Paper 2 makes valuable contributions to understanding specification gaming in reasoning models, Paper 1's scale, real-world deployment, and potential to reshape scientific peer review give it broader and more transformative impact across all scientific fields.

    vs. Characterizing Model-Native Skills
    gpt-5.25/5/2026

    Paper 2 is more novel and broadly impactful: it proposes a general, model-native framework for discovering skill axes directly from activations and demonstrates bidirectional utility (training data selection and inference-time steering) across multiple models and domains (math reasoning, safety). The method is likely to generalize to many intervention settings in interpretability, alignment, and optimization, with clear methodological contributions and open-sourced code. Paper 1 has high immediate real-world relevance, but its impact is more application/deployment-specific and may be less transferable scientifically than a new representation-grounded intervention paradigm.

    vs. Characterizing Model-Native Skills
    gpt-5.25/5/2026

    Paper 1 offers a novel, generalizable methodology (model-native skill bases from activations) with clear, quantified gains and dual use for both training data selection and inference-time steering—likely to influence multiple LLM intervention areas (capabilities, alignment, interpretability). Its approach is broadly applicable across models/tasks and advances mechanistic control. Paper 2 is timely and high-profile with strong real-world relevance, but its impact may be narrower (peer-review operations/policy), more deployment- and survey-dependent, and potentially harder to generalize scientifically beyond the specific system and setting.

    vs. QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
    gemini-35/5/2026

    Paper 2 addresses a universal bottleneck in the scientific ecosystem—peer review scalability. Its massive field deployment (23,000+ papers) and empirical evidence that AI reviews are preferred over human ones offer immediate, transformative implications for how research is evaluated across all disciplines. While Paper 1 makes impressive strides in AI-driven mathematical discovery, Paper 2's broad applicability, timeliness, and successfully executed systemic integration give it a higher potential for widespread, structural scientific impact.

    vs. QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
    gpt-5.25/5/2026

    Paper 2 has higher potential impact: it reports original, expert-verified proofs for open mathematical problems and releases an open-source multi-agent system, advancing both AI-for-math capability and mathematical knowledge. This is highly novel, methodologically concrete (failure-mode analysis, targeted architecture, expert verification), and broadly influential for automated reasoning, formal methods, and scientific discovery. Paper 1 is timely with clear real-world application and strong scale, but its core contribution is a large deployment/benchmarking of existing frontier-model tooling rather than a breakthrough capability, and its applicability depends on policy/ethics constraints.

    vs. Machine Collective Intelligence for Explainable Scientific Discovery
    claude-opus-4.65/5/2026

    Paper 1 introduces a fundamentally new paradigm (machine collective intelligence) for autonomous scientific equation discovery that bridges symbolism and metaheuristics, with demonstrated results across diverse scientific domains showing orders-of-magnitude improvements over deep neural networks. This addresses a core bottleneck in AI-driven science with broad applicability across physics, biology, and engineering. Paper 2, while practically significant as the first large-scale AI peer review deployment, is more operational/procedural in nature and addresses a workflow optimization problem rather than enabling new scientific discoveries. Paper 1's methodological contribution has deeper and broader scientific implications.

    vs. Machine Collective Intelligence for Explainable Scientific Discovery
    claude-opus-4.65/5/2026

    Paper 1 presents a fundamentally new paradigm for scientific discovery—machine collective intelligence combining symbolism and metaheuristics—with demonstrated results across diverse scientific domains, achieving dramatic improvements in extrapolation (up to 6 orders of magnitude) and interpretability. This addresses a core bottleneck in AI-driven science with broad cross-disciplinary applicability. Paper 2, while practically significant as the first large-scale AI peer review deployment, is more of an engineering/systems contribution specific to the review process. Paper 1's methodological innovation and potential to transform how governing equations are discovered across sciences gives it higher fundamental impact.

    vs. Model Spec Midtraining: Improving How Alignment Training Generalizes
    gemini-35/5/2026

    Paper 1 demonstrates a real-world, large-scale deployment of AI in the scientific peer-review process, addressing a critical bottleneck in research evaluation. Its successful application at a major conference and preference over human reviews suggest a transformative impact on how scientific literature is assessed across all disciplines. Paper 2 presents a valuable but narrower technical contribution to AI alignment, whereas Paper 1's findings have direct, immediate, and broad implications for the global scientific community's operational infrastructure.