Forecasting Scientific Progress with Artificial Intelligence

Sean Wu, Pan Lu, Yupeng Chen, Jonathan Bragg, Yutaro Yamada, Peter Clark, David Clifton, Philip Torr

May 21, 2026

arXiv:2605.22681v1 PDF

cs.AI(primary)

#240of 2292·Artificial Intelligence

#240 of 2292 · Artificial Intelligence

Tournament Score

1510±47

10501800

37%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance8

Rigor6.8

Novelty7.5

Clarity7.5

Tournament Score

1510±47

10501800

37%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Forecasting Scientific Progress with Artificial Intelligence"

1. Core Contribution

This paper introduces CUSP, a temporally grounded benchmark comprising 4,760 scientific milestones and 17,429 structured tasks designed to evaluate whether frontier LLMs can forecast scientific progress. The key novelty lies in the *operationalization* of scientific forecasting as a measurable capability, decomposed into four complementary dimensions: binary feasibility assessment, mechanistic reasoning (MCQ), generative solution design (FRQ), and temporal prediction (date estimation). The benchmark enforces strict temporal knowledge constraints tied to each model's training cutoff, enabling controlled disentanglement of knowledge access from genuine forecasting ability.

The central finding is negative but informative: current frontier AI systems systematically fail at scientific forecasting despite having access to substantial prior knowledge. They can identify plausible technical approaches (MCQ accuracy up to 0.819 for GPT-5.4) but cannot reliably predict *whether* advances will be realized (binary performance near chance at ~0.50) or *when* they will occur (systematic positive temporal bias of 4-36 months). The paper introduces the analytical decomposition of performance gaps into "knowledge gaps" (∆know) and "forecasting gaps" (∆fore), revealing that the forecasting gap dominates and grows with citation impact.

2. Methodological Rigor

Strengths in design: The two-track evaluation framework—combining deterministic outcome scoring with rubric-based reasoning evaluation—is well-motivated and addresses known issues with LLM evaluation (correct answers via flawed reasoning). The temporal stratification using earliest-observed DOI dates across multiple APIs (Crossref, Semantic Scholar, OpenAlex, etc.) is thorough. The web search augmentation experiments with controlled information access (WS+Cutoff vs. unrestricted WS) provide a clean experimental design for decomposing knowledge versus forecasting contributions.

Concerns: Several methodological choices warrant scrutiny. First, the binary task design has an inherent label imbalance (ground truth for originals is always "Yes," perturbed always "No"), and the "merged score" averaging these may mask important asymmetries—as Table 19 reveals, models exhibit extreme response biases (LLaMA 3.3: 93% "Yes" bias). Second, the FRQ evaluation relies on GPT-5.4-mini with web search as judge, which introduces its own biases; the human-AI judge correlation of r=0.34 (n=60), while statistically significant, is modest. Third, the date prediction metric (exponential decay with 0.1 scaling) is somewhat arbitrary—the choice of decay rate significantly affects relative model rankings. Fourth, the LLaMA 3.3 date prediction anomaly (best score despite worst MCQ) appears driven by temporal anchoring artifacts acknowledged but not fully resolved.

The validation pipeline using Grok-3 as an independent judge from GPT-4o (used for generation) is a reasonable precaution, and human validation agreement analysis (Appendix E.1) adds credibility. The statistical testing in web search experiments (with p-values) strengthens causal claims.

3. Potential Impact

Direct impact on AI evaluation: CUSP addresses a genuine gap in the benchmark landscape. Unlike retrospective scientific reasoning benchmarks (GPQA, MMLU-Pro) that are approaching saturation, CUSP remains substantially unsaturated (Figure 11), providing continued discriminative power. The continuously updatable design and Time Capsule component enable living evaluation.

Implications for AI-for-science: The finding that forecasting gaps grow with citation impact (Table 29) has practical consequences for using AI in research prioritization, funding allocation, and technology roadmapping. The domain-dependent predictability results (AI timing more predictable than biology/chemistry/physics) provide concrete guidance about where AI forecasting tools may be more or less reliable.

Broader implications: The systematic overconfidence findings (Table 20) carry weight for any deployment of LLMs as scientific advisors. The disconnect between knowledge access and forecasting ability challenges the assumption that larger, more knowledgeable models will naturally become better forecasters.

Limitations in impact: The benchmark, while multi-disciplinary, is heavily biased toward biology (1,234 papers) and AI (1,141), with chemistry (203) and environmental science (235) underrepresented. The restriction to Nature/Science/Cell for natural sciences introduces a prestige bias that may not represent the full spectrum of scientific progress.

4. Timeliness & Relevance

This work is highly timely. The rapid deployment of AI in scientific discovery (AlphaFold, AI Co-Scientist, Virtual Lab) has created urgent need to understand AI's predictive limitations. The paper directly addresses the emerging question of whether AI can move from scientific *assistance* to scientific *anticipation*—a distinction with significant implications for autonomous research agents. The benchmark spans January 2024–March 2026, ensuring evaluation against very recent frontier models including GPT-5.4.

5. Strengths & Limitations

Key strengths:

Novel and well-defined problem formulation with rigorous temporal controls

Large-scale, multi-disciplinary benchmark with diverse task types

Clean experimental design for decomposing knowledge vs. forecasting contributions

Practically important negative results with clear implications

Living benchmark design with Time Capsule for prospective evaluation

Comprehensive evaluation of six frontier models across multiple dimensions

Notable weaknesses:

The FRQ judge correlation with humans (r=0.34) raises reliability concerns for this task type

Binary task label structure (always Yes/No pairing) may be too predictable

Domain coverage is uneven; chemistry and environmental science are thin

The "scientific forecasting" framing may overstate what's being measured—predicting specific paper outcomes differs from forecasting broad scientific trajectories

No evaluation of reasoning models with extended thinking (e.g., o3, o4-mini) beyond DeepSeek R1

The Time Capsule remains unresolved, limiting its current contribution to calibration analysis only

The causal claim that "performance benefits more from post-event information than from forward-looking prediction" could be more precisely stated—this is somewhat tautological given the experimental design

Overall assessment: This is a substantial contribution that introduces a well-designed benchmark addressing a genuinely novel evaluation dimension. The negative findings are robustly established and practically important. While methodological refinements are needed (particularly for FRQ evaluation and binary task design), the framework establishes a strong foundation for studying AI's capacity for scientific foresight.

Rating:7.4/ 10

Significance 8Rigor 6.8Novelty 7.5Clarity 7.5

Generated May 22, 2026

Comparison History (30)

vs. SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

claude-opus-4.65/22/2026

Paper 1 introduces CUSP, a novel benchmark addressing a fundamental question about AI's ability to forecast scientific progress—a topic with broad implications across all scientific disciplines. Its systematic evaluation of frontier models reveals important limitations (overconfidence, domain heterogeneity, failure modes) that inform the entire AI-for-science community. Paper 2, while technically sound, addresses a narrower problem (credit assignment in search-augmented reasoning) with an incremental methodological contribution. Paper 1's breadth of impact, timeliness given the AI-for-science movement, and foundational insights about AI capabilities give it higher potential impact.

vs. SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

claude-opus-4.65/22/2026

Paper 1 introduces CUSP, a novel benchmark addressing a fundamental question about AI's ability to forecast scientific progress—a topic with broad implications across all scientific disciplines. Its comprehensive evaluation of frontier models reveals systematic limitations in scientific forecasting, contributing important insights to AI capabilities research, science of science, and epistemic calibration. Paper 2, while technically sound, addresses a more incremental improvement in search-augmented reasoning via self-distillation, with narrower scope. Paper 1's breadth of impact, timeliness given rapid AI integration in science, and its interdisciplinary relevance give it higher potential impact.

vs. Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

gemini-3.15/22/2026

Paper 1 offers higher potential scientific impact due to its immediate, practical solutions to critical bottlenecks in LLM alignment and efficiency. By introducing 'Behavior Cues,' it provides a novel mechanism for monitorable reasoning, yielding significant efficiency gains (50% reduction in wasted tokens) and major safety improvements (recovering safe actions from 80% of unsafe traces). While Paper 2 introduces a valuable metascience benchmark, its primarily negative findings about current AI limitations offer less immediate utility compared to Paper 1's actionable framework for scalable oversight and safe AI deployment.

vs. Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

claude-opus-4.65/22/2026

Paper 2 introduces a novel, large-scale benchmark (CUSP) for evaluating AI's ability to forecast scientific progress—a fundamental and broadly relevant question. Its multi-disciplinary scope (4,760 events across many fields), systematic evaluation of frontier models, and insights about AI's predictive limitations have wide implications for science policy, AI development, and the philosophy of discovery. Paper 1 addresses an important but narrower technical problem (LLM reasoning oversight via behavior cues) with solid but domain-specific contributions. Paper 2's breadth, timeliness, and relevance to the rapidly growing AI-for-science community give it higher potential impact.

vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures

gemini-3.15/22/2026

Paper 1 presents a highly actionable, novel methodology (GSS) that directly accelerates materials and molecular discovery by tenfold. This has profound real-world applications in drug discovery, materials science, and clean energy. In contrast, Paper 2 is a benchmarking study reporting primarily negative results about AI's current inability to forecast scientific progress. While Paper 2 is relevant to meta-science and AI evaluation, Paper 1 offers a concrete, computationally rigorous tool that immediately solves a major bottleneck in high-dimensional energy landscape exploration, providing more immediate and tangible scientific impact.

vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

claude-opus-4.65/22/2026

Paper 1 presents a novel theoretical framework unifying three major fields—Bayesian inference, game theory, and thermodynamics—through a collective variational principle. It offers formal proofs connecting free-energy minimisation to Nash equilibria, introduces new concepts (free-energy Harsanyi dividend), and makes falsifiable predictions validated across multiple systems. This kind of deep unifying theory has potential for broad, lasting impact across physics, neuroscience, biology, and AI. Paper 2, while timely and methodologically sound, is primarily a benchmark/evaluation study revealing AI limitations in forecasting, with more incremental impact on the AI evaluation community.

vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules

claude-opus-4.65/22/2026

MIMIC presents a novel generative multimodal foundation model that unifies multiple biological data modalities (sequence, structure, regulation, evolution) into a single framework, achieving state-of-the-art results across multiple tasks and enabling constrained biomolecular design. Its direct applications to RNA editing and protein design (e.g., PD-L1 binding) demonstrate significant translational potential. Paper 2, while introducing a valuable benchmark (CUSP) for evaluating AI forecasting of scientific progress, primarily characterizes limitations of current models rather than providing a transformative tool. MIMIC's methodological innovation, breadth of biological applications, and practical design capabilities give it substantially higher potential for driving downstream research and real-world impact.

vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to broader cross-disciplinary scope and foundational contribution: a temporally grounded, controlled-knowledge benchmark (CUSP) for forecasting scientific progress across thousands of events. It addresses a general, timely question about AI’s role in science and provides reusable evaluation infrastructure with implications for scientometrics, AI evaluation, and research policy. Paper 2 is novel and important for AI safety/clinical deployment, but its domain is narrower (clinical safety framing effects) and may have more constrained breadth despite strong pre-registration and methodology.

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to strong real-world applicability (regulatory RWE, surveillance, forecasting), large-scale novel dataset/modeling (43.8B events, up to 1.7B params), and extensive methodological validation (1,000+ tasks, prospective/retrospective, external datasets, scale/post-training ablations, bias reduction in target trial emulation). Its contributions can influence multiple fields—clinical informatics, epidemiology, health economics, and causal inference—at a timely moment for RWD-driven decision-making. Paper 1 is novel and broadly relevant but mainly diagnostic/benchmarking with less immediate downstream deployment impact.

vs. Simulating clinical interventions with a generative multimodal model of human physiology

gemini-3.15/22/2026

Paper 2 presents a foundational multimodal model of human physiology with immediate, transformative applications in personalized medicine, clinical risk prediction, and in silico clinical trials. Its ability to accurately simulate clinical interventions and outperform established risk scores demonstrates profound real-world utility and broad impact across medicine and AI. While Paper 1 introduces a valuable benchmark for AI capabilities in metascience, Paper 2's tangible clinical applications and potential to serve as a 'health world model' give it a higher potential for widespread scientific and societal impact.

vs. AI scientists produce results without reasoning scientifically

claude-opus-4.65/22/2026

Paper 1 addresses a more fundamental and urgent question about AI-driven science: whether LLM agents actually reason scientifically. With 25,000+ agent runs across 8 domains, it provides rigorous evidence that current AI agents ignore evidence 68% of the time and lack epistemic self-correction. This has immediate implications for the rapidly growing field of autonomous AI research agents, challenging a core assumption underlying billions in investment. Its finding that scaffold engineering cannot fix reasoning deficits and that reasoning must become a training target provides actionable direction. Paper 2 offers a valuable benchmark but addresses a narrower, less immediately consequential question about forecasting scientific progress.

vs. Machine Collective Intelligence for Explainable Scientific Discovery

gemini-3.15/22/2026

Paper 2 presents a highly innovative methodology for autonomous scientific discovery, directly addressing the critical bottlenecks of explainability and extrapolation in AI. By successfully extracting interpretable governing equations and reducing extrapolation errors by orders of magnitude compared to neural networks, it offers immediate, transformative applications across all physical and empirical sciences. In contrast, Paper 1 introduces a valuable benchmark but primarily highlights the limitations of current AI in forecasting, which has less immediate transformative potential for active scientific discovery.

vs. End-to-end autonomous scientific discovery on a real optical platform

gpt-5.25/22/2026

Paper 1 likely has higher impact: it claims a first end-to-end autonomous discovery system operating on a real physical platform and, crucially, reports an experimentally validated, previously unreported physical mechanism with potential hardware implications (optical pairwise computation). This is both methodologically ambitious and highly timely for autonomous science and photonic computing, with broad cross-field relevance (AI agents, optics, hardware). Paper 2 is valuable and rigorous as a benchmark for forecasting science, but its primary contribution is evaluative/diagnostic rather than enabling new physical or technological capabilities.

vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

claude-opus-4.65/22/2026

IatroBench addresses a critical and timely problem—AI safety measures causing iatrogenic harm—with a rigorous pre-registered methodology that reveals a striking, actionable finding: identity-contingent knowledge withholding. Its clear demonstration that safety guardrails systematically harm vulnerable users who have exhausted standard referrals has immediate policy implications for AI deployment in healthcare. The finding that evaluation tools share the same blind spots as training systems is also deeply consequential. While Paper 1 (CUSP) is methodologically sound and interesting, its conclusions are largely negative (AI can't forecast science well), offering less actionable insight. Paper 2's specific, reproducible failure modes will likely drive concrete changes in AI safety design and regulation.

vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

gemini-3.15/22/2026

Paper 2 proposes a fundamental theoretical unification of thermodynamics, Bayesian inference, and game theory, extending the Free Energy Principle to multi-agent systems. This offers profound, cross-disciplinary implications for physics, biology, economics, and AI. In contrast, Paper 1 presents an empirical benchmark evaluating current AI capabilities in forecasting science. While highly relevant and timely, its impact is narrower, primarily contributing to AI evaluation and meta-science, lacking the sweeping theoretical breakthrough and broad applicability presented in Paper 2.

vs. AI scientists produce results without reasoning scientifically

gemini-3.15/22/2026

Paper 2 addresses a more fundamental and widely applicable issue: the epistemic validity of autonomous 'AI Scientists.' By demonstrating that these agents fail to engage in actual scientific reasoning (e.g., ignoring evidence, lacking belief revision), it challenges the foundational reliability of a highly hyped and rapidly growing field. Paper 1's focus on forecasting scientific progress, while novel, represents a narrower use case compared to the broader implications of AI systems generating unjustified scientific knowledge.

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

claude-opus-4.65/22/2026

ReClaim demonstrates a large-scale foundation model trained on 43.8 billion medical events from 200M+ patients, showing strong empirical results across 1,000+ prediction tasks with clear scaling laws. It has immediate, concrete applications in disease prediction, expenditure forecasting, and causal inference for real-world evidence—areas of high practical and regulatory importance. Paper 2 introduces an interesting benchmark for AI scientific forecasting but primarily documents limitations of current models without offering solutions, making its impact more diagnostic than transformative. ReClaim's methodological contributions and healthcare applications give it broader and more actionable impact.

vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures

gpt-5.25/22/2026

Paper 1 presents a concrete, technically novel framework unifying diffusion-based generative modeling with physics-driven random structure search, demonstrating >10× sampling efficiency gains and out-of-distribution effectiveness for molecular/crystal structure discovery. This has immediate real-world applicability in materials and drug discovery and broad downstream impact across chemistry, physics, and ML, with clear methodological grounding and measurable benchmarks. Paper 2 introduces a valuable benchmark for forecasting scientific progress and highlights limitations of current models, but it is primarily evaluative/meta-scientific with less direct translational payoff and narrower near-term practical impact.

vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules

gemini-3.15/22/2026

Paper 1 introduces a versatile, state-of-the-art multimodal foundation model for biomolecules with immediate applications in structural biology, drug design, and genetics. Its direct utility in solving complex, real-world molecular tasks offers immense scientific value. Paper 2 presents a benchmark for evaluating AI's ability to forecast science; while insightful for AI evaluation and meta-science, its practical utility in driving tangible, novel scientific discoveries is much lower compared to the direct application of the biomolecular model.

vs. End-to-end autonomous scientific discovery on a real optical platform

claude-opus-4.65/22/2026

Paper 1 demonstrates a groundbreaking achievement: the first end-to-end autonomous AI system that identifies and experimentally validates a previously unreported physical mechanism on real hardware. This represents a paradigm shift in how scientific discovery can be conducted, with immediate implications for AI-driven research across all experimental sciences and potential practical applications in optical computing. Paper 2, while methodologically rigorous and timely, primarily characterizes limitations of current AI forecasting capabilities—an important but largely negative result. Paper 1's novelty, real-world validation, and transformative potential for accelerating scientific discovery give it substantially higher impact.