Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search
Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, Nicholas G. Reich, Zahra Shamsi
Abstract
Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper presents ERA (Empirical Research Assistance), an LLM-guided Monte Carlo tree search system that autonomously generates, evaluates, and iteratively refines executable epidemiological forecasting code. The central claim—and the paper's most important contribution—is the demonstration that this system, deployed prospectively during the 2025-2026 US respiratory season, produced model ensembles that matched or outperformed gold-standard CDC hub ensembles for influenza, COVID-19, and RSV forecasting.
The system transforms natural-language descriptions of modeling methodologies into executable Python programs through iterative code generation and scoring against historical validation data. The key advance over the group's prior retrospective work is the fully prospective, time-stamped, leak-proof evaluation protocol—a critical distinction in time-series forecasting where retrospective evaluations are notoriously susceptible to data leakage and hindsight bias.
Methodological Rigor
Strengths of the evaluation design: The prospective protocol is exceptionally well-designed. All submissions were time-stamped on GitHub, ground truth was unavailable at submission time, and the evaluation followed CDC hub standards using proper scoring rules (WIS, log WIS). The 80% task-coverage eligibility threshold and pairwise relative scoring framework are methodologically sound and consistent with established forecasting evaluation practice.
Scale of exploration: The study generated over 207,500 candidate models from 142 unique prompts, ultimately selecting 54 for an internal hub and 19 for ensemble components. This massive search scale is itself informative about the landscape of possible forecasting approaches.
Controlled ablation studies: The retrospective ablations in Part II are well-designed. The metric comparison (CRPS vs. log-scale CRPS vs. Log Score) reveals that log-scale CRPS provides the most reliable optimization target, with the Log Score producing "timid" forecasts with systematically wide, downward-biased distributions—a finding with practical implications for any automated forecasting system. The LLM tier comparison (Gemini 2.5 Flash through 3 Pro) reveals a clear capability gradient in instruction-following fidelity.
Limitations in evaluation: The comparison is somewhat uneven—the Google-SAI ensemble benefits from a massive computational budget and systematic exploration that individual hub-participating teams cannot match. The paper acknowledges that not all models are assessed on identical task sets, though the pairwise scoring partially mitigates this. The analysis is also based on preliminary data (as of May 2, 2026), with the season not yet officially concluded.
Fidelity concerns: The paper honestly reports that ERA-generated "adaptations" of existing methods frequently diverge from the original methodology. The LANL-DBM recombination model replaced the core SIR compartmental structure with a cyclical regression model while preserving statistical ideas—this raises questions about whether ERA is truly implementing mechanistic epidemiological models or finding effective statistical approximations. The NU-PGF_FLUH method could not be faithfully implemented by any LLM tier, highlighting fundamental limits.
Potential Impact
Democratization of forecasting: The most transformative implication is lowering barriers to deploying expert-level forecasting. Currently, CDC hubs depend on sustained, voluntary participation by specialized modeling teams. ERA could enable expansion to granular geographic resolutions (county-level), new geographic regions (low-resource countries), and emerging pathogens—all settings where human modeling expertise is the bottleneck.
Cold-start capability: The RSV results provide direct evidence for data-scarce scenarios. Despite minimal historical data and a nascent modeling literature, ERA's unconstrained search discovered effective models that outperformed the RSVHub ensemble in its inaugural season. This is directly relevant to pandemic preparedness.
Methodological insights for AutoML in science: The findings on reward hacking (Log Score producing "timid" forecasts), the role of judge-in-the-loop mechanisms, and the LLM capability gradient for instruction-following have broad implications for any application of LLM-guided code generation to scientific tasks.
Transparency: All model source code is open-source and auditable—a significant advantage over black-box approaches in a public health context where transparency and interpretability matter.
Timeliness & Relevance
This work arrives at a moment of convergence between three trends: (1) the maturation of collaborative forecasting hubs as public health infrastructure, (2) the demonstrated capabilities of frontier LLMs for code generation, and (3) growing recognition that pandemic preparedness requires scalable modeling capacity. The paper directly addresses the labor bottleneck identified by the forecasting community and provides a concrete, validated blueprint for scaling.
Key Strengths
1. Prospective, real-time validation across three pathogens with auditable time stamps—the strongest possible evidence for a forecasting system.
2. Competitive with decades of expert effort: Matching or exceeding CDC hub ensembles (themselves aggregations from dozens of expert teams) is a remarkable achievement.
3. Methodological diversity: ERA discovered diverse approaches (mechanistic-statistical hybrids, gradient boosting, renewal models) rather than converging on a single paradigm.
4. Honest reporting of limitations: The paper transparently discusses cases where ERA fails (NU-PGF_FLUH fidelity, Deep Research failures for RSV) and systematic biases in forecasts.
5. Thorough ablation studies that provide actionable insights for system design.
Notable Weaknesses
1. Computational asymmetry: The resource advantage over individual hub teams (207,500+ candidate models) makes the comparison somewhat unfair, though it validates the *concept* of automated model discovery.
2. Expert guidance still required: The paper acknowledges that expert judgment (e.g., identifying LANL-DBM/Inferno as promising targets) contributed to performance, making it difficult to assess true autonomy.
3. Ensemble construction is simplistic: The equally-weighted median ensemble leaves significant potential gains on the table—a limitation the authors acknowledge.
4. Bias toward ML regressors: The systematic tendency toward gradient-boosted trees, even with mechanistic prompts, limits true methodological diversity and suggests the system primarily excels at optimizing statistical approximations rather than faithfully implementing mechanistic models.
5. Preliminary data: Results may shift with final season data, though the authors argue this is unlikely to change the overall conclusions.
Overall Assessment
This paper represents a significant advance in both AI-for-science and computational epidemiology. The prospective validation across three pathogens elevates it well above typical retrospective demonstrations. While ERA does not eliminate the need for human expertise, it convincingly demonstrates that LLM-guided search can produce forecasting capacity at a scale and diversity previously requiring dozens of specialized teams. The practical implications for global public health surveillance are substantial.
Generated May 18, 2026
Comparison History (18)
Paper 1 presents a foundation model trained on an unprecedented scale (5 million participants, 1 trillion minutes), addressing a major bottleneck in digital health. Its broad applicability across 35 diverse health tasks, combined with few-shot learning and clinician-validated integration into a Personal Health Agent, suggests massive potential impact in personalized medicine and ubiquitous health monitoring. While Paper 2 offers a highly innovative automated modeling approach for epidemiology, Paper 1's sheer scale, breadth of applications, and immediate relevance to global consumer health edge it out in overall scientific impact.
Paper 1 demonstrates broader scientific impact by creating a generalizable infrastructure (KI) applicable across 14 Earth-science domains and 119 process-based models, fundamentally democratizing access to scientific simulation. Its scaffolding approach is domain-agnostic and addresses systemic barriers for climate-vulnerable communities. Paper 2, while impressive in its real-time disease forecasting results matching CDC ensembles, addresses a narrower application domain. Paper 1's breadth of impact across fields, its novel conceptual framework (knowledge infrastructure as a scientific commons), and its potential to transform how entire scientific communities interact with simulation models gives it higher long-term impact potential.
Paper 1 demonstrates higher scientific impact by introducing a novel paradigm of autonomous AI-driven scientific modeling. By using LLM-guided tree search to generate executable forecasting code, it matches or outperforms gold-standard CDC ensembles in rigorous, prospective real-time trials. While Paper 2 presents an excellent clinical diagnostic tool, Paper 1's framework of autonomously translating complex theory into transparent, executable software solves a critical human labor bottleneck. This methodological breakthrough has transformational potential for automated hypothesis generation and modeling across multiple data-scarce scientific domains beyond public health.
While Paper 1 provides valuable theoretical advancements in AI fairness, Paper 2 demonstrates a groundbreaking application of LLMs for autonomous scientific modeling. By automating the labor-intensive process of epidemiological forecasting and matching gold-standard CDC models in real-time, Paper 2 solves a critical public health bottleneck. Its interdisciplinary approach advances both automated machine learning (AutoML) and infectious disease management, offering immense real-world utility for pandemic preparedness and scalable, data-scarce forecasting. This tangible, cross-domain breakthrough gives it a higher potential for immediate and broad scientific impact.
Paper 2 introduces a fundamentally new paradigm for AI safety—containment verification—that provides formal, model-independent safety guarantees for agentic AI systems. This is a foundational contribution with broad implications across AI safety, formal methods, and software verification, addressing one of the most critical challenges as AI agents become more capable. While Paper 1 is impressive applied work demonstrating LLM-guided automated forecasting, it represents an incremental advance in automation of existing epidemiological methods. Paper 2's theoretical framework and formal verification approach could reshape how the entire field approaches AI safety, giving it broader and deeper long-term impact.
Paper 1 likely has higher impact: it introduces an autonomous LLM-guided model-discovery framework validated prospectively in a real-world, high-stakes public-health setting, matching or beating CDC hub ensembles across multiple pathogens and handling cold starts—strong novelty, clear applications, and timeliness. Its methodological contributions (reward-hacking prevention, judge-in-the-loop fidelity) also generalize to scientific code synthesis. Paper 2 is rigorous and important for AI oversight, but its primary impact is diagnostic/interpretive rather than enabling a new deployed capability with immediate cross-domain societal value.
Paper 2 demonstrates a fully autonomous system for infectious disease forecasting that matches or outperforms CDC gold-standard ensembles in real-time prospective evaluation across multiple pathogens. Its practical public health impact is immediate and scalable, addressing a critical bottleneck in pandemic preparedness. While Paper 1 provides elegant mechanistic interpretability insights into LLM persuasion circuits—valuable for AI safety—Paper 2's combination of novel LLM-guided automated scientific discovery, prospective real-world validation, and direct applicability to a pressing global health challenge gives it broader and more immediate cross-disciplinary impact.
Paper 2 presents a highly novel autonomous AI system capable of matching gold-standard human-curated epidemiological forecasts in real-time. Its direct, scalable application to public health and infectious disease forecasting offers profound real-world impact. While Paper 1 provides a valuable evaluation framework for HCI, Paper 2's breakthrough in automated scientific modeling and code generation demonstrates broader interdisciplinary impact, exceptional timeliness post-pandemic, and addresses a major labor bottleneck in global health monitoring.
Paper 1 has higher impact potential: it targets a high-stakes, timely public-health problem and claims fully prospective real-time validation against CDC hub ensembles across multiple pathogens, including cold-start settings—strong evidence of real-world utility and methodological rigor. The autonomous LLM-guided model discovery could scale forecasting to new geographies/pathogens, broadening operational impact beyond AI. Paper 2 is novel and useful for interpretability in multi-agent deliberation, but its applications are more research-internal and its validation appears narrower, making near-term societal impact and cross-field breadth likely lower.
Paper 1 represents a highly novel, transformative application of AI to scientific discovery and epidemiology. By autonomously generating executable disease forecasting models that outperform human-curated CDC ensembles in a real-time, prospective evaluation, it solves a critical real-world labor bottleneck in public health. While Paper 2 provides an excellent open-source framework for AI agents with strong benchmark results, Paper 1 demonstrates a more profound cross-disciplinary scientific breakthrough with direct, large-scale societal implications.
Paper 2 demonstrates higher potential scientific impact for several reasons: (1) It addresses a critical public health bottleneck—scalable disease forecasting—with immediate real-world applications validated prospectively against CDC gold-standard ensembles. (2) The novelty of using LLM-guided tree search to autonomously generate scientific modeling code represents a paradigm shift applicable beyond epidemiology. (3) It was validated in real-time during the 2025-2026 respiratory season across multiple pathogens, demonstrating practical deployment readiness. (4) Its breadth of impact spans AI, public health, and computational science. Paper 1, while solid, addresses a narrower human-machine teaming problem with more incremental contributions.
Paper 1 has higher likely impact due to demonstrated real-time, prospective superiority over a widely used public-health benchmark (CDC hub ensembles) across multiple pathogens, including cold-start scenarios—showing immediate deployability and scalability. It targets a major operational bottleneck (expert labor for model curation) with an autonomous, generalizable system and includes rigor via prospective evaluation and ablations addressing reward hacking and theory fidelity. Its applications are urgent and broad (public health decision-making, surveillance, forecasting automation) with cross-field relevance to ML, epidemiology, and scientific software synthesis.
Paper 2 likely has higher scientific impact: it demonstrates a novel autonomous LLM-guided tree-search system that generates executable epidemiological forecasting models and validates them in a fully prospective, real-time setting, matching or beating CDC hub ensembles across multiple pathogens, including cold-start RSV. This combination of methodological innovation (autonomous model discovery + safeguards against reward hacking) and strong real-world applicability (public health decision support at scale) suggests broad, timely impact across epidemiology, AI for science, and automated software/model synthesis. Paper 1 is valuable but primarily a benchmark/resource contribution with indirect downstream impact.
Paper 2 has higher impact potential due to stronger novelty (autonomous LLM-guided tree search that generates executable epidemiological models), high-stakes real-world application (prospective multi-pathogen public health forecasting), and compelling validation (real-time 2025–2026 season performance matching/outperforming CDC hub ensembles, including cold-start RSV). Its methodology addresses known pitfalls (reward hacking, theory-fidelity judging) and could generalize to other scientific modeling domains, giving broader cross-field impact and timeliness. Paper 1 is valuable for AIOps reliability but is more domain-specific and likely narrower in societal impact.
Paper 2 has higher impact potential due to its fully prospective, real-time validation across multiple major pathogens and direct public-health applicability, demonstrating performance competitive with CDC hub ensembles. The autonomous generation of executable forecasting models addresses a major scalability bottleneck and is timely given ongoing respiratory surveillance needs. Methodological rigor is strengthened by prospective evaluation plus ablations targeting reward hacking and theory fidelity. Its implications span epidemiology, AI for science, software synthesis, and decision support, yielding broader cross-field and real-world impact than Paper 1’s more domain-specific mechanistic KG explanation framework.
Paper 2 presents a highly innovative, autonomous system for infectious disease forecasting that directly addresses a critical public health bottleneck. Its successful real-time, prospective evaluation outperforming gold-standard human models demonstrates significant real-world utility. While Paper 1 provides valuable insights into LLM limitations in educational technology, Paper 2's breakthrough in automating expert-level epidemiological modeling has a broader, more immediate impact across public health and AI-for-science domains.
Paper 2 addresses a critical public health challenge by automating disease forecasting, a highly labor-intensive process. Its ability to match or outperform gold-standard CDC models for multiple pathogens demonstrates profound real-world applicability. While Paper 1 offers a valuable technical optimization for AI model routing, Paper 2's interdisciplinary impact across AI, epidemiology, and public health, combined with its prospective, real-time evaluation, gives it significantly higher scientific and societal significance.
Paper 1 likely has higher scientific impact due to strong real-world applicability (public-health forecasting), prospective real-time validation against CDC hub ensembles, and demonstrated scalability across multiple pathogens including cold-start RSV. Its autonomous LLM-guided model synthesis addresses a concrete labor bottleneck with broad relevance to epidemiology, forecasting, and AI-for-science. Paper 2 is novel in agent memory evolution without weight updates and shows solid gains, but evidence is confined to a single benchmark/attacker setting, making generalizability and near-term real-world impact less certain.