EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts
Yiming Lu, Sihang Zeng, Zhengxu Tang, Max Lau, Fei Liu, Wei Jin
Abstract
Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches average accuracy, compared with for the static backbone and for the external CDC ensemble, and reduces recovery lag after regime shifts from to weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.
AI Impact Assessments
(1 models)Scientific Impact Assessment: EpiEvolve
1. Core Contribution
EpiEvolve addresses a genuine gap between how epidemic LLM forecasters are trained/evaluated (static supervised learning) and how they must operate in practice (streaming, with delayed labels and distribution shifts). The core novelty is a self-evolving agent architecture that wraps a frozen LLM backbone with three interacting mechanisms: (1) hierarchical episodic memory organized at state, regional, and national scopes, (2) outcome-informed reflection that generates lessons from delayed ground truth, and (3) strategic lesson distillation that promotes recurring error patterns into predicate-form rules. A drift detector triggers regime transitions based on variant surveillance text and error statistics.
The key insight is that post-deployment adaptation can be achieved entirely through memory and prompt manipulation rather than gradient updates — a practically important property when model weights cannot be modified after release.
2. Methodological Rigor
Strengths in evaluation design: The chronological streaming protocol is well-formulated. The delayed feedback constraint (Eq. 2-3) prevents future leakage, and the paper explicitly ensures no within-week cross-region leakage. The regime partitioning is used only for post-hoc evaluation, not supplied to the forecaster. Recovery lag as a metric is a meaningful operational quantity.
Concerns about rigor:
The ablation study is reasonably thorough, decomposing contributions of reflection, strategic memory, drift detection, and retrieval tiers. The hyperparameter sensitivity analysis (Table 3) shows stability across tested ranges, which is reassuring.
3. Potential Impact
Practical relevance: The frozen-backbone + evolving-memory paradigm is highly practical for deployment scenarios where model retraining is expensive or prohibited (regulatory, computational, or organizational constraints). Public health agencies could potentially adopt this approach for operational forecasting.
Cross-domain transferability: The design pattern — hierarchical episodic memory with regime-conditioned retrieval and rule distillation — is domain-agnostic in principle. It could apply to financial forecasting, climate prediction, supply chain management, or any streaming prediction task with delayed labels and concept drift.
Limitations on impact: The paper only demonstrates the approach on one specific task with one backbone. The regime shifts studied (COVID-19 variants) are relatively well-characterized compared to truly novel emergence events. The approach's value in forecasting genuinely unprecedented regimes (where no similar historical pattern exists) remains untested.
4. Timeliness & Relevance
The paper is timely on multiple fronts:
However, the COVID-19 dataset is now historical rather than operational, which somewhat limits the immediacy of the contribution.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations
The paper positions itself as "the first self-evolving LLM agent for streaming epidemic forecasting under regime shifts," which appears accurate based on the literature review. The conceptual framework of converting forecast errors into reusable lessons via reflection and distillation is sound and well-articulated. However, the empirical evidence, while positive, is narrow enough that the contribution is best viewed as a promising proof-of-concept rather than a validated methodology.
The writing is clear and well-structured, with appropriate use of figures and tables. The appendices provide useful implementation details and prompt templates that aid reproducibility.
Generated Jun 5, 2026
Comparison History (20)
Paper 2 demonstrates higher potential scientific impact due to its profound interdisciplinary application in public health and epidemiology. While Paper 1 offers a strong, highly engineered multi-agent framework for automated research, Paper 2 tackles a critical, high-stakes real-world problem: real-time pandemic forecasting under regime shifts. Its novel self-evolving agent architecture, which handles streaming data and delayed labels without weight updates, offers an innovative methodological contribution that could broadly influence both AI time-series forecasting and global health crisis management.
DyCon addresses a fundamental and broadly applicable problem in LLM reasoning efficiency ('overthinking') with a training-free, model-agnostic framework validated across 4 models and 12 benchmarks spanning multiple domains. Its insight that difficulty is linearly encoded in step-level embeddings is novel and generalizable. EpiEvolve, while innovative in combining episodic memory with LLM forecasting for epidemics, addresses a narrower application domain (COVID-19 hospitalization forecasting) and is evaluated on a single dataset. DyCon's breadth of applicability, theoretical insight, and practical efficiency gains give it wider potential impact across the LLM research community.
EpiEvolve introduces a novel paradigm for adaptive LLM-based epidemic forecasting that addresses a fundamental gap between static model training and streaming real-world deployment. Its self-evolving agent architecture with hierarchical episodic memory, reflection, and regime-aware retrieval is broadly applicable beyond epidemiology to any streaming prediction task with regime shifts. The paper demonstrates strong empirical gains over CDC ensemble baselines and provides rigorous ablations. Paper 1, while addressing important policy optimization questions, combines relatively established RL techniques (DQN, DDPG, TD3) in a small-scale simulation (1,000 agents) without real-world validation, limiting its impact.
Paper 2 addresses a critically timely and broadly relevant topic—the environmental footprint of hyperscale data centers driven by AI growth—with novel facility-level empirical data (403 HDCs) that fills a major knowledge gap. Its findings (HDC carbon intensity 48% above grid average) have immediate policy implications and broad interdisciplinary impact across energy, environmental science, computer science, and policy. Paper 1, while technically sound, addresses a narrower ML/epidemiology niche with incremental improvements to LLM-based forecasting. Paper 2's empirical contribution and societal relevance give it substantially broader citation potential and impact.
EpiEvolve addresses a critical real-world problem (pandemic forecasting) with a novel self-evolving agent framework that handles regime shifts in streaming data. Its practical impact is immediate and broadly relevant to public health. The methodology—hierarchical episodic memory, reflection on delayed labels, and regime-aware retrieval—introduces generalizable concepts for adaptive AI systems beyond epidemiology. AnyEdit++ makes solid contributions to knowledge editing in LLMs but addresses a narrower technical problem. EpiEvolve's demonstrated superiority over CDC ensemble forecasts and its cross-disciplinary relevance (AI + epidemiology) give it higher potential impact.
Paper 2 has higher potential impact: it introduces a general, mathematically rigorous framework for multi-agent complementarity that applies broadly across human-AI interaction protocols, aggregation theory, and learning theory. Its formalism yields multiple theorems (impossibility results, equivalences, invariances) with cross-domain relevance and likely to influence how workflows and evaluation baselines are defined. Paper 1 is timely and applied with strong empirical gains for streaming epidemiological forecasting, but its scope is narrower (one domain/task) and depends on specific agent/memory design choices, making its broader theoretical spillover smaller.
Paper 2 has higher impact potential due to stronger real-world applicability and timeliness: streaming pandemic forecasting under regime shifts is an operationally critical, high-stakes problem. Its self-evolving agent design (episodic memory, reflection on delayed labels, regime-aware retrieval, rule distillation) is a broadly relevant paradigm for non-stationary time-series decision support and could transfer to many domains beyond epidemiology. The streaming, leakage-aware protocol and regime-shift evaluation improve methodological rigor and relevance. Paper 1 is solid and novel within audio sarcasm recognition, but its broader societal and cross-field impact is likely narrower.
Paper 1 addresses a fundamental bottleneck in LLM training (multilingual interference) with a theoretically rigorous and scalable framework. Its broad applicability across foundational AI models gives it a wider, horizontal scientific impact across NLP and machine learning compared to the highly domain-specific, albeit important, public health application of Paper 2.
Paper 2 reveals a fundamental mechanistic insight about LLM behavior—that self-correction failures are chat-template artifacts rather than capability deficits—with broad implications across all LLM applications. It provides a training-free, model-agnostic intervention applicable immediately. The finding is surprising, rigorously controlled (SHA-256 verified identical claims, 13 model-domain cells, statistical significance), and affects the large community working on LLM reasoning and self-correction. Paper 1, while useful for pandemic forecasting, is more narrowly applied and incremental in its agent-memory architecture contribution.
Paper 2 has higher potential impact: it proposes a concrete, reusable method (self-evolving agent with episodic memory, reflection, and rule distillation) addressing a timely, high-stakes real-world problem (streaming pandemic forecasting under regime shifts), with clear operational constraints (no weight updates, anti-leakage protocol) and measurable improvements over strong baselines. Its ideas likely generalize to other non-stationary time-series and decision-making settings. Paper 1 is rigorous and important for evaluation hygiene in RAG, but is primarily diagnostic and narrower in application, offering tools rather than a new capability.
Paper 1 is likely higher impact due to a methodologically rigorous, broadly applicable advance for high-fidelity scientific data compression, a critical HPC/data-management bottleneck across climate, CFD, and turbulence. Its residual-centric formulation and deterministic residual coders (including a neural-aided but deterministic pipeline) directly address a known failure mode of learned compressors under strict error bounds, with strong quantitative gains on multiple canonical datasets and clear deployability. Paper 2 is timely and application-relevant, but relies on agent/memory heuristics around fixed LLMs in a narrower domain, with impact more contingent on evaluation design and generalizability.
EpiEvolve addresses a fundamental and practical problem—streaming pandemic forecasting under regime shifts—with a novel self-evolving agent architecture combining episodic memory, reflection, and regime-aware retrieval. It demonstrates substantial quantitative improvements over baselines including CDC ensembles, has clear real-world public health applications, and introduces a methodological framework (self-evolving agents for streaming prediction) transferable to many domains. CogManip, while addressing important AI safety concerns around LLM manipulation, is primarily a benchmark contribution with evaluation results, offering less methodological novelty and narrower applicability.
Paper 1 likely has higher scientific impact due to strong novelty in systems-level RAG serving (compressed-view, query-aware cache fusion) with clear, broad applicability to LLM deployment across many domains. It addresses a timely bottleneck (prefill latency/cost), integrates into a real serving stack (SGLang), and reports consistent speed/quality gains across multiple models and datasets, suggesting methodological rigor and generality. Paper 2 is compelling and application-relevant, but its impact is narrower (pandemic forecasting) and may hinge more on dataset/protocol specifics and operational adoption.
RHO addresses a more general and broadly applicable problem—self-supervised optimization of LLM agent harnesses without ground-truth labels—applicable across diverse domains (software engineering, technical work, knowledge work). Its domain-agnostic framework, strong empirical results (59%→78% on SWE-Bench Pro), and practical relevance to real deployment settings give it broader impact potential. EpiEvolve, while innovative in its streaming pandemic forecasting approach, is more narrowly scoped to epidemic prediction. RHO's self-preference mechanism and trajectory-based optimization represent a more transferable methodological contribution to the rapidly growing LLM agents field.
Paper 1 introduces a highly novel LLM-agent architecture with episodic memory and reflection to handle streaming data and regime shifts, advancing the frontier of time-series forecasting and LLM applications. Paper 2, while practically valuable, primarily applies existing deep reinforcement learning techniques to a specific supply chain problem, offering less methodological innovation and a narrower scope of scientific impact.
Paper 2 (Vortex) likely has higher scientific impact due to broader applicability and timeliness: it provides a programmable systems abstraction plus an optimized serving backend that can accelerate research and deployment across many LLM/agent workloads, models, and sparse-attention methods. The reported throughput gains on very large, modern architectures and production-relevant GPUs suggest strong real-world uptake potential. Paper 1 is novel and rigorous for streaming epidemiological forecasting under regime shifts, but its domain scope is narrower and impact is more specialized compared to a general-purpose LLM serving infrastructure advance.
Edit-R2 introduces a novel RL post-training framework addressing a fundamental challenge in multi-turn image editing with unified multimodal models. It tackles coupled failure modes (long-context dilution, state contamination) with innovative solutions spanning both discrete text and continuous latent spaces. The paper also contributes a new benchmark (MICE-Bench). Its broader applicability to multimodal foundation models and RL for generative models gives it wider impact potential across computer vision, NLP, and RL communities. EpiEvolve, while valuable for pandemic forecasting, addresses a narrower domain with a more incremental contribution (prompt engineering with memory/reflection around a frozen LLM).
Paper 1 presents a highly novel approach by introducing self-evolving LLM agents with episodic memory and reflection for streaming data. This addresses a critical limitation in static machine learning models facing regime shifts. While Paper 2 offers a robust, practical solution for renewable energy forecasting, its architectural improvements are more incremental and domain-specific. Paper 1 has broader cross-disciplinary potential in dynamic time-series forecasting, streaming AI, and public health, making its overall potential scientific impact significantly higher.
WorldFly introduces a novel architectural paradigm combining world models with VLA for UAV navigation, addressing a fundamental challenge (partial observability) with a principled solution (spatial imagination via flow matching). It contributes both a new benchmark and a generalizable framework applicable beyond UAVs to broader embodied AI. Paper 1, while solid, is more narrowly focused on pandemic forecasting with incremental LLM agent engineering. Paper 2's contributions to world models, embodied AI, and robotics have broader cross-field impact potential and align with high-momentum research directions.
Agents' Last Exam (ALE) has broader potential scientific impact as a comprehensive benchmark covering 55 subfields and 13 industry clusters with 1K+ tasks, developed with 250+ experts. It addresses a fundamental evaluation gap between AI benchmark performance and real-world economic value, which could reshape how the entire AI community measures progress. Its living benchmark design ensures sustained relevance. While EpiEvolve presents a solid contribution to epidemic forecasting with a novel self-evolving agent framework, its impact is more narrowly scoped to pandemic forecasting. ALE's breadth across industries and its potential to redirect AI development priorities gives it significantly wider impact.