EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

Yiming Lu, Sihang Zeng, Zhengxu Tang, Max Lau, Fei Liu, Wei Jin

Jun 3, 2026

arXiv:2606.05513v1 PDF

cs.AI(primary)cs.CL

#2396of 3355·Artificial Intelligence

#2396 of 3355 · Artificial Intelligence

Tournament Score

1349±47

10501800

40%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.8

Novelty7

Clarity7.5

Tournament Score

1349±47

10501800

40%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: EpiEvolve

1. Core Contribution

EpiEvolve addresses a genuine gap between how epidemic LLM forecasters are trained/evaluated (static supervised learning) and how they must operate in practice (streaming, with delayed labels and distribution shifts). The core novelty is a self-evolving agent architecture that wraps a frozen LLM backbone with three interacting mechanisms: (1) hierarchical episodic memory organized at state, regional, and national scopes, (2) outcome-informed reflection that generates lessons from delayed ground truth, and (3) strategic lesson distillation that promotes recurring error patterns into predicate-form rules. A drift detector triggers regime transitions based on variant surveillance text and error statistics.

The key insight is that post-deployment adaptation can be achieved entirely through memory and prompt manipulation rather than gradient updates — a practically important property when model weights cannot be modified after release.

2. Methodological Rigor

Strengths in evaluation design: The chronological streaming protocol is well-formulated. The delayed feedback constraint (Eq. 2-3) prevents future leakage, and the paper explicitly ensures no within-week cross-region leakage. The regime partitioning is used only for post-hoc evaluation, not supplied to the forecaster. Recovery lag as a metric is a meaningful operational quantity.

Concerns about rigor:

Scale of evaluation: The streaming window covers 81 weeks × 50 states = 4,050 predictions. While reasonable for an epidemic forecasting study, this is relatively small for drawing strong conclusions about the generality of the approach.

Backbone choice: Only Qwen3-14B-Base is evaluated as the backbone. The interaction between backbone quality and memory-based adaptation is unexplored — it's unclear whether gains persist with stronger or weaker backbones.

CDC ensemble mapping: The CDC ensemble baseline undergoes a non-trivial mapping from probabilistic quantile forecasts to five-class trend labels. This conversion could disadvantage the ensemble, making the 0.325 accuracy potentially misleading. The paper acknowledges this implicitly but doesn't investigate sensitivity to the mapping.

Single task: Only hospitalization trend classification (5-class ordinal) is tested. No regression targets, no other diseases, no international data.

Statistical significance: No confidence intervals, bootstrap tests, or significance testing is reported. Given the moderate dataset size, this is a notable omission.

The ablation study is reasonably thorough, decomposing contributions of reflection, strategic memory, drift detection, and retrieval tiers. The hyperparameter sensitivity analysis (Table 3) shows stability across tested ranges, which is reassuring.

3. Potential Impact

Practical relevance: The frozen-backbone + evolving-memory paradigm is highly practical for deployment scenarios where model retraining is expensive or prohibited (regulatory, computational, or organizational constraints). Public health agencies could potentially adopt this approach for operational forecasting.

Cross-domain transferability: The design pattern — hierarchical episodic memory with regime-conditioned retrieval and rule distillation — is domain-agnostic in principle. It could apply to financial forecasting, climate prediction, supply chain management, or any streaming prediction task with delayed labels and concept drift.

Limitations on impact: The paper only demonstrates the approach on one specific task with one backbone. The regime shifts studied (COVID-19 variants) are relatively well-characterized compared to truly novel emergence events. The approach's value in forecasting genuinely unprecedented regimes (where no similar historical pattern exists) remains untested.

4. Timeliness & Relevance

The paper is timely on multiple fronts:

LLM agents for scientific applications is a rapidly growing area, and this paper provides one of the first rigorous streaming evaluations in epidemiology.

Self-evolving agents is an active research direction (cited works from 2025-2026), and EpiEvolve applies these ideas to a consequential domain.

Post-pandemic preparedness remains a priority, making tools for adaptive forecasting under regime shifts practically relevant.

However, the COVID-19 dataset is now historical rather than operational, which somewhat limits the immediacy of the contribution.

5. Strengths & Limitations

Key Strengths:

Clean problem formulation separating the streaming protocol from the adaptation mechanism

The hierarchical memory design (state/regional/national) is well-motivated by the spatial structure of epidemic dynamics

Recovery lag metric captures an operationally important quantity that aggregate accuracy misses

The case study (Figure 4) effectively demonstrates how components interact in a concrete prediction

The retrieval composition analysis (Figure 3a) showing automatic fallback from local to broader memory tiers at regime boundaries is insightful

Per-class analysis (Figure 6) confirms gains aren't driven by majority class bias

Notable Weaknesses:

Single dataset, single task: The generalizability claim rests entirely on one COVID-19 hospitalization trend dataset

Prompt sensitivity: The paper acknowledges but does not investigate sensitivity to prompt design, which is concerning given that all adaptation flows through prompts

Rule quality: The strategic rules are generated by an LLM without formal verification — there's no analysis of rule precision/recall or whether rules sometimes encode spurious correlations

Computational cost: No analysis of inference cost, memory footprint, or latency, which matter for operational deployment

Comparison fairness: The streaming fine-tuning baseline uses the same backbone but details of its update schedule and learning rate are sparse, making it hard to judge whether it represents a strong fine-tuning baseline

The accuracy numbers themselves (0.629) are moderate — the system still misclassifies ~37% of predictions, and it's unclear whether this is operationally useful

Additional Observations

The paper positions itself as "the first self-evolving LLM agent for streaming epidemic forecasting under regime shifts," which appears accurate based on the literature review. The conceptual framework of converting forecast errors into reusable lessons via reflection and distillation is sound and well-articulated. However, the empirical evidence, while positive, is narrow enough that the contribution is best viewed as a promising proof-of-concept rather than a validated methodology.

The writing is clear and well-structured, with appropriate use of figures and tables. The appendices provide useful implementation details and prompt templates that aid reproducibility.

Rating:6.2/ 10

Significance 6.5Rigor 5.8Novelty 7Clarity 7.5

Generated Jun 5, 2026

Comparison History (20)

vs. DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

gemini-3.16/8/2026

Paper 2 demonstrates higher potential scientific impact due to its profound interdisciplinary application in public health and epidemiology. While Paper 1 offers a strong, highly engineered multi-agent framework for automated research, Paper 2 tackles a critical, high-stakes real-world problem: real-time pandemic forecasting under regime shifts. Its novel self-evolving agent architecture, which handles streaming data and delayed labels without weight updates, offers an innovative methodological contribution that could broadly influence both AI time-series forecasting and global health crisis management.

vs. DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

claude-opus-4.66/8/2026

DyCon addresses a fundamental and broadly applicable problem in LLM reasoning efficiency ('overthinking') with a training-free, model-agnostic framework validated across 4 models and 12 benchmarks spanning multiple domains. Its insight that difficulty is linearly encoded in step-level embeddings is novel and generalizable. EpiEvolve, while innovative in combining episodic memory with LLM forecasting for epidemics, addresses a narrower application domain (COVID-19 hospitalization forecasting) and is evaluated on a single dataset. DyCon's breadth of applicability, theoretical insight, and practical efficiency gains give it wider potential impact across the LLM research community.

vs. Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

claude-opus-4.66/6/2026

EpiEvolve introduces a novel paradigm for adaptive LLM-based epidemic forecasting that addresses a fundamental gap between static model training and streaming real-world deployment. Its self-evolving agent architecture with hierarchical episodic memory, reflection, and regime-aware retrieval is broadly applicable beyond epidemiology to any streaming prediction task with regime shifts. The paper demonstrates strong empirical gains over CDC ensemble baselines and provides rigorous ablations. Paper 1, while addressing important policy optimization questions, combines relatively established RL techniques (DQN, DDPG, TD3) in a small-scale simulation (1,000 agents) without real-world validation, limiting its impact.

vs. Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

claude-opus-4.66/6/2026

Paper 2 addresses a critically timely and broadly relevant topic—the environmental footprint of hyperscale data centers driven by AI growth—with novel facility-level empirical data (403 HDCs) that fills a major knowledge gap. Its findings (HDC carbon intensity 48% above grid average) have immediate policy implications and broad interdisciplinary impact across energy, environmental science, computer science, and policy. Paper 1, while technically sound, addresses a narrower ML/epidemiology niche with incremental improvements to LLM-based forecasting. Paper 2's empirical contribution and societal relevance give it substantially broader citation potential and impact.

vs. AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise

claude-opus-4.66/6/2026

EpiEvolve addresses a critical real-world problem (pandemic forecasting) with a novel self-evolving agent framework that handles regime shifts in streaming data. Its practical impact is immediate and broadly relevant to public health. The methodology—hierarchical episodic memory, reflection on delayed labels, and regime-aware retrieval—introduces generalizable concepts for adaptive AI systems beyond epidemiology. AnyEdit++ makes solid contributions to knowledge editing in LLMs but addresses a narrower technical problem. EpiEvolve's demonstrated superiority over CDC ensemble forecasts and its cross-disciplinary relevance (AI + epidemiology) give it higher potential impact.

vs. Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

gpt-5.26/6/2026

Paper 2 has higher potential impact: it introduces a general, mathematically rigorous framework for multi-agent complementarity that applies broadly across human-AI interaction protocols, aggregation theory, and learning theory. Its formalism yields multiple theorems (impossibility results, equivalences, invariances) with cross-domain relevance and likely to influence how workflows and evaluation baselines are defined. Paper 1 is timely and applied with strong empirical gains for streaming epidemiological forecasting, but its scope is narrower (one domain/task) and depends on specific agent/memory design choices, making its broader theoretical spillover smaller.

vs. ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity

gpt-5.26/5/2026

Paper 2 has higher impact potential due to stronger real-world applicability and timeliness: streaming pandemic forecasting under regime shifts is an operationally critical, high-stakes problem. Its self-evolving agent design (episodic memory, reflection on delayed labels, regime-aware retrieval, rule distillation) is a broadly relevant paradigm for non-stationary time-series decision support and could transfer to many domains beyond epidemiology. The streaming, leakage-aware protocol and regime-shift evaluation improve methodological rigor and relevance. Paper 1 is solid and novel within audio sarcasm recognition, but its broader societal and cross-field impact is likely narrower.

vs. Multilingual Fine-Tuning via Localized Gradient Conflict Resolution

gemini-3.16/5/2026

Paper 1 addresses a fundamental bottleneck in LLM training (multilingual interference) with a theoretically rigorous and scalable framework. Its broad applicability across foundational AI models gives it a wider, horizontal scientific impact across NLP and machine learning compared to the highly domain-specific, albeit important, public health application of Paper 2.

vs. The Self-Correction Illusion: LLMs Correct Others but Not Themselves

claude-opus-4.66/5/2026

Paper 2 reveals a fundamental mechanistic insight about LLM behavior—that self-correction failures are chat-template artifacts rather than capability deficits—with broad implications across all LLM applications. It provides a training-free, model-agnostic intervention applicable immediately. The finding is surprising, rigorously controlled (SHA-256 verified identical claims, 13 model-domain cells, statistical significance), and affects the large community working on LLM reasoning and self-correction. Paper 1, while useful for pandemic forecasting, is more narrowly applied and incremental in its agent-memory architecture contribution.

vs. Answer Presence Drives RAG Rewriting Gains

gpt-5.26/5/2026

Paper 2 has higher potential impact: it proposes a concrete, reusable method (self-evolving agent with episodic memory, reflection, and rule distillation) addressing a timely, high-stakes real-world problem (streaming pandemic forecasting under regime shifts), with clear operational constraints (no weight updates, anti-leakage protocol) and measurable improvements over strong baselines. Its ideas likely generalize to other non-stationary time-series and decision-making settings. Paper 1 is rigorous and important for evaluation hygiene in RAG, but is primarily diagnostic and narrower in application, offering tools rather than a new capability.

vs. Residual Modeling for High-Fidelity Learned Compression of Scientific Data

gpt-5.26/5/2026

Paper 1 is likely higher impact due to a methodologically rigorous, broadly applicable advance for high-fidelity scientific data compression, a critical HPC/data-management bottleneck across climate, CFD, and turbulence. Its residual-centric formulation and deterministic residual coders (including a neural-aided but deterministic pipeline) directly address a known failure mode of learned compressors under strict error bounds, with strong quantitative gains on multiple canonical datasets and clear deployability. Paper 2 is timely and application-relevant, but relies on agent/memory heuristics around fixed LLMs in a narrower domain, with impact more contingent on evaluation design and generalizability.

vs. CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model

claude-opus-4.66/5/2026

EpiEvolve addresses a fundamental and practical problem—streaming pandemic forecasting under regime shifts—with a novel self-evolving agent architecture combining episodic memory, reflection, and regime-aware retrieval. It demonstrates substantial quantitative improvements over baselines including CDC ensembles, has clear real-world public health applications, and introduces a methodological framework (self-evolving agents for streaming prediction) transferable to many domains. CogManip, while addressing important AI safety concerns around LLM manipulation, is primarily a benchmark contribution with evaluation results, offering less methodological novelty and narrower applicability.

vs. QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

gpt-5.26/5/2026

Paper 1 likely has higher scientific impact due to strong novelty in systems-level RAG serving (compressed-view, query-aware cache fusion) with clear, broad applicability to LLM deployment across many domains. It addresses a timely bottleneck (prefill latency/cost), integrates into a real serving stack (SGLang), and reports consistent speed/quality gains across multiple models and datasets, suggesting methodological rigor and generality. Paper 2 is compelling and application-relevant, but its impact is narrower (pandemic forecasting) and may hinge more on dataset/protocol specifics and operational adoption.

vs. Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

claude-opus-4.66/5/2026

RHO addresses a more general and broadly applicable problem—self-supervised optimization of LLM agent harnesses without ground-truth labels—applicable across diverse domains (software engineering, technical work, knowledge work). Its domain-agnostic framework, strong empirical results (59%→78% on SWE-Bench Pro), and practical relevance to real deployment settings give it broader impact potential. EpiEvolve, while innovative in its streaming pandemic forecasting approach, is more narrowly scoped to epidemic prediction. RHO's self-preference mechanism and trajectory-based optimization represent a more transferable methodological contribution to the rapidly growing LLM agents field.

vs. Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains

gemini-3.16/5/2026

Paper 1 introduces a highly novel LLM-agent architecture with episodic memory and reflection to handle streaming data and regime shifts, advancing the frontier of time-series forecasting and LLM applications. Paper 2, while practically valuable, primarily applies existing deep reinforcement learning techniques to a specific supply chain problem, offering less methodological innovation and a narrower scope of scientific impact.

vs. Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

gpt-5.26/5/2026

Paper 2 (Vortex) likely has higher scientific impact due to broader applicability and timeliness: it provides a programmable systems abstraction plus an optimized serving backend that can accelerate research and deployment across many LLM/agent workloads, models, and sparse-attention methods. The reported throughput gains on very large, modern architectures and production-relevant GPUs suggest strong real-world uptake potential. Paper 1 is novel and rigorous for streaming epidemiological forecasting under regime shifts, but its domain scope is narrower and impact is more specialized compared to a general-purpose LLM serving infrastructure advance.

vs. Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

claude-opus-4.66/5/2026

Edit-R2 introduces a novel RL post-training framework addressing a fundamental challenge in multi-turn image editing with unified multimodal models. It tackles coupled failure modes (long-context dilution, state contamination) with innovative solutions spanning both discrete text and continuous latent spaces. The paper also contributes a new benchmark (MICE-Bench). Its broader applicability to multimodal foundation models and RL for generative models gives it wider impact potential across computer vision, NLP, and RL communities. EpiEvolve, while valuable for pandemic forecasting, addresses a narrower domain with a more incremental contribution (prompt engineering with memory/reflection around a frozen LLM).

vs. Step-adaptive multimodal fusion network with multi-scale cloud feature learning for ultra-short-term solar irradiance forecasting

gemini-3.16/5/2026

Paper 1 presents a highly novel approach by introducing self-evolving LLM agents with episodic memory and reflection for streaming data. This addresses a critical limitation in static machine learning models facing regime shifts. While Paper 2 offers a robust, practical solution for renewable energy forecasting, its architectural improvements are more incremental and domain-specific. Paper 1 has broader cross-disciplinary potential in dynamic time-series forecasting, streaming AI, and public health, making its overall potential scientific impact significantly higher.

vs. WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation

claude-opus-4.66/5/2026

WorldFly introduces a novel architectural paradigm combining world models with VLA for UAV navigation, addressing a fundamental challenge (partial observability) with a principled solution (spatial imagination via flow matching). It contributes both a new benchmark and a generalizable framework applicable beyond UAVs to broader embodied AI. Paper 1, while solid, is more narrowly focused on pandemic forecasting with incremental LLM agent engineering. Paper 2's contributions to world models, embodied AI, and robotics have broader cross-field impact potential and align with high-momentum research directions.

vs. Agents' Last Exam

claude-opus-4.66/5/2026

Agents' Last Exam (ALE) has broader potential scientific impact as a comprehensive benchmark covering 55 subfields and 13 industry clusters with 1K+ tasks, developed with 250+ experts. It addresses a fundamental evaluation gap between AI benchmark performance and real-world economic value, which could reshape how the entire AI community measures progress. Its living benchmark design ensures sustained relevance. While EpiEvolve presents a solid contribution to epidemic forecasting with a novel self-evolving agent framework, its impact is more narrowly scoped to pandemic forecasting. ALE's breadth across industries and its potential to redirect AI development priorities gives it significantly wider impact.