An Infectious Disease Spread Simulation Based on Large Language Model Decision Making

Yonchanok Khaokaew, Ruochen Kong, Andreas Zufle, Hao Xue, Taylor Anderson, Chandini Raina MacIntyre, Matthew Scotch, Flora D. Salim

Jun 4, 2026

arXiv:2606.06360v1 PDF

cs.AI(primary)

#2465of 3404·Artificial Intelligence

#2465 of 3404 · Artificial Intelligence

Tournament Score

1343±46

10501800

37%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor5

Novelty5.5

Clarity7

Tournament Score

1343±46

10501800

37%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Modelling individual decision-making during infectious disease outbreaks is crucial for understanding behavioural dynamics and informing effective public health interventions. Prior work has shown that large language models can simulate realistic human behaviour by generating agent decisions based on demographic prompts and situational context. We build on this foundation with a spatially grounded, agent-based simulation framework that integrates LLM-generated decisions about self-reported influenza-like illness into a census-based synthetic population of agents. Location is treated as a central feature: agents are assigned to spatial units within cities, capturing the spatial distributions of different demographic groups using real-world census data and enabling geographically diverse behavioural modelling. We implement and compare three decision scenarios, independent reasoning, household influence, and message framing, and simulate self-reporting outcomes in San Francisco and Atlanta. Results reveal that income and education are the dominant drivers of reporting rate variation, with smaller but consistent effects from geography, LLM model choice, and message framing. Our framework generates synthetic data that captures both social and geographic heterogeneity, supporting spatial epidemiological modelling and bias-aware behavioural analysis.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces a spatially grounded, agent-based simulation framework that replaces rule-based or logistic regression-based decision mechanisms with LLM-generated decisions for modeling symptom reporting behavior during infectious disease outbreaks. Agents are initialized from real census tract data in San Francisco and Atlanta, assigned demographic profiles (age, race, gender, education, income), and embedded in an SEIR disease transmission model built atop the existing "Patterns of Life" simulator. The key innovation is the pre-generation of a "decision bank" — a lookup table indexed by demographic combinations — populated by querying four open-source LLMs about whether a given persona would report flu-like symptoms. Three scenarios are tested: independent decision-making, household influence, and public health message framing.

The paper contributes an integration layer between LLM-based behavioral generation and spatial epidemiological modeling, rather than a fundamentally new modeling paradigm. The idea of using LLMs as behavioral proxies for survey data is timely, and the spatial grounding via census tracts adds geographic realism that prior LLM-agent healthcare simulations lacked.

Methodological Rigor

The methodology has several commendable elements but also notable weaknesses:

Strengths: The census-based population generation ensures demographic realism at the tract level. The pre-computed decision bank approach is pragmatic and ensures reproducibility. The comparison across four LLMs, five prompt variants, and three context-richness levels provides useful sensitivity analysis. The ANOVA decomposition (Table 3) clearly quantifies the relative influence of different factors.

Weaknesses: The validation strategy is the paper's most significant limitation. The authors compare LLM reporting rates against (1) a logistic regression baseline from prior work and (2) COVID-19 vaccine intent data from the Understanding America Study. Neither is a direct validation. Vaccine intent and symptom reporting are fundamentally different health behaviors, and the authors acknowledge disagreements on age and gender — yet still use this as a "directional proxy." The Spearman correlations with the LR baseline (ρ ≈ 0.41) are modest and explain only ~17% of ranking variance. The paper does not validate against actual ILI reporting data, which exists in datasets like ILINet or state-level surveillance.

The decision bank approach, while scalable, is a double-edged sword. It collapses continuous behavioral adaptation into a fixed lookup table with only 5 binary/categorical variables (yielding ~96 unique keys). This discretization is coarse — a 45-year-old Asian male with $69, 000 i n c o m e i s t r e a t e d i d e n t i c a l l y t o o n e w i t h$ 30,000 income. The paper acknowledges this but does not explore finer granularity.

The claim that LLMs capture "implicit correlations" between income and education (versus the LR model's independence assumption) is speculative. LLMs may simply have different biases rather than genuinely modeling intersectionality. Without ground truth, it's impossible to distinguish learned correlation from stereotypical association.

The SEIR parameters are presented but the disease dynamics themselves receive little validation or sensitivity analysis — the focus is entirely on the behavioral layer.

Potential Impact

The framework occupies a useful niche: generating synthetic behavioral data for epidemiological simulations where survey data is unavailable or costly. This could support:

1. Scenario planning for public health agencies exploring how different messaging strategies might affect reporting equity across demographic groups.

2. Bias-aware surveillance modeling, by making explicit how demographic factors create differential disease visibility.

3. Synthetic data generation for training downstream ML models when real data is scarce.

However, the practical impact is constrained by the lack of rigorous validation. Without demonstrating that LLM-generated decisions actually approximate real reporting behavior beyond directional agreement on income/education gradients, the framework remains a hypothesis-generating tool rather than a predictive one. The authors appropriately frame agents as "behavioural proxies," but policy-oriented applications would require stronger calibration.

The broader methodological contribution — using LLMs to populate decision banks for ABMs — is transferable to other domains (evacuation modeling, vaccine uptake, mobility during crises), which increases the paper's potential influence.

Timeliness & Relevance

The paper addresses a genuine need at the intersection of two active research areas: LLM-based agent simulation and computational epidemiology. Post-pandemic, there is heightened awareness that reporting biases significantly distort disease surveillance data, and tools to model these biases spatially are valuable. The use of open-source LLMs (rather than proprietary APIs) enhances accessibility and reproducibility.

The work is timely but not the first to explore LLM-driven epidemic simulation — Williams et al. (2023) is cited as prior work. The spatial grounding and systematic sensitivity analysis represent incremental rather than transformative advances.

Strengths

Systematic experimental design: Three scenarios, four LLMs, five prompt variants, three context levels, two cities — the combinatorial exploration is thorough.

Spatial realism: Census tract-level demographic initialization is more rigorous than most LLM-agent simulations.

Transparency: Code is released; prompts are fully documented in appendices; limitations and ethical considerations are thoughtfully discussed.

Practical insight: The finding that income and education dominate reporting variation (η² ≈ 0.19–0.17) while model choice and geography have smaller effects (η² ≈ 0.04) provides actionable guidance for simulation design.

Limitations

Lack of ground-truth validation: The most critical gap. No comparison against actual ILI or COVID-19 reporting data stratified by demographics.

Coarse demographic discretization: Five binary/categorical variables with ~96 unique profiles is a severe simplification.

LLM bias conflated with behavioral realism: The paper cannot distinguish whether LLMs reproduce real behavioral patterns or reflect training data stereotypes (e.g., associating low income with non-compliance).

Static decision banks: No dynamic adaptation during simulation; agents cannot update decisions based on evolving epidemic conditions.

Limited disease model analysis: The SEIR component is standard and receives minimal attention; the interaction between disease dynamics and behavioral feedback loops is underexplored.

Scenario 2 (household influence) only modestly shifts reporting rates, and the mechanism (appending a sixth key digit) is crude — binary rather than reflecting gradations of household behavior.

Overall Assessment

This is a competent engineering contribution that integrates existing components (Patterns of Life simulator, census data, open-source LLMs, SEIR model) into a coherent framework. The sensitivity analyses are the paper's strongest contribution, revealing how model choice, prompt design, and context richness affect simulated behavior. However, the absence of validation against real behavioral data limits the scientific claims that can be made. The paper is best understood as a framework paper and exploratory analysis rather than a validated modeling advance.

Rating:5.5/ 10

Significance 5.5Rigor 5Novelty 5.5Clarity 7

Generated Jun 5, 2026

Comparison History (19)

vs. AEGIS: A Backup Reflex for Physical AI

gemini-3.16/8/2026

Paper 1 presents a highly rigorous, pre-registered methodology with exact statistical tests, which is rare and highly valuable in AI research. Its approach to solving long-horizon manipulation failures through selective escalation is highly novel and offers immediate, practical efficiency gains for physical AI systems. While Paper 2 presents an interesting interdisciplinary application of LLMs, Paper 1's superior methodological rigor and direct technical contribution to autonomous systems give it a higher potential for significant scientific impact.

vs. OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

gemini-3.16/6/2026

Paper 2 addresses a fundamental bottleneck in AI development (the 'data wall') by proposing a novel arena for evolving and benchmarking LLM-based Collective Intelligence. Its paradigm for continuous self-improvement and robust evaluation metrics offers broad methodological impact across AI research, whereas Paper 1, while highly relevant for epidemiology, represents a more domain-specific application of existing LLM capabilities.

vs. AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

claude-opus-4.66/6/2026

Paper 2 presents a more novel and broadly impactful framework combining LLMs with agent-based epidemiological modeling, integrating spatial/demographic heterogeneity for public health applications. It addresses a timely intersection of AI and infectious disease modeling with clear real-world policy implications. Paper 1, while valuable as a benchmark dataset for AI companion safety, is more narrowly focused on evaluating LLMs as safety judges for a specific application domain. Paper 2's interdisciplinary contribution spanning computational social science, epidemiology, and AI gives it broader potential impact across multiple fields.

vs. DMF: A Deterministic Memory Framework for Conversational AI Agents

gemini-3.16/6/2026

Paper 1 addresses a critical bottleneck in the rapidly expanding field of AI agents: the high computational cost and non-determinism of LLM-based memory management. By proposing a highly efficient, CPU-first deterministic framework that drastically reduces token usage (up to 242x), it offers immediate, scalable, and highly impactful real-world applications for conversational AI. While Paper 2 presents an interesting interdisciplinary application of LLMs in epidemiology, Paper 1's methodological innovation and broad implications for AI engineering give it a higher potential for widespread scientific and industrial impact.

vs. Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

claude-opus-4.66/5/2026

Paper 1 presents a more novel and interdisciplinary framework combining LLMs with spatial epidemiological modeling using real census data, addressing a timely public health need. It introduces a unique application domain (disease spread simulation with geographically grounded behavioral modeling) with clear real-world policy implications. Paper 2, while methodologically sound, addresses a more incremental improvement in mathematical reasoning benchmarks (GSM8K), a well-studied area with many existing multi-agent and critique-based approaches. Paper 1's broader cross-disciplinary impact (epidemiology, behavioral science, public health) gives it higher potential impact.

vs. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

claude-opus-4.66/5/2026

PLAN-S addresses a fundamental challenge in autonomous driving world models—the compactness-controllability dilemma—with a novel, well-validated architectural contribution. It demonstrates clear quantitative improvements (42% collision rate reduction) on established benchmarks (nuScenes, NAVSIM) with rigorous ablations isolating its contribution. Autonomous driving has enormous real-world impact and active research investment. Paper 1, while interesting in combining LLMs with spatial epidemiological modeling, is more incremental—applying known LLM agent simulation techniques to a specific public health scenario—and lacks ground-truth validation of its synthetic behavioral outputs.

vs. Multi-ResNets for Subspace Preconditioning in Constrained Optimization

gemini-3.16/5/2026

Paper 1 bridges large language models, agent-based modeling, and epidemiology, addressing the critical challenge of simulating human behavioral dynamics during disease outbreaks. Its integration of real-world spatial census data and exploration of social heterogeneity offers broad, immediate applications in public health policy and crisis management. While Paper 2 presents a rigorous approach to constrained optimization, Paper 1's high timeliness, interdisciplinary novelty, and direct societal relevance give it a broader potential scientific and real-world impact.

vs. Zero knowledge verification for frontier AI training is possible

claude-opus-4.66/5/2026

Paper 1 addresses a fundamental gap in AI governance—the lack of technical verification for frontier AI training compliance—proposing a novel zero-knowledge proof architecture that could underpin future international AI agreements. Its impact spans AI policy, cryptography, hardware verification, and international governance, with clear real-world applications analogous to nuclear nonproliferation verification. Paper 2 presents an incremental application of LLMs to agent-based epidemiological modeling, which, while useful, represents a more incremental contribution with narrower impact. Paper 1's timeliness given rapid AI governance developments and its potential to enable enforceable international agreements give it substantially higher impact potential.

vs. Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

gpt-5.26/5/2026

Paper 2 likely has higher impact: it introduces a broadly useful, expert-validated benchmark for evaluating continual learning in frontier AI systems across six real-world domains, with a clear metric to separate online learning from base capability. Benchmarks often become field standards, enabling reproducible comparisons and accelerating progress across many subareas. Its findings (naive ICL outperforming memory systems) are timely and directly actionable for AI research. Paper 1 is innovative in combining LLM-driven decisions with spatial agent-based epidemiological simulation, but its impact is more domain-specific and depends strongly on validation of LLM behavioral realism.

vs. Retry Policy Gradients in Continuous Action Spaces

gemini-3.16/5/2026

Paper 2 demonstrates higher potential scientific impact due to its interdisciplinary approach and high real-world relevance. By integrating large language models with spatial epidemiology and agent-based modeling, it introduces a novel tool for public health planning and infectious disease simulation. While Paper 1 provides a solid, mathematically rigorous algorithmic improvement in reinforcement learning, Paper 2 addresses a critical societal need (pandemic preparedness) and is likely to influence multiple fields including public health, epidemiology, sociology, and applied artificial intelligence.

vs. TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents

gpt-5.26/5/2026

Paper 2 likely has higher impact: it introduces a clearly novel, formally motivated RL training fix (credit transfer for tool-use) with broad applicability to multimodal/tool-augmented agents, a rapidly growing area. It diagnoses a general failure mode (credit misassignment), quantifies it, and proposes a plug-and-play method with negligible overhead and consistent gains across multiple benchmarks and RL algorithms—suggesting strong methodological rigor and immediate real-world utility in search/agent systems. Paper 1 is timely and useful for epidemiology, but its reliance on LLM-simulated decisions may face validity/generalization concerns and narrower cross-field uptake.

vs. Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

gemini-3.16/5/2026

Paper 1 provides a highly rigorous, first-of-its-kind evaluation bridging formal verification and LLMs. Its use of formal model checkers guarantees objective evaluation, and its findings on negative transfer and reasoning alignment offer fundamental insights for AI development, likely impacting software engineering and AI safety more broadly than Paper 2's simulated application.

vs. MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

claude-opus-4.66/5/2026

MLEvolve presents a significantly more novel and impactful contribution: a self-evolving multi-agent framework achieving state-of-the-art on MLE-Bench while outperforming AlphaEvolve on algorithm discovery tasks. It introduces multiple technical innovations (Progressive MCGS, Retrospective Memory, adaptive coding modes) with broad applicability across ML and scientific discovery. Paper 2 applies LLMs to epidemiological simulation in a relatively incremental way, combining existing ideas (LLM-based agents, ABM, census data) without fundamental methodological advances. MLEvolve's cross-domain generalization and strong benchmarks suggest wider and deeper scientific influence.

vs. Benchmark Everything Everywhere All at Once

gpt-5.26/5/2026

Paper 1 has higher potential impact due to strong novelty and broad applicability: an autonomous, end-to-end benchmark-construction agent addresses a major scalability bottleneck in LLM/MLLM evaluation and could be adopted across many domains, influencing how models are measured and improved. It is timely given rapid benchmark saturation and continual model releases, and its outputs (multiple benchmarks + tooling) can propagate widely. Paper 2 is valuable and application-driven, but its contribution is narrower (epidemiological simulation with LLM decisions) and more sensitive to methodological concerns about LLM validity/calibration for real behavior.

vs. Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains

gpt-5.26/5/2026

Paper 1 is more novel and timely by integrating LLM-driven behavioral decision-making into spatial, census-grounded agent-based epidemiological simulations, enabling bias-aware and geographically heterogeneous outbreak modeling with clear public-health relevance. Its approach has broader cross-field impact (LLMs, computational social science, epidemiology, spatial modeling) and could influence both methodological research and policy-facing tools. Paper 2 targets an important application, but hybrid DRL for inventory replenishment is a well-established direction; the specific A3C/PPO combination appears more incremental and likely narrower in disciplinary reach.

vs. InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

gemini-3.16/5/2026

Paper 1 addresses a critical bottleneck in AI: the computational efficiency of LLM reasoning. By introducing a novel, entropy-based reward framework to prevent reward hacking and reduce verbosity, it offers foundational improvements applicable across the broader machine learning field. Paper 2 presents a valuable interdisciplinary application of LLMs for epidemiological simulation, but its scope is more domain-specific. Paper 1's methodological innovation in reinforcement learning for reasoning models gives it a much higher potential for widespread scientific and commercial impact.

vs. LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

gpt-5.26/5/2026

Paper 1 is likely higher impact: it introduces a novel, execution-aware offline agentic learning framework tailored to costly, non-differentiable hardware verification, plus new benchmark/protocol work, and demonstrates strong gains with a compact model—suggesting methodological rigor and practical relevance to an expensive industrial bottleneck. Its contributions (offline learning with deterministic evaluators, data curation/synthesis/sampling) may generalize to other tool-feedback domains. Paper 2 is timely and applicable, but LLM-driven agent-based outbreak simulations are a more incremental extension of existing ideas and may face validation/rigor challenges given reliance on LLM behavior.

vs. ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity

claude-opus-4.66/5/2026

Paper 1 has higher potential impact due to its broader applicability and timeliness. It addresses a critical public health challenge—modeling behavioral dynamics during disease outbreaks—by integrating LLMs with spatially grounded agent-based simulations using real census data. This framework has direct real-world applications in epidemic preparedness and public health policy. It bridges multiple fields (epidemiology, AI, behavioral science, spatial analysis), giving it wider interdisciplinary reach. Paper 2, while methodologically sound and novel in its prosody-based sarcasm detection approach, addresses a narrower NLP/speech processing problem with more limited real-world impact.

vs. Learning Adaptive Parallel Execution for Efficient Code Localization

gemini-3.16/5/2026

Paper 1 offers higher potential scientific impact due to its broad, interdisciplinary application intersecting AI, epidemiology, and public health. By using LLMs to model geographically and demographically grounded human behaviors during disease outbreaks, it provides a novel tool for public health policy and crisis management. While Paper 2 presents significant efficiency improvements for AI coding agents, Paper 1 addresses a globally critical challenge—infectious disease spread—making its real-world implications and cross-field relevance substantially broader and more vital for societal well-being.