Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses
Yan Wang, Ziyi Guo, Christopher McCarty
Abstract
Survey research faces mounting structural challenges: declining response rates, sample bias, block-wise missingness among at-risk respondents, and AI-assisted fraudulent completions in online panels. Large language models (LLMs) have been proposed as a remedy, yet rigorous evaluations across the full survey workflow remain scarce, particularly in disaster contexts where data quality matters most. We present and evaluate a five-stage framework for LLM integration covering questionnaire design, sample selection, pilot testing, missing-data imputation, and post-collection analysis, using the 2024 Hurricane Milton preparedness survey of Florida residents (n=946) as a shared empirical testbed. We introduce a Protection Motivation Theory (PMT)-constrained co-occurrence knowledge graph and develop seven LLM configurations spanning zero-shot inference, retrieval-augmented baselines, and novel theory-informed variants. Our proposed Anchored Marginal Theory-Informed LLM (A-TLM) outperforms all three classical imputation baselines (IPW/MI, MICE+PMM, missForest) on RMSE under disaster-relevant block-wise MNAR conditions (S4 RMSE 1.439 vs. 1.496 for the next-best), while achieving near-zero signed bias (-0.121) where the random-forest imputer produces the largest absolute bias (-0.631). Organizing retrieval around PMT causal structure and integrating all evidence in a single model call outperforms unstructured retrieval and staged sequential inference (MAE 0.993 vs. 1.097 for standard RAG). We document that near-zero aggregate bias can mask opposing subgroup errors and propose subgroup-stratified bias auditing as a reporting standard. A retrieval-constrained knowledge-graph chatbot demonstrates that hallucination is architecturally manageable through grounded refusal.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper proposes a five-stage framework for integrating LLMs into the survey research workflow: questionnaire design, sample selection, pilot testing, missing-data imputation, and post-collection analysis. The central empirical contribution is the Anchored Marginal Theory-Informed LLM (A-TLM), which organizes retrieval-augmented generation around Protection Motivation Theory (PMT) causal structure via a constrained co-occurrence knowledge graph. The paper uses the 2024 Hurricane Milton preparedness survey (n=946) as a unified testbed across all stages.
The key innovation is the theory-constrained retrieval architecture: rather than using flat embedding similarity for RAG, the authors organize evidence according to PMT's causal cascade and integrate all evidence in a single model call. This outperforms both unstructured retrieval and staged sequential inference. The paper also contributes the methodological insight that near-zero aggregate imputation bias can mask opposing subgroup errors, proposing subgroup-stratified bias auditing as a reporting standard.
2. Methodological Rigor
The experimental design is generally sound but has notable limitations that the authors acknowledge. The controlled experiments (Stages 3 and 4) are well-structured with a deterministic train/validation split, four progressively demanding missingness mechanisms (MCAR through block-wise MNAR), and comparison against three established classical baselines (IPW/MI, MICE+PMM, missForest). The component ablation in Table 8 effectively isolates the contribution of peer examples and vulnerability cues.
However, several concerns temper confidence in the results:
3. Potential Impact
The paper addresses a genuine and worsening crisis in survey methodology. The statistics cited—response rates declining 15-30 percentage points, AI agents passing attention checks at 99.8%, usable completion rates falling to 10% in some panels—paint a dire picture. The proposed framework has several practical applications:
The subgroup-stratified bias auditing recommendation is perhaps the most broadly impactful contribution—it's a simple, implementable standard that could improve transparency across any LLM-augmented survey workflow.
4. Timeliness & Relevance
This paper is highly timely. The convergence of declining survey quality, proliferating AI-generated fraudulent responses, and rapid LLM capability improvement creates an urgent need for systematic evaluation of LLM integration in survey science. The disaster context amplifies this urgency given climate-driven increases in extreme weather events. The paper also speaks to the broader debate about "silicon sampling" and AI surrogates in social science, providing a more nuanced position than either uncritical adoption or wholesale rejection.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
This is a competent, well-structured paper that makes a meaningful contribution to a timely problem. The five-stage framework provides a useful organizational scaffold, and the theory-constrained retrieval architecture is a genuine methodological contribution. However, the empirical evidence is somewhat thin—small validation samples, single dataset, single LLM, and modest performance margins limit the strength of claims. The paper's greatest value may be in its conceptual framing and its subgroup-bias decomposition insight rather than in the specific numerical improvements reported.
Generated May 20, 2026
Comparison History (24)
Benchmarks evaluating large language models, particularly in complex areas like multi-turn emotional intelligence, typically achieve widespread adoption and high citation counts across the broad AI and NLP communities. While Paper 1 offers rigorous methodological advancements for survey research, its impact is largely concentrated within computational social science. Paper 2 introduces a fundamental evaluation tool with broader relevance to human-computer interaction, model alignment, and conversational AI development.
Paper 1 likely has higher scientific impact due to its broader, timely relevance to LLM agents and web retrieval, a clearly novel problem formulation (state-gated retrieval), and a reusable benchmark with diagnostic error taxonomy that can drive progress across many agent systems and applications. Benchmarks often become community reference points, enabling comparable evaluation and follow-on methods work. Paper 2 is rigorous and practically important for survey methodology, but its empirical scope is narrower (one disaster survey context) and improvements are incremental; impact may remain more contained within computational social science and imputation workflows.
Paper 1 addresses a broader, more fundamental challenge in survey research methodology with a comprehensive five-stage framework applicable across many disciplines. It introduces novel methodological contributions (A-TLM, theory-constrained knowledge graphs, subgroup-stratified bias auditing) with rigorous evaluation against established baselines. Its impact spans social sciences, disaster management, and AI methodology. Paper 2, while technically solid, addresses a narrower problem in collaborative autonomous driving with incremental improvements. Paper 1's methodological contributions and cross-disciplinary relevance give it higher potential impact.
Paper 2 addresses a broader interdisciplinary problem (survey methodology + LLMs + disaster research) with novel theoretical contributions (PMT-constrained knowledge graph, A-TLM method, subgroup-stratified bias auditing). It introduces methodologically rigorous comparisons against established statistical baselines, proposes new reporting standards, and has clear real-world applications in disaster preparedness and survey science—fields with massive user bases. Paper 1, while technically solid, is more narrowly focused on engineering design benchmarking for LLM agents, serving a smaller community. Paper 2's contributions to imputation methodology and bias auditing have broader cross-field applicability.
Paper 1 presents a comprehensive, methodologically rigorous framework addressing fundamental challenges in survey research (declining response rates, missing data, AI fraud) with novel contributions including theory-constrained LLM imputation, subgroup-stratified bias auditing, and hallucination-managed chatbots. It tackles a broadly applicable methodological problem relevant across social sciences, public health, and disaster research. Paper 2, while innovative in applying VLM agents to A/B test simulation, addresses a narrower commercial application with 77% directional alignment—promising but domain-specific. Paper 1's breadth of impact, methodological contributions, and cross-disciplinary relevance give it higher scientific impact potential.
MOCHA addresses a fundamental gap in LLM agent optimization by introducing a principled multi-objective framework (Chebyshev scalarization with annealing) that demonstrably outperforms existing methods across diverse tasks. Its contribution is methodologically novel, broadly applicable to the rapidly growing field of LLM agents and prompt optimization, and solves a well-defined technical problem with strong empirical results. Paper 2, while thorough and practically useful for survey methodology, is more application-specific to disaster preparedness surveys, with incremental improvements over existing imputation methods and narrower cross-field impact potential.
Paper 1 has higher potential impact due to a more novel, generally applicable methodological contribution: a trainable neurosymbolic framework that deterministically maps explicit arguments to ternary verdicts, addressing a central, timely issue (faithful LLM reasoning/explanations) with clear rigor and broad applicability to high-stakes verification beyond a single domain. Paper 2 is strong and practical for survey methodology (notably MNAR imputation and bias auditing), but is more domain-specific and largely an applied evaluation of LLM configurations on one empirical setting, which may limit breadth and novelty relative to Paper 1’s framework-level advance.
Paper 1 addresses a critical and timely crisis in survey research (declining response rates, AI fraud) using highly relevant LLM methodologies. Its application to disaster preparedness offers immediate, high-stakes real-world utility. The introduction of theory-informed LLMs outperforming classical missing-data imputation baselines provides a methodological breakthrough with broad interdisciplinary impact across computational social science, public policy, and NLP, giving it an edge over the more theoretical focus of Paper 2.
Paper 2 presents a rigorous empirical evaluation with quantitative results on a real-world dataset, introduces a novel imputation method (A-TLM) that outperforms established baselines, and addresses concrete methodological challenges in survey research with broad applicability across social sciences. It also contributes actionable findings (subgroup-stratified bias auditing) and demonstrates practical LLM integration. Paper 1, while intellectually interesting in formalizing architectural patterns for LLM agents, is primarily a conceptual/methodological framework without strong empirical validation, limiting its immediate measurable impact.
Paper 1 has higher potential impact due to stronger methodological rigor and clearer real-world relevance: it evaluates LLM integration across the full survey workflow on a real disaster-preparedness survey, benchmarks against established imputation methods under MNAR block-missingness, reports multiple error/bias metrics, and proposes an actionable auditing standard. It also contributes a theory-constrained retrieval/knowledge-graph approach with safety benefits (grounded refusal). Paper 2 is timely and broadly applicable, but is a proof-of-concept with smaller-scale evaluation and relies on LLM-judged improvements, reducing evidential strength.
Paper 1 offers broader cross-disciplinary impact by addressing a systemic crisis in the social sciences: survey data degradation. By demonstrating that theoretically-anchored LLMs can outperform established imputation baselines (MICE, missForest) in high-stakes disaster contexts, it provides immediate real-world utility for public health and policy researchers. While Paper 2 presents an elegant neuro-symbolic approach to constraint programming with massive speedups, its impact is largely confined to combinatorial optimization. Paper 1's comprehensive framework, rigorous empirical evaluation, and relevance to wide-ranging fields give it higher potential for widespread scientific and societal impact.
Paper 2 is likely higher impact due to its broadly applicable, reusable infrastructure for verifiable evaluation of computer-use agents across 33 real applications and 1,000 tasks. Verifier-grounded, auditable rewards address a central bottleneck in agent research (reliable evaluation), with immediate relevance to autonomy, safety, benchmarking, and reinforcement learning. Its methodology (state verifiers, trajectory logging, partial-credit scoring, self-evolving verification) is general and could become a standard testbed across academia and industry. Paper 1 is strong and timely but narrower to survey workflows/disaster contexts and incremental over established imputation baselines.
Paper 2 demonstrates higher potential scientific impact due to its rigorous quantitative results, demonstrating that its novel LLM configurations outperform classical baselines in missing data imputation. It addresses critical, widespread challenges in survey research and social sciences, offering broad cross-disciplinary applicability. In contrast, Paper 1 targets a narrower domain (autonomous vehicles) and, despite interesting qualitative observations, fails to show statistically significant quantitative improvements, limiting its immediate practical impact.
Paper 2 presents a more comprehensive and novel methodological framework (five-stage LLM integration for survey research) with concrete empirical contributions including a new imputation method (A-TLM) that outperforms established baselines, a theory-constrained knowledge graph, and actionable recommendations (subgroup-stratified bias auditing). It addresses widely recognized challenges in survey methodology with broad applicability across social sciences. Paper 1, while offering a useful negative result and interesting hypothesis about environment-feedback bandwidth, is narrower in scope (cybersecurity CTF agents), relies on reanalysis of existing data with non-significant results, and its primary contribution is a falsifiable hypothesis rather than a validated method.
Paper 2 has a significantly broader scientific impact due to its comprehensive application of LLMs across the entire survey research workflow, a ubiquitous methodology in social and health sciences. It introduces a novel architecture (A-TLM) that rigorously outperforms established statistical baselines (e.g., MICE, missForest) in addressing complex missing data. In contrast, Paper 1 is a narrow case study focusing on a single mathematical olympiad problem, making its impact largely confined to the specific niche of AI-assisted formal verification.
Paper 2 addresses a critical bottleneck in foundational AI research: LLM training instability and compute waste. By introducing a governance layer above AdamW that rescues failing training runs under severe stress, it offers massive potential savings in compute resources and broad applicability across all deep learning domains. While Paper 1 presents an innovative application of LLMs to computational social science, Paper 2's methodological improvements to the core infrastructure of LLM training promise a much wider, immediate, and economically significant impact across the entire artificial intelligence field.
Paper 2 has higher likely impact due to timeliness and breadth: it provides a systematic, multi-lens evaluation framework (manuscript-only vs artifact-aware vs human meta-review) over 117 agent-generated papers, revealing concrete failure modes (fabrication, underpowered studies, plan/execution mismatch) and large agent-dependent differences. This directly informs evaluation standards, benchmarking, and safety/quality controls for rapidly growing autonomous research systems across many fields. Paper 1 is innovative and rigorous within survey methodology and disaster preparedness, but its impact is narrower and more domain-specific.
POLAR-Bench addresses a fundamental and rapidly growing concern—privacy in LLM agents—with broad applicability across the entire LLM ecosystem. It introduces a reusable benchmark (7,852 samples, 10 domains) with a clear diagnostic framework that can be adopted widely by the AI safety and alignment community. Its finding that smaller open-weight models leak significantly more private data has immediate practical implications for on-device deployment. Paper 1, while methodologically thorough, addresses a narrower application (disaster survey imputation) with incremental improvements over existing methods and more limited cross-field relevance.
Paper 1 has higher potential scientific impact due to a more novel, broadly applicable methodological contribution: probabilistic multi-trajectory recursive latent reasoning with variational training and inference-time scaling. This can influence core ML research (reasoning architectures, generative modeling, uncertainty, test-time compute) and transfer across many domains. Paper 2 is timely and practically relevant for survey methodology and disaster research, but its impact is more domain-specific and incremental (evaluating LLM integrations and proposing theory-informed retrieval/imputation variants on one survey setting).
Paper 1 addresses a critical and universal challenge in AI safety—preventing multimodal agent hallucinations from executing unsafe actions. Its architectural solution (ECA) provides a robust security framework with broad, cross-domain implications for the safe deployment of autonomous AI systems. In contrast, Paper 2, while methodologically sound, is an applied study whose impact is primarily limited to survey methodology and social sciences.