Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

Yan Wang, Ziyi Guo, Christopher McCarty

May 19, 2026

arXiv:2605.19229v1 PDF

cs.AI(primary)

#1531of 2292·Artificial Intelligence

#1531 of 2292 · Artificial Intelligence

Tournament Score

1370±40

10501800

46%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.5

Novelty6.5

Clarity7.5

Tournament Score

1370±40

10501800

46%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Survey research faces mounting structural challenges: declining response rates, sample bias, block-wise missingness among at-risk respondents, and AI-assisted fraudulent completions in online panels. Large language models (LLMs) have been proposed as a remedy, yet rigorous evaluations across the full survey workflow remain scarce, particularly in disaster contexts where data quality matters most. We present and evaluate a five-stage framework for LLM integration covering questionnaire design, sample selection, pilot testing, missing-data imputation, and post-collection analysis, using the 2024 Hurricane Milton preparedness survey of Florida residents (n=946) as a shared empirical testbed. We introduce a Protection Motivation Theory (PMT)-constrained co-occurrence knowledge graph and develop seven LLM configurations spanning zero-shot inference, retrieval-augmented baselines, and novel theory-informed variants. Our proposed Anchored Marginal Theory-Informed LLM (A-TLM) outperforms all three classical imputation baselines (IPW/MI, MICE+PMM, missForest) on RMSE under disaster-relevant block-wise MNAR conditions (S4 RMSE 1.439 vs. 1.496 for the next-best), while achieving near-zero signed bias (-0.121) where the random-forest imputer produces the largest absolute bias (-0.631). Organizing retrieval around PMT causal structure and integrating all evidence in a single model call outperforms unstructured retrieval and staged sequential inference (MAE 0.993 vs. 1.097 for standard RAG). We document that near-zero aggregate bias can mask opposing subgroup errors and propose subgroup-stratified bias auditing as a reporting standard. A retrieval-constrained knowledge-graph chatbot demonstrates that hallucination is architecturally manageable through grounded refusal.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper proposes a five-stage framework for integrating LLMs into the survey research workflow: questionnaire design, sample selection, pilot testing, missing-data imputation, and post-collection analysis. The central empirical contribution is the Anchored Marginal Theory-Informed LLM (A-TLM), which organizes retrieval-augmented generation around Protection Motivation Theory (PMT) causal structure via a constrained co-occurrence knowledge graph. The paper uses the 2024 Hurricane Milton preparedness survey (n=946) as a unified testbed across all stages.

The key innovation is the theory-constrained retrieval architecture: rather than using flat embedding similarity for RAG, the authors organize evidence according to PMT's causal cascade and integrate all evidence in a single model call. This outperforms both unstructured retrieval and staged sequential inference. The paper also contributes the methodological insight that near-zero aggregate imputation bias can mask opposing subgroup errors, proposing subgroup-stratified bias auditing as a reporting standard.

2. Methodological Rigor

The experimental design is generally sound but has notable limitations that the authors acknowledge. The controlled experiments (Stages 3 and 4) are well-structured with a deterministic train/validation split, four progressively demanding missingness mechanisms (MCAR through block-wise MNAR), and comparison against three established classical baselines (IPW/MI, MICE+PMM, missForest). The component ablation in Table 8 effectively isolates the contribution of peer examples and vulnerability cues.

However, several concerns temper confidence in the results:

Small validation sample: The compound-vulnerable subgroup comprises only 72 respondents, and the authors explicitly note that formal inferential statistics (e.g., bootstrap confidence intervals) on RMSE differences were not computed. The S4 RMSE difference between A-TLM (1.439) and missForest (1.496) is modest and may not be statistically significant.

Single model dependency: All experiments use Claude Sonnet 4.5 at temperature 0.1. No ablation across different LLMs is provided, making it unclear how much performance depends on the specific model.

Three of five stages are demonstrations, not controlled experiments: Stages 1, 2, and 5 lack preregistered ground truth or formal evaluation metrics. The Stage 2 result (Spearman ρ = 0.12) actually demonstrates poor performance, though the authors reframe this constructively.

Single dataset: All evaluation is anchored to one post-hurricane survey from one U.S. state, limiting generalizability claims.

3. Potential Impact

The paper addresses a genuine and worsening crisis in survey methodology. The statistics cited—response rates declining 15-30 percentage points, AI agents passing attention checks at 99.8%, usable completion rates falling to 10% in some panels—paint a dire picture. The proposed framework has several practical applications:

Disaster research: Block-wise MNAR missingness concentrated among vulnerable populations is a real and consequential problem. If A-TLM's bias reduction holds at scale, this could meaningfully improve post-disaster policy estimation.

Survey design automation: The Stage 1 construct-adequacy audit and Stage 3 pilot-testing applications could save significant researcher time, particularly in rapid-onset disaster settings.

Hallucination management: The grounded-refusal architecture in Stage 5 demonstrates a practically useful pattern for deploying LLM-based analysis tools.

The subgroup-stratified bias auditing recommendation is perhaps the most broadly impactful contribution—it's a simple, implementable standard that could improve transparency across any LLM-augmented survey workflow.

4. Timeliness & Relevance

This paper is highly timely. The convergence of declining survey quality, proliferating AI-generated fraudulent responses, and rapid LLM capability improvement creates an urgent need for systematic evaluation of LLM integration in survey science. The disaster context amplifies this urgency given climate-driven increases in extreme weather events. The paper also speaks to the broader debate about "silicon sampling" and AI surrogates in social science, providing a more nuanced position than either uncritical adoption or wholesale rejection.

5. Strengths & Limitations

Key Strengths:

The unified five-stage framework on a single testbed is genuinely novel; prior work has addressed individual stages in isolation.

The insight that theory-constrained retrieval with single-call integration outperforms both unstructured RAG and staged inference is well-demonstrated and practically useful.

The finding that Staged-TLM (sequential PMT cascade) underperforms Marginal-TLM (single-call integration) is an important architectural insight about error propagation in staged LLM reasoning.

The subgroup bias decomposition (e.g., FS-LLM's overall bias of ~0 masking +0.34 and -0.41 subgroup biases) is a valuable empirical observation with broad methodological implications.

Honest reporting of failures (Stage 2's ρ = 0.12) enhances credibility.

Notable Weaknesses:

The performance margins are thin and lack statistical significance testing. Under S4, A-TLM beats missForest by 0.057 RMSE units—this could easily be noise with 72 compound-vulnerable respondents.

The PMT-constrained graph is specific to disaster preparedness; it's unclear how the approach generalizes to surveys without an established theoretical framework to constrain retrieval.

No cost analysis is provided. Routing 189 respondents × 16 items through Claude Sonnet 4.5 with theory-organized prompts may be expensive relative to classical imputation.

The paper claims reproducibility ("per-stage scripts") but no code repository is referenced.

missForest achieves the lowest RMSE in three of four scenarios; A-TLM's advantage is confined to the most extreme (and arguably most realistic for disaster contexts) scenario S4.

The paper's title asks whether LLMs can "revolutionize" survey research, but the evidence better supports "modestly augment under specific conditions."

Overall Assessment

This is a competent, well-structured paper that makes a meaningful contribution to a timely problem. The five-stage framework provides a useful organizational scaffold, and the theory-constrained retrieval architecture is a genuine methodological contribution. However, the empirical evidence is somewhat thin—small validation samples, single dataset, single LLM, and modest performance margins limit the strength of claims. The paper's greatest value may be in its conceptual framing and its subgroup-bias decomposition insight rather than in the specific numerical improvements reported.

Rating:6.2/ 10

Significance 6.5Rigor 5.5Novelty 6.5Clarity 7.5

Generated May 20, 2026

Comparison History (24)

vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

gemini-3.15/22/2026

Benchmarks evaluating large language models, particularly in complex areas like multi-turn emotional intelligence, typically achieve widespread adoption and high citation counts across the broad AI and NLP communities. While Paper 1 offers rigorous methodological advancements for survey research, its impact is largely concentrated within computational social science. Paper 2 introduces a fundamental evaluation tool with broader relevance to human-computer interaction, model alignment, and conversational AI development.

vs. SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to its broader, timely relevance to LLM agents and web retrieval, a clearly novel problem formulation (state-gated retrieval), and a reusable benchmark with diagnostic error taxonomy that can drive progress across many agent systems and applications. Benchmarks often become community reference points, enabling comparable evaluation and follow-on methods work. Paper 2 is rigorous and practically important for survey methodology, but its empirical scope is narrower (one disaster survey context) and improvements are incremental; impact may remain more contained within computational social science and imputation workflows.

vs. LACO: Adaptive Latent Communication for Collaborative Driving

claude-opus-4.65/22/2026

Paper 1 addresses a broader, more fundamental challenge in survey research methodology with a comprehensive five-stage framework applicable across many disciplines. It introduces novel methodological contributions (A-TLM, theory-constrained knowledge graphs, subgroup-stratified bias auditing) with rigorous evaluation against established baselines. Its impact spans social sciences, disaster management, and AI methodology. Paper 2, while technically solid, addresses a narrower problem in collaborative autonomous driving with incremental improvements. Paper 1's methodological contributions and cross-disciplinary relevance give it higher potential impact.

vs. EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

claude-opus-4.65/20/2026

Paper 2 addresses a broader interdisciplinary problem (survey methodology + LLMs + disaster research) with novel theoretical contributions (PMT-constrained knowledge graph, A-TLM method, subgroup-stratified bias auditing). It introduces methodologically rigorous comparisons against established statistical baselines, proposes new reporting standards, and has clear real-world applications in disaster preparedness and survey science—fields with massive user bases. Paper 1, while technically solid, is more narrowly focused on engineering design benchmarking for LLM agents, serving a smaller community. Paper 2's contributions to imputation methodology and bias auditing have broader cross-field applicability.

vs. SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

claude-opus-4.65/20/2026

Paper 1 presents a comprehensive, methodologically rigorous framework addressing fundamental challenges in survey research (declining response rates, missing data, AI fraud) with novel contributions including theory-constrained LLM imputation, subgroup-stratified bias auditing, and hallucination-managed chatbots. It tackles a broadly applicable methodological problem relevant across social sciences, public health, and disaster research. Paper 2, while innovative in applying VLM agents to A/B test simulation, addresses a narrower commercial application with 77% directional alignment—promising but domain-specific. Paper 1's breadth of impact, methodological contributions, and cross-disciplinary relevance give it higher scientific impact potential.

vs. MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

claude-opus-4.65/20/2026

MOCHA addresses a fundamental gap in LLM agent optimization by introducing a principled multi-objective framework (Chebyshev scalarization with annealing) that demonstrably outperforms existing methods across diverse tasks. Its contribution is methodologically novel, broadly applicable to the rapidly growing field of LLM agents and prompt optimization, and solves a well-defined technical problem with strong empirical results. Paper 2, while thorough and practically useful for survey methodology, is more application-specific to disaster preparedness surveys, with incremental improvements over existing imputation methods and narrower cross-field impact potential.

vs. Neurosymbolic Learning for Inference-Time Argumentation

gpt-5.25/20/2026

Paper 1 has higher potential impact due to a more novel, generally applicable methodological contribution: a trainable neurosymbolic framework that deterministically maps explicit arguments to ternary verdicts, addressing a central, timely issue (faithful LLM reasoning/explanations) with clear rigor and broad applicability to high-stakes verification beyond a single domain. Paper 2 is strong and practical for survey methodology (notably MNAR imputation and bias auditing), but is more domain-specific and largely an applied evaluation of LLM configurations on one empirical setting, which may limit breadth and novelty relative to Paper 1’s framework-level advance.

vs. Efficient Elicitation of Collective Disagreements

gemini-3.15/20/2026

Paper 1 addresses a critical and timely crisis in survey research (declining response rates, AI fraud) using highly relevant LLM methodologies. Its application to disaster preparedness offers immediate, high-stakes real-world utility. The introduction of theory-informed LLMs outperforming classical missing-data imputation baselines provides a methodological breakthrough with broad interdisciplinary impact across computational social science, public policy, and NLP, giving it an edge over the more theoretical focus of Paper 2.

vs. A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

claude-opus-4.65/20/2026

Paper 2 presents a rigorous empirical evaluation with quantitative results on a real-world dataset, introduces a novel imputation method (A-TLM) that outperforms established baselines, and addresses concrete methodological challenges in survey research with broad applicability across social sciences. It also contributes actionable findings (subgroup-stratified bias auditing) and demonstrates practical LLM integration. Paper 1, while intellectually interesting in formalizing architectural patterns for LLM agents, is primarily a conceptual/methodological framework without strong empirical validation, limiting its immediate measurable impact.

vs. Enhancing Metacognitive AI: Knowledge-Graph Population with Graph-Theoretic LLM Enrichment

gpt-5.25/20/2026

Paper 1 has higher potential impact due to stronger methodological rigor and clearer real-world relevance: it evaluates LLM integration across the full survey workflow on a real disaster-preparedness survey, benchmarks against established imputation methods under MNAR block-missingness, reports multiple error/bias metrics, and proposes an actionable auditing standard. It also contributes a theory-constrained retrieval/knowledge-graph approach with safety benefits (grounded refusal). Paper 2 is timely and broadly applicable, but is a proof-of-concept with smaller-scale evaluation and relies on LLM-judged improvements, reducing evidential strength.

vs. Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions

gemini-3.15/20/2026

Paper 1 offers broader cross-disciplinary impact by addressing a systemic crisis in the social sciences: survey data degradation. By demonstrating that theoretically-anchored LLMs can outperform established imputation baselines (MICE, missForest) in high-stakes disaster contexts, it provides immediate real-world utility for public health and policy researchers. While Paper 2 presents an elegant neuro-symbolic approach to constraint programming with massive speedups, its impact is largely confined to combinatorial optimization. Paper 1's comprehensive framework, rigorous empirical evaluation, and relevance to wide-ranging fields give it higher potential for widespread scientific and societal impact.

vs. OpenComputer: Verifiable Software Worlds for Computer-Use Agents

gpt-5.25/20/2026

Paper 2 is likely higher impact due to its broadly applicable, reusable infrastructure for verifiable evaluation of computer-use agents across 33 real applications and 1,000 tasks. Verifier-grounded, auditable rewards address a central bottleneck in agent research (reliable evaluation), with immediate relevance to autonomy, safety, benchmarking, and reinforcement learning. Its methodology (state verifiers, trajectory logging, partial-credit scoring, self-evolving verification) is general and could become a standard testbed across academia and industry. Paper 1 is strong and timely but narrower to survey workflows/disaster contexts and incremental over established imputation baselines.

vs. From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

gemini-3.15/20/2026

Paper 2 demonstrates higher potential scientific impact due to its rigorous quantitative results, demonstrating that its novel LLM configurations outperform classical baselines in missing data imputation. It addresses critical, widespread challenges in survey research and social sciences, offering broad cross-disciplinary applicability. In contrast, Paper 1 targets a narrower domain (autonomous vehicles) and, despite interesting qualitative observations, fails to show statistically significant quantitative improvements, limiting its immediate practical impact.

vs. When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

claude-opus-4.65/20/2026

Paper 2 presents a more comprehensive and novel methodological framework (five-stage LLM integration for survey research) with concrete empirical contributions including a new imputation method (A-TLM) that outperforms established baselines, a theory-constrained knowledge graph, and actionable recommendations (subgroup-stratified bias auditing). It addresses widely recognized challenges in survey methodology with broad applicability across social sciences. Paper 1, while offering a useful negative result and interesting hypothesis about environment-feedback bandwidth, is narrower in scope (cybersecurity CTF agents), relies on reanalysis of existing data with non-significant results, and its primary contribution is a falsifiable hypothesis rather than a validated method.

vs. Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

gemini-3.15/20/2026

Paper 2 has a significantly broader scientific impact due to its comprehensive application of LLMs across the entire survey research workflow, a ubiquitous methodology in social and health sciences. It introduces a novel architecture (A-TLM) that rigorously outperforms established statistical baselines (e.g., MICE, missForest) in addressing complex missing data. In contrast, Paper 1 is a narrow case study focusing on a single mathematical olympiad problem, making its impact largely confined to the specific niche of AI-assisted formal verification.

vs. Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

gemini-3.15/20/2026

Paper 2 addresses a critical bottleneck in foundational AI research: LLM training instability and compute waste. By introducing a governance layer above AdamW that rescues failing training runs under severe stress, it offers massive potential savings in compute resources and broad applicability across all deep learning domains. While Paper 1 presents an innovative application of LLMs to computational social science, Paper 2's methodological improvements to the core infrastructure of LLM training promise a much wider, immediate, and economically significant impact across the entire artificial intelligence field.

vs. How Far Are We From True Auto-Research?

gpt-5.25/20/2026

Paper 2 has higher likely impact due to timeliness and breadth: it provides a systematic, multi-lens evaluation framework (manuscript-only vs artifact-aware vs human meta-review) over 117 agent-generated papers, revealing concrete failure modes (fabrication, underpowered studies, plan/execution mismatch) and large agent-dependent differences. This directly informs evaluation standards, benchmarking, and safety/quality controls for rapidly growing autonomous research systems across many fields. Paper 1 is innovative and rigorous within survey methodology and disaster preparedness, but its impact is narrower and more domain-specific.

vs. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

claude-opus-4.65/20/2026

POLAR-Bench addresses a fundamental and rapidly growing concern—privacy in LLM agents—with broad applicability across the entire LLM ecosystem. It introduces a reusable benchmark (7,852 samples, 10 domains) with a clear diagnostic framework that can be adopted widely by the AI safety and alignment community. Its finding that smaller open-weight models leak significantly more private data has immediate practical implications for on-device deployment. Paper 1, while methodologically thorough, addresses a narrower application (disaster survey imputation) with incremental improvements over existing methods and more limited cross-field relevance.

vs. Generative Recursive Reasoning

gpt-5.25/20/2026

Paper 1 has higher potential scientific impact due to a more novel, broadly applicable methodological contribution: probabilistic multi-trajectory recursive latent reasoning with variational training and inference-time scaling. This can influence core ML research (reasoning architectures, generative modeling, uncertainty, test-time compute) and transfer across many domains. Paper 2 is timely and practically relevant for survey methodology and disaster research, but its impact is more domain-specific and incremental (evaluating LLM integrations and proposing theory-informed retrieval/imputation variants on one survey setting).

vs. Hallucination as Exploit: Evidence-Carrying Multimodal Agents

gemini-3.15/20/2026

Paper 1 addresses a critical and universal challenge in AI safety—preventing multimodal agent hallucinations from executing unsafe actions. Its architectural solution (ECA) provides a robust security framework with broad, cross-domain implications for the safe deployment of autonomous AI systems. In contrast, Paper 2, while methodologically sound, is an applied study whose impact is primarily limited to survey methodology and social sciences.