Beyond One-shot: AI Agents for Learning in Field Experiments
Junjie Luo, Ritu Agarwal, Gordon Gao
Abstract
Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent intervention design. Significant barriers exist to extracting actionable knowledge from prior experimental data to inform new interventions. We study whether tool-augmented agentic AI can automatically learn from experimental data to generate new interventions in subsequent experiments. Through two-stage field experiments in healthcare prescription messaging (693,139 patient visits), we compare a Human + Chatbot method (Stage 1: behavioral experts with conversational AI co-designing 13 message variants, 444,691 patient visits) against a Tool-Augmented Agentic AI method (Stage 2: AI autonomously extracting principles from Stage 1 data to generate 17 new variants, 248,448 patient visits). The Agentic AI method, equipped with analytical tools, structured Data-Information-Knowledge-Wisdom (DIKW) reasoning agents, and transparent evidence chains, produces superior interventions: the best AI-generated message achieved a 69.8% CTR (+6.5 percentage points over baseline). Critically, our results suggest that the value comes from domain-specific experimental data, not from general reasoning ability: frontier LLMs operating without experimental data failed to predict which interventions would succeed. The field experiments also revealed that general-purpose behavioral theories used for intervention design do not extend uniformly to specific healthcare contexts, motivating an agentic AI approach to theory audits at field-experiment scale. Our research shows that tool-augmented AI can learn from experimental data and generate improved domain-relevant interventions, transforming behavioral experimentation from one-shot evaluation into a scalable system for cumulative design learning.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper addresses a genuine and important gap: the failure of organizations to extract reusable design knowledge from behavioral experiments, resulting in "one-shot" evaluation rather than cumulative learning. The authors propose a tool-augmented agentic AI system based on the DIKW (Data-Information-Knowledge-Wisdom) hierarchy that autonomously analyzes experimental data from one round and generates improved interventions for the next. The system was validated through two large-scale field experiments in healthcare prescription messaging (693,139 patient visits total), where AI-generated messages outperformed both the baseline and the best human-plus-chatbot designs from Stage 1.
The core novelty lies not in any single component but in the integration: combining code execution for statistical analysis, structured multi-level reasoning agents, and transparent evidence chains to close the loop between experimental evaluation and intervention design. The finding that domain-specific experimental data matters more than general LLM reasoning ability is a substantive contribution—frontier LLMs without data access performed no better than random at predicting which messages would succeed.
2. Methodological Rigor
Strengths: The two-stage field experiment design is well-constructed. Stage 1 (444,691 visits, 13 variants) serves both as a legitimate test of human+chatbot intervention design and as the data source for Stage 2. Stage 2 (248,448 visits, 20 variants including 3 Stage 1 baselines) provides a clean comparison. Randomization balance is documented, coefficient stability across progressively richer specifications confirms clean randomization, and multiple comparison corrections (Holm-Bonferroni, Benjamini-Hochberg) are applied. The inclusion of both click-through and authentication outcomes strengthens credibility.
Weaknesses: Several methodological concerns temper enthusiasm:
3. Potential Impact
The paper has meaningful practical implications for organizations running repeated A/B tests. The idea that experimental data should feed forward into subsequent intervention design is intuitive but rarely operationalized, and the DIKW framework provides a concrete architecture for doing so. The healthcare messaging domain is practically important ($528B annual costs from non-optimized medication therapy), though the connection between CTR improvements and actual health outcomes remains unestablished.
The broader contribution—demonstrating that agentic AI systems can extract actionable knowledge from experimental data—could influence how organizations approach experimentation across marketing, public policy, education, and product design. The finding about domain-specific behavioral principles (social proof failing in healthcare despite being a canonical nudging technique) is genuinely useful for the behavioral science community.
4. Timeliness & Relevance
The paper sits at the intersection of two highly active areas: agentic AI systems and evidence-based behavioral intervention design. The critique of one-shot experimentation resonates with growing calls for integrative experimental designs in social science (Almaatouq et al., 2024). The demonstration that general-purpose LLMs cannot substitute for domain-specific experimental data is timely given widespread enthusiasm about LLM capabilities.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations:
The paper is well-written and clearly structured, though at times the contribution is oversold relative to what the evidence supports. Calling this "cumulative learning" based on a single two-stage experiment stretches the definition. The reproducibility of the DIKW system outputs is unclear—would the same system produce the same messages with different random seeds or slight prompt variations? The paper would benefit from sensitivity analyses on the agentic system itself.
Overall, this is a solid applied contribution that demonstrates a promising approach to closing the experimental learning loop, but the evidence for the specific architectural choices (DIKW hierarchy, multi-agent design) driving the improvement over simpler alternatives remains thin.
Generated Jun 2, 2026
Comparison History (30)
Paper 2 introduces a fundamental architectural advancement for LLM agents (parametric memory via fast online LoRA updates), addressing critical limitations in continuous learning and context constraints. This methodological innovation offers broad, cross-domain applicability and has the potential to fundamentally shift foundational agent design. While Paper 1 presents an impressive, large-scale real-world application, Paper 2's core algorithmic breakthrough is likely to have a more widespread and foundational impact across the broader AI research community.
Paper 2 likely has higher scientific impact: it advances a core scientific capability (mechanically verifiable formal reasoning) with a broadly reusable agentic framework, introduces a timely new benchmark (Lean-IMO-Bench), and demonstrates strong, rigorous evaluations plus research-level contributions (verified result tied to an open combinatorics challenge). Its applications span mathematics, formal methods, software/hardware verification, and AI safety. Paper 1 is novel and highly applied with impressive-scale field evidence, but its impact is more domain-specific (healthcare messaging/experimentation) and may generalize less broadly across scientific fields.
Paper 2 demonstrates higher potential scientific impact due to its novel real-world application combining agentic AI with large-scale field experiments (693K+ patient visits) in healthcare. It introduces a practical framework for cumulative experimental learning that transforms how organizations conduct behavioral interventions. The methodology bridges AI and experimental design in a generalizable way across domains. Paper 1, while identifying an important reliability issue (harmful overthinking in LRMs), is more diagnostic in nature and narrower in scope, primarily characterizing a known limitation rather than introducing a transformative methodology with demonstrated real-world impact.
Paper 2 likely has higher impact due to strong methodological rigor (large-scale, two-stage field experiments with clear baselines), immediate real-world applicability (healthcare messaging optimization), and timely relevance to agentic/tool-augmented AI that learns from prior experimental data. Its findings also generalize across experimentation-driven domains (A/B testing, marketing, policy, product design), potentially influencing both ML/agent design and empirical social science workflows. Paper 1 is a compelling conceptual framework with broad interdisciplinary appeal, but its impact is more dependent on subsequent operationalization and empirical validation.
Paper 1 demonstrates higher potential scientific impact due to its large-scale field experiment validation (693K+ patient visits), direct real-world healthcare application, and novel paradigm of using agentic AI to transform experimentation from one-shot evaluation into cumulative learning. It bridges AI, behavioral science, and healthcare at scale. Paper 2 offers solid theoretical contributions to causal Bayesian optimization with elegant mathematical properties, but addresses a more specialized methodological niche. Paper 1's breadth of impact, practical applicability, and timeliness (leveraging frontier LLMs in experimental design) give it broader reach across multiple disciplines.
Paper 1 demonstrates massive real-world impact through a nearly 700k-subject field experiment in healthcare, showing how agentic AI can automate cumulative learning in A/B testing. Its findings challenge the reliance on general LLM reasoning, proving the necessity of domain-specific data. Paper 2 offers a valuable but more niche methodological improvement for Bayesian optimization on standard benchmarks, giving Paper 1 broader multi-disciplinary and practical significance.
Paper 1 likely has higher impact due to its rare combination of methodological novelty (agentic, tool-augmented learning from prior field-experiment data), massive real-world deployment scale (693k+ patient visits), and clear, practically meaningful outcome gains. It also makes a timely contribution to AI-for-science/experimentation by showing limits of frontier LLMs without domain data and proposing a scalable “cumulative design learning” paradigm with transparent evidence chains. Paper 2 is a solid, rigorous benchmarking/evaluation contribution in RL, but its applications and cross-field reach are narrower and incremental relative to active work on RL generalization metrics.
Paper 1 demonstrates massive real-world impact through a large-scale field experiment (~700k patients) in healthcare. By bridging AI, behavioral science, and empirical A/B testing, it proves that agentic AI can autonomously extract insights from experimental data to design superior interventions. This cross-disciplinary approach and rigorous real-world validation provide it with broader scientific and practical implications compared to Paper 2, which, while methodologically novel in agent architecture, relies on standard simulated AI benchmarks.
Paper 1 demonstrates significantly higher scientific impact through its novel contribution of using tool-augmented agentic AI to iteratively learn from field experimental data, validated through large-scale field experiments (693,139 patient visits). It addresses a fundamental problem in experimental science—cumulative learning across experiments—with broad applicability beyond healthcare. The methodological innovation (DIKW reasoning, evidence chains) and the key finding that domain-specific experimental data outperforms general LLM reasoning are highly novel. Paper 2, while competent, applies relatively established techniques (knowledge graphs, attention mechanisms) to a narrower educational domain with less generalizable insights.
Paper 1 has higher likely scientific impact due to its novel, broadly relevant insight that chain-of-thought is an unreliable oversight channel, supported by a rigorous multi-method causal/probing framework across nine models and seven benchmarks. This directly affects interpretability, AI safety, evaluation, and product practices wherever CoT is used, making its cross-field impact large and timely. Paper 2 is strong and highly applicable, but its contribution is more domain- and setting-specific (healthcare messaging/experimentation) and may generalize less broadly than Paper 1’s foundational result about LLM reasoning traces.
Paper 1 presents novel empirical findings from large-scale field experiments (693K patient visits) demonstrating that tool-augmented agentic AI can autonomously learn from experimental data to generate superior interventions. This represents a concrete methodological innovation with immediate real-world applications in healthcare and behavioral science. The finding that domain-specific experimental data matters more than general reasoning ability is a significant, actionable insight. Paper 2, while useful as a comprehensive survey of RLM adoption across disciplines, primarily synthesizes existing knowledge and proposes a maturity framework rather than generating new empirical results or methods with direct impact.
Paper 1 demonstrates higher potential scientific impact due to its massive scale (nearly 700,000 patient visits) and direct real-world application in healthcare. While Paper 2 presents a valuable tool for computational mathematics, Paper 1 introduces a broadly applicable framework for cumulative AI-driven experimental design. By transforming A/B testing from a one-shot evaluation into an automated, continuous learning system, Paper 1 offers immense cross-disciplinary utility for behavioral science, healthcare, and tech industries, backed by exceptional methodological rigor and large-scale empirical validation.
Paper 1 demonstrates higher potential scientific impact due to its broader applicability, larger-scale real-world validation (693K patient visits in field experiments), and its novel paradigm of transforming one-shot experiments into cumulative learning systems. It addresses a fundamental challenge across all experimental sciences—how to systematically learn from prior experiments to design better interventions. The healthcare application with measurable outcomes adds practical significance. Paper 2, while technically sound and relevant to drug design, is more narrowly focused on a specific optimization problem within SBDD and represents an incremental improvement to existing LLM-agent pipelines.
Paper 1 presents a large-scale, empirical field experiment (nearly 700,000 patient visits) demonstrating how agentic AI can tangibly improve experimental design and intervention effectiveness. Its data-driven methodology, immediate real-world applications in healthcare and beyond, and proof that AI can facilitate cumulative design learning give it a highly rigorous and measurable scientific impact. In contrast, Paper 2 offers a theoretical legal framework, which, while highly relevant for policy, lacks the empirical validation and broad cross-disciplinary methodological innovation of Paper 1.
Paper 1 demonstrates higher scientific impact through its novel integration of agentic AI with large-scale field experiments (693K patient visits), addressing the fundamental problem of cumulative learning across experiments. It introduces a validated framework (DIKW reasoning agents) with real-world healthcare applications and shows domain-specific experimental data outperforms general LLM reasoning—a broadly important finding. Paper 2, while technically solid in creating procedurally generated RL training environments, represents more incremental progress in visual reasoning benchmarks. Paper 1's cross-disciplinary relevance (AI, behavioral science, healthcare, experimentation) gives it broader impact potential.
Paper 1 demonstrates significantly higher scientific impact potential. It introduces a novel framework for using tool-augmented agentic AI to autonomously learn from experimental data and generate improved interventions, validated through large-scale field experiments (693K+ patient visits) in healthcare. It addresses a broadly relevant problem (cumulative learning across experiments), combines AI/ML with behavioral science, and has clear real-world applications. Paper 2 addresses a narrow, incremental gap in external memory search algorithms, studying simple baseline methods for IDD-based A*, which is a much more limited contribution to a specialized subfield of AI search.
Paper 2 has higher potential impact due to its methodological rigor (large-scale, real-world two-stage field experiments with clear counterfactual comparison), direct applicability (automating iterative intervention design in organizations), and broad relevance across experimentation-heavy domains (healthcare, marketing, public policy, product). Its central claim—agents can extract knowledge from prior experimental data to improve future interventions—advances a timely agenda of cumulative learning systems, and the negative result (LLMs without data fail) strengthens scientific value. Paper 1 is a useful benchmark but is narrower and primarily infrastructural.
Paper 1 has higher near-term scientific impact: it demonstrates a novel, tool-augmented agentic AI loop that learns from prior field-experiment data to generate better interventions, validated at very large real-world scale (hundreds of thousands of visits) with clear measurable gains. Methodological rigor is stronger due to randomized field experimentation and head-to-head comparison against human+chatbot design. Applications (healthcare messaging, A/B testing, experimentation platforms) are immediate and broadly relevant. Paper 2 is conceptually ambitious and potentially cross-disciplinary, but appears more framework/prototype- and case-study-driven with less decisive empirical validation.
Paper 1 represents a massive cross-disciplinary paradigm shift, utilizing AI agents for cumulative learning in experimental design. Its methodological rigor is exceptional, featuring a nearly 700,000-subject field experiment in healthcare. While Paper 2 addresses a critical NLP problem (hallucinations), Paper 1's scale, real-world application, and demonstration of AI autonomously auditing and improving domain-specific behavioral theories offer a broader, transformative scientific impact across AI, behavioral economics, and healthcare.
Paper 1 demonstrates massive real-world impact through large-scale field experiments (~700k patient visits) in healthcare, showcasing how AI agents can cumulatively learn from data to design better interventions. Its interdisciplinary application across AI, behavioral science, and healthcare, combined with robust empirical validation, offers significantly higher breadth of impact and methodological rigor compared to Paper 2's preliminary evaluation on 25 tasks in a niche data engineering context.