Beyond One-shot: AI Agents for Learning in Field Experiments

Junjie Luo, Ritu Agarwal, Gordon Gao

#112 of 3404 · Artificial Intelligence
Share
Tournament Score
1542±44
10501800
87%
Win Rate
26
Wins
4
Losses
30
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent intervention design. Significant barriers exist to extracting actionable knowledge from prior experimental data to inform new interventions. We study whether tool-augmented agentic AI can automatically learn from experimental data to generate new interventions in subsequent experiments. Through two-stage field experiments in healthcare prescription messaging (693,139 patient visits), we compare a Human + Chatbot method (Stage 1: behavioral experts with conversational AI co-designing 13 message variants, 444,691 patient visits) against a Tool-Augmented Agentic AI method (Stage 2: AI autonomously extracting principles from Stage 1 data to generate 17 new variants, 248,448 patient visits). The Agentic AI method, equipped with analytical tools, structured Data-Information-Knowledge-Wisdom (DIKW) reasoning agents, and transparent evidence chains, produces superior interventions: the best AI-generated message achieved a 69.8% CTR (+6.5 percentage points over baseline). Critically, our results suggest that the value comes from domain-specific experimental data, not from general reasoning ability: frontier LLMs operating without experimental data failed to predict which interventions would succeed. The field experiments also revealed that general-purpose behavioral theories used for intervention design do not extend uniformly to specific healthcare contexts, motivating an agentic AI approach to theory audits at field-experiment scale. Our research shows that tool-augmented AI can learn from experimental data and generate improved domain-relevant interventions, transforming behavioral experimentation from one-shot evaluation into a scalable system for cumulative design learning.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper addresses a genuine and important gap: the failure of organizations to extract reusable design knowledge from behavioral experiments, resulting in "one-shot" evaluation rather than cumulative learning. The authors propose a tool-augmented agentic AI system based on the DIKW (Data-Information-Knowledge-Wisdom) hierarchy that autonomously analyzes experimental data from one round and generates improved interventions for the next. The system was validated through two large-scale field experiments in healthcare prescription messaging (693,139 patient visits total), where AI-generated messages outperformed both the baseline and the best human-plus-chatbot designs from Stage 1.

The core novelty lies not in any single component but in the integration: combining code execution for statistical analysis, structured multi-level reasoning agents, and transparent evidence chains to close the loop between experimental evaluation and intervention design. The finding that domain-specific experimental data matters more than general LLM reasoning ability is a substantive contribution—frontier LLMs without data access performed no better than random at predicting which messages would succeed.

2. Methodological Rigor

Strengths: The two-stage field experiment design is well-constructed. Stage 1 (444,691 visits, 13 variants) serves both as a legitimate test of human+chatbot intervention design and as the data source for Stage 2. Stage 2 (248,448 visits, 20 variants including 3 Stage 1 baselines) provides a clean comparison. Randomization balance is documented, coefficient stability across progressively richer specifications confirms clean randomization, and multiple comparison corrections (Holm-Bonferroni, Benjamini-Hochberg) are applied. The inclusion of both click-through and authentication outcomes strengthens credibility.

Weaknesses: Several methodological concerns temper enthusiasm:

  • Confounded comparison: Stage 1 and Stage 2 are separated by ~7 weeks, different time periods, different patient populations (no overlap), and potentially different seasonal/contextual factors. The 63.3% baseline CTR in Stage 2 vs. 62.5% in Stage 1 already suggests temporal drift. The comparison between "Human+Chatbot" (Stage 1) and "Agentic AI" (Stage 2) is therefore partially confounded with time.
  • No controlled ablation of the DIKW system: We cannot disentangle whether the improvement comes from the structured DIKW reasoning, the code execution capability, the specific LLM used (Claude 4 Sonnet), or simply having access to Stage 1 data in any form. Would a human analyst given Stage 1 results and asked to design new messages have done comparably? The authors acknowledge this but it remains a significant gap.
  • Selection bias in message portfolio: Three of 20 AI-generated messages were excluded by human reviewers before Stage 2 deployment. This introduces a human curation step that partially conflates "agentic AI" with "agentic AI + human selection," though this is acknowledged as part of the workflow.
  • LLM comparison methodology: The pairwise Elo evaluation of frontier LLMs uses a specific prompt framing. Different prompting strategies, chain-of-thought reasoning, or providing even summary statistics from Stage 1 might yield different results. The comparison feels somewhat strawman-like.
  • Outcome scope: Click-through rate, while a valid engagement metric, is far from a health outcome. The paper's framing around healthcare impact somewhat overstates what CTR differences mean for medication adherence or patient health.
  • 3. Potential Impact

    The paper has meaningful practical implications for organizations running repeated A/B tests. The idea that experimental data should feed forward into subsequent intervention design is intuitive but rarely operationalized, and the DIKW framework provides a concrete architecture for doing so. The healthcare messaging domain is practically important ($528B annual costs from non-optimized medication therapy), though the connection between CTR improvements and actual health outcomes remains unestablished.

    The broader contribution—demonstrating that agentic AI systems can extract actionable knowledge from experimental data—could influence how organizations approach experimentation across marketing, public policy, education, and product design. The finding about domain-specific behavioral principles (social proof failing in healthcare despite being a canonical nudging technique) is genuinely useful for the behavioral science community.

    4. Timeliness & Relevance

    The paper sits at the intersection of two highly active areas: agentic AI systems and evidence-based behavioral intervention design. The critique of one-shot experimentation resonates with growing calls for integrative experimental designs in social science (Almaatouq et al., 2024). The demonstration that general-purpose LLMs cannot substitute for domain-specific experimental data is timely given widespread enthusiasm about LLM capabilities.

    5. Strengths & Limitations

    Key Strengths:

  • Large-scale real-world field experiments with meaningful sample sizes
  • Transparent architecture with auditable evidence chains
  • Substantive domain-specific findings (social proof failure, efficiency framing success)
  • The LLM-without-data comparison, despite limitations, makes an important point about the value of domain-specific evidence
  • Well-articulated connection to organizational learning theory and design science
  • Notable Limitations:

  • The two-stage design cannot cleanly attribute improvement to the DIKW system vs. simply having prior experimental data in any form
  • Only one experimental cycle is demonstrated; "cumulative learning" implies multiple rounds
  • The healthcare context is narrow (prescription notification SMS for a single platform)
  • No comparison with simpler alternatives (e.g., human experts reviewing Stage 1 data, basic statistical summaries fed to an LLM)
  • The DIKW framework, while useful as an organizing principle, is not empirically validated as superior to alternative knowledge hierarchies
  • Effect sizes, while statistically significant, are measured on a proxy outcome (CTR) rather than the target health behavior
  • Additional Observations:

    The paper is well-written and clearly structured, though at times the contribution is oversold relative to what the evidence supports. Calling this "cumulative learning" based on a single two-stage experiment stretches the definition. The reproducibility of the DIKW system outputs is unclear—would the same system produce the same messages with different random seeds or slight prompt variations? The paper would benefit from sensitivity analyses on the agentic system itself.

    Overall, this is a solid applied contribution that demonstrates a promising approach to closing the experimental learning loop, but the evidence for the specific architectural choices (DIKW hierarchy, multi-agent design) driving the improvement over simpler alternatives remains thin.

    Rating:6.5/ 10
    Significance 7Rigor 5.5Novelty 6.5Clarity 7.5

    Generated Jun 2, 2026

    Comparison History (30)

    vs. Scaling Self-Evolving Agents via Parametric Memory
    gemini-3.16/5/2026

    Paper 2 introduces a fundamental architectural advancement for LLM agents (parametric memory via fast online LoRA updates), addressing critical limitations in continuous learning and context constraints. This methodological innovation offers broad, cross-domain applicability and has the potential to fundamentally shift foundational agent design. While Paper 1 presents an impressive, large-scale real-world application, Paper 2's core algorithmic breakthrough is likely to have a more widespread and foundational impact across the broader AI research community.

    vs. LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks
    gpt-5.26/3/2026

    Paper 2 likely has higher scientific impact: it advances a core scientific capability (mechanically verifiable formal reasoning) with a broadly reusable agentic framework, introduces a timely new benchmark (Lean-IMO-Bench), and demonstrates strong, rigorous evaluations plus research-level contributions (verified result tied to an open combinatorics challenge). Its applications span mathematics, formal methods, software/hardware verification, and AI safety. Paper 1 is novel and highly applied with impressive-scale field evidence, but its impact is more domain-specific (healthcare messaging/experimentation) and may generalize less broadly across scientific fields.

    vs. Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models
    claude-opus-4.66/3/2026

    Paper 2 demonstrates higher potential scientific impact due to its novel real-world application combining agentic AI with large-scale field experiments (693K+ patient visits) in healthcare. It introduces a practical framework for cumulative experimental learning that transforms how organizations conduct behavioral interventions. The methodology bridges AI and experimental design in a generalizable way across domains. Paper 1, while identifying an important reliability issue (harmful overthinking in LRMs), is more diagnostic in nature and narrower in scope, primarily characterizing a known limitation rather than introducing a transformative methodology with demonstrated real-world impact.

    vs. Interaction-Centered Intelligence: Toward Interaction as the Primary Unit of Analysis in Co-Creative AI and Human-AI Systems
    gpt-5.26/2/2026

    Paper 2 likely has higher impact due to strong methodological rigor (large-scale, two-stage field experiments with clear baselines), immediate real-world applicability (healthcare messaging optimization), and timely relevance to agentic/tool-augmented AI that learns from prior experimental data. Its findings also generalize across experimentation-driven domains (A/B testing, marketing, policy, product design), potentially influencing both ML/agent design and empirical social science workflows. Paper 1 is a compelling conceptual framework with broad interdisciplinary appeal, but its impact is more dependent on subsequent operationalization and empirical validation.

    vs. Transferring Information Across Interventions in Causal Bayesian Optimization
    claude-opus-4.66/2/2026

    Paper 1 demonstrates higher potential scientific impact due to its large-scale field experiment validation (693K+ patient visits), direct real-world healthcare application, and novel paradigm of using agentic AI to transform experimentation from one-shot evaluation into cumulative learning. It bridges AI, behavioral science, and healthcare at scale. Paper 2 offers solid theoretical contributions to causal Bayesian optimization with elegant mathematical properties, but addresses a more specialized methodological niche. Paper 1's breadth of impact, practical applicability, and timeliness (leveraging frontier LLMs in experimental design) give it broader reach across multiple disciplines.

    vs. Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization
    gemini-3.16/2/2026

    Paper 1 demonstrates massive real-world impact through a nearly 700k-subject field experiment in healthcare, showing how agentic AI can automate cumulative learning in A/B testing. Its findings challenge the reliance on general LLM reasoning, proving the necessity of domain-specific data. Paper 2 offers a valuable but more niche methodological improvement for Bayesian optimization on standard benchmarks, giving Paper 1 broader multi-disciplinary and practical significance.

    vs. Certificate-Guided Evaluation of Reinforcement Learning Generalization
    gpt-5.26/2/2026

    Paper 1 likely has higher impact due to its rare combination of methodological novelty (agentic, tool-augmented learning from prior field-experiment data), massive real-world deployment scale (693k+ patient visits), and clear, practically meaningful outcome gains. It also makes a timely contribution to AI-for-science/experimentation by showing limits of frontier LLMs without domain data and proposing a scalable “cumulative design learning” paradigm with transparent evidence chains. Paper 2 is a solid, rigorous benchmarking/evaluation contribution in RL, but its applications and cross-field reach are narrower and incremental relative to active work on RL generalization metrics.

    vs. SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems
    gemini-3.16/2/2026

    Paper 1 demonstrates massive real-world impact through a large-scale field experiment (~700k patients) in healthcare. By bridging AI, behavioral science, and empirical A/B testing, it proves that agentic AI can autonomously extract insights from experimental data to design superior interventions. This cross-disciplinary approach and rigorous real-world validation provide it with broader scientific and practical implications compared to Paper 2, which, while methodologically novel in agent architecture, relies on standard simulated AI benchmarks.

    vs. Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis
    claude-opus-4.66/2/2026

    Paper 1 demonstrates significantly higher scientific impact through its novel contribution of using tool-augmented agentic AI to iteratively learn from field experimental data, validated through large-scale field experiments (693,139 patient visits). It addresses a fundamental problem in experimental science—cumulative learning across experiments—with broad applicability beyond healthcare. The methodological innovation (DIKW reasoning, evidence chains) and the key finding that domain-specific experimental data outperforms general LLM reasoning are highly novel. Paper 2, while competent, applies relatively established techniques (knowledge graphs, attention mechanisms) to a narrower educational domain with less generalizable insights.

    vs. When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
    gpt-5.26/2/2026

    Paper 1 has higher likely scientific impact due to its novel, broadly relevant insight that chain-of-thought is an unreliable oversight channel, supported by a rigorous multi-method causal/probing framework across nine models and seven benchmarks. This directly affects interpretability, AI safety, evaluation, and product practices wherever CoT is used, making its cross-field impact large and timely. Paper 2 is strong and highly applicable, but its contribution is more domain- and setting-specific (healthcare messaging/experimentation) and may generalize less broadly than Paper 1’s foundational result about LLM reasoning traces.

    vs. Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches
    claude-opus-4.66/2/2026

    Paper 1 presents novel empirical findings from large-scale field experiments (693K patient visits) demonstrating that tool-augmented agentic AI can autonomously learn from experimental data to generate superior interventions. This represents a concrete methodological innovation with immediate real-world applications in healthcare and behavioral science. The finding that domain-specific experimental data matters more than general reasoning ability is a significant, actionable insight. Paper 2, while useful as a comprehensive survey of RLM adoption across disciplines, primarily synthesizes existing knowledge and proposes a maturity framework rather than generating new empirical results or methods with direct impact.

    vs. Iteris: Agentic Research Loops for Computational Mathematics
    gemini-3.16/2/2026

    Paper 1 demonstrates higher potential scientific impact due to its massive scale (nearly 700,000 patient visits) and direct real-world application in healthcare. While Paper 2 presents a valuable tool for computational mathematics, Paper 1 introduces a broadly applicable framework for cumulative AI-driven experimental design. By transforming A/B testing from a one-shot evaluation into an automated, continuous learning system, Paper 1 offers immense cross-disciplinary utility for behavioral science, healthcare, and tech industries, backed by exceptional methodological rigor and large-scale empirical validation.

    vs. Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design
    claude-opus-4.66/2/2026

    Paper 1 demonstrates higher potential scientific impact due to its broader applicability, larger-scale real-world validation (693K patient visits in field experiments), and its novel paradigm of transforming one-shot experiments into cumulative learning systems. It addresses a fundamental challenge across all experimental sciences—how to systematically learn from prior experiments to design better interventions. The healthcare application with measurable outcomes adds practical significance. Paper 2, while technically sound and relevant to drug design, is more narrowly focused on a specific optimization problem within SBDD and represents an incremental improvement to existing LLM-agent pipelines.

    vs. Acting with AI: An Interaction-Based Framework for Agentic Tort Liability
    gemini-3.16/2/2026

    Paper 1 presents a large-scale, empirical field experiment (nearly 700,000 patient visits) demonstrating how agentic AI can tangibly improve experimental design and intervention effectiveness. Its data-driven methodology, immediate real-world applications in healthcare and beyond, and proof that AI can facilitate cumulative design learning give it a highly rigorous and measurable scientific impact. In contrast, Paper 2 offers a theoretical legal framework, which, while highly relevant for policy, lacks the empirical validation and broad cross-disciplinary methodological innovation of Paper 1.

    vs. TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL
    claude-opus-4.66/2/2026

    Paper 1 demonstrates higher scientific impact through its novel integration of agentic AI with large-scale field experiments (693K patient visits), addressing the fundamental problem of cumulative learning across experiments. It introduces a validated framework (DIKW reasoning agents) with real-world healthcare applications and shows domain-specific experimental data outperforms general LLM reasoning—a broadly important finding. Paper 2, while technically solid in creating procedurally generated RL training environments, represents more incremental progress in visual reasoning benchmarks. Paper 1's cross-disciplinary relevance (AI, behavioral science, healthcare, experimentation) gives it broader impact potential.

    vs. Evaluation of Baseline Methods for IDD-based SSD External Memory Search
    claude-opus-4.66/2/2026

    Paper 1 demonstrates significantly higher scientific impact potential. It introduces a novel framework for using tool-augmented agentic AI to autonomously learn from experimental data and generate improved interventions, validated through large-scale field experiments (693K+ patient visits) in healthcare. It addresses a broadly relevant problem (cumulative learning across experiments), combines AI/ML with behavioral science, and has clear real-world applications. Paper 2 addresses a narrow, incremental gap in external memory search algorithms, studying simple baseline methods for IDD-based A*, which is a much more limited contribution to a specialized subfield of AI search.

    vs. SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes
    gpt-5.26/2/2026

    Paper 2 has higher potential impact due to its methodological rigor (large-scale, real-world two-stage field experiments with clear counterfactual comparison), direct applicability (automating iterative intervention design in organizations), and broad relevance across experimentation-heavy domains (healthcare, marketing, public policy, product). Its central claim—agents can extract knowledge from prior experimental data to improve future interventions—advances a timely agenda of cumulative learning systems, and the negative result (LLMs without data fail) strengthens scientific value. Paper 1 is a useful benchmark but is narrower and primarily infrastructural.

    vs. PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models
    gpt-5.26/2/2026

    Paper 1 has higher near-term scientific impact: it demonstrates a novel, tool-augmented agentic AI loop that learns from prior field-experiment data to generate better interventions, validated at very large real-world scale (hundreds of thousands of visits) with clear measurable gains. Methodological rigor is stronger due to randomized field experimentation and head-to-head comparison against human+chatbot design. Applications (healthcare messaging, A/B testing, experimentation platforms) are immediate and broadly relevant. Paper 2 is conceptually ambitious and potentially cross-disciplinary, but appears more framework/prototype- and case-study-driven with less decisive empirical validation.

    vs. Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping
    gemini-3.16/2/2026

    Paper 1 represents a massive cross-disciplinary paradigm shift, utilizing AI agents for cumulative learning in experimental design. Its methodological rigor is exceptional, featuring a nearly 700,000-subject field experiment in healthcare. While Paper 2 addresses a critical NLP problem (hallucinations), Paper 1's scale, real-world application, and demonstration of AI autonomously auditing and improving domain-specific behavioral theories offer a broader, transformative scientific impact across AI, behavioral economics, and healthcare.

    vs. "Skill issues'': data-centric optimization of lakehouse agents
    gemini-3.16/2/2026

    Paper 1 demonstrates massive real-world impact through large-scale field experiments (~700k patient visits) in healthcare, showcasing how AI agents can cumulatively learn from data to design better interventions. Its interdisciplinary application across AI, behavioral science, and healthcare, combined with robust empirical validation, offers significantly higher breadth of impact and methodological rigor compared to Paper 2's preliminary evaluation on 25 tasks in a niche data engineering context.