SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

Han Li, Vibhor Malik, Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ailin Fan, Keat Yang Koay, Yuanzheng Zhu

May 19, 2026

arXiv:2605.19219v1 PDF

cs.AI(primary)

#1396of 2292·Artificial Intelligence

#1396 of 2292 · Artificial Intelligence

Tournament Score

1387±42

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor6

Novelty5.5

Clarity7.5

Tournament Score

1387±42

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and intents from production clickstream data; (b) a live-browser agent architecture that combines multimodal perception over visual and browser-structured observations with episodic memory and guardrails to conduct coherent shopping sessions across control and treatment storefronts; and (c) an evaluation protocol that compares simulated outcome shifts with observed shifts in real buyer behavior. We validate SimGym on A/B tests of visually driven UI theme changes from a major e-commerce platform across diverse storefronts and product categories. Empirical results show that SimGym agents achieve strong agreement with observed outcome shifts, attaining 77% directional alignment with add-to-cart shifts observed across interface variants in real-buyer traffic. It reduces experimental cycles from weeks to under an hour, enabling rapid experimentation without exposing real buyers to candidate variants.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SimGym

1. Core Contribution

SimGym presents an end-to-end framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents that interact with live browser environments. The framework integrates three components: (a) a traffic-grounded persona generation pipeline that derives buyer archetypes from production clickstream data, (b) a multimodal live-browser agent architecture with episodic memory and guardrails, and (c) an evaluation protocol that validates simulated outcomes against real human behavioral shifts.

The central problem addressed is the cost, latency, and risk of traditional A/B testing—which requires weeks of live traffic diversion and can degrade user experience. SimGym proposes to pre-screen interface variants synthetically, reducing experimental cycles from weeks to minutes. The key novelty lies in the integration of these three components into a unified, validated system rather than any single component being fundamentally new. Traffic-grounded persona generation, browser-based agents, and A/B evaluation have been explored individually, but their composition with validation against real production A/B tests is the primary contribution.

2. Methodological Rigor

Strengths in methodology:

The evaluation against 50 real-world A/B tests across 16 countries and 11 product categories provides meaningful empirical grounding. Using real production data from a major e-commerce platform (Shopify) lends credibility.

The ablation studies are well-designed, demonstrating the importance of full persona profiles (vs. intent-only or product-only), visual input (vs. text-only), and episodic memory. The memory ablation showing collapse to 0 correlation is particularly informative.

The bootstrap sensitivity analysis on agent budget (50–700 agents) provides practical guidance and demonstrates stability.

Quality-control filters excluding confounding factors (promotions, pricing changes, new-shop ramp-up) from the ground-truth set show methodological care.

Weaknesses in methodology:

The 77% directional alignment, while above chance (50%), has a wide 95% CI of [66%, 87%], and the Pearson correlation of 0.55 with CI [0.32, 0.72] suggests moderate but not strong predictive validity. For a system intended to replace or pre-filter A/B tests, this level of accuracy may be insufficient for high-stakes decisions.

The evaluation is limited to add-to-cart (A2C) rate shifts only. Conversion, revenue, bounce rate, and engagement metrics are not evaluated, limiting the practical utility assessment.

The "skimmers" cohort (26.4% of sessions, 9.5% A2C rate) is used as the primary evaluation target, which is a specific and relatively high-intent segment. Performance on the full traffic distribution is not directly validated.

Only 50 shops constitute the evaluation set, which is relatively small for drawing robust conclusions. The restriction to visually driven theme changes further narrows the scope.

The paper uses GPT-5 for persona generation and Gemini 3 Flash/Pro for agent reasoning—both proprietary models—limiting reproducibility. The open-source alternative (GPT-OSS) shows notably worse performance (59% alignment).

There is no comparison to simpler baselines (e.g., heuristic-based prediction from theme change magnitude, historical A2C variance) to contextualize the 77% alignment figure.

3. Potential Impact

The practical implications are significant for the e-commerce industry. If synthetic A/B testing can reliably pre-screen interface variants, it could:

Dramatically reduce experimentation costs and timelines for merchants

Enable bolder design exploration by lowering the risk of exposing users to poor variants

Democratize A/B testing for smaller merchants who lack sufficient traffic for statistical significance

However, the current accuracy levels suggest SimGym is better positioned as a screening/prioritization tool rather than a replacement for real A/B tests. The paper acknowledges this in the conclusion.

For the research community, SimGym contributes a concrete methodology for validating synthetic agents against real intervention effects—a gap the authors correctly identify in prior work. The modular architecture could serve as a testbed for improved persona generation, agent architectures, or evaluation protocols.

4. Timeliness & Relevance

This work is highly timely. The convergence of capable VLMs (Gemini 3, GPT-5), robust browser automation tools (Stagehand), and growing interest in LLM-based agents creates a natural opportunity. The paper addresses a genuine industry pain point, and the timing relative to concurrent work (AgentA/B, SimAB, UXAgent, PAARS) positions it well in an emerging research area. The key differentiation from concurrent work is the validation against real production A/B test outcomes rather than proxy metrics or synthetic benchmarks.

5. Strengths & Limitations

Key Strengths:

Real-world validation against production A/B tests is the standout contribution, providing a credibility bar that most concurrent work lacks

The persona generation pipeline's six-stage design is thoughtful, with clear separation between behavioral and values dimensions

Ablation studies convincingly demonstrate the necessity of each component

The framework's modularity enables future improvements without system redesign

Practical runtime (5.3 min/shop) makes the approach feasible for real deployment

Notable Limitations:

Single-platform evaluation (Shopify only) limits generalizability claims

Restricted to visual theme changes—functional changes, pricing strategies, or content modifications are not evaluated

No post-training alignment on human traces, which the authors acknowledge could improve fidelity

The 600-agent budget per shop at API costs may not be trivial for widespread adoption

No analysis of failure modes: what types of theme changes does SimGym systematically mispredict?

The paper does not discuss calibration—whether the magnitude of predicted shifts is systematically biased (the scatter plots in Figure 3 suggest potential compression)

Privacy and data contamination concerns are acknowledged but not deeply addressed

Additional Observations:

The paper is well-written and structured, with comprehensive appendices including agent trace examples. The industry authorship (Shopify) provides unique access to production data but may also introduce bias toward demonstrating commercial viability. The lack of a public dataset or reproducible benchmark limits community adoption for comparative research.

Overall, SimGym represents a solid applied contribution that demonstrates the feasibility of VLM-based A/B test simulation with real-world validation, but the moderate accuracy levels, narrow evaluation scope, and single-platform setting temper the strength of its conclusions.

Rating:6.2/ 10

Significance 6.5Rigor 6Novelty 5.5Clarity 7.5

Generated May 20, 2026

Comparison History (18)

vs. AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

claude-opus-4.65/21/2026

SimGym presents a novel, end-to-end framework addressing a significant practical problem in e-commerce A/B testing, combining VLM agents with real traffic data and validated against real-world outcomes (77% directional alignment). It offers clear real-world applications—reducing A/B test cycles from weeks to under an hour—with strong methodological rigor through empirical validation on a major platform. Paper 2 (AgentAtlas) contributes useful taxonomies and evaluation methodology for LLM agents, but is explicitly positioned as a 'measurement-protocol demonstration, not a benchmark release,' limiting its immediate impact. SimGym's concrete, validated system with direct industry applicability gives it higher potential impact.

vs. OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

claude-opus-4.65/21/2026

SimGym addresses a high-value practical problem in e-commerce A/B testing with a novel framework combining VLMs, browser agents, and real traffic grounding. Its 77% directional alignment with real outcomes and reduction from weeks to under an hour represents significant practical impact. Paper 2 advances ToM benchmarking, a narrower NLP subfield, with incremental improvements on existing benchmarks. While both are competent works, SimGym's combination of methodological novelty (VLM agents for realistic simulation), broad commercial applicability, and validation against real production data gives it higher potential for cross-disciplinary impact in AI, HCI, and e-commerce.

vs. Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

claude-opus-4.65/21/2026

SimGym addresses a well-defined, high-impact problem (A/B testing in e-commerce) with a complete framework validated against real-world data, achieving measurable results (77% directional alignment). It has clear practical applications, reducing experimental cycles from weeks to under an hour. Paper 1, while addressing an interesting problem in data-system composition, is self-described as an 'early prototype' and 'proof of life,' suggesting limited validation. SimGym's combination of VLM agents, traffic-grounded personas, and rigorous empirical validation gives it broader appeal across HCI, ML, and e-commerce research communities.

vs. Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

claude-opus-4.65/21/2026

SimGym presents a more complete and validated framework with strong empirical results (77% directional alignment with real A/B tests), addressing a well-defined, high-value problem in e-commerce experimentation. It combines novel VLM agent architecture with real-world validation against production data. Paper 2, while addressing an interesting problem in data-system composition, is self-described as an 'early prototype' and 'proof of life,' with limited validation on a single workload. SimGym's broader applicability, methodological rigor, and immediate practical value for the large e-commerce industry give it higher impact potential.

vs. When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

gpt-5.25/20/2026

Paper 2 (SimGym) has higher potential impact due to strong real-world applicability and timeliness: it targets a major bottleneck in industry experimentation (A/B test cost, latency, risk) and proposes an end-to-end, traffic-grounded VLM-agent simulation framework validated against real platform outcomes. Its methodology includes grounding personas in production clickstream data and an evaluation protocol comparing simulated vs. observed shifts, supporting practical adoption. Breadth is also larger (ML, HCI, causal/experimentation, e-commerce systems). Paper 1 is novel and important for strategic ML on tabular PFNs, but is narrower and more niche in immediate deployment.

vs. Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

claude-opus-4.65/20/2026

Paper 1 presents a comprehensive, methodologically rigorous framework addressing fundamental challenges in survey research (declining response rates, missing data, AI fraud) with novel contributions including theory-constrained LLM imputation, subgroup-stratified bias auditing, and hallucination-managed chatbots. It tackles a broadly applicable methodological problem relevant across social sciences, public health, and disaster research. Paper 2, while innovative in applying VLM agents to A/B test simulation, addresses a narrower commercial application with 77% directional alignment—promising but domain-specific. Paper 1's breadth of impact, methodological contributions, and cross-disciplinary relevance give it higher scientific impact potential.

vs. VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

claude-opus-4.65/20/2026

SimGym introduces a novel framework for simulating A/B tests using VLM agents in live browsers, addressing a significant practical problem in e-commerce. Its cross-disciplinary impact (HCI, ML, e-commerce) and immediate real-world applicability—reducing A/B test cycles from weeks to under an hour with 77% directional alignment—represent substantial practical value. While VISAFF makes solid contributions to multimodal ERC with its tuning-free approach, it is more incremental within an established research area. SimGym's novelty in combining traffic-grounded personas with browser-based VLM agents opens a new research direction with broader industry implications.

vs. Interference-Aware Multi-Task Unlearning

gemini-3.15/20/2026

Paper 1 tackles a fundamental challenge in trustworthy AI by extending machine unlearning to multi-task settings, a crucial step for modern shared-backbone models. Its rigorous methodology addressing task and instance-level interference offers broad applicability across AI domains, particularly for privacy compliance. While Paper 2 presents a highly valuable commercial application of VLMs, Paper 1 provides foundational methodological advancements with wider theoretical and cross-disciplinary implications in machine learning.

vs. Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

claude-opus-4.65/20/2026

Paper 2 reveals a counterintuitive and broadly important finding—that higher observation fidelity can actually hurt LLM agent performance in embodied tasks, with moderate noise improving success rates by 2.85x. This challenges fundamental assumptions about how LLMs interact with perception systems and has implications across robotics, AI evaluation methodology, and LLM reasoning research. The finding that noise masks reasoning failures rather than enabling robust problem-solving is a critical insight for the growing embodied AI field. Paper 1, while practically useful for e-commerce A/B testing, addresses a narrower application domain with less fundamental scientific contribution.

vs. From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

gemini-3.15/20/2026

Paper 1 demonstrates immediate and highly practical real-world utility by significantly reducing A/B testing cycles in e-commerce with strong empirical results (77% alignment). In contrast, Paper 2 explores an important topic (temporal grounding in AVs) but reports no statistically significant improvements in quantitative metrics, limiting its immediate scientific and practical impact.

vs. Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

gpt-5.25/20/2026

Paper 1 is more likely to have higher impact: it proposes a novel, end-to-end framework using traffic-grounded personas and VLM browser agents to simulate A/B tests, validated on real e-commerce experiments with measurable predictive alignment and large cycle-time reductions, enabling immediate industrial application. Its methodology includes grounding, live-browser execution, and empirical comparison to real outcomes, giving broad relevance across ML, HCI, experimentation, and e-commerce. Paper 2 is valuable as a careful case study on limits of AI formalization, but its scope is narrower and mainly diagnostic rather than delivering a new broadly applicable capability.

vs. KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition

gemini-3.15/20/2026

Paper 1 introduces a highly novel application of Vision-Language Models to simulate human behavior, potentially revolutionizing the costly and time-consuming process of A/B testing in e-commerce. Its methodological innovation in creating traffic-grounded agents offers broader implications for agentic AI and human-computer interaction, representing a significant paradigm shift compared to Paper 2's more incremental architectural investigation of KANs in a specific domain (Human Activity Recognition).

vs. Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

gpt-5.25/20/2026

Paper 2 (SimGym) has higher estimated impact due to strong real-world applicability and timeliness: it targets a costly, widely used industry workflow (A/B testing) and offers large practical speedups. Its methodology is grounded in production clickstream data and validated against real A/B test outcome shifts across diverse storefronts, suggesting robustness and adoption potential. The framework could influence multiple areas (HCI, recommender systems, agent evaluation, simulation, marketing science). Paper 1 is technically novel for RL on diffusion MLLMs, but its impact is narrower and more incremental within generative model optimization.

vs. Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

gpt-5.25/20/2026

Paper 2 has higher likely scientific impact due to broader relevance beyond a single domain: it targets a core reliability failure mode in personalized/memory LLM systems (commitment and infeasibility), proposing a principled framework (bounded evidence activation + commitment validation) with clear tradeoffs and strong controlled evaluation across fixtures/backends. Its ideas generalize to assistants, agents, and safety-critical personalization, and are timely given long-context adoption. Paper 1 is innovative and highly applicable to e-commerce experimentation, but its impact is narrower and more product/platform-specific, with weaker methodological generality.

vs. STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

claude-opus-4.65/20/2026

STRIDE addresses a fundamental scientific challenge—automated equation/law discovery from data—with broad applicability across physics, biology, and engineering. Its self-reflective agent framework for symbolic regression introduces methodological innovations (mixed-fitting evaluation, critic-executor repair, semantic memory) that advance both AI and scientific discovery. Paper 1 (SimGym) is a well-executed engineering contribution but is narrowly focused on e-commerce A/B test simulation with limited generalizability beyond that domain. STRIDE's potential to accelerate scientific discovery across multiple fields gives it substantially broader and deeper impact.

vs. SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

claude-opus-4.65/20/2026

SimGym introduces a novel framework combining VLMs with live browser simulation to replace costly A/B tests in e-commerce—a broadly applicable problem. Its cross-disciplinary impact (HCI, ML, e-commerce) and practical value (reducing weeks-long experiments to under an hour with 77% directional alignment) give it wider real-world applicability. SAPO, while technically sound, addresses a narrower problem (credit assignment in generative recommendation RL), representing an incremental improvement within a specific subfield. SimGym's paradigm of simulated experimentation with grounded VLM agents has broader transformative potential.

vs. Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

gemini-3.15/20/2026

Paper 2 introduces a highly novel paradigm for simulating A/B testing using VLM agents, shifting experimentation from live user traffic to zero-risk, rapid simulations. This approach has broad, transformative potential across HCI, e-commerce, and behavioral modeling. While Paper 1 offers strong operational improvements for SRE workflows, Paper 2's ability to accurately model human economic behavior and drastically reduce experimental cycles gives it a wider and more fundamental scientific impact.

vs. Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions

claude-opus-4.65/20/2026

Paper 2 introduces a genuinely novel methodological contribution—using CNNs to discover structural patterns in constraint programming solutions and translating them into streamliner constraints via LLMs. This bridges machine learning and combinatorial optimization in a creative way, with impressive speedups (up to 1103x). The approach is broadly applicable across constraint programming problems. Paper 1, while practically useful for e-commerce A/B testing, is more application-specific with moderate accuracy (77% directional alignment) and relies on assembling existing components (VLMs, browser agents) rather than introducing fundamentally new methods.