ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie, Alberto Castelo, Tianfu Wu, Lingyun Wang

May 15, 2026

arXiv:2605.16116v1 PDF

cs.AI(primary)

#1111of 2292·Artificial Intelligence

#1111 of 2292 · Artificial Intelligence

Tournament Score

1415±37

10501800

46%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor4.5

Novelty5.5

Clarity7.5

Tournament Score

1415±37

10501800

46%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ShopGym

1. Core Contribution

ShopGym addresses a well-identified gap in e-commerce web agent evaluation: the tradeoff between realistic but non-stationary live storefronts and controllable but narrow sandbox benchmarks. The framework introduces two complementary components: ShopArena, which converts live seed storefronts into anonymized, self-contained sandbox shops through a multi-agent exploration-specification-generation pipeline; and ShopGuru, which synthesizes grounded benchmark tasks across seven skill categories. The key architectural insight is the introduction of an intermediate anonymized specification document that decouples storefront exploration from sandbox generation, enabling independent iteration on either phase and providing a human-editable control surface.

The contribution is primarily methodological and systems-oriented rather than algorithmic. It does not propose new agent architectures or learning algorithms but instead provides infrastructure for more rigorous evaluation. The multi-seed composition capability—where structural signals from multiple live storefronts can be merged into a single specification—is a genuinely novel mechanism for scaling diversity within generated environments.

2. Methodological Rigor

The validation has both structural and behavioral components, but both are relatively thin:

Structural validation compares 7 real shops against 3 synthetic shops using accessibility tree depth, interaction element counts, and state-transition graph statistics. While the metrics are reasonable proxies, the sample sizes are small, and the comparison is unpaired—the synthetic shops are not matched to specific real shops for structural comparison. The paper acknowledges that synthetic shops have fewer edges and lower out-degree in the transition graph, attributing this to intentional exclusion of external links and auxiliary pages—a reasonable explanation but one that slightly undermines the "structural alignment" claim.

Behavioral validation uses "twin shops" (synthetic shops built from real product data with visual verification) and compares agent success rates across three frontier models (GPT-5-mini, Gemini 3 Flash, GPT-5) using two evaluation harnesses. The results show "positive correlation" between performance on real and twin shops, but this is demonstrated visually through bar charts of only 3 model points per condition rather than through formal correlation analysis. With only 3 data points, any monotonic relationship would appear correlated. The paper claims positive correlation but provides no correlation coefficients, confidence intervals, or statistical tests.

The evaluation uses 224 generated tasks across 6 sandbox shops, which is a reasonable but not large-scale demonstration. The use of GPT-5 as an LLM judge introduces potential evaluation noise, though the paper provides some safeguards (hard URL gates, forced failure on timeout).

3. Potential Impact

The framework addresses a genuine practical need. As web agents mature toward deployment, the field urgently needs better evaluation infrastructure. ShopGym's design allows:

Scalable benchmark creation: New shops can be generated from any live storefront

Reproducible comparison: Sandbox shops are resettable and stable

Training environments: Shops could serve as RL training environments (mentioned but not demonstrated)

Controlled ablation: The specification layer enables systematic modifications

The framework is built on a commercial platform (Shopify) and uses proprietary models (Claude Opus 4.6, GPT-5), which limits immediate reproducibility for the broader academic community. The reliance on expensive frontier model APIs for shop generation is a scalability concern not fully addressed.

The e-commerce focus is both a strength (deep domain expertise) and a limitation (narrow applicability). The methodology could inspire similar approaches in other web domains, but the current implementation is tightly coupled to e-commerce patterns.

4. Timeliness & Relevance

The paper is highly timely. Web agent research has exploded in 2024-2026, and the evaluation methodology bottleneck is widely recognized. The reference list includes many 2025-2026 papers, indicating active engagement with cutting-edge work. The concurrent work on WebForge, VeriEnv, and WebArena-Infinity addresses similar concerns, suggesting this is a recognized community need. ShopGym distinguishes itself through its grounding in real storefronts and the specification-mediated generation approach.

5. Strengths & Limitations

Key Strengths:

Well-motivated problem framing: The realism-control tradeoff is clearly articulated and the specification-mediated solution is elegant

End-to-end system design: The exploration→specification→generation→task synthesis pipeline is well-structured with clear interfaces between components

Practical safeguards: Anonymization by construction, validator-driven polish loops, and multi-round verification demonstrate engineering maturity

Comprehensive documentation: The appendices (specifications, prompts, examples) provide unusual transparency into the generation process

Multi-seed composition: The ability to blend structural properties from multiple storefronts is a genuinely useful capability

Notable Limitations:

Weak statistical validation: The behavioral correlation claim rests on 3 data points without formal statistical analysis. This is the paper's most significant scientific weakness

Reproducibility concerns: Heavy reliance on proprietary, expensive frontier models (Claude Opus 4.6, GPT-5) for the generation pipeline limits community adoption

Limited scope of generated shops: Only 6 shops across 3 domains are demonstrated. Claims about scalability are architectural rather than empirically demonstrated at scale

No agent training demonstration: Despite mentioning RL training as a use case, no training experiments are conducted

Evaluation narrowness: Only success rate is reported. No analysis of failure modes, per-skill-category breakdowns (only short vs. long horizon), or qualitative analysis of where synthetic shops diverge from real ones

Cost analysis absent: No discussion of the computational/API cost of generating a single sandbox shop, which is critical for the scalability claim

Limited comparison to concurrent work: WebForge and VeriEnv are cited but not empirically compared

Additional Observations

The paper reads more as a system paper or benchmark paper than a scientific contribution with novel insights. The technical novelty lies in engineering integration rather than in new algorithms or theoretical understanding. The multi-agent pipeline design, while practical, relies entirely on prompt engineering over frontier models—the approach may not generalize well as models change.

The framework's value will ultimately be determined by community adoption and whether the generated environments are sufficiently realistic for meaningful agent development. The current validation, while directionally positive, is insufficient to establish this conclusively.

Rating:5.8/ 10

Significance 6.5Rigor 4.5Novelty 5.5Clarity 7.5

Generated May 18, 2026

Comparison History (26)

vs. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

gemini-3.15/19/2026

Paper 1 addresses a critical bottleneck in deploying LLMs for scientific discovery: refining ill-posed requests into actionable tasks. By focusing on computational science domains like fluid mechanics and materials science, it provides a foundation for AI assistants that can accelerate actual research. While Paper 2 offers a valuable framework for commercial e-commerce agents, Paper 1's direct contribution to enhancing scientific methodology and its potential to catalyze cross-disciplinary scientific breakthroughs give it a broader and more profound scientific impact.

vs. Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

gemini-3.15/19/2026

Paper 1 tackles the fundamental debate of generalization versus memorization in LLMs, a critical issue for the entire AI community. By using chess as a controlled domain to demonstrate pattern-matching over true rule-understanding, and proposing a cost-effective verifier framework, its theoretical and practical insights offer broader impact across AI fields compared to Paper 2's domain-specific e-commerce benchmark.

vs. Skim: Speculative Execution for Fast and Efficient Web Agents

gpt-5.25/19/2026

Paper 2 (ShopGym) likely has higher scientific impact because it addresses a core field-wide bottleneck—reproducible, scalable, and realistic evaluation for e-commerce web agents—via an integrated framework (environment generation + task synthesis) that can become shared infrastructure. This enables broader, longer-term benchmarking across methods and supports rigorous, controllable comparisons, with validation linking synthetic to live-store performance. Paper 1 (Skim) is a strong systems optimization with clear practical gains, but its impact is narrower (site-specific templating/speculation) and less foundational than a widely reusable benchmarking ecosystem.

vs. DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

gpt-5.25/19/2026

Paper 2 (ShopGym) likely has higher scientific impact due to broader, more immediate applicability and field-level methodological contribution: it enables realistic, reproducible, scalable evaluation for web agents—an urgent bottleneck in LLM/agent research. The framework can standardize benchmarking across academia/industry and generalizes beyond e-commerce to other web-task domains. Its emphasis on controllability, inspectability, and correlation with live-storefront performance strengthens rigor and relevance. Paper 1 is novel and valuable for EEG foundation modeling, but its impact is narrower to neuro/BCI and depends on data availability and clinical translation timelines.

vs. Scalable Environments Drive Generalizable Agents

gemini-3.15/19/2026

Paper 2 addresses a fundamental challenge in AI—agent generalization—by proposing a paradigm shift toward environment scaling. While Paper 1 offers a highly useful but domain-specific benchmarking tool for e-commerce, Paper 2 provides a conceptual taxonomy and roadmap applicable across all reinforcement learning and autonomous agent research. Its broader scope and potential to shape future research directions across multiple subfields give it a higher potential for widespread scientific impact.

vs. Dynamics of collective creativity in AI art competitions

claude-opus-4.65/19/2026

Paper 2 addresses fundamental questions about collective creativity, cultural evolution, and human-AI co-creation using a large-scale empirical dataset. Its findings about attractor dynamics, the paradox between preference and novelty, and how group size affects creative output have broad implications across cognitive science, cultural evolution, AI, and social science. Paper 1, while technically solid, addresses a narrower engineering problem (benchmarking e-commerce web agents) with impact largely limited to the AI agents community. Paper 2's interdisciplinary relevance and novel empirical insights give it higher potential impact.

vs. Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

gpt-5.25/19/2026

Paper 2 likely has higher impact because it delivers a broadly usable, scalable, and reproducible evaluation framework (simulation + task generation) for web agents, addressing a major methodological bottleneck with clear real-world relevance to e-commerce automation and agent benchmarking. Its artifacts (ShopArena/ShopGuru, tasks, validation analyses) can become community infrastructure, enabling standardized comparisons across models and labs. Paper 1 is novel and timely for embodied ToM in MLLMs, but appears more niche and potentially more sensitive to prompt/CoT-driven gains, with narrower immediate applicability than an evaluation platform that can be widely adopted.

vs. LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

gemini-3.15/19/2026

Paper 2 provides a foundational benchmarking and simulation framework for the rapidly growing field of web agents. By solving critical issues of reproducibility, scalability, and control in e-commerce agent evaluation, it is likely to become a standard testbed that drives future research and standardizes evaluation, often resulting in higher broad impact and citation counts than specific algorithmic improvements like those in Paper 1.

vs. Evaluating Cognitive Age Alignment in Interactive AI Agents

gpt-5.25/19/2026

Paper 1 likely has higher impact due to a concrete, scalable methodology for building realistic yet reproducible e-commerce agent environments, addressing a major evaluation bottleneck with clear real-world applicability (shopping/web automation) and measurable validation (structural analyses, task generation, correlation with live-store performance). Its artifacts (simulated shops + grounded tasks) can become shared infrastructure for benchmarking and progress tracking. Paper 2 is novel and timely by importing psychometrics to interactive agent evaluation, but may face higher construct-validity challenges and narrower immediate deployment pathways compared with ShopGym’s direct engineering and benchmarking utility.

vs. Imperfect World Models are Exploitable

claude-opus-4.65/19/2026

Paper 2 addresses a fundamental theoretical question in reinforcement learning—whether imperfect world models are inherently exploitable—and establishes formal connections between reward hacking and model exploitation. This has broad implications across all RL applications (robotics, game playing, autonomous systems, RLHF), not just e-commerce. The theoretical contributions (inevitability proofs, safe horizon bounds) provide foundational knowledge that will influence future work on safe AI planning. Paper 1, while practically useful, is a domain-specific benchmarking framework with narrower scope and incremental methodological contribution.

vs. Going Headless? On the Boundaries of Vertical AI Firms

claude-opus-4.65/19/2026

Paper 1 presents a concrete, novel technical framework (ShopGym) addressing a well-defined methodological gap in e-commerce web agent evaluation with reproducible benchmarks, empirical validation, and immediate utility for the growing AI agent research community. Paper 2 offers a strategic/theoretical analysis of vertical AI firm boundaries using established economic frameworks (Coase, Teece), which, while timely and insightful for practitioners, introduces fewer novel scientific constructs and is more of a business strategy essay than a research contribution with testable, generalizable methodology. Paper 1's methodological rigor and direct applicability to an active research area give it higher scientific impact potential.

vs. PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

gpt-5.25/19/2026

Paper 2 is likely higher impact: it introduces a novel, generally applicable training signal (dense step-level rewards) for multi-turn agent optimization without external judges or ground-truth, directly addressing a key bottleneck in RLHF-style methods (credit assignment) with low inference cost. The approach is methodologically grounded (diagnosis of prefix contamination, then a two-stage corrective model) and broadly relevant across LLM agent training, alignment, and RL. Paper 1 is valuable infrastructure for web-agent evaluation, but its impact is more domain-specific (e-commerce) and benchmark-focused rather than a widely reusable learning method.

vs. ScreenSearch: Uncertainty-Aware OS Exploration

gemini-3.15/18/2026

Paper 1 introduces a foundational benchmarking and simulation framework (ShopGym) for e-commerce web agents. In AI research, comprehensive benchmarks and simulation environments (like OpenAI Gym) historically have exceptionally high scientific impact as they become the standard testbeds for evaluating new algorithms. While Paper 2 presents a strong exploration method for OS agents, Paper 1 solves a critical methodological bottleneck (the tradeoff between realism and reproducibility) that will likely anchor future research and attract broad citations across the web agent community.

vs. Position: Artificial Intelligence Needs Meta Intelligence -- the Case for Metacognitive AI

claude-opus-4.65/18/2026

Paper 2 proposes metacognition as a general design principle for AI, which has broader cross-disciplinary impact spanning cognitive science, AI safety, resource efficiency, and federated learning. Its conceptual framework could influence diverse AI subfields and offers a novel paradigm shift. Paper 1, while methodologically solid, addresses a narrower problem (e-commerce web agent benchmarking) with more limited audience and applicability. Paper 2's breadth of potential impact, timeliness given growing concerns about AI reliability and efficiency, and its bridging of cognitive science with practical AI design give it higher estimated scientific impact.

vs. Human-Inspired Memory Architecture for LLM Agents

gpt-5.25/18/2026

Paper 1 likely has higher impact due to offering a broadly enabling, reproducible evaluation infrastructure for web/e-commerce agents—addressing a core bottleneck (realism vs. controllability) with clear methodological rigor (generation pipeline, structural/behavioral validation, correlation to live storefront performance). Its artifacts can standardize benchmarking across labs and accelerate progress in agent research, HCI, and applied AI, with immediate real-world relevance. Paper 2 is timely and useful for agent memory, but its mechanisms are partly conceptual/engineering integrations and may be more sensitive to implementation choices and benchmark-specific effects.

vs. CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

gpt-5.25/18/2026

Paper 2 (ShopGym) has higher likely scientific impact because it addresses a major evaluation bottleneck for web agents with a reusable, scalable methodology enabling realistic yet reproducible benchmarks. Its artifacts (simulated storefronts + grounded task generator) can become shared infrastructure, broadly benefiting research in agents, HCI, benchmarking, and e-commerce automation, with clear real-world relevance and timeliness as web agents surge. Paper 1 (CAPS) is a strong, practical inference-time efficiency improvement for pairwise self-verification, but it is more incremental and narrower in scope, primarily impacting LLM test-time scaling workflows.

vs. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

claude-opus-4.65/18/2026

Skill1 presents a novel unified framework for co-evolving skill selection, utilization, and distillation in language model agents via a single reinforcement learning objective, addressing a fundamental challenge in agent learning. Its methodological contribution—decomposing a single reward signal into low-frequency and high-frequency components for credit assignment—is innovative and broadly applicable beyond the tested domains. While ShopGym provides a valuable benchmarking infrastructure for e-commerce agents, it is more domain-specific and incremental (simulation/benchmark construction), whereas Skill1 introduces a transferable algorithmic paradigm with demonstrated improvements over multiple baselines and rigorous ablation studies confirming the necessity of each component.

vs. Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

gemini-3.15/18/2026

Paper 1 addresses a critical bottleneck in LLM reasoning: exploration efficiency in Reinforcement Learning with Verifiable Rewards (RLVR). Given the massive current interest in scaling reasoning via RL (e.g., DeepSeek-R1, OpenAI o1), outperforming GRPO with smaller rollout budgets offers highly timely and broadly applicable advancements. Paper 2 presents a valuable but domain-specific benchmark for e-commerce web agents, which has a narrower scope of impact compared to foundational improvements in LLM reasoning capabilities.

vs. AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

claude-opus-4.65/18/2026

AGPO addresses a fundamental limitation of RLVR methods—reasoning boundary shrinkage—with a novel asymmetric reinforcement strategy that has both theoretical depth and demonstrated practical impact. It achieves state-of-the-art on mathematical benchmarks and shows real-world industrial deployment at JD for search ads relevance. ShopGym contributes a useful benchmarking framework for e-commerce agents but is more narrowly scoped as infrastructure. AGPO's insights about maintaining exploration capacity while suppressing incorrect paths have broader implications for LLM training methodology across many domains.

vs. Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

gemini-3.15/18/2026

Paper 2 addresses a highly timely and critical issue—the efficacy of LLMs as educational tutors. By demonstrating fundamental flaws in how LLMs handle suboptimal and incorrect student solutions, it provides crucial insights that impact the rapidly growing intersection of AI and EdTech. Paper 1 offers a valuable framework for web agents in e-commerce, but its scope is more niche compared to the broader societal and cross-disciplinary implications of evaluating and improving AI-driven educational tools.