ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie, Alberto Castelo, Tianfu Wu, Lingyun Wang
Abstract
Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.
AI Impact Assessments
(1 models)Scientific Impact Assessment: ShopGym
1. Core Contribution
ShopGym addresses a well-identified gap in e-commerce web agent evaluation: the tradeoff between realistic but non-stationary live storefronts and controllable but narrow sandbox benchmarks. The framework introduces two complementary components: ShopArena, which converts live seed storefronts into anonymized, self-contained sandbox shops through a multi-agent exploration-specification-generation pipeline; and ShopGuru, which synthesizes grounded benchmark tasks across seven skill categories. The key architectural insight is the introduction of an intermediate anonymized specification document that decouples storefront exploration from sandbox generation, enabling independent iteration on either phase and providing a human-editable control surface.
The contribution is primarily methodological and systems-oriented rather than algorithmic. It does not propose new agent architectures or learning algorithms but instead provides infrastructure for more rigorous evaluation. The multi-seed composition capability—where structural signals from multiple live storefronts can be merged into a single specification—is a genuinely novel mechanism for scaling diversity within generated environments.
2. Methodological Rigor
The validation has both structural and behavioral components, but both are relatively thin:
Structural validation compares 7 real shops against 3 synthetic shops using accessibility tree depth, interaction element counts, and state-transition graph statistics. While the metrics are reasonable proxies, the sample sizes are small, and the comparison is unpaired—the synthetic shops are not matched to specific real shops for structural comparison. The paper acknowledges that synthetic shops have fewer edges and lower out-degree in the transition graph, attributing this to intentional exclusion of external links and auxiliary pages—a reasonable explanation but one that slightly undermines the "structural alignment" claim.
Behavioral validation uses "twin shops" (synthetic shops built from real product data with visual verification) and compares agent success rates across three frontier models (GPT-5-mini, Gemini 3 Flash, GPT-5) using two evaluation harnesses. The results show "positive correlation" between performance on real and twin shops, but this is demonstrated visually through bar charts of only 3 model points per condition rather than through formal correlation analysis. With only 3 data points, any monotonic relationship would appear correlated. The paper claims positive correlation but provides no correlation coefficients, confidence intervals, or statistical tests.
The evaluation uses 224 generated tasks across 6 sandbox shops, which is a reasonable but not large-scale demonstration. The use of GPT-5 as an LLM judge introduces potential evaluation noise, though the paper provides some safeguards (hard URL gates, forced failure on timeout).
3. Potential Impact
The framework addresses a genuine practical need. As web agents mature toward deployment, the field urgently needs better evaluation infrastructure. ShopGym's design allows:
The framework is built on a commercial platform (Shopify) and uses proprietary models (Claude Opus 4.6, GPT-5), which limits immediate reproducibility for the broader academic community. The reliance on expensive frontier model APIs for shop generation is a scalability concern not fully addressed.
The e-commerce focus is both a strength (deep domain expertise) and a limitation (narrow applicability). The methodology could inspire similar approaches in other web domains, but the current implementation is tightly coupled to e-commerce patterns.
4. Timeliness & Relevance
The paper is highly timely. Web agent research has exploded in 2024-2026, and the evaluation methodology bottleneck is widely recognized. The reference list includes many 2025-2026 papers, indicating active engagement with cutting-edge work. The concurrent work on WebForge, VeriEnv, and WebArena-Infinity addresses similar concerns, suggesting this is a recognized community need. ShopGym distinguishes itself through its grounding in real storefronts and the specification-mediated generation approach.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper reads more as a system paper or benchmark paper than a scientific contribution with novel insights. The technical novelty lies in engineering integration rather than in new algorithms or theoretical understanding. The multi-agent pipeline design, while practical, relies entirely on prompt engineering over frontier models—the approach may not generalize well as models change.
The framework's value will ultimately be determined by community adoption and whether the generated environments are sufficiently realistic for meaningful agent development. The current validation, while directionally positive, is insufficient to establish this conclusively.
Generated May 18, 2026
Comparison History (26)
Paper 1 addresses a critical bottleneck in deploying LLMs for scientific discovery: refining ill-posed requests into actionable tasks. By focusing on computational science domains like fluid mechanics and materials science, it provides a foundation for AI assistants that can accelerate actual research. While Paper 2 offers a valuable framework for commercial e-commerce agents, Paper 1's direct contribution to enhancing scientific methodology and its potential to catalyze cross-disciplinary scientific breakthroughs give it a broader and more profound scientific impact.
Paper 1 tackles the fundamental debate of generalization versus memorization in LLMs, a critical issue for the entire AI community. By using chess as a controlled domain to demonstrate pattern-matching over true rule-understanding, and proposing a cost-effective verifier framework, its theoretical and practical insights offer broader impact across AI fields compared to Paper 2's domain-specific e-commerce benchmark.
Paper 2 (ShopGym) likely has higher scientific impact because it addresses a core field-wide bottleneck—reproducible, scalable, and realistic evaluation for e-commerce web agents—via an integrated framework (environment generation + task synthesis) that can become shared infrastructure. This enables broader, longer-term benchmarking across methods and supports rigorous, controllable comparisons, with validation linking synthetic to live-store performance. Paper 1 (Skim) is a strong systems optimization with clear practical gains, but its impact is narrower (site-specific templating/speculation) and less foundational than a widely reusable benchmarking ecosystem.
Paper 2 (ShopGym) likely has higher scientific impact due to broader, more immediate applicability and field-level methodological contribution: it enables realistic, reproducible, scalable evaluation for web agents—an urgent bottleneck in LLM/agent research. The framework can standardize benchmarking across academia/industry and generalizes beyond e-commerce to other web-task domains. Its emphasis on controllability, inspectability, and correlation with live-storefront performance strengthens rigor and relevance. Paper 1 is novel and valuable for EEG foundation modeling, but its impact is narrower to neuro/BCI and depends on data availability and clinical translation timelines.
Paper 2 addresses a fundamental challenge in AI—agent generalization—by proposing a paradigm shift toward environment scaling. While Paper 1 offers a highly useful but domain-specific benchmarking tool for e-commerce, Paper 2 provides a conceptual taxonomy and roadmap applicable across all reinforcement learning and autonomous agent research. Its broader scope and potential to shape future research directions across multiple subfields give it a higher potential for widespread scientific impact.
Paper 2 addresses fundamental questions about collective creativity, cultural evolution, and human-AI co-creation using a large-scale empirical dataset. Its findings about attractor dynamics, the paradox between preference and novelty, and how group size affects creative output have broad implications across cognitive science, cultural evolution, AI, and social science. Paper 1, while technically solid, addresses a narrower engineering problem (benchmarking e-commerce web agents) with impact largely limited to the AI agents community. Paper 2's interdisciplinary relevance and novel empirical insights give it higher potential impact.
Paper 2 likely has higher impact because it delivers a broadly usable, scalable, and reproducible evaluation framework (simulation + task generation) for web agents, addressing a major methodological bottleneck with clear real-world relevance to e-commerce automation and agent benchmarking. Its artifacts (ShopArena/ShopGuru, tasks, validation analyses) can become community infrastructure, enabling standardized comparisons across models and labs. Paper 1 is novel and timely for embodied ToM in MLLMs, but appears more niche and potentially more sensitive to prompt/CoT-driven gains, with narrower immediate applicability than an evaluation platform that can be widely adopted.
Paper 2 provides a foundational benchmarking and simulation framework for the rapidly growing field of web agents. By solving critical issues of reproducibility, scalability, and control in e-commerce agent evaluation, it is likely to become a standard testbed that drives future research and standardizes evaluation, often resulting in higher broad impact and citation counts than specific algorithmic improvements like those in Paper 1.
Paper 1 likely has higher impact due to a concrete, scalable methodology for building realistic yet reproducible e-commerce agent environments, addressing a major evaluation bottleneck with clear real-world applicability (shopping/web automation) and measurable validation (structural analyses, task generation, correlation with live-store performance). Its artifacts (simulated shops + grounded tasks) can become shared infrastructure for benchmarking and progress tracking. Paper 2 is novel and timely by importing psychometrics to interactive agent evaluation, but may face higher construct-validity challenges and narrower immediate deployment pathways compared with ShopGym’s direct engineering and benchmarking utility.
Paper 2 addresses a fundamental theoretical question in reinforcement learning—whether imperfect world models are inherently exploitable—and establishes formal connections between reward hacking and model exploitation. This has broad implications across all RL applications (robotics, game playing, autonomous systems, RLHF), not just e-commerce. The theoretical contributions (inevitability proofs, safe horizon bounds) provide foundational knowledge that will influence future work on safe AI planning. Paper 1, while practically useful, is a domain-specific benchmarking framework with narrower scope and incremental methodological contribution.
Paper 1 presents a concrete, novel technical framework (ShopGym) addressing a well-defined methodological gap in e-commerce web agent evaluation with reproducible benchmarks, empirical validation, and immediate utility for the growing AI agent research community. Paper 2 offers a strategic/theoretical analysis of vertical AI firm boundaries using established economic frameworks (Coase, Teece), which, while timely and insightful for practitioners, introduces fewer novel scientific constructs and is more of a business strategy essay than a research contribution with testable, generalizable methodology. Paper 1's methodological rigor and direct applicability to an active research area give it higher scientific impact potential.
Paper 2 is likely higher impact: it introduces a novel, generally applicable training signal (dense step-level rewards) for multi-turn agent optimization without external judges or ground-truth, directly addressing a key bottleneck in RLHF-style methods (credit assignment) with low inference cost. The approach is methodologically grounded (diagnosis of prefix contamination, then a two-stage corrective model) and broadly relevant across LLM agent training, alignment, and RL. Paper 1 is valuable infrastructure for web-agent evaluation, but its impact is more domain-specific (e-commerce) and benchmark-focused rather than a widely reusable learning method.
Paper 1 introduces a foundational benchmarking and simulation framework (ShopGym) for e-commerce web agents. In AI research, comprehensive benchmarks and simulation environments (like OpenAI Gym) historically have exceptionally high scientific impact as they become the standard testbeds for evaluating new algorithms. While Paper 2 presents a strong exploration method for OS agents, Paper 1 solves a critical methodological bottleneck (the tradeoff between realism and reproducibility) that will likely anchor future research and attract broad citations across the web agent community.
Paper 2 proposes metacognition as a general design principle for AI, which has broader cross-disciplinary impact spanning cognitive science, AI safety, resource efficiency, and federated learning. Its conceptual framework could influence diverse AI subfields and offers a novel paradigm shift. Paper 1, while methodologically solid, addresses a narrower problem (e-commerce web agent benchmarking) with more limited audience and applicability. Paper 2's breadth of potential impact, timeliness given growing concerns about AI reliability and efficiency, and its bridging of cognitive science with practical AI design give it higher estimated scientific impact.
Paper 1 likely has higher impact due to offering a broadly enabling, reproducible evaluation infrastructure for web/e-commerce agents—addressing a core bottleneck (realism vs. controllability) with clear methodological rigor (generation pipeline, structural/behavioral validation, correlation to live storefront performance). Its artifacts can standardize benchmarking across labs and accelerate progress in agent research, HCI, and applied AI, with immediate real-world relevance. Paper 2 is timely and useful for agent memory, but its mechanisms are partly conceptual/engineering integrations and may be more sensitive to implementation choices and benchmark-specific effects.
Paper 2 (ShopGym) has higher likely scientific impact because it addresses a major evaluation bottleneck for web agents with a reusable, scalable methodology enabling realistic yet reproducible benchmarks. Its artifacts (simulated storefronts + grounded task generator) can become shared infrastructure, broadly benefiting research in agents, HCI, benchmarking, and e-commerce automation, with clear real-world relevance and timeliness as web agents surge. Paper 1 (CAPS) is a strong, practical inference-time efficiency improvement for pairwise self-verification, but it is more incremental and narrower in scope, primarily impacting LLM test-time scaling workflows.
Skill1 presents a novel unified framework for co-evolving skill selection, utilization, and distillation in language model agents via a single reinforcement learning objective, addressing a fundamental challenge in agent learning. Its methodological contribution—decomposing a single reward signal into low-frequency and high-frequency components for credit assignment—is innovative and broadly applicable beyond the tested domains. While ShopGym provides a valuable benchmarking infrastructure for e-commerce agents, it is more domain-specific and incremental (simulation/benchmark construction), whereas Skill1 introduces a transferable algorithmic paradigm with demonstrated improvements over multiple baselines and rigorous ablation studies confirming the necessity of each component.
Paper 1 addresses a critical bottleneck in LLM reasoning: exploration efficiency in Reinforcement Learning with Verifiable Rewards (RLVR). Given the massive current interest in scaling reasoning via RL (e.g., DeepSeek-R1, OpenAI o1), outperforming GRPO with smaller rollout budgets offers highly timely and broadly applicable advancements. Paper 2 presents a valuable but domain-specific benchmark for e-commerce web agents, which has a narrower scope of impact compared to foundational improvements in LLM reasoning capabilities.
AGPO addresses a fundamental limitation of RLVR methods—reasoning boundary shrinkage—with a novel asymmetric reinforcement strategy that has both theoretical depth and demonstrated practical impact. It achieves state-of-the-art on mathematical benchmarks and shows real-world industrial deployment at JD for search ads relevance. ShopGym contributes a useful benchmarking framework for e-commerce agents but is more narrowly scoped as infrastructure. AGPO's insights about maintaining exploration capacity while suppressing incorrect paths have broader implications for LLM training methodology across many domains.
Paper 2 addresses a highly timely and critical issue—the efficacy of LLMs as educational tutors. By demonstrating fundamental flaws in how LLMs handle suboptimal and incorrect student solutions, it provides crucial insights that impact the rapidly growing intersection of AI and EdTech. Paper 1 offers a valuable framework for web agents in e-commerce, but its scope is more niche compared to the broader societal and cross-disciplinary implications of evaluating and improving AI-driven educational tools.