CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu

May 25, 2026

arXiv:2605.25624v1 PDF

cs.AI(primary)cs.LG

#180of 2682·Artificial Intelligence

#180 of 2682 · Artificial Intelligence

Tournament Score

1528±46

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance7.5

Rigor6.5

Novelty6.5

Clarity7.5

Tournament Score

1528±46

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CUA-Gym

Core Contribution

CUA-Gym addresses a well-identified bottleneck in training computer-use agents (CUAs) via reinforcement learning with verifiable rewards (RLVR): the scarcity of scalable training data that simultaneously provides deterministic rewards, broad application coverage, and diverse tasks. The paper's core contribution is a multi-agent pipeline that co-generates task instructions, environment states, and reward functions from shared specifications. The key architectural insight is the Generator-Discriminator separation with an information barrier: the Generator constructs initial and golden environment states while the Discriminator writes reward functions from the task description alone, preventing reward hacking by construction. This is complemented by CUA-Gym-Hub, a suite of 94 synthesized mock web applications that serve as reusable, resettable training environments.

The resulting dataset of 32,112 verified RLVR tuples across 110 environments is, by the authors' account, the largest open-source CUA RLVR corpus with programmatic verification covering both desktop and web platforms.

Methodological Rigor

The pipeline design reflects careful engineering. The adversarial loop with five agreement conditions, forbidden-pattern static scanning, and the two-stage filtering (LLM majority voting + teacher rollouts) create multiple layers of quality assurance. The information barrier between Generator and Discriminator is well-motivated and operationally enforced through process-level isolation with a clearly defined access matrix.

The experimental evaluation is reasonably thorough. Data scaling experiments (1.4K, 3K, 12K tuples) show monotonic improvement without saturation, and the environment scaling ablation (10 vs. 80 environments at fixed data) provides direct evidence that environment diversity is a complementary scaling axis. The transfer results on WebArena (a held-out benchmark disjoint from training environments) are particularly valuable, demonstrating generalization rather than overfitting to synthesized mocks.

However, there are notable limitations in rigor. All RL runs are single-seed due to compute costs, limiting statistical confidence. The environment scaling study uses teacher distillation rather than full RL, making it an indirect proxy for the claim about RL scaling. The OSWorld-Verified benchmark results, while strong (62.1% for A3B, 72.6% for A17B), involve relatively modest absolute gains on WebArena (+3.7 and +2.0 pp), raising questions about how much of the improvement is benchmark-specific.

Potential Impact

Immediate impact: The open-sourcing of the full pipeline, dataset, environments, and models creates significant infrastructure for the CUA research community. CUA-Gym-Hub in particular could become a standard substrate for CUA training, similar to how WebArena serves as an evaluation benchmark.

Broader implications: The paper demonstrates that the RLVR recipe that worked for math and code transfers to the GUI agent domain, which has substantially different structural requirements (environment state management, visual grounding, multi-step interaction). This validates a research direction and provides a template for scaling.

Environment synthesis as a scaling axis: The finding that environment diversity independently contributes to performance beyond data volume is a valuable insight. This reframes the CUA training problem as needing investment in breadth of environments, not just depth of tasks per environment.

Emergent multi-action batching: The spontaneous emergence of action batching during RL training (33-45% trajectory compression) is an interesting finding that parallels emergent behaviors in reasoning RL, suggesting a general phenomenon worth further study.

Timeliness & Relevance

This work arrives at a critical moment. Computer-use agents are rapidly becoming a focal point for AI deployment, with Anthropic, OpenAI, and others investing heavily. The data bottleneck for CUA RLVR is widely acknowledged, and this paper offers the most comprehensive solution to date. The grounding of environment selection in O*NET occupational taxonomies and the Anthropic Economic Index reflects a practical orientation toward real-world utility distributions.

Strengths

1. Systems-level completeness: The paper addresses the full pipeline from environment synthesis to task generation to reward verification to RL training, rather than solving one piece in isolation.

2. Anti-reward-hacking by design: The information barrier, forbidden-pattern scanning, and multi-stage filtering represent a principled approach to a problem that plagues many RL training pipelines.

3. Strong empirical results: CUA-Gym-A3B matching the untrained A17B base at ~10× fewer parameters is a compelling demonstration of data efficiency gains.

4. Reproducibility commitment: Open-sourcing the full stack (pipeline, dataset, environments, models) is essential for community adoption.

5. Detailed documentation: The appendix provides extraordinary detail on implementation (skill files, reward patterns, full code examples), enabling genuine reproducibility.

Limitations & Weaknesses

1. Terminal-state verification only: Rewards verify final environment state rather than process quality. A destructive path that recreates the correct final state receives full reward—a fundamental limitation acknowledged but unaddressed.

2. Mock fidelity gap: Mock environments are approximations missing authentication, rate limits, network latency, and real server-side behavior. The degree to which this limits transfer to real applications remains unclear.

3. Single-seed experiments: The most important results (RL training curves, final benchmark numbers) lack error bars, making it difficult to assess reliability.

4. Compute cost barriers: The infrastructure requirements (192-512 H200 GPUs, 2000 parallel VMs, ~$60K+ in compute per large run) limit reproducibility despite code release.

5. Benchmark specificity: The WebArena gains are modest compared to OSWorld gains, suggesting the training may be somewhat OSWorld-aligned.

6. Trademark/IP concerns: Using real product names as internal development labels for synthesized mocks, even with compliance procedures, introduces legal ambiguity that could complicate adoption.

Overall Assessment

CUA-Gym is a strong systems contribution that advances the state of CUA training infrastructure substantially. Its primary value lies in demonstrating that the RLVR scaling paradigm transfers to CUAs and in providing reusable infrastructure for the community. The information-barrier design for reward co-generation is the most novel technical contribution. While individual components (mock synthesis, adversarial verification, RL training) are not groundbreaking in isolation, their integration into a coherent, scalable pipeline represents meaningful engineering innovation with clear practical impact.

Rating:7.5/ 10

Significance 7.5Rigor 6.5Novelty 6.5Clarity 7.5

Generated May 26, 2026

Comparison History (22)

vs. Credit Assignment with Resets in Language Model Reasoning

gemini-3.15/27/2026

Paper 2 tackles a fundamental algorithmic bottleneck in LLM reasoning (credit assignment in multi-step RL) with broad applicability across all reasoning domains. While Paper 1 provides a highly valuable dataset and pipeline for computer-use agents, Paper 2's theoretical framework and self-localization improvements over standard GRPO offer deeper methodological innovation and broader foundational scientific impact.

vs. Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

claude-opus-4.65/27/2026

CUA-Gym addresses a critical bottleneck in training computer-use agents via RLVR by providing a scalable pipeline for generating verified training data, environments, and rewards. It produces concrete artifacts (32K training tuples, 110 environments, trained models) that enable reproducible progress in an emerging high-impact area. The demonstrated scaling laws and cross-benchmark transfer are compelling. Paper 1 offers valuable diagnostic insights about composition collapse in LLM reasoning, but its contribution is primarily analytical/methodological rather than enabling new capabilities. CUA-Gym's open-source release of pipeline, data, and models will likely drive broader follow-on research.

vs. PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models

gpt-5.25/26/2026

Paper 2 likely has higher impact due to a broadly enabling contribution: a scalable pipeline plus a large, verifiable RLVR dataset (32k tasks, 110 environments) and synthesized app suite, with open-sourcing planned. This directly addresses a key bottleneck for computer-use agents, a timely area with strong real-world applicability (automation on GUIs/web) and potential to standardize training/evaluation across labs. While Paper 1 is a novel, rigorous PEFT method for reducing interference, its impact is narrower (model adaptation mechanics) and more incremental relative to the ecosystem-wide leverage of scalable verified environments.

vs. Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

gpt-5.25/26/2026

Paper 1 likely has higher impact due to a concrete, scalable solution to a key bottleneck (verifiable RL training data for computer-use agents), with sizable artifacts (32k verified tuples, 110 environments, synthesized app hub) and strong benchmark gains plus demonstrated transfer. The co-generation pipeline for tasks/environments/rewards and planned open-sourcing can broadly accelerate CUA/RLVR research across agents, tooling, and UI automation. Paper 2 targets an important domain, but the abstract suggests a more incremental neuro-symbolic/fuzzy-logic integration with performance merely comparable to SOTA and fewer clearly specified methodological/benchmark innovations, limiting expected breadth and adoption.

vs. Agentic Systems as Boosting Weak Reasoning Models

claude-opus-4.65/26/2026

Paper 2 (CUA-Gym) introduces a novel scalable pipeline for generating verifiable training data for computer-use agents, addressing a critical bottleneck in RLVR for CUAs. It provides concrete infrastructure (32K training tuples, 110 environments, open-source models/pipeline) that enables the community to advance CUA research. Paper 1 provides valuable theoretical insights on inference-time boosting with weak models, but is more incremental—formalizing and empirically validating known intuitions about committee-based selection. Paper 2's broader impact stems from creating foundational training infrastructure for an emerging field, with demonstrated state-of-the-art results and transfer learning, plus full open-source release.

vs. Adaptive Human-AI Coordination via Hierarchical Action Disentanglement

gemini-3.15/26/2026

Paper 2 addresses a critical bottleneck in the highly impactful field of Computer-Use Agents by introducing a scalable pipeline for verifiable reinforcement learning (RLVR). Releasing a massive dataset, environments, and performant models provides immense value to the community, directly advancing real-world LLM agent capabilities. Paper 1 offers a solid algorithmic contribution to human-AI coordination but evaluates primarily in a simulated toy domain (Overcooked), limiting its immediate real-world breadth and impact compared to Paper 2.

vs. Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

claude-opus-4.65/26/2026

CUA-Gym addresses a fundamental bottleneck in training computer-use agents via RLVR by creating a scalable pipeline for generating verified training data with deterministic rewards. It produces 32K verified training tuples across 110 environments, demonstrates strong empirical results (state-of-the-art on OSWorld and transfer to WebArena), and promises to open-source the full pipeline, dataset, and models. This infrastructure contribution enables future research at scale. Paper 2 proposes an incremental training-free method for hallucination mitigation in VLMs—a well-studied problem with many existing solutions—offering narrower impact scope.

vs. Test-Time Deep Thinking to Explore Implicit Rules

gpt-5.25/26/2026

Paper 2 likely has higher impact due to a broadly useful, scalable infrastructure contribution: a pipeline to generate verifiable RL training tasks/environments/rewards for computer-use agents, plus a large released dataset (32k tuples, 110 environments) and synthetic app suite. This directly addresses a key bottleneck for RLVR in CUAs and is timely with strong real-world applicability (web/OS automation). The methodological design (generator–discriminator–orchestrator + filtering) and demonstrated scaling/transfer suggest robust, cross-project adoption potential beyond a single task setting.

vs. A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact due to its scale (2.8M ECGs; 9 external cohorts), methodological rigor via broad external validation across 89 clinically relevant tasks, and direct real-world applicability to cardiovascular screening and diagnosis, including rare diseases. Its signal-language foundation approach is timely and broadly extensible in medical AI, with potential immediate translational value in healthcare systems. Paper 1 is novel and important for agent training infrastructure, but its impact is more concentrated within CUA/RL communities and depends on downstream adoption and robustness in real environments.

vs. EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages

gemini-3.15/26/2026

Paper 2 addresses a critical bottleneck in the highly active field of AI agents by introducing a scalable, automated pipeline for generating verifiable RL environments. Its broad applicability to autonomous computer-use, massive dataset release, and strong performance gains indicate a much higher potential for widespread adoption and cross-disciplinary impact compared to Paper 1's incremental gains in the narrower, highly specialized domain of clinical NLP.

vs. Towards end-to-end LLM-based censoring-aware survival analysis

claude-opus-4.65/26/2026

CUA-Gym addresses a fundamental bottleneck in training computer-use agents via RLVR by creating a scalable pipeline for generating verified training data. It produces a large-scale dataset (32K tuples, 110 environments), demonstrates strong empirical results with transfer to held-out benchmarks, and promises full open-source release of pipeline, data, and models. Its breadth of impact spans RL, LLM agents, and software automation—a rapidly growing field. Paper 1, while novel in adapting LLMs for survival analysis, shows modest improvements over baselines and serves primarily as a proof of concept in a narrower domain.

vs. AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

gemini-3.15/26/2026

Paper 1 addresses a critical bottleneck in developing Computer-Use Agents (CUAs)—the scarcity of verifiable training environments. By providing a scalable pipeline for generating environments, tasks, and reward functions, it enables Reinforcement Learning with Verifiable Rewards (RLVR) for CUAs. This unlocks significant potential for automating complex digital tasks, representing a foundational leap for agentic AI. While Paper 2 provides a valuable evaluation tool for AV generation, the generative environment synthesis and agent training paradigm in Paper 1 offers a broader, more transformative impact on general-purpose AI development.

vs. A Deep Dive into Axiomatic Design -- Part I: Problem Formulation

gpt-5.25/26/2026

Paper 1 is more novel and timely, proposing a scalable, verifiable RL training pipeline for computer-use agents with generated environments, reward functions, and a large released dataset. It demonstrates methodological rigor via adversarial generation, filtering, and clear benchmark gains with transfer, and has strong real-world applicability (automation on web/OS tasks) plus broad impact across RL, agentic AI, and software/tool use. Paper 2 is largely expository/synthesizing existing axiomatic design theory with practical advice; useful pedagogically but likely lower novelty and narrower scientific impact.

vs. StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs

gpt-5.25/26/2026

Paper 1 likely has higher scientific impact: it introduces a scalable, verifiable RL training pipeline for computer-use agents plus a large dataset (32k tuples) and 110 synthetic yet high-fidelity environments, addressing a key bottleneck for RLVR in CUAs. It demonstrates measurable performance gains and transfer (OSWorld-Verified, WebArena) and plans full open-sourcing, enabling broad adoption across agents, RL, HCI, and automation. Paper 2 is timely and important for safety, but mainly provides an attack/benchmark in a narrower niche with less direct capability-building impact.

vs. Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs

gpt-5.25/26/2026

Paper 1 likely has higher impact: it delivers a scalable, verifiable RL training pipeline plus a large dataset (32k tuples, 110 environments) and synthetic web-app suite, directly addressing a major bottleneck for computer-use agents. It shows strong empirical gains on established benchmarks and promises open-sourcing of pipeline, data, environments, and models—enabling broad adoption and follow-on work. Its applications (web/OS automation) are immediate and timely. Paper 2 is conceptually novel for heuristic design, but appears less concretely enabling at ecosystem scale than Paper 1’s infrastructure-and-data contribution.

vs. Hypothesis Generation and Inductive Inference in Children and Language Models

gpt-5.25/26/2026

Paper 1 likely has higher scientific impact due to a scalable, verifiable RL training pipeline plus a large open dataset/environments for computer-use agents—an area currently bottlenecked by reward verification and environment availability. Its methodological contribution (generator–discriminator reward synthesis with execution-based filtering) and demonstrated performance gains/transfer suggest immediate applicability for agent training and benchmarking across industry and academia, with broad downstream influence on RL, agentic LLMs, and UI automation. Paper 2 is novel and interdisciplinary, but its impact is more domain-specific (cognitive modeling) and less directly enabling for large-scale systems development.

vs. When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

claude-opus-4.65/26/2026

CUA-Gym addresses a critical bottleneck in training computer-use agents by providing a scalable pipeline for generating verified training data with deterministic rewards. It delivers 32K verified training tuples across 110 environments, demonstrates strong empirical results (outperforming prior open-source CUAs), and promises full open-source release of pipeline, data, environments, and models. This infrastructure contribution enables broad follow-on research. Paper 2 provides valuable analysis of multi-agent RL training dynamics but is more diagnostic/analytical in nature, offering design insights rather than enabling new capabilities or releasing transformative resources.

vs. GRAIL: AI translation for scientists application workflow on satellite data

claude-opus-4.65/26/2026

CUA-Gym addresses a fundamental bottleneck in training computer-use agents via RLVR by creating a scalable pipeline for generating verified training data, producing 32,112 training tuples across 110 environments. It achieves state-of-the-art results on established benchmarks (OSWorld, WebArena) and demonstrates transfer learning. The methodology is broadly applicable to the rapidly growing field of autonomous agents. Paper 2 (GRAIL) solves a useful but narrower problem—translating geospatial Python scripts to Spark—with limited novelty (uses existing LLMs and frameworks) and narrower domain impact.

vs. FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

gemini-3.15/26/2026

Paper 2 addresses a critical bottleneck in the highly impactful field of general-purpose AI agents: the lack of scalable, verifiable training data for Computer-Use Agents (CUAs). While Paper 1 introduces a valuable benchmark for the specific niche of Operations Research, Paper 2 provides a comprehensive, scalable pipeline that co-generates environments and rewards, yields a massive dataset (32k+ tasks), and trains state-of-the-art open-source models that outperform existing baselines on major benchmarks. This demonstrates broader applicability, higher methodological innovation, and immediate real-world utility for developing autonomous agents.

vs. Associations between echocardiographic traits and AI-ECG predictions of heart failure

claude-opus-4.65/26/2026

CUA-Gym addresses a critical bottleneck in training computer-use agents via RLVR by providing a scalable pipeline for generating verified training data with deterministic rewards. It produces 32K verified training tuples across 110 environments, demonstrates state-of-the-art results on established benchmarks, and promises full open-source release of pipeline, data, and models. This has broad impact across AI/ML, software engineering, and human-computer interaction. Paper 2, while clinically useful, is primarily a correlational analysis validating what an existing AI-ECG model detects, with more incremental contributions to cardiology.