PRO-CUA: Process-Reward Optimization for Computer Use Agents

Yifei He, Rui Yang, Hao Bai, Tong Zhang, Han Zhao

May 27, 2026

arXiv:2605.29119v1 PDF

cs.AI(primary)

#669of 2821·Artificial Intelligence

#669 of 2821 · Artificial Intelligence

Tournament Score

1465±49

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance5.5

Rigor5.5

Novelty5

Clarity7.5

Tournament Score

1465±49

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PRO-CUA: Process-Reward Optimization for Computer Use Agents

1. Core Contribution

PRO-CUA addresses a genuine pain point in training computer use agents (CUAs): the tension between needing on-policy exploration for robust learning and the prohibitive cost of live environment interaction. The paper proposes a two-stage decoupled framework that (1) collects on-policy states through live rollouts, and (2) performs policy optimization offline by sampling candidate actions, grading them with a Process Reward Model (PRM), and updating via Group Relative Policy Optimization (GRPO). The key insight is that by decoupling state collection from policy optimization, the system avoids the infrastructure nightmare of synchronous RL in GUI environments while still training on the agent's own state distribution.

The three interconnected innovations — on-policy state collection, PRM-based step-level rewards replacing rule-based rewards, and the decoupled training architecture — collectively address distribution shift, sparse reward, and infrastructure cost challenges simultaneously. While none of these individual ideas is entirely novel (PRMs exist for math reasoning, on-policy collection is standard in RL, GRPO is established), their combination and adaptation to the CUA setting represents a meaningful engineering and methodological contribution.

2. Methodological Rigor

Strengths: The experimental design includes controlled ablations that isolate the effect of the reward source (Table 1), comparing PRM-based rewards against rule-based rewards on the same data subset. The data utilization analysis (Figure 4) concretely demonstrates why PRO-CUA can leverage more training signal. The comparison between Qwen3-VL-4B and GPT-5-mini as PRMs (Figure 3) provides useful insight about GRPO's robustness to reward calibration differences.

Weaknesses: The experimental evaluation has several notable gaps:

The base models are relatively small (4B and 8B parameters), and it's unclear how the approach scales to larger models.

The training task set is derived from WebVoyager's synthetic tasks, creating a domain proximity advantage for WebVoyager evaluation that the authors acknowledge but don't adequately control for.

Success rate improvements are measured against controlled baselines (FBC, rule-based Step-RL) but not against the most competitive approaches. The comparison with external models (UI-TARS, WebSTAR) is not apples-to-apples since those use different base models and training data.

The paper excludes numerous websites due to anti-bot mechanisms (Table 3), which may inflate success rates by removing harder domains.

GPT-5 as an automatic evaluator for task success introduces evaluator bias, though this is standard practice.

Only 10 training iterations with 256 tasks each is a relatively small-scale experiment. Statistical significance or variance across runs is not reported.

3. Potential Impact

The framework has practical relevance for the growing CUA industry. The decoupled architecture is a pragmatic solution that could be adopted by teams building production CUAs. The demonstration that a small PRM (Qwen3-VL-4B) works comparably to GPT-5-mini is economically significant — it suggests practitioners don't need expensive proprietary models for reward grading.

However, the impact is somewhat bounded by several factors:

The approach is validated only on web navigation, not desktop or mobile environments.

The absolute performance numbers remain modest (e.g., 42-43% on WebVoyager, 28-35% on Mind2Web-Live), suggesting CUAs are still far from practical reliability.

The PRM itself is not trained or adapted — it uses off-the-shelf models as judges, which limits the framework's ceiling to the PRM's evaluation capability.

4. Timeliness & Relevance

This paper is highly timely. CUAs are a rapidly emerging application area with significant commercial interest (OpenAI's Operator, Anthropic's Computer Use). The training methodology gap — between simple behavior cloning and full online RL — is a recognized bottleneck. The paper arrives at a moment when the community is actively seeking scalable RL approaches for agent training, as evidenced by concurrent works like GUI-R1, GUI-Libra, and UI-R1. PRO-CUA's contribution of using PRMs as training signals (rather than inference-time selectors) is a timely distinction from WebShepherd and WebArbiter.

5. Strengths & Limitations

Key Strengths:

*Practical architecture design:* The decoupled two-stage framework is elegant and addresses real infrastructure challenges. Each stage can use hardware optimized for its computational profile.

*Data efficiency:* Learning from both successful and failed trajectories (Figure 4) is a clear, well-demonstrated advantage over baselines requiring golden actions.

*Robustness to PRM quality:* The finding that cheap and expensive PRMs yield similar downstream performance is practically valuable and theoretically interesting (explained by GRPO's group normalization).

*Reproducibility potential:* The paper uses open models (Qwen3-VL) and open benchmarks, with code and models promised for release.

Notable Limitations:

*Limited scale:* Small models, small training sets, few iterations. The scalability claims are aspirational rather than demonstrated.

*No PRM training:* The PRM is used off-the-shelf. A more complete contribution would investigate training specialized PRMs for CUA tasks.

*Web-only evaluation:* Despite the general framing, validation is restricted to web navigation.

*Missing analysis:* No failure case analysis, no qualitative examples of PRM grading accuracy, no analysis of what types of tasks or steps benefit most from PRM feedback.

*Incremental novelty:* Each component (on-policy collection, PRM grading, GRPO) exists in prior work; the combination, while useful, is primarily an engineering contribution.

*On-policy claim nuance:* The states are on-policy, but the candidate actions are generated independently and never executed. This is a hybrid approach, not fully on-policy RL, and the paper could better discuss this distinction's theoretical implications.

Overall Assessment

PRO-CUA makes a solid, practical contribution to CUA training methodology by combining on-policy state collection with PRM-guided step-level RL in a decoupled framework. The experimental evidence supports its effectiveness over controlled baselines, though the scale remains limited and the novelty is primarily in the system design rather than algorithmic innovation. The work is timely and addresses real bottlenecks, but falls short of a definitive advance due to modest absolute performance, limited evaluation scope, and lack of deeper analysis.

Rating:5.8/ 10

Significance 5.5Rigor 5.5Novelty 5Clarity 7.5

Generated May 29, 2026

Comparison History (15)

vs. CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

gemini-3.15/29/2026

Paper 1 directly accelerates fundamental scientific discovery in materials science and chemistry. By integrating physical priors, reasoning, and RL into LLMs for crystal structure generation, it addresses a major bottleneck in discovering new materials. While Paper 2 offers valuable advancements in AI agent training for software automation, Paper 1's cross-disciplinary application and potential to impact critical areas like energy storage and advanced materials give it a broader and more profound scientific impact.

vs. Orthogonal Concept Erasure for Diffusion Models

gpt-5.25/29/2026

Paper 2 (OCE) is likely to have higher scientific impact: it introduces a novel geometric reframing of diffusion model editing (multiplicative, orthogonal transforms with closed-form updates) that directly addresses a known limitation of additive edits, shows strong methodological clarity, and scales to large multi-concept erasure quickly. Its applications to safety, IP/privacy, and controllable generation are immediate and broadly relevant across ML and generative media. Paper 1 is valuable for agent training, but depends on PRM quality and is more domain-specific to GUI agents, with narrower cross-field uptake.

vs. GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation

claude-opus-4.65/29/2026

PRO-CUA addresses a fundamental challenge in training computer use agents through a novel process-reward optimization framework that combines step-level reinforcement learning with process reward models. This has broad implications for AI agent development, automation, and GUI interaction—a rapidly growing field. The methodological innovation of decoupling on-policy interaction from optimization with dense credit assignment is significant. Paper 2, while competent, addresses a narrower domain (tourist mobility modeling in Tokyo) with incremental methodological contributions combining existing techniques (GPS priors, LLMs), limiting its broader scientific impact.

vs. Primal-Dual Guided Decoding for Constrained Discrete Diffusion

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact: it introduces a broadly applicable, inference-time constrained decoding framework for discrete diffusion with a principled primal-dual/KL-regularized formulation, supports multiple constraints, needs no retraining or extra model calls, and offers formal bounds—strong methodological rigor and cross-domain breadth (text, molecules, music). Paper 1 is timely and valuable for GUI agents, but its impact is more application-specific (CUAs) and relies on process reward models whose reliability/generalization may limit broad adoption compared to a general decoding method.

vs. Robust and Efficient Guardrails with Latent Reasoning

gpt-5.25/29/2026

Paper 2 (COLAGUARD) likely has higher impact: it introduces a novel latent-reasoning guardrail that materially advances a widely relevant deployment bottleneck (safety robustness vs. inference cost) with strong, quantified gains (macro-F1 improvements plus large speed/token reductions) across many benchmarks and settings. The approach is broadly applicable to LLM safety infrastructure across products and domains, timely given widespread LLM deployment. Paper 1 is innovative for CUAs but is narrower in scope (GUI agents) and depends on PRM reliability and live rollouts, which may limit generalizability and adoption.

vs. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

gpt-5.25/29/2026

Paper 1 introduces a broadly applicable architectural decomposition (reactive execution, simulative planning via a world model, and learned self-regulation of when/how much to plan) and demonstrates substantial efficiency gains (large token savings) while matching much larger models across diverse reasoning tasks. This targets a timely bottleneck—reasoning cost/latency—and could generalize beyond planning to self-governed computation in agents, affecting multiple subfields (LLM reasoning, planning, agent design, efficiency). Paper 2 is strong and practical for GUI agents, but its impact is narrower and more application-specific.

vs. Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

claude-opus-4.65/29/2026

PRO-CUA addresses a high-demand practical problem—training computer use agents via step-level reinforcement learning—with a clear, scalable framework that reduces distribution shift and improves credit assignment. Its real-world applicability to GUI automation gives it broad impact potential. Paper 2 offers rigorous theoretical analysis of compositional incoherence in multi-LLM systems, which is intellectually interesting but more niche. While Paper 2's formalization is novel, PRO-CUA's combination of methodological innovation, practical relevance, and timeliness in the rapidly growing CUA space gives it higher expected impact.

vs. Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

claude-opus-4.65/29/2026

PRO-CUA addresses a fundamental training challenge for computer use agents with a novel process-reward optimization framework that combines step-level reinforcement learning with process reward models. This offers broad methodological contributions applicable beyond CUAs to general agent training. Paper 1 identifies an important security threat (sleeper attacks on LLM agents) with a solid benchmark, but is more narrowly focused on a specific attack vector. PRO-CUA's framework for dense credit assignment and on-policy training without expert trajectories has wider applicability and addresses a more fundamental bottleneck in the rapidly growing agent training field.

vs. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

gpt-5.25/29/2026

Paper 2 likely has higher impact: it targets rapidly growing computer-use agents with clear, near-term real-world automation applications and proposes a broadly applicable training framework (process-reward, step-level RL with reduced on-policy cost) that can transfer across GUIs and agent settings. Its contribution aligns with current trends (PRMs, dense credit assignment, agentic web tasks) and can influence both research and product deployments. Paper 1 is novel for knowledge editing, but the scope is narrower and impact depends on adoption of causal-editing benchmarks and integration into editing pipelines.

vs. Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

gemini-3.15/29/2026

Paper 1 addresses a critical and highly timely bottleneck in the development of AI agents—training computer use agents via reinforcement learning. Its method offers immediate, high-impact real-world applications in digital workflow automation. While Paper 2 provides fascinating fundamental insights into LLM representations and cognitive science, Paper 1's approach resolves practical engineering constraints in agentic AI, leading to more direct and widespread technological adoption across domains.

vs. Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact due to a clearer, broadly applicable training algorithm for computer-use agents, addressing major bottlenecks (sparse rewards, credit assignment, costly interaction) with a decoupled rollout/optimization scheme using step-level process rewards and group-relative advantages, validated on live web benchmarks. This is timely and general across GUI/web automation, agentic RL, and LLM training. Paper 1 is novel for scientific multi-agent workflow integrity (semantic drift checkpoints), but appears more domain-specific and system-design heavy, with impact depending on adoption in scientific computing pipelines.

vs. Formalizing Mathematics at Scale

claude-opus-4.65/29/2026

Paper 2 presents a transformative contribution to mathematical formalization at scale, producing a concrete, reusable artifact (45,000+ verified Lean 4 declarations across 26 textbooks) that enables automated verification of research-level mathematics. Its breadth of impact spans mathematics, formal verification, AI for math, and education. While Paper 1 makes a solid contribution to training computer use agents with process rewards, it is more incremental within the RL/agent training space. Paper 2's demonstration that large-scale autoformalisation is feasible represents a paradigm shift with far-reaching implications for how mathematics is verified and produced.

vs. Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

gpt-5.25/29/2026

Paper 2 likely has higher impact due to broader applicability and timeliness: process-reward, step-level RL for computer-use agents targets a fast-growing area (GUI/web agents) with clear real-world automation value. Methodologically, decoupling live interaction from optimization via PRM-guided dense feedback and group-relative advantages addresses key RL bottlenecks (sparse rewards, credit assignment, distribution shift) and could generalize across agentic tasks beyond GUIs. Paper 1 is novel for interactive ASR and semantic evaluation, but its impact is more domain-specific to speech recognition and depends on LLM-judge reliability.

vs. Make LLM Learn to Synthesize from Streaming Experiences through Feedback

claude-opus-4.65/29/2026

PRO-CUA addresses a critical bottleneck in training computer use agents through a novel process-reward optimization framework with step-level reinforcement learning. It tackles fundamental challenges (sparse rewards, credit assignment, distribution shift) in a rapidly growing field of autonomous GUI agents. The practical implications for automating digital workflows are substantial. Paper 2's StreamSynth, while interesting in framing data synthesis as experience-driven, addresses a comparatively narrower problem with less transformative potential. PRO-CUA's methodology combining PRMs with on-policy learning is more technically innovative and timely given the surge in agentic AI research.

vs. ProvMind: Provenance-grounded reasoning for materials synthesis

gpt-5.25/29/2026

Paper 2 likely has higher impact due to broader applicability and timeliness: step-level RL with process reward models for computer-use agents targets a rapidly expanding area (general-purpose GUI/web automation) with clear near-term product and research impact across AI, HCI, and robotics-style sequential decision-making. Its methodological contribution (decoupling interaction from optimization, dense credit assignment via PRM, reducing distribution shift) is general and extensible. Paper 1 is novel and rigorous within materials synthesis reasoning, but its impact is more domain-specific and benchmark-centered, with narrower cross-field reach.