PRO-CUA: Process-Reward Optimization for Computer Use Agents
Yifei He, Rui Yang, Hao Bai, Tong Zhang, Han Zhao
Abstract
Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.
AI Impact Assessments
(1 models)Scientific Impact Assessment: PRO-CUA: Process-Reward Optimization for Computer Use Agents
1. Core Contribution
PRO-CUA addresses a genuine pain point in training computer use agents (CUAs): the tension between needing on-policy exploration for robust learning and the prohibitive cost of live environment interaction. The paper proposes a two-stage decoupled framework that (1) collects on-policy states through live rollouts, and (2) performs policy optimization offline by sampling candidate actions, grading them with a Process Reward Model (PRM), and updating via Group Relative Policy Optimization (GRPO). The key insight is that by decoupling state collection from policy optimization, the system avoids the infrastructure nightmare of synchronous RL in GUI environments while still training on the agent's own state distribution.
The three interconnected innovations — on-policy state collection, PRM-based step-level rewards replacing rule-based rewards, and the decoupled training architecture — collectively address distribution shift, sparse reward, and infrastructure cost challenges simultaneously. While none of these individual ideas is entirely novel (PRMs exist for math reasoning, on-policy collection is standard in RL, GRPO is established), their combination and adaptation to the CUA setting represents a meaningful engineering and methodological contribution.
2. Methodological Rigor
Strengths: The experimental design includes controlled ablations that isolate the effect of the reward source (Table 1), comparing PRM-based rewards against rule-based rewards on the same data subset. The data utilization analysis (Figure 4) concretely demonstrates why PRO-CUA can leverage more training signal. The comparison between Qwen3-VL-4B and GPT-5-mini as PRMs (Figure 3) provides useful insight about GRPO's robustness to reward calibration differences.
Weaknesses: The experimental evaluation has several notable gaps:
3. Potential Impact
The framework has practical relevance for the growing CUA industry. The decoupled architecture is a pragmatic solution that could be adopted by teams building production CUAs. The demonstration that a small PRM (Qwen3-VL-4B) works comparably to GPT-5-mini is economically significant — it suggests practitioners don't need expensive proprietary models for reward grading.
However, the impact is somewhat bounded by several factors:
4. Timeliness & Relevance
This paper is highly timely. CUAs are a rapidly emerging application area with significant commercial interest (OpenAI's Operator, Anthropic's Computer Use). The training methodology gap — between simple behavior cloning and full online RL — is a recognized bottleneck. The paper arrives at a moment when the community is actively seeking scalable RL approaches for agent training, as evidenced by concurrent works like GUI-R1, GUI-Libra, and UI-R1. PRO-CUA's contribution of using PRMs as training signals (rather than inference-time selectors) is a timely distinction from WebShepherd and WebArbiter.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
PRO-CUA makes a solid, practical contribution to CUA training methodology by combining on-policy state collection with PRM-guided step-level RL in a decoupled framework. The experimental evidence supports its effectiveness over controlled baselines, though the scale remains limited and the novelty is primarily in the system design rather than algorithmic innovation. The work is timely and addresses real bottlenecks, but falls short of a definitive advance due to modest absolute performance, limited evaluation scope, and lack of deeper analysis.
Generated May 29, 2026
Comparison History (15)
Paper 1 directly accelerates fundamental scientific discovery in materials science and chemistry. By integrating physical priors, reasoning, and RL into LLMs for crystal structure generation, it addresses a major bottleneck in discovering new materials. While Paper 2 offers valuable advancements in AI agent training for software automation, Paper 1's cross-disciplinary application and potential to impact critical areas like energy storage and advanced materials give it a broader and more profound scientific impact.
Paper 2 (OCE) is likely to have higher scientific impact: it introduces a novel geometric reframing of diffusion model editing (multiplicative, orthogonal transforms with closed-form updates) that directly addresses a known limitation of additive edits, shows strong methodological clarity, and scales to large multi-concept erasure quickly. Its applications to safety, IP/privacy, and controllable generation are immediate and broadly relevant across ML and generative media. Paper 1 is valuable for agent training, but depends on PRM quality and is more domain-specific to GUI agents, with narrower cross-field uptake.
PRO-CUA addresses a fundamental challenge in training computer use agents through a novel process-reward optimization framework that combines step-level reinforcement learning with process reward models. This has broad implications for AI agent development, automation, and GUI interaction—a rapidly growing field. The methodological innovation of decoupling on-policy interaction from optimization with dense credit assignment is significant. Paper 2, while competent, addresses a narrower domain (tourist mobility modeling in Tokyo) with incremental methodological contributions combining existing techniques (GPS priors, LLMs), limiting its broader scientific impact.
Paper 2 likely has higher scientific impact: it introduces a broadly applicable, inference-time constrained decoding framework for discrete diffusion with a principled primal-dual/KL-regularized formulation, supports multiple constraints, needs no retraining or extra model calls, and offers formal bounds—strong methodological rigor and cross-domain breadth (text, molecules, music). Paper 1 is timely and valuable for GUI agents, but its impact is more application-specific (CUAs) and relies on process reward models whose reliability/generalization may limit broad adoption compared to a general decoding method.
Paper 2 (COLAGUARD) likely has higher impact: it introduces a novel latent-reasoning guardrail that materially advances a widely relevant deployment bottleneck (safety robustness vs. inference cost) with strong, quantified gains (macro-F1 improvements plus large speed/token reductions) across many benchmarks and settings. The approach is broadly applicable to LLM safety infrastructure across products and domains, timely given widespread LLM deployment. Paper 1 is innovative for CUAs but is narrower in scope (GUI agents) and depends on PRM reliability and live rollouts, which may limit generalizability and adoption.
Paper 1 introduces a broadly applicable architectural decomposition (reactive execution, simulative planning via a world model, and learned self-regulation of when/how much to plan) and demonstrates substantial efficiency gains (large token savings) while matching much larger models across diverse reasoning tasks. This targets a timely bottleneck—reasoning cost/latency—and could generalize beyond planning to self-governed computation in agents, affecting multiple subfields (LLM reasoning, planning, agent design, efficiency). Paper 2 is strong and practical for GUI agents, but its impact is narrower and more application-specific.
PRO-CUA addresses a high-demand practical problem—training computer use agents via step-level reinforcement learning—with a clear, scalable framework that reduces distribution shift and improves credit assignment. Its real-world applicability to GUI automation gives it broad impact potential. Paper 2 offers rigorous theoretical analysis of compositional incoherence in multi-LLM systems, which is intellectually interesting but more niche. While Paper 2's formalization is novel, PRO-CUA's combination of methodological innovation, practical relevance, and timeliness in the rapidly growing CUA space gives it higher expected impact.
PRO-CUA addresses a fundamental training challenge for computer use agents with a novel process-reward optimization framework that combines step-level reinforcement learning with process reward models. This offers broad methodological contributions applicable beyond CUAs to general agent training. Paper 1 identifies an important security threat (sleeper attacks on LLM agents) with a solid benchmark, but is more narrowly focused on a specific attack vector. PRO-CUA's framework for dense credit assignment and on-policy training without expert trajectories has wider applicability and addresses a more fundamental bottleneck in the rapidly growing agent training field.
Paper 2 likely has higher impact: it targets rapidly growing computer-use agents with clear, near-term real-world automation applications and proposes a broadly applicable training framework (process-reward, step-level RL with reduced on-policy cost) that can transfer across GUIs and agent settings. Its contribution aligns with current trends (PRMs, dense credit assignment, agentic web tasks) and can influence both research and product deployments. Paper 1 is novel for knowledge editing, but the scope is narrower and impact depends on adoption of causal-editing benchmarks and integration into editing pipelines.
Paper 1 addresses a critical and highly timely bottleneck in the development of AI agents—training computer use agents via reinforcement learning. Its method offers immediate, high-impact real-world applications in digital workflow automation. While Paper 2 provides fascinating fundamental insights into LLM representations and cognitive science, Paper 1's approach resolves practical engineering constraints in agentic AI, leading to more direct and widespread technological adoption across domains.
Paper 2 likely has higher scientific impact due to a clearer, broadly applicable training algorithm for computer-use agents, addressing major bottlenecks (sparse rewards, credit assignment, costly interaction) with a decoupled rollout/optimization scheme using step-level process rewards and group-relative advantages, validated on live web benchmarks. This is timely and general across GUI/web automation, agentic RL, and LLM training. Paper 1 is novel for scientific multi-agent workflow integrity (semantic drift checkpoints), but appears more domain-specific and system-design heavy, with impact depending on adoption in scientific computing pipelines.
Paper 2 presents a transformative contribution to mathematical formalization at scale, producing a concrete, reusable artifact (45,000+ verified Lean 4 declarations across 26 textbooks) that enables automated verification of research-level mathematics. Its breadth of impact spans mathematics, formal verification, AI for math, and education. While Paper 1 makes a solid contribution to training computer use agents with process rewards, it is more incremental within the RL/agent training space. Paper 2's demonstration that large-scale autoformalisation is feasible represents a paradigm shift with far-reaching implications for how mathematics is verified and produced.
Paper 2 likely has higher impact due to broader applicability and timeliness: process-reward, step-level RL for computer-use agents targets a fast-growing area (GUI/web agents) with clear real-world automation value. Methodologically, decoupling live interaction from optimization via PRM-guided dense feedback and group-relative advantages addresses key RL bottlenecks (sparse rewards, credit assignment, distribution shift) and could generalize across agentic tasks beyond GUIs. Paper 1 is novel for interactive ASR and semantic evaluation, but its impact is more domain-specific to speech recognition and depends on LLM-judge reliability.
PRO-CUA addresses a critical bottleneck in training computer use agents through a novel process-reward optimization framework with step-level reinforcement learning. It tackles fundamental challenges (sparse rewards, credit assignment, distribution shift) in a rapidly growing field of autonomous GUI agents. The practical implications for automating digital workflows are substantial. Paper 2's StreamSynth, while interesting in framing data synthesis as experience-driven, addresses a comparatively narrower problem with less transformative potential. PRO-CUA's methodology combining PRMs with on-policy learning is more technically innovative and timely given the surge in agentic AI research.
Paper 2 likely has higher impact due to broader applicability and timeliness: step-level RL with process reward models for computer-use agents targets a rapidly expanding area (general-purpose GUI/web automation) with clear near-term product and research impact across AI, HCI, and robotics-style sequential decision-making. Its methodological contribution (decoupling interaction from optimization, dense credit assignment via PRM, reducing distribution shift) is general and extensible. Paper 1 is novel and rigorous within materials synthesis reasoning, but its impact is more domain-specific and benchmark-centered, with narrower cross-field reach.