OpenComputer: Verifiable Software Worlds for Computer-Use Agents
Jinbiao Wei, Qianran Ma, Yilun Zhao, Xiao Zhou, Kangqi Ni, Guo Gan, Arman Cohan
Abstract
We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.
AI Impact Assessments
(1 models)Scientific Impact Assessment: OpenComputer
Core Contribution
OpenComputer addresses two coupled bottlenecks in computer-use agent research: (1) the high cost of manually constructing realistic desktop task environments, and (2) the unreliability of evaluation methods, particularly LLM-as-judge approaches. The paper's key insight is making verification the organizing principle of task and environment construction, rather than treating it as an afterthought. The framework synthesizes executable desktop tasks paired with hard-coded programmatic verifiers that inspect actual application state (via SQLite databases, debugging protocols, file parsing, D-Bus, etc.) rather than relying on screenshot-based proxies.
The concrete deliverable is a benchmark of 33 desktop applications and 1,000 tasks with an average of 6.9 machine-checkable criteria per task and 17.7 verifier endpoints per application, spanning browsers, office tools, creative software, IDEs, and communication apps.
Methodological Rigor
Strengths in methodology:
Weaknesses in methodology:
Potential Impact
Evaluation infrastructure: OpenComputer could become an important evaluation standard for desktop automation agents, filling a gap where OSWorld (with ~370 tasks) has been the primary benchmark. The dramatic score drops observed (GUI-OWL-1.5-8B: 52.3% OSWorld → 5.7% OpenComputer) suggest the benchmark exposes genuine capability gaps rather than trivially overlapping with existing evaluations.
Training signal: The framework's design as infrastructure for RL and rejection sampling (machine-checkable partial-credit rewards) positions it well for the emerging paradigm of training agents with verifiable environmental feedback. This could have outsized impact if the community adopts it for post-training.
Verification methodology: The principle of "verification-first" benchmark construction could influence adjacent fields (web agents, mobile agents, software engineering agents). The self-evolving verification concept—using disagreement between programmatic and LLM judges to debug checkers—is a reusable pattern.
Broader applications: The GUI vs. CLI comparison (Section 5.2) provides useful data for the agent design community, showing that CLI agents trade accuracy for 2-4x speed improvements.
Timeliness & Relevance
This work is highly timely. Computer-use agents are rapidly proliferating (Claude Computer Use, GPT-5 with computer control, open-source alternatives), but evaluation infrastructure has lagged. The paper correctly identifies that LLM-as-judge evaluation creates a problematic dependency—using one model to evaluate another—especially when success depends on non-visual state. The benchmark arrives at a moment when the community needs scalable, trustworthy evaluation for desktop agents, and when training pipelines increasingly demand verifiable rewards.
Strengths
1. Verification-first design philosophy is the paper's strongest conceptual contribution. By making verifiability a constraint on task generation rather than an evaluation add-on, it ensures benchmark quality by construction.
2. Breadth of application coverage (33 apps across 6 categories) with diverse inspection channels demonstrates the framework's generality.
3. Concrete evidence against LLM-as-judge for fine-grained desktop evaluation, with persuasive failure mode analysis (spreadsheet cell boundaries, terminal scrollback).
4. Partial credit scoring via checklist-based rewards enables more informative agent comparison than binary success/failure.
5. Open-source release with Docker-based reproducibility and cloud deployment support maximizes potential adoption.
Limitations
1. Scalability of verifier construction remains unclear. While the paper emphasizes automation, the verifier generation still requires substantial per-application engineering (inspection channel selection, endpoint implementation, testing). Adding a 34th application likely requires non-trivial effort.
2. Coverage gaps acknowledged but unresolved. The paper honestly notes that 17 tasks required visual/geometric judgments beyond programmatic verification. For creative applications (Draw.io, design tools), this limitation could grow as task complexity increases.
3. Benchmark difficulty calibration. The paper filters for "upper half difficulty" but doesn't provide a rigorous difficulty calibration methodology. The dramatic drops from OSWorld scores could partially reflect task distribution differences rather than pure difficulty increases.
4. Limited analysis of verifier false positives. The paper focuses on human-verifier agreement but doesn't deeply analyze cases where verifiers pass but humans disagree—which would indicate over-permissive verification.
5. No training experiments. While the paper positions OpenComputer as training infrastructure, no RL or SFT experiments are presented, leaving this potential impact unvalidated.
6. Temporal fragility. Application updates could break verifiers that depend on specific database schemas, API behaviors, or file formats—the darktable example in Appendix A illustrates exactly this risk.
Overall Assessment
OpenComputer makes a solid engineering and methodological contribution to the computer-use agent evaluation ecosystem. Its verification-first philosophy is well-motivated and empirically validated. The benchmark exposes meaningful capability gaps in current agents. However, the work is primarily infrastructural—it builds better evaluation tools rather than advancing agent capabilities or providing deep theoretical insights. Its ultimate impact depends heavily on community adoption and whether the framework proves maintainable as applications evolve.
Generated May 20, 2026
Comparison History (21)
Paper 2 introduces a practical, large-scale verifiable framework and benchmark for evaluating computer-use agents, addressing a critical bottleneck in a rapidly growing field. Benchmarks typically drive significant empirical progress and garner high citations. While Paper 1 provides valuable theoretical clarification regarding Transformer Turing-completeness, Paper 2's direct utility for evaluating and training AI agents gives it a broader and more immediate potential for widespread scientific impact.
Paper 2 presents a counterintuitive and thought-provoking finding—that perceptual noise can improve embodied LLM performance by disrupting repetitive action loops—which challenges fundamental assumptions about observation fidelity in robotics/AI. This insight has broad implications for how we design and evaluate embodied AI systems. While Paper 1 (OpenComputer) is a solid engineering contribution providing benchmarking infrastructure for computer-use agents, Paper 2 offers deeper scientific insight into LLM reasoning failures, with potential to influence evaluation methodology and agent design across robotics, cognitive science, and AI research more broadly.
Paper 2 addresses a fundamental and widely applicable question about LLM agent design—whether stacking components always helps—with rigorous factorial experimental methodology, statistical analysis (Shapley values, submodularity tests, replication across models). Its finding that 'more is not always better' challenges a pervasive assumption in the rapidly growing agent-building community, offering actionable guidance for practitioners. Paper 1, while valuable as an engineering benchmark contribution, is more incremental (another evaluation framework) with narrower applicability. Paper 2's insight has broader theoretical and practical impact across all agent system design.
Paper 2 likely has higher impact: it introduces an infrastructure-level, verifier-grounded framework (verifiable software worlds, task generation, evaluation harness) spanning 33 real apps and 1,000 tasks, enabling reproducible, auditable evaluation and training signals for computer-use agents—an area of high current relevance. Its methodological rigor and broad applicability (agent evaluation, reward design, benchmarking, safety/verification) support cross-field adoption. Paper 1 is novel for aggregate-only prompt optimization, but is narrower in scope and closer to incremental advances in prompt/BO tooling compared to a new benchmark+verification ecosystem.
OpenComputer addresses the broader and more impactful problem of general-purpose computer-use agents with verifiable evaluation across 33 desktop applications. Its contributions—state verifiers, self-evolving verification, task generation, and trajectory evaluation—have wider applicability across AI agent research. The framework exposes fundamental gaps in frontier and open-source models for robust computer automation, which is a highly timely research direction. EngiAI, while rigorous, targets a narrower engineering design niche with a domain-specific multi-agent benchmark that will impact fewer research communities.
Paper 1 addresses a fundamental and broadly relevant problem in RLHF/RL post-training: reward hacking in rubric-based settings. It provides a principled framework decomposing sources of divergence (verifier failure vs. rubric-design limitations), introduces a novel verifier-free diagnostic (self-internalization gap), and yields insights applicable across any domain using rubric-based RL. Paper 2 is a solid engineering contribution (benchmark/framework for computer-use agents) but is more narrowly scoped. Paper 1's findings about the limits of stronger verification and rubric design have deeper theoretical implications for the rapidly growing RL post-training field.
OpenComputer presents a novel, comprehensive framework for evaluating computer-use agents with verifiable benchmarks across 33 applications and 1,000 tasks. It addresses a timely and broadly impactful problem—reliable evaluation of AI agents interacting with real software—with clear practical applications and methodological contributions (verifier-grounded evaluation, self-evolving verification, partial-credit rewards). Paper 1, while a useful case study, is narrow in scope: it documents limitations of a single AI-assisted theorem proving attempt on one problem, contributing primarily an anecdotal artifact rather than a generalizable methodology or tool.
Paper 2 likely has higher impact because it introduces broadly useful infrastructure: verifiable, auditable software worlds with structured state verifiers, task generation, and partial-credit rewards across 33 real applications and 1,000 tasks. This enables reproducible evaluation and training for computer-use agents, addressing a timely bottleneck (reliable benchmarks and reward signals) with clear real-world relevance. Its methodological contribution (verifier-grounded evaluation vs LLM judges, self-evolving verification) can influence multiple subfields (agent RL, benchmarking, HCI, software engineering). Paper 1 is strong but more incremental within RL/distillation.
Paper 1 likely has higher impact: it introduces a broadly usable, verifier-grounded benchmark infrastructure for real desktop applications (33 apps, 1,000 tasks) with auditable evaluation and partial-credit rewards—addressing a central bottleneck in computer-use agents (reliable, fine-grained evaluation). Its methodology (state verifiers + self-improving verification + task generation + full-trajectory harness) is highly actionable for many labs and can become shared community infrastructure. Paper 2 is a strong, timely RLVR improvement, but is narrower in scope and more incremental compared to a new evaluation ecosystem for agentic computing.
Paper 2 is likely higher impact due to its broadly applicable, reusable infrastructure for verifiable evaluation of computer-use agents across 33 real applications and 1,000 tasks. Verifier-grounded, auditable rewards address a central bottleneck in agent research (reliable evaluation), with immediate relevance to autonomy, safety, benchmarking, and reinforcement learning. Its methodology (state verifiers, trajectory logging, partial-credit scoring, self-evolving verification) is general and could become a standard testbed across academia and industry. Paper 1 is strong and timely but narrower to survey workflows/disaster contexts and incremental over established imputation baselines.
Paper 1 addresses a critical bottleneck in the rapidly growing field of agentic AI by providing a robust, scalable framework and benchmark for evaluating computer-use agents. Infrastructure and evaluation frameworks typically have broader and longer-lasting scientific impact across multiple disciplines compared to specific attack methodologies like the jailbreak technique presented in Paper 2.
Paper 1 likely has higher impact: it introduces a verifier-grounded framework and infrastructure (verifiers, self-improving verification, task generation, auditable evaluation) spanning 33 real desktop apps and 1,000 tasks, addressing a central, timely bottleneck in computer-use agents—reliable evaluation and grounding in real application state. Its methodology and artifacts can broadly influence agent benchmarking, safety/reliability, and human-computer interaction. Paper 2 is novel and impressive for efficient test-time compute scaling on puzzle-like domains, but its demonstrated scope is narrower and may transfer less directly to real-world deployment compared with a general evaluation/verification substrate.
Shepherd introduces a novel, formalized runtime infrastructure with highly efficient execution tracing and state forking. Its mechanized foundation and demonstrated significant performance gains across multiple domains (intervention, meta-optimization, RL training) represent a deeper methodological innovation with broader algorithmic applicability than OpenComputer, which primarily serves as an evaluation framework.
Paper 1 has higher impact potential due to a concrete, novel infrastructure contribution: verifiable, state-based evaluation for real desktop applications with scalable task generation and auditable rewards across 33 apps/1,000 tasks. This directly enables more rigorous benchmarking and training for computer-use agents, a broadly relevant and timely area in AI and HCI, with clear real-world applications and methodological rigor (ground-truth verifiers vs LLM judges). Paper 2 is valuable and timely but is primarily an audit/survey; its impact is more incremental and narrower to finance research practices.
Paper 2 likely has higher impact because it introduces a broadly useful, verifiable evaluation/training infrastructure for computer-use agents across 33 real applications and 1,000 tasks, addressing a timely bottleneck: reliable, auditable benchmarking beyond LLM-as-judge. Its framework (verifiers, self-improving verification, task generation, partial-credit rewards) can enable reproducible research and accelerate progress across agent learning, HCI, software engineering, and safety. Paper 1 is novel and mechanistically rigorous for MLLM hallucinations, but its intervention is narrower in scope and primarily impacts multimodal generation robustness rather than enabling a cross-field platform.
OpenComputer addresses a critical infrastructure gap for evaluating computer-use agents—a rapidly growing area in AI. Its contribution of verifiable benchmarks across 33 applications with 1,000 tasks provides a foundational evaluation framework that could broadly impact agent development, reinforcement learning from environment feedback, and AI safety. The finding that hard-coded verifiers outperform LLM-as-judge is significant for the field. Paper 2, while methodologically sound, addresses a narrower intersection (strategic classification + tabular foundation models) with more limited breadth of impact and community interest.
Paper 2 is likely higher impact: it introduces a broadly usable, verifier-grounded evaluation and task framework for computer-use agents across 33 real applications and 1,000 tasks, addressing a central bottleneck (reliable, auditable measurement) with clear real-world relevance and potential to become a community benchmark/infrastructure. The methodology emphasizes verifiability, trajectory logging, and partial-credit rewards, improving rigor over LLM-as-judge. Paper 1 is novel for multi-objective skill/prompt optimization, but its scope is narrower and impact depends on adoption within agent-skill tuning workflows.
Paper 1 has higher impact potential: it introduces a broadly usable, verifier-grounded framework for auditable evaluation and task generation for computer-use agents across 33 real applications, addressing a central bottleneck in agent benchmarking (reliable, fine-grained success criteria). Its methodology (structured state verifiers, self-improving verification, full-trajectory logging, partial-credit rewards) is likely to influence evaluation practices across AI agents, HCI, and software engineering. Paper 2 is a useful incremental improvement in deepfake generalization on a single dataset with modest gains, with narrower scope and likely faster obsolescence.
OpenComputer addresses a fundamental infrastructure challenge for evaluating computer-use agents—a rapidly growing research area driven by LLM advances. Its contribution of verifiable benchmarking across 33 applications with rigorous state-based verification fills a critical methodological gap, likely enabling and shaping future research broadly. Paper 2, while technically solid with real-world deployment results, represents an incremental advance in auto-bidding within a narrower domain (digital advertising). OpenComputer's broader applicability across AI agent research and its potential to become a standard evaluation framework gives it higher estimated impact.
Paper 1 addresses a fundamental question about how GenAI affects productivity inequality—a topic with broad implications across economics, education, management, and policy. Its RCT methodology, novel construct of AI Interaction Competence (AIC), and actionable finding that scaffolding reduces variance make it highly relevant and timely. Paper 2 is a solid engineering contribution (benchmarking framework for computer-use agents) but has narrower impact, primarily within the AI agents community. Paper 1's findings about human-AI complementarities will likely influence organizational adoption strategies, workforce training, and educational policy at scale.