OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Jinbiao Wei, Qianran Ma, Yilun Zhao, Xiao Zhou, Kangqi Ni, Guo Gan, Arman Cohan

#423 of 2292 · Artificial Intelligence
Share
Tournament Score
1482±45
10501800
76%
Win Rate
16
Wins
5
Losses
21
Matches
Rating
7/ 10
Significance
Rigor
Novelty
Clarity

Abstract

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: OpenComputer

Core Contribution

OpenComputer addresses two coupled bottlenecks in computer-use agent research: (1) the high cost of manually constructing realistic desktop task environments, and (2) the unreliability of evaluation methods, particularly LLM-as-judge approaches. The paper's key insight is making verification the organizing principle of task and environment construction, rather than treating it as an afterthought. The framework synthesizes executable desktop tasks paired with hard-coded programmatic verifiers that inspect actual application state (via SQLite databases, debugging protocols, file parsing, D-Bus, etc.) rather than relying on screenshot-based proxies.

The concrete deliverable is a benchmark of 33 desktop applications and 1,000 tasks with an average of 6.9 machine-checkable criteria per task and 17.7 verifier endpoints per application, spanning browsers, office tools, creative software, IDEs, and communication apps.

Methodological Rigor

Strengths in methodology:

  • The four-component architecture (verifier generation → self-evolving verification → task synthesis → evaluation harness) is well-motivated and logically coherent. Each stage addresses a clear failure mode.
  • The self-evolving verification layer is a particularly thoughtful contribution. The ablation study (Section 5.3) demonstrates concrete value: 68 of 76 checker-side errors were repaired, improving human-checker agreement from 85.2% to 94.1%. The breakdown showing most fixes occur in round 1 (47/68) suggests the approach is efficient.
  • The 120-task human annotation study comparing hard-coded verifiers (94.1% task-level agreement) vs. LLM-as-judge (79.2%) provides strong evidence for the core claim. The per-criterion agreement numbers (97.3% vs. 92.2%) reinforce this.
  • Weaknesses in methodology:

  • The 120-task comparison set is relatively small given the 1,000-task benchmark. It's unclear how representative this sample is across the full application diversity.
  • The paper doesn't provide inter-annotator agreement metrics for the human evaluation, which would strengthen claims about the gold standard.
  • The self-evolving verification relies on a "strong agent" (seemingly GPT-class) for calibration runs, creating a dependency on frontier model capabilities. The paper doesn't fully discuss what happens when the calibration agent itself fails on tasks it should solve.
  • The verifier generation pipeline itself relies heavily on LLM-based code generation (described as "the agent" implementing endpoints), but the reliability and human oversight of this process is underspecified.
  • Potential Impact

    Evaluation infrastructure: OpenComputer could become an important evaluation standard for desktop automation agents, filling a gap where OSWorld (with ~370 tasks) has been the primary benchmark. The dramatic score drops observed (GUI-OWL-1.5-8B: 52.3% OSWorld → 5.7% OpenComputer) suggest the benchmark exposes genuine capability gaps rather than trivially overlapping with existing evaluations.

    Training signal: The framework's design as infrastructure for RL and rejection sampling (machine-checkable partial-credit rewards) positions it well for the emerging paradigm of training agents with verifiable environmental feedback. This could have outsized impact if the community adopts it for post-training.

    Verification methodology: The principle of "verification-first" benchmark construction could influence adjacent fields (web agents, mobile agents, software engineering agents). The self-evolving verification concept—using disagreement between programmatic and LLM judges to debug checkers—is a reusable pattern.

    Broader applications: The GUI vs. CLI comparison (Section 5.2) provides useful data for the agent design community, showing that CLI agents trade accuracy for 2-4x speed improvements.

    Timeliness & Relevance

    This work is highly timely. Computer-use agents are rapidly proliferating (Claude Computer Use, GPT-5 with computer control, open-source alternatives), but evaluation infrastructure has lagged. The paper correctly identifies that LLM-as-judge evaluation creates a problematic dependency—using one model to evaluate another—especially when success depends on non-visual state. The benchmark arrives at a moment when the community needs scalable, trustworthy evaluation for desktop agents, and when training pipelines increasingly demand verifiable rewards.

    Strengths

    1. Verification-first design philosophy is the paper's strongest conceptual contribution. By making verifiability a constraint on task generation rather than an evaluation add-on, it ensures benchmark quality by construction.

    2. Breadth of application coverage (33 apps across 6 categories) with diverse inspection channels demonstrates the framework's generality.

    3. Concrete evidence against LLM-as-judge for fine-grained desktop evaluation, with persuasive failure mode analysis (spreadsheet cell boundaries, terminal scrollback).

    4. Partial credit scoring via checklist-based rewards enables more informative agent comparison than binary success/failure.

    5. Open-source release with Docker-based reproducibility and cloud deployment support maximizes potential adoption.

    Limitations

    1. Scalability of verifier construction remains unclear. While the paper emphasizes automation, the verifier generation still requires substantial per-application engineering (inspection channel selection, endpoint implementation, testing). Adding a 34th application likely requires non-trivial effort.

    2. Coverage gaps acknowledged but unresolved. The paper honestly notes that 17 tasks required visual/geometric judgments beyond programmatic verification. For creative applications (Draw.io, design tools), this limitation could grow as task complexity increases.

    3. Benchmark difficulty calibration. The paper filters for "upper half difficulty" but doesn't provide a rigorous difficulty calibration methodology. The dramatic drops from OSWorld scores could partially reflect task distribution differences rather than pure difficulty increases.

    4. Limited analysis of verifier false positives. The paper focuses on human-verifier agreement but doesn't deeply analyze cases where verifiers pass but humans disagree—which would indicate over-permissive verification.

    5. No training experiments. While the paper positions OpenComputer as training infrastructure, no RL or SFT experiments are presented, leaving this potential impact unvalidated.

    6. Temporal fragility. Application updates could break verifiers that depend on specific database schemas, API behaviors, or file formats—the darktable example in Appendix A illustrates exactly this risk.

    Overall Assessment

    OpenComputer makes a solid engineering and methodological contribution to the computer-use agent evaluation ecosystem. Its verification-first philosophy is well-motivated and empirically validated. The benchmark exposes meaningful capability gaps in current agents. However, the work is primarily infrastructural—it builds better evaluation tools rather than advancing agent capabilities or providing deep theoretical insights. Its ultimate impact depends heavily on community adoption and whether the framework proves maintainable as applications evolve.

    Rating:7/ 10
    Significance 7.5Rigor 7Novelty 6.5Clarity 8

    Generated May 20, 2026

    Comparison History (21)

    vs. Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management
    gemini-3.15/20/2026

    Paper 2 introduces a practical, large-scale verifiable framework and benchmark for evaluating computer-use agents, addressing a critical bottleneck in a rapidly growing field. Benchmarks typically drive significant empirical progress and garner high citations. While Paper 1 provides valuable theoretical clarification regarding Transformer Turing-completeness, Paper 2's direct utility for evaluating and training AI agents gives it a broader and more immediate potential for widespread scientific impact.

    vs. Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving
    claude-opus-4.65/20/2026

    Paper 2 presents a counterintuitive and thought-provoking finding—that perceptual noise can improve embodied LLM performance by disrupting repetitive action loops—which challenges fundamental assumptions about observation fidelity in robotics/AI. This insight has broad implications for how we design and evaluate embodied AI systems. While Paper 1 (OpenComputer) is a solid engineering contribution providing benchmarking infrastructure for computer-use agents, Paper 2 offers deeper scientific insight into LLM reasoning failures, with potential to influence evaluation methodology and agent design across robotics, cognitive science, and AI research more broadly.

    vs. More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding
    claude-opus-4.65/20/2026

    Paper 2 addresses a fundamental and widely applicable question about LLM agent design—whether stacking components always helps—with rigorous factorial experimental methodology, statistical analysis (Shapley values, submodularity tests, replication across models). Its finding that 'more is not always better' challenges a pervasive assumption in the rapidly growing agent-building community, offering actionable guidance for practitioners. Paper 1, while valuable as an engineering benchmark contribution, is more incremental (another evaluation framework) with narrower applicability. Paper 2's insight has broader theoretical and practical impact across all agent system design.

    vs. Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
    gpt-5.25/20/2026

    Paper 2 likely has higher impact: it introduces an infrastructure-level, verifier-grounded framework (verifiable software worlds, task generation, evaluation harness) spanning 33 real apps and 1,000 tasks, enabling reproducible, auditable evaluation and training signals for computer-use agents—an area of high current relevance. Its methodological rigor and broad applicability (agent evaluation, reward design, benchmarking, safety/verification) support cross-field adoption. Paper 1 is novel for aggregate-only prompt optimization, but is narrower in scope and closer to incremental advances in prompt/BO tooling compared to a new benchmark+verification ecosystem.

    vs. EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design
    claude-opus-4.65/20/2026

    OpenComputer addresses the broader and more impactful problem of general-purpose computer-use agents with verifiable evaluation across 33 desktop applications. Its contributions—state verifiers, self-evolving verification, task generation, and trajectory evaluation—have wider applicability across AI agent research. The framework exposes fundamental gaps in frontier and open-source models for robust computer automation, which is a highly timely research direction. EngiAI, while rigorous, targets a narrower engineering design niche with a domain-specific multi-agent benchmark that will impact fewer research communities.

    vs. Reward Hacking in Rubric-Based Reinforcement Learning
    claude-opus-4.65/20/2026

    Paper 1 addresses a fundamental and broadly relevant problem in RLHF/RL post-training: reward hacking in rubric-based settings. It provides a principled framework decomposing sources of divergence (verifier failure vs. rubric-design limitations), introduces a novel verifier-free diagnostic (self-internalization gap), and yields insights applicable across any domain using rubric-based RL. Paper 2 is a solid engineering contribution (benchmark/framework for computer-use agents) but is more narrowly scoped. Paper 1's findings about the limits of stronger verification and rubric design have deeper theoretical implications for the rapidly growing RL post-training field.

    vs. Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem
    claude-opus-4.65/20/2026

    OpenComputer presents a novel, comprehensive framework for evaluating computer-use agents with verifiable benchmarks across 33 applications and 1,000 tasks. It addresses a timely and broadly impactful problem—reliable evaluation of AI agents interacting with real software—with clear practical applications and methodological contributions (verifier-grounded evaluation, self-evolving verification, partial-credit rewards). Paper 1, while a useful case study, is narrow in scope: it documents limitations of a single AI-assisted theorem proving attempt on one problem, contributing primarily an anecdotal artifact rather than a generalizable methodology or tool.

    vs. What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
    gpt-5.25/20/2026

    Paper 2 likely has higher impact because it introduces broadly useful infrastructure: verifiable, auditable software worlds with structured state verifiers, task generation, and partial-credit rewards across 33 real applications and 1,000 tasks. This enables reproducible evaluation and training for computer-use agents, addressing a timely bottleneck (reliable benchmarks and reward signals) with clear real-world relevance. Its methodological contribution (verifier-grounded evaluation vs LLM judges, self-evolving verification) can influence multiple subfields (agent RL, benchmarking, HCI, software engineering). Paper 1 is strong but more incremental within RL/distillation.

    vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
    gpt-5.25/20/2026

    Paper 1 likely has higher impact: it introduces a broadly usable, verifier-grounded benchmark infrastructure for real desktop applications (33 apps, 1,000 tasks) with auditable evaluation and partial-credit rewards—addressing a central bottleneck in computer-use agents (reliable, fine-grained evaluation). Its methodology (state verifiers + self-improving verification + task generation + full-trajectory harness) is highly actionable for many labs and can become shared community infrastructure. Paper 2 is a strong, timely RLVR improvement, but is narrower in scope and more incremental compared to a new evaluation ecosystem for agentic computing.

    vs. Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses
    gpt-5.25/20/2026

    Paper 2 is likely higher impact due to its broadly applicable, reusable infrastructure for verifiable evaluation of computer-use agents across 33 real applications and 1,000 tasks. Verifier-grounded, auditable rewards address a central bottleneck in agent research (reliable evaluation), with immediate relevance to autonomy, safety, benchmarking, and reinforcement learning. Its methodology (state verifiers, trajectory logging, partial-credit scoring, self-evolving verification) is general and could become a standard testbed across academia and industry. Paper 1 is strong and timely but narrower to survey workflows/disaster contexts and incremental over established imputation baselines.

    vs. Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
    gemini-3.15/20/2026

    Paper 1 addresses a critical bottleneck in the rapidly growing field of agentic AI by providing a robust, scalable framework and benchmark for evaluating computer-use agents. Infrastructure and evaluation frameworks typically have broader and longer-lasting scientific impact across multiple disciplines compared to specific attack methodologies like the jailbreak technique presented in Paper 2.

    vs. Probabilistic Tiny Recursive Model
    gpt-5.25/20/2026

    Paper 1 likely has higher impact: it introduces a verifier-grounded framework and infrastructure (verifiers, self-improving verification, task generation, auditable evaluation) spanning 33 real desktop apps and 1,000 tasks, addressing a central, timely bottleneck in computer-use agents—reliable evaluation and grounding in real application state. Its methodology and artifacts can broadly influence agent benchmarking, safety/reliability, and human-computer interaction. Paper 2 is novel and impressive for efficient test-time compute scaling on puzzle-like domains, but its demonstrated scope is narrower and may transfer less directly to real-world deployment compared with a general evaluation/verification substrate.

    vs. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
    gemini-3.15/20/2026

    Shepherd introduces a novel, formalized runtime infrastructure with highly efficient execution tracing and state forking. Its mechanized foundation and demonstrated significant performance gains across multiple domains (intervention, meta-optimization, RL training) represent a deeper methodological innovation with broader algorithmic applicability than OpenComputer, which primarily serves as an evaluation framework.

    vs. Agentic Trading: When LLM Agents Meet Financial Markets
    gpt-5.25/20/2026

    Paper 1 has higher impact potential due to a concrete, novel infrastructure contribution: verifiable, state-based evaluation for real desktop applications with scalable task generation and auditable rewards across 33 apps/1,000 tasks. This directly enables more rigorous benchmarking and training for computer-use agents, a broadly relevant and timely area in AI and HCI, with clear real-world applications and methodological rigor (ground-truth verifiers vs LLM judges). Paper 2 is valuable and timely but is primarily an audit/survey; its impact is more incremental and narrower to finance research practices.

    vs. Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination
    gpt-5.25/20/2026

    Paper 2 likely has higher impact because it introduces a broadly useful, verifiable evaluation/training infrastructure for computer-use agents across 33 real applications and 1,000 tasks, addressing a timely bottleneck: reliable, auditable benchmarking beyond LLM-as-judge. Its framework (verifiers, self-improving verification, task generation, partial-credit rewards) can enable reproducible research and accelerate progress across agent learning, HCI, software engineering, and safety. Paper 1 is novel and mechanistically rigorous for MLLM hallucinations, but its intervention is narrower in scope and primarily impacts multimodal generation robustness rather than enabling a cross-field platform.

    vs. When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach
    claude-opus-4.65/20/2026

    OpenComputer addresses a critical infrastructure gap for evaluating computer-use agents—a rapidly growing area in AI. Its contribution of verifiable benchmarks across 33 applications with 1,000 tasks provides a foundational evaluation framework that could broadly impact agent development, reinforcement learning from environment feedback, and AI safety. The finding that hard-coded verifiers outperform LLM-as-judge is significant for the field. Paper 2, while methodologically sound, addresses a narrower intersection (strategic classification + tabular foundation models) with more limited breadth of impact and community interest.

    vs. MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization
    gpt-5.25/20/2026

    Paper 2 is likely higher impact: it introduces a broadly usable, verifier-grounded evaluation and task framework for computer-use agents across 33 real applications and 1,000 tasks, addressing a central bottleneck (reliable, auditable measurement) with clear real-world relevance and potential to become a community benchmark/infrastructure. The methodology emphasizes verifiability, trajectory logging, and partial-credit rewards, improving rigor over LLM-as-judge. Paper 1 is novel for multi-objective skill/prompt optimization, but its scope is narrower and impact depends on adoption within agent-skill tuning workflows.

    vs. EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection
    gpt-5.25/20/2026

    Paper 1 has higher impact potential: it introduces a broadly usable, verifier-grounded framework for auditable evaluation and task generation for computer-use agents across 33 real applications, addressing a central bottleneck in agent benchmarking (reliable, fine-grained success criteria). Its methodology (structured state verifiers, self-improving verification, full-trajectory logging, partial-credit rewards) is likely to influence evaluation practices across AI agents, HCI, and software engineering. Paper 2 is a useful incremental improvement in deepfake generalization on a single dataset with modest gains, with narrower scope and likely faster obsolescence.

    vs. Generative Auto-Bidding with Unified Modeling and Exploration
    claude-opus-4.65/20/2026

    OpenComputer addresses a fundamental infrastructure challenge for evaluating computer-use agents—a rapidly growing research area driven by LLM advances. Its contribution of verifiable benchmarking across 33 applications with rigorous state-based verification fills a critical methodological gap, likely enabling and shaping future research broadly. Paper 2, while technically solid with real-world deployment results, represents an incremental advance in auto-bidding within a narrower domain (digital advertising). OpenComputer's broader applicability across AI agent research and its potential to become a standard evaluation framework gives it higher estimated impact.

    vs. Generative AI and the Productivity Divide: Human-AI Complementarities in Education
    claude-opus-4.65/20/2026

    Paper 1 addresses a fundamental question about how GenAI affects productivity inequality—a topic with broad implications across economics, education, management, and policy. Its RCT methodology, novel construct of AI Interaction Competence (AIC), and actionable finding that scaffolding reduces variance make it highly relevant and timely. Paper 2 is a solid engineering contribution (benchmarking framework for computer-use agents) but has narrower impact, primarily within the AI agents community. Paper 1's findings about human-AI complementarities will likely influence organizational adoption strategies, workforce training, and educational policy at scale.