Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Xiang Gao, Qingshui Gu

Jun 9, 2026arXiv:2606.11042v1

cs.AI

v1v2

Frozen v1 — this version was superseded on arXiv. Stats reflect the state at freeze time.View latest (v2) →

#1889of 3539·Artificial Intelligence

#1889 of 3539 · Artificial Intelligence

Tournament Score

1390±42

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty6

Clarity7.5

Abstract

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Workflow-GYM

1. Core Contribution

Workflow-GYM introduces a benchmark comprising 338 tasks across 6 professional domains (Data Analysis, Engineering & Design, Finance & Management, Geography & Environment, Multimedia & Creative, Scientific Computing) and 23 subdomains, targeting long-horizon GUI-based workflows in specialized software environments. The key conceptual contribution is the abstraction of "workflows"—structured, goal-directed sequences of GUI actions that professionals routinely perform—as the unit of evaluation for computer-use agents. Tasks require 30–110 atomic GUI steps, span 58 distinct professional software tools (from Blender and QGIS to FreeCAD and KNIME), and are evaluated via deterministic success criteria on final artifacts or GUI states.

The benchmark fills a genuine gap: prior GUI benchmarks (OSWorld, AndroidWorld, GUI-360) focus on general-purpose software and short-horizon tasks, while long-horizon reasoning benchmarks (coding, QA) rarely involve GUI interaction. Workflow-GYM sits at this intersection, demanding both domain expertise and sustained GUI manipulation.

2. Methodological Rigor

The data construction pipeline is well-designed with four phases: expert-driven workflow selection, standardized environment setup (58 VMs), structured task construction, and multi-stage validation (environment verification, instruction-level review, end-to-end agent testing). The triple validation loop is a strength—tasks are verified by human execution, LLM-based instruction quality checks, and preliminary agent experiments. The filtering criteria (realism, domain specificity, complexity ≥30 steps, verifiability) are clearly articulated.

Evaluation design is reasonable: binary pass/fail with deterministic criteria, 3 trials per task, model-specific agent frameworks with a 400-step budget. Using Seed-1.8 as a judge model for non-rule-based scoring is a practical choice, though this introduces a dependency that could affect reproducibility.

Experimental breadth is adequate, covering 6 frontier models (GPT-5.4 variants, Gemini 3.1 Pro, Gemini 3 Flash, Kimi-k2.6, Seed-2.0-lite) with model-specific frameworks. The ablation studies on textual tutorials and video demonstrations add depth, though the video ablation is limited to 100 tasks and 2 models.

However, there are methodological concerns:

The use of model-specific agent frameworks complicates fair comparison—performance differences may reflect framework quality rather than model capability. The paper acknowledges this ("agentic framework matters") but doesn't fully disentangle the confound.

Inter-annotator agreement metrics for task construction and validation are not reported.

The 30-step minimum threshold for "long-horizon" is somewhat arbitrary and moderate.

Reproducibility depends on access to 58 VM images and commercial professional software, which may limit external adoption.

3. Potential Impact

Direct impact on GUI agent research: The benchmark provides a concrete, challenging target (best model ≈30% success) that should drive development of agents with better long-horizon planning, error recovery, and domain-specific software knowledge. The identified failure modes (error propagation, workflow stage omission, objective drift, repetitive action looping) offer actionable research directions.

Broader implications: The observation about the fundamental mismatch between continuous human interaction and discrete observation-action paradigms (Section 5.3) is an important architectural insight that could influence GUI agent framework design. The finding that video tutorials help but don't resolve complex tasks suggests a path for learning from demonstrations.

Practical relevance: If GUI agents are to automate professional work, benchmarks must capture professional workflows. Workflow-GYM's focus on economically valuable tasks (CAD modeling, financial reporting, geospatial analysis) directly addresses the question of AI's potential for workplace automation.

Limitations on impact: The benchmark's reliance on VM-based environments with commercial software may limit accessibility. The task count (338) is moderate. The professional software choices, while diverse, are skewed toward open-source tools, which may not represent the most commercially important professional workflows.

4. Timeliness & Relevance

This work is highly timely. Computer-use agents (Claude Computer Use, OpenAI Operator, etc.) are being actively deployed, and the gap between general-purpose benchmarks and real professional needs is widely recognized. The benchmark arrives as frontier models (GPT-5.4, Gemini 3.1 Pro) are being released with explicit computer-use capabilities, making the evaluation particularly relevant. The focus on professional workflows also aligns with growing industry interest in AI-driven automation of knowledge work.

5. Strengths & Limitations

Key Strengths:

Addresses a clear, important gap in the evaluation landscape

Comprehensive domain coverage across 6 fields, 23 subdomains, 58 software tools

Rigorous multi-stage validation pipeline

Rich qualitative failure analysis with detailed case studies (error propagation, stage omission, objective drift)

Insightful ablation studies revealing the value of procedural guidance and video demonstrations

The continuous-vs-discrete interaction analysis (Section 5.3) identifies a fundamental architectural limitation

Notable Weaknesses:

Model-specific frameworks confound model capability assessment

Limited statistical analysis—no confidence intervals or significance tests on the main results

The evaluation of non-artifact tasks relies on VLM-based scoring, which may introduce noise

Scalability and reproducibility concerns due to VM-based infrastructure with specialized software

The paper is primarily a benchmark contribution without proposing methods to address the identified challenges

Some results are presented without sufficient statistical rigor (e.g., domain performance differences could be driven by small sample sizes within domains)

The expert pool (71 members) and task count (338), while substantial, may not provide deep coverage in any single domain

Additional Observations

The paper would benefit from a cost analysis of benchmark construction and maintenance, inter-rater reliability metrics, and a more formal analysis of what makes tasks difficult beyond step count (e.g., number of software-specific concepts, degree of UI complexity). The correlation analysis (Pearson r = -0.97 between incompletion rate and success) is computed on only 6 data points, limiting its statistical power.

Overall, Workflow-GYM is a well-executed benchmark paper that addresses a timely need and provides valuable empirical insights, though it is primarily incremental in its conceptual contribution—extending the benchmark paradigm to professional domains rather than introducing fundamentally new evaluation methodologies.

Rating:6.8/ 10

Significance 7.5Rigor 6.5Novelty 6Clarity 7.5

Generated Jun 10, 2026

Comparison History (16)

Wonvs. A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents

Benchmarks in AI typically have broader scientific impact as they establish standardized evaluation metrics that drive future research. Workflow-GYM addresses a critical gap in evaluating agents on long-horizon, professional GUI tasks, offering a platform that researchers across the field will likely use to test new models. While Paper 1 provides a strong architectural framework for enterprise security, Paper 2's benchmark will directly facilitate and measure the broader progression of agentic AI capabilities.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

StatefulDiscovery addresses a fundamental challenge in AI-driven scientific discovery—evidence-calibrated claim formation—which has broad implications across all scientific fields. Its novel framework for coupling exploration with claim adjudication introduces a methodologically rigorous approach to open-ended discovery, a frontier problem with transformative potential. While Workflow-GYM is a useful benchmark contribution highlighting GUI agent limitations, benchmarks tend to have more incremental impact. StatefulDiscovery's framework for autonomous scientific reasoning is more novel, more broadly applicable, and addresses a more consequential problem for accelerating science.

claude-opus-4-6·Jun 11, 2026

Lostvs. Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

Paper 2 introduces an automated system to generate dynamic, updatable benchmarks, addressing the fundamental issue of static benchmarks quickly saturating in AI. This methodological innovation has broader, longer-lasting impact across multiple embodied AI fields compared to Paper 1, which, while highly relevant, proposes a single static benchmark for GUI tasks.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

Paper 2 introduces a much-needed benchmark for long-horizon, professional GUI tasks, addressing a critical bottleneck in the rapidly expanding field of autonomous AI agents. By revealing significant limitations in state-of-the-art models, it provides a foundational evaluation framework that will likely drive broad future research. Paper 1 offers a valuable but highly specific methodological refinement for LLM unlearning, which, while important for safety, has a narrower scope compared to the foundational impact of a novel, challenging benchmark in agentic AI.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

Paper 1 likely has higher scientific impact because it introduces a broadly useful, timely benchmark for long-horizon GUI agent evaluation in real professional software—an area with wide applicability across AI agents, HCI, benchmarking, and enterprise automation. Its finding of low success rates creates a clear research gap and can become a standard evaluation tool, influencing many subsequent works. Paper 2 is methodologically rich and impactful for EDA/hardware design, but its domain is narrower; gains, while meaningful, may affect a smaller research and industry community than a cross-domain agent benchmark.

gpt-5.2·Jun 10, 2026

Wonvs. Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

Workflow-GYM addresses a highly timely and economically valuable problem: long-horizon computer-use agents operating professional software. While Paper 1 offers a novel cognitive perspective on collaboration, Paper 2 aligns directly with the current AI frontier of developing autonomous GUI agents for real-world workflows. Its potential to drive immediate practical applications and highlight critical failure modes in state-of-the-art models gives it broader and more immediate scientific and industrial impact.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

Paper 1 presents a novel and concrete methodological contribution—delayed per-step reward attribution with eligibility gating—that addresses a fundamental challenge in multi-agent RL for language models. Its strong empirical validation (an 8B model beating GPT-5 on a competitive benchmark, winning first place at NeurIPS 2025) demonstrates significant practical impact. Paper 2, while addressing an important gap in GUI agent evaluation, is primarily a benchmark contribution that reveals limitations without proposing solutions. Benchmark papers tend to have narrower long-term impact compared to papers introducing methods that achieve state-of-the-art results with dramatically fewer resources.

claude-opus-4-6·Jun 10, 2026

Wonvs. Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

Paper 1 likely has higher impact because it introduces a new, timely benchmark targeting long-horizon, economically valuable, domain-specific GUI workflows—an evaluation gap with broad relevance to agent research, HCI, and real-world deployment. Benchmarks often become shared infrastructure, shaping research directions and enabling standardized comparisons across models and methods. While Paper 2’s uncertainty-aligned RL for tool-calling is innovative and useful, it is a more incremental algorithmic contribution whose impact depends on adoption and generalization, whereas Workflow-GYM can catalyze a wider ecosystem of methods and evaluations.

gpt-5.2·Jun 10, 2026

Wonvs. Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts

Paper 1 likely has higher impact due to stronger timeliness and broad applicability: long-horizon, real-world GUI agent evaluation is a key bottleneck for deploying agents in professional settings, and a well-designed benchmark can become a community standard. It targets economically valuable workflows across diverse domains and provides diagnostic failure modes that can steer multiple research areas (agent planning, UI grounding, robustness). Paper 2 is novel and useful, but idea-generation gains are incremental, evaluation is harder to validate, and downstream real-world adoption is less immediate than a benchmark for agent capability.

gpt-5.2·Jun 10, 2026

Lostvs. Residual Modeling for High-Fidelity Learned Compression of Scientific Data

Paper 2 has higher potential impact due to a clear algorithmic contribution (residual-centric coding with deterministic guarantees) that materially advances high-fidelity scientific data compression, a critical bottleneck for HPC and climate/CFD workflows. It targets stringent error regimes (NRMSE 1e-6–1e-4), demonstrates sizable gains over strong baselines (GAE, SZ) across multiple real datasets, and emphasizes reproducible, deterministic decoding—important for scientific use. Paper 1 is timely and useful as a benchmark, but benchmark papers often yield narrower, more incremental impact unless they become a dominant standard.

gpt-5.2·Jun 10, 2026

#1889of 3539·Artificial Intelligence

#1889 of 3539 · Artificial Intelligence

Tournament Score

1390±42

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty6

Clarity7.5