Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Yi Zhu, Duju Zeng

Jun 9, 2026arXiv:2606.11042v2

cs.AI

v1v2

#1283of 3489·Artificial Intelligence

#1283 of 3489 · Artificial Intelligence

Tournament Score

1430±49

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty6.5

Clarity7

Abstract

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Workflow-GYM

1. Core Contribution

Workflow-GYM introduces a benchmark of 338 tasks spanning 6 professional domains and 23 subdomains, designed to evaluate whether GUI agents can autonomously complete long-horizon, domain-specific workflows in specialized software environments. The key abstraction is the concept of a "workflow"—a structured, goal-directed sequence of GUI actions requiring 30–110 atomic steps. Unlike prior GUI benchmarks that focus on general-purpose software (web browsing, basic OS operations) and short-horizon interactions, Workflow-GYM targets professional tools such as Blender, QGIS, FreeCAD, KNIME, CapCut, and 50+ other specialized applications across engineering, science, finance, multimedia, and more.

The benchmark fills a genuine gap: prior work like OSWorld evaluates simpler tasks where SOTA models achieve >70% success, while Workflow-GYM shows even the best models (Gemini-3.1-Pro, Kimi-k2.6) reach only ~30% average pass rates. This stark contrast demonstrates that professional long-horizon GUI workflows remain a fundamentally unsolved challenge.

2. Methodological Rigor

The data construction pipeline is well-structured with four phases: expert-driven workflow selection, standardized environment setup via 58 virtual machines, detailed task construction with expert procedures, and multi-stage validation (environment verification, instruction-level validation via expert+LLM review, and end-to-end preliminary experiments). The filtering criteria—realism, domain specificity, complexity (≥30 atomic actions), and verifiability—are clearly motivated and applied.

The evaluation methodology is sound in several respects: tasks are run across 3 independent trials to account for stochasticity, evaluation uses both rule-based and VLM-based assessment with predefined rubrics, and the benchmark uses a 400-step interaction budget. The use of model-specific agent frameworks rather than a one-size-fits-all approach is pragmatic and well-justified, though it introduces some confounding between model capability and framework quality.

However, there are notable methodological concerns. The use of VLM-based judging (Seed-1.8) for non-rule-based tasks introduces potential evaluation noise, though the authors argue that predefined rubrics minimize bias. The 338-task size, while substantial for expert-curated benchmarks, may limit statistical power for fine-grained domain-level comparisons (e.g., some subdomains likely have very few tasks). The paper also does not report inter-annotator agreement or detailed reliability statistics for the evaluation pipeline.

3. Potential Impact

Benchmarking impact: Workflow-GYM addresses a critical evaluation gap. As GUI agents mature, the field needs benchmarks that test economically meaningful, professional-grade capabilities rather than toy tasks. This benchmark could become a standard reference point for measuring progress toward human-level computer use in professional settings.

Diagnostic value: The failure analysis is perhaps the paper's most valuable contribution. The taxonomy of failure modes—error propagation, workflow stage omission, objective drift, software knowledge deficiency, and repetitive action looping—provides actionable insights for model developers. The finding that workflow incompletion rate correlates strongly (Pearson r = -0.97) with overall performance elegantly demonstrates that simply completing long-horizon workflows is the primary bottleneck.

Framework-level insight: The identification of the "continuous interaction vs. discrete observation" mismatch as a fundamental architectural limitation of current GUI agent frameworks is a particularly important observation that could guide next-generation agent designs.

Ablation insights: The tutorial ablation experiments (text and video) reveal that procedural guidance substantially helps weaker models but provides diminishing returns for stronger ones, and that video demonstrations provide complementary benefits beyond text for procedural, deterministic tasks.

4. Timeliness & Relevance

This benchmark is highly timely. The rapid deployment of computer-use agents (Claude Computer Use, OpenAI's computer-use capabilities, UI-TARS) has created urgent demand for rigorous evaluation of professional-grade GUI capabilities. The paper evaluates cutting-edge models including GPT-5.4, Gemini-3.1-Pro, and Kimi-k2.6 (all 2026 models), making it immediately relevant to the current state of the field. The focus on economically valuable professional workflows also aligns with growing industry interest in AI agents that can substitute for skilled human labor.

5. Strengths & Limitations

Key Strengths:

Expert-driven task curation from 71 domain professionals ensures authenticity that synthetic or web-scraped benchmarks cannot match

Comprehensive coverage of 58 professional software tools across diverse domains

Thorough failure analysis with detailed case studies (error propagation in Blender, software knowledge deficiency in GRASS GIS, objective drift in CapCut)

Well-designed ablation studies that isolate the contribution of procedural guidance and visual demonstrations

Standardized VM-based environments enabling reproducibility

The strong negative correlation between workflow incompletion and performance is a clean, memorable finding

Notable Limitations:

The benchmark is static; professional software evolves rapidly, and maintaining 58 VM environments will require ongoing effort

Model-specific agent frameworks make it difficult to fully disentangle model capability from framework quality

The 30-step minimum threshold for "long-horizon" is somewhat arbitrary, and the paper doesn't justify why this particular cutoff captures the transition to professional-grade difficulty

Limited analysis of evaluation reliability—no systematic study of VLM judge agreement with human judgments

The paper tests only 6 models; broader coverage (including open-source models, specialized GUI agents like UI-TARS-2) would strengthen the empirical contribution

Cross-domain performance comparisons are somewhat superficial; the varying numbers of tasks per domain make some comparisons unreliable

The paper does not discuss cost or computational requirements for running the full benchmark, which is relevant for reproducibility

Minor Observations:

The paper is dated June 2026 and references models like GPT-5.4, suggesting this is a forward-looking or speculative timeline

The writing is generally clear but the paper is quite long with extensive appendices; some case studies could be condensed

The benchmark focuses exclusively on desktop GUI; mobile professional workflows are excluded

Overall Assessment

Workflow-GYM makes a solid contribution to the GUI agent evaluation landscape by raising the bar from simple, short-horizon tasks to professional-grade workflows. Its primary value lies in (1) demonstrating a large capability gap between current agents and professional requirements, (2) providing a well-structured taxonomy of failure modes, and (3) offering a reusable, expert-validated benchmark infrastructure. While the benchmark design has some limitations in scale and evaluation methodology, it addresses a genuine and timely need and will likely influence how the community thinks about and evaluates the next generation of computer-use agents.

Rating:7/ 10

Significance 7.5Rigor 6.5Novelty 6.5Clarity 7

Generated Jun 11, 2026

Comparison History (15)

Lostvs. MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Paper 2 (MLEvolve) has higher potential impact due to a more novel algorithmic framework (progressive MCGS with cross-branch information flow, retrospective memory, hierarchical control) and stronger demonstrated gains on established benchmarks (MLE-Bench) plus cross-domain generalization beyond ML engineering. Its applications (automated ML algorithm discovery, scientific/engineering optimization) are broad and timely as LLM agents move toward long-horizon self-improvement. Paper 1 is valuable but primarily introduces a benchmark; its impact depends on community adoption and may be narrower to GUI-agent evaluation.

gpt-5.2·Jun 11, 2026

Lostvs. AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

Paper 2 (AgentJet) likely has higher scientific impact due to broader applicability and infrastructural leverage: a flexible, distributed RL framework can accelerate many lines of agentic RL research and enable new experimental regimes (heterogeneous multi-model/multi-agent training, fault tolerance, live iteration). Its claimed 1.5–10× speedups and automated long-horizon research loops suggest strong real-world utility and timeliness for scaling agent training. Paper 1 is novel and valuable as an evaluation benchmark, but its impact is narrower (primarily benchmarking) and depends on community adoption and task coverage.

gpt-5.2·Jun 11, 2026

Wonvs. PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

Paper 1 addresses a major bottleneck in agentic AI: evaluating long-horizon, economically valuable tasks in professional GUI environments. While Paper 2 explores important safety risks in LLM memory, Paper 1 has broader implications for developing autonomous agents capable of real-world, end-to-end automation, giving it higher potential for widespread technological and economic impact.