Back to Rankings

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Yi Zhu, Duju Zeng

cs.AI
v1v2
Share
#1283 of 3489 · Artificial Intelligence
Tournament Score
1430±49
10501800
67%
Win Rate
10
Wins
5
Losses
15
Matches
Rating
7/ 10
Significance7.5
Rigor6.5
Novelty6.5
Clarity7

Abstract

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Workflow-GYM

1. Core Contribution

Workflow-GYM introduces a benchmark of 338 tasks spanning 6 professional domains and 23 subdomains, designed to evaluate whether GUI agents can autonomously complete long-horizon, domain-specific workflows in specialized software environments. The key abstraction is the concept of a "workflow"—a structured, goal-directed sequence of GUI actions requiring 30–110 atomic steps. Unlike prior GUI benchmarks that focus on general-purpose software (web browsing, basic OS operations) and short-horizon interactions, Workflow-GYM targets professional tools such as Blender, QGIS, FreeCAD, KNIME, CapCut, and 50+ other specialized applications across engineering, science, finance, multimedia, and more.

The benchmark fills a genuine gap: prior work like OSWorld evaluates simpler tasks where SOTA models achieve >70% success, while Workflow-GYM shows even the best models (Gemini-3.1-Pro, Kimi-k2.6) reach only ~30% average pass rates. This stark contrast demonstrates that professional long-horizon GUI workflows remain a fundamentally unsolved challenge.

2. Methodological Rigor

The data construction pipeline is well-structured with four phases: expert-driven workflow selection, standardized environment setup via 58 virtual machines, detailed task construction with expert procedures, and multi-stage validation (environment verification, instruction-level validation via expert+LLM review, and end-to-end preliminary experiments). The filtering criteria—realism, domain specificity, complexity (≥30 atomic actions), and verifiability—are clearly motivated and applied.

The evaluation methodology is sound in several respects: tasks are run across 3 independent trials to account for stochasticity, evaluation uses both rule-based and VLM-based assessment with predefined rubrics, and the benchmark uses a 400-step interaction budget. The use of model-specific agent frameworks rather than a one-size-fits-all approach is pragmatic and well-justified, though it introduces some confounding between model capability and framework quality.

However, there are notable methodological concerns. The use of VLM-based judging (Seed-1.8) for non-rule-based tasks introduces potential evaluation noise, though the authors argue that predefined rubrics minimize bias. The 338-task size, while substantial for expert-curated benchmarks, may limit statistical power for fine-grained domain-level comparisons (e.g., some subdomains likely have very few tasks). The paper also does not report inter-annotator agreement or detailed reliability statistics for the evaluation pipeline.

3. Potential Impact

Benchmarking impact: Workflow-GYM addresses a critical evaluation gap. As GUI agents mature, the field needs benchmarks that test economically meaningful, professional-grade capabilities rather than toy tasks. This benchmark could become a standard reference point for measuring progress toward human-level computer use in professional settings.

Diagnostic value: The failure analysis is perhaps the paper's most valuable contribution. The taxonomy of failure modes—error propagation, workflow stage omission, objective drift, software knowledge deficiency, and repetitive action looping—provides actionable insights for model developers. The finding that workflow incompletion rate correlates strongly (Pearson r = -0.97) with overall performance elegantly demonstrates that simply completing long-horizon workflows is the primary bottleneck.

Framework-level insight: The identification of the "continuous interaction vs. discrete observation" mismatch as a fundamental architectural limitation of current GUI agent frameworks is a particularly important observation that could guide next-generation agent designs.

Ablation insights: The tutorial ablation experiments (text and video) reveal that procedural guidance substantially helps weaker models but provides diminishing returns for stronger ones, and that video demonstrations provide complementary benefits beyond text for procedural, deterministic tasks.

4. Timeliness & Relevance

This benchmark is highly timely. The rapid deployment of computer-use agents (Claude Computer Use, OpenAI's computer-use capabilities, UI-TARS) has created urgent demand for rigorous evaluation of professional-grade GUI capabilities. The paper evaluates cutting-edge models including GPT-5.4, Gemini-3.1-Pro, and Kimi-k2.6 (all 2026 models), making it immediately relevant to the current state of the field. The focus on economically valuable professional workflows also aligns with growing industry interest in AI agents that can substitute for skilled human labor.

5. Strengths & Limitations

Key Strengths:

  • Expert-driven task curation from 71 domain professionals ensures authenticity that synthetic or web-scraped benchmarks cannot match
  • Comprehensive coverage of 58 professional software tools across diverse domains
  • Thorough failure analysis with detailed case studies (error propagation in Blender, software knowledge deficiency in GRASS GIS, objective drift in CapCut)
  • Well-designed ablation studies that isolate the contribution of procedural guidance and visual demonstrations
  • Standardized VM-based environments enabling reproducibility
  • The strong negative correlation between workflow incompletion and performance is a clean, memorable finding
  • Notable Limitations:

  • The benchmark is static; professional software evolves rapidly, and maintaining 58 VM environments will require ongoing effort
  • Model-specific agent frameworks make it difficult to fully disentangle model capability from framework quality
  • The 30-step minimum threshold for "long-horizon" is somewhat arbitrary, and the paper doesn't justify why this particular cutoff captures the transition to professional-grade difficulty
  • Limited analysis of evaluation reliability—no systematic study of VLM judge agreement with human judgments
  • The paper tests only 6 models; broader coverage (including open-source models, specialized GUI agents like UI-TARS-2) would strengthen the empirical contribution
  • Cross-domain performance comparisons are somewhat superficial; the varying numbers of tasks per domain make some comparisons unreliable
  • The paper does not discuss cost or computational requirements for running the full benchmark, which is relevant for reproducibility
  • Minor Observations:

  • The paper is dated June 2026 and references models like GPT-5.4, suggesting this is a forward-looking or speculative timeline
  • The writing is generally clear but the paper is quite long with extensive appendices; some case studies could be condensed
  • The benchmark focuses exclusively on desktop GUI; mobile professional workflows are excluded
  • Overall Assessment

    Workflow-GYM makes a solid contribution to the GUI agent evaluation landscape by raising the bar from simple, short-horizon tasks to professional-grade workflows. Its primary value lies in (1) demonstrating a large capability gap between current agents and professional requirements, (2) providing a well-structured taxonomy of failure modes, and (3) offering a reusable, expert-validated benchmark infrastructure. While the benchmark design has some limitations in scale and evaluation methodology, it addresses a genuine and timely need and will likely influence how the community thinks about and evaluates the next generation of computer-use agents.

    Rating:7/ 10
    Significance 7.5Rigor 6.5Novelty 6.5Clarity 7

    Generated Jun 11, 2026

    Comparison History (15)

    Lostvs. MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

    Paper 2 (MLEvolve) has higher potential impact due to a more novel algorithmic framework (progressive MCGS with cross-branch information flow, retrospective memory, hierarchical control) and stronger demonstrated gains on established benchmarks (MLE-Bench) plus cross-domain generalization beyond ML engineering. Its applications (automated ML algorithm discovery, scientific/engineering optimization) are broad and timely as LLM agents move toward long-horizon self-improvement. Paper 1 is valuable but primarily introduces a benchmark; its impact depends on community adoption and may be narrower to GUI-agent evaluation.

    gpt-5.2·Jun 11, 2026
    Lostvs. AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

    Paper 2 (AgentJet) likely has higher scientific impact due to broader applicability and infrastructural leverage: a flexible, distributed RL framework can accelerate many lines of agentic RL research and enable new experimental regimes (heterogeneous multi-model/multi-agent training, fault tolerance, live iteration). Its claimed 1.5–10× speedups and automated long-horizon research loops suggest strong real-world utility and timeliness for scaling agent training. Paper 1 is novel and valuable as an evaluation benchmark, but its impact is narrower (primarily benchmarking) and depends on community adoption and task coverage.

    gpt-5.2·Jun 11, 2026
    Wonvs. PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

    Paper 1 addresses a major bottleneck in agentic AI: evaluating long-horizon, economically valuable tasks in professional GUI environments. While Paper 2 explores important safety risks in LLM memory, Paper 1 has broader implications for developing autonomous agents capable of real-world, end-to-end automation, giving it higher potential for widespread technological and economic impact.

    gemini-3.1-pro-preview·Jun 11, 2026
    Wonvs. PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

    Paper 1 introduces a highly timely and relevant benchmark for evaluating AI agents on long-horizon, real-world professional GUI tasks. As the field rapidly shifts towards agentic AI and practical automation, foundational benchmarks are critical for driving progress and adoption. While Paper 2 presents an innovative memory architecture for VLMs, Paper 1 addresses a broader, more pressing evaluation gap in a rapidly expanding subfield, giving it higher potential for widespread scientific and practical impact.

    gemini-3.1-pro-preview·Jun 11, 2026
    Wonvs. MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

    Paper 2 likely has higher impact because it introduces a timely, broadly relevant evaluation benchmark for long-horizon GUI agent workflows in specialized professional software—an area with major real-world economic applicability and cross-field relevance (HCI, agentic AI, evaluation, robotics-like planning). Its finding that SOTA models achieve only ~30% success is a strong, actionable signal that can shape research directions. Paper 1 is innovative and rigorous for financial/tabular numerical reasoning, but its scope is narrower (domain-specific QA) and closer to incremental advances in multi-agent verification/code-agent pipelines.

    gpt-5.2·Jun 11, 2026
    Wonvs. Toward Trustworthy AI: Multi-Target Adversarial Attacks and Robust Defenses for Continuous Data Summarization

    Paper 1 introduces a benchmark for a highly relevant and rapidly growing field: long-horizon AI agents operating professional GUIs. Benchmarks in this domain typically drive significant subsequent research and development in LLMs and agentic AI. While Paper 2 offers strong mathematical rigor in adversarial robustness, Paper 1 addresses a broader, more timely bottleneck in AI capabilities, giving it greater potential for widespread real-world applications and immediate scientific impact.

    gemini-3.1-pro-preview·Jun 11, 2026
    Wonvs. PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents

    Paper 2 introduces a comprehensive benchmark for long-horizon GUI agents in professional domains, a critical bottleneck in AI research. Benchmarks typically drive significant future research and garner broad citations. In contrast, Paper 1 presents a practical engineering tool with a very limited evaluation methodology (a two-month self-study), giving Paper 2 much higher scientific rigor and potential field-wide impact.

    gemini-3.1-pro-preview·Jun 11, 2026
    Wonvs. Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

    Workflow-GYM introduces a novel benchmark for high-value, professional GUI workflows, addressing a critical gap in the rapidly expanding field of computer-use agents. Benchmarks that define new, challenging problem spaces typically achieve high scientific impact by steering future research directions and serving as standard evaluation metrics. While Paper 2 offers a strong algorithmic contribution for memory retention, Paper 1's focus on real-world, economically valuable tasks provides broader potential for widespread adoption, cross-domain applicability, and community-driven progress.

    gemini-3.1-pro-preview·Jun 11, 2026
    Lostvs. Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

    Paper 2 introduces a novel RL framework for MDPs with implicitly defined, state-dependent feasible action sets—a common but under-served setting in operations research and constrained control. It offers methodological rigor via a performance guarantee decomposing approximation vs learning error, and demonstrates strong results on queueing network control, suggesting broad applicability to constrained decision-making (logistics, scheduling, networks). Paper 1 is timely and valuable as a benchmark, but benchmarks often have narrower scientific novelty and impact unless they become a dominant standard; its core contribution is evaluative rather than a new algorithmic principle.

    gpt-5.2·Jun 11, 2026
    Lostvs. ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

    ReasonAlloc addresses a critical and timely problem in LLM inference efficiency—KV cache management for reasoning models with long chain-of-thought. It introduces a novel hierarchical budget allocation framework with the 'Reasoning Wave' concept, is training-free and plug-and-play, and demonstrates clear improvements on established benchmarks. While Workflow-GYM contributes a useful benchmark for GUI agents, benchmarks tend to have more transient impact. ReasonAlloc's methodological contribution to efficient inference for reasoning models has broader applicability as reasoning LLMs become ubiquitous, making it likely to influence a wider body of follow-up work.

    claude-opus-4-6·Jun 11, 2026