Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Xiang Gao, Qingshui Gu
Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.
Workflow-GYM introduces a benchmark comprising 338 tasks across 6 professional domains (Data Analysis, Engineering & Design, Finance & Management, Geography & Environment, Multimedia & Creative, Scientific Computing) and 23 subdomains, targeting long-horizon GUI-based workflows in specialized software environments. The key conceptual contribution is the abstraction of "workflows"—structured, goal-directed sequences of GUI actions that professionals routinely perform—as the unit of evaluation for computer-use agents. Tasks require 30–110 atomic GUI steps, span 58 distinct professional software tools (from Blender and QGIS to FreeCAD and KNIME), and are evaluated via deterministic success criteria on final artifacts or GUI states.
The benchmark fills a genuine gap: prior GUI benchmarks (OSWorld, AndroidWorld, GUI-360) focus on general-purpose software and short-horizon tasks, while long-horizon reasoning benchmarks (coding, QA) rarely involve GUI interaction. Workflow-GYM sits at this intersection, demanding both domain expertise and sustained GUI manipulation.
The data construction pipeline is well-designed with four phases: expert-driven workflow selection, standardized environment setup (58 VMs), structured task construction, and multi-stage validation (environment verification, instruction-level review, end-to-end agent testing). The triple validation loop is a strength—tasks are verified by human execution, LLM-based instruction quality checks, and preliminary agent experiments. The filtering criteria (realism, domain specificity, complexity ≥30 steps, verifiability) are clearly articulated.
Evaluation design is reasonable: binary pass/fail with deterministic criteria, 3 trials per task, model-specific agent frameworks with a 400-step budget. Using Seed-1.8 as a judge model for non-rule-based scoring is a practical choice, though this introduces a dependency that could affect reproducibility.
Experimental breadth is adequate, covering 6 frontier models (GPT-5.4 variants, Gemini 3.1 Pro, Gemini 3 Flash, Kimi-k2.6, Seed-2.0-lite) with model-specific frameworks. The ablation studies on textual tutorials and video demonstrations add depth, though the video ablation is limited to 100 tasks and 2 models.
However, there are methodological concerns:
Direct impact on GUI agent research: The benchmark provides a concrete, challenging target (best model ≈30% success) that should drive development of agents with better long-horizon planning, error recovery, and domain-specific software knowledge. The identified failure modes (error propagation, workflow stage omission, objective drift, repetitive action looping) offer actionable research directions.
Broader implications: The observation about the fundamental mismatch between continuous human interaction and discrete observation-action paradigms (Section 5.3) is an important architectural insight that could influence GUI agent framework design. The finding that video tutorials help but don't resolve complex tasks suggests a path for learning from demonstrations.
Practical relevance: If GUI agents are to automate professional work, benchmarks must capture professional workflows. Workflow-GYM's focus on economically valuable tasks (CAD modeling, financial reporting, geospatial analysis) directly addresses the question of AI's potential for workplace automation.
Limitations on impact: The benchmark's reliance on VM-based environments with commercial software may limit accessibility. The task count (338) is moderate. The professional software choices, while diverse, are skewed toward open-source tools, which may not represent the most commercially important professional workflows.
This work is highly timely. Computer-use agents (Claude Computer Use, OpenAI Operator, etc.) are being actively deployed, and the gap between general-purpose benchmarks and real professional needs is widely recognized. The benchmark arrives as frontier models (GPT-5.4, Gemini 3.1 Pro) are being released with explicit computer-use capabilities, making the evaluation particularly relevant. The focus on professional workflows also aligns with growing industry interest in AI-driven automation of knowledge work.
The paper would benefit from a cost analysis of benchmark construction and maintenance, inter-rater reliability metrics, and a more formal analysis of what makes tasks difficult beyond step count (e.g., number of software-specific concepts, degree of UI complexity). The correlation analysis (Pearson r = -0.97 between incompletion rate and success) is computed on only 6 data points, limiting its statistical power.
Overall, Workflow-GYM is a well-executed benchmark paper that addresses a timely need and provides valuable empirical insights, though it is primarily incremental in its conceptual contribution—extending the benchmark paradigm to professional domains rather than introducing fundamentally new evaluation methodologies.
Generated Jun 10, 2026
Benchmarks in AI typically have broader scientific impact as they establish standardized evaluation metrics that drive future research. Workflow-GYM addresses a critical gap in evaluating agents on long-horizon, professional GUI tasks, offering a platform that researchers across the field will likely use to test new models. While Paper 1 provides a strong architectural framework for enterprise security, Paper 2's benchmark will directly facilitate and measure the broader progression of agentic AI capabilities.
StatefulDiscovery addresses a fundamental challenge in AI-driven scientific discovery—evidence-calibrated claim formation—which has broad implications across all scientific fields. Its novel framework for coupling exploration with claim adjudication introduces a methodologically rigorous approach to open-ended discovery, a frontier problem with transformative potential. While Workflow-GYM is a useful benchmark contribution highlighting GUI agent limitations, benchmarks tend to have more incremental impact. StatefulDiscovery's framework for autonomous scientific reasoning is more novel, more broadly applicable, and addresses a more consequential problem for accelerating science.
Paper 2 introduces an automated system to generate dynamic, updatable benchmarks, addressing the fundamental issue of static benchmarks quickly saturating in AI. This methodological innovation has broader, longer-lasting impact across multiple embodied AI fields compared to Paper 1, which, while highly relevant, proposes a single static benchmark for GUI tasks.
Paper 2 introduces a much-needed benchmark for long-horizon, professional GUI tasks, addressing a critical bottleneck in the rapidly expanding field of autonomous AI agents. By revealing significant limitations in state-of-the-art models, it provides a foundational evaluation framework that will likely drive broad future research. Paper 1 offers a valuable but highly specific methodological refinement for LLM unlearning, which, while important for safety, has a narrower scope compared to the foundational impact of a novel, challenging benchmark in agentic AI.
Paper 1 likely has higher scientific impact because it introduces a broadly useful, timely benchmark for long-horizon GUI agent evaluation in real professional software—an area with wide applicability across AI agents, HCI, benchmarking, and enterprise automation. Its finding of low success rates creates a clear research gap and can become a standard evaluation tool, influencing many subsequent works. Paper 2 is methodologically rich and impactful for EDA/hardware design, but its domain is narrower; gains, while meaningful, may affect a smaller research and industry community than a cross-domain agent benchmark.
Workflow-GYM addresses a highly timely and economically valuable problem: long-horizon computer-use agents operating professional software. While Paper 1 offers a novel cognitive perspective on collaboration, Paper 2 aligns directly with the current AI frontier of developing autonomous GUI agents for real-world workflows. Its potential to drive immediate practical applications and highlight critical failure modes in state-of-the-art models gives it broader and more immediate scientific and industrial impact.
Paper 1 presents a novel and concrete methodological contribution—delayed per-step reward attribution with eligibility gating—that addresses a fundamental challenge in multi-agent RL for language models. Its strong empirical validation (an 8B model beating GPT-5 on a competitive benchmark, winning first place at NeurIPS 2025) demonstrates significant practical impact. Paper 2, while addressing an important gap in GUI agent evaluation, is primarily a benchmark contribution that reveals limitations without proposing solutions. Benchmark papers tend to have narrower long-term impact compared to papers introducing methods that achieve state-of-the-art results with dramatically fewer resources.
Paper 1 likely has higher impact because it introduces a new, timely benchmark targeting long-horizon, economically valuable, domain-specific GUI workflows—an evaluation gap with broad relevance to agent research, HCI, and real-world deployment. Benchmarks often become shared infrastructure, shaping research directions and enabling standardized comparisons across models and methods. While Paper 2’s uncertainty-aligned RL for tool-calling is innovative and useful, it is a more incremental algorithmic contribution whose impact depends on adoption and generalization, whereas Workflow-GYM can catalyze a wider ecosystem of methods and evaluations.
Paper 1 likely has higher impact due to stronger timeliness and broad applicability: long-horizon, real-world GUI agent evaluation is a key bottleneck for deploying agents in professional settings, and a well-designed benchmark can become a community standard. It targets economically valuable workflows across diverse domains and provides diagnostic failure modes that can steer multiple research areas (agent planning, UI grounding, robustness). Paper 2 is novel and useful, but idea-generation gains are incremental, evaluation is harder to validate, and downstream real-world adoption is less immediate than a benchmark for agent capability.
Paper 2 has higher potential impact due to a clear algorithmic contribution (residual-centric coding with deterministic guarantees) that materially advances high-fidelity scientific data compression, a critical bottleneck for HPC and climate/CFD workflows. It targets stringent error regimes (NRMSE 1e-6–1e-4), demonstrates sizable gains over strong baselines (GAE, SZ) across multiple real datasets, and emphasizes reproducible, deterministic decoding—important for scientific use. Paper 1 is timely and useful as a benchmark, but benchmark papers often yield narrower, more incremental impact unless they become a dominant standard.