DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Wenkai Wang, Tao Xiong, Jingchen Ni, Yunpeng Bao, Xiyun Li, Tianqi Liu, Hongcan Guo, Zilong Huang

Jun 2, 2026

arXiv:2606.03103v1 PDF

cs.AI(primary)

#817of 3404·Artificial Intelligence

#817 of 3404 · Artificial Intelligence

Tournament Score

1458±46

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6

Novelty6.5

Clarity7.5

Tournament Score

1458±46

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: DeskCraft

1. Core Contribution

DeskCraft introduces a 538-task benchmark for desktop GUI agents that addresses three underexplored dimensions simultaneously: (1) long-horizon professional workflows requiring 50+ execution steps, (2) a formalized human-agent interaction protocol with composable triggers, and (3) coverage of professional creative/engineering software (Blender, Kdenlive, Audacity, GIMP, Inkscape) beyond the typical office suite focus. The benchmark structures tasks into an L1/L2/L3 difficulty taxonomy and defines three trigger types (agent_done, agent_ask, step_count) that model post-turn and mid-turn collaboration patterns. The key insight is that real desktop work involves iterative human-agent coordination—not just executing a fixed instruction—and existing benchmarks fail to capture this.

2. Methodological Rigor

Strengths in design: The interaction protocol is well-formalized. Representing collaboration as phase-conditioned triggers (ϕ_k = (u_k, g_k)) is elegant—it makes evaluation deterministic and reproducible while capturing realistic patterns like clarification, interruption, and progressive refinement. The execution-based verification using domain-aware verifiers (parsing SVG XML, Blender scene graphs via bpy, audio signal analysis, Kdenlive project XML) is substantially more robust than screenshot-based evaluation.

Concerns: The user simulator (Kimi-K2.5 backbone) introduces a confound—simulator quality could affect interactive task evaluation, though the authors mitigate this by making final evaluation depend on desktop state rather than dialogue quality. The pass@k analysis is conducted on a "representative task subset" rather than the full set, which limits generalizability claims. The paper evaluates 18 agents but provides limited error analysis beyond aggregate statistics; the appendix case studies (H.1-H.3) are informative but anecdotal. There's no inter-annotator agreement reported for task difficulty classification or evaluator correctness beyond "human and LLM dual review."

The 300-step budget analysis (RQ2) is interesting but only conducted for one model (Kimi-K2.6), limiting conclusions about whether the step budget is a universal bottleneck.

3. Potential Impact

Direct impact on agent evaluation: DeskCraft fills a genuine gap. OSWorld, Windows Agent Arena, and macOSWorld all focus on short, atomic tasks with fixed instructions. By combining long-horizon workflows with interactive collaboration, DeskCraft provides a more realistic proxy for actual desktop agent deployment. The professional software coverage (3D modeling, video editing, audio production) is particularly valuable as these domains have dense UI surfaces and require spatial precision that stress-tests agent capabilities differently from office tasks.

Influence on agent development: The finding that agents rarely use the ASK channel proactively (Obs.❼) is practically important—it identifies a concrete capability gap. The observation that correction/feedback tasks are easier than interruption tasks (Obs.❻) provides actionable guidance for training interactive agents. The L3 performance cliff (GPT-5.4 dropping from 40.7% at L2 to 9.5% at L3) quantifies a critical failure mode.

Limitations on impact: The benchmark is Ubuntu-only, which may limit adoption for Windows/macOS-focused agent development. The interactive split (152 tasks) is relatively small for fine-grained analysis across 6 collaboration modes × multiple applications. The reliance on a fixed user simulator means the benchmark cannot capture adversarial or creative user behaviors.

4. Timeliness & Relevance

The paper is highly timely. Computer-use agents (CUAs) are a major research and product frontier in 2025-2026, with Claude, GPT-5, and multiple open-source efforts targeting desktop automation. The gap between current benchmark capabilities (short atomic tasks) and deployment requirements (sustained interactive workflows) is widely recognized. The inclusion of results for very recent models (GPT-5.4, Kimi-K2.6, Qwen3.5/3.6) makes the evaluation immediately relevant.

The human-in-the-loop dimension is particularly timely as the field moves from "can agents click buttons" to "can agents collaborate with users"—a prerequisite for practical deployment.

5. Strengths & Limitations

Key strengths:

Novel evaluation axis: The composable trigger protocol is the paper's strongest conceptual contribution, providing a principled framework for interactive agent evaluation on desktops.

Professional software coverage: Blender, Kdenlive, Audacity, and Inkscape tasks demand domain knowledge and spatial reasoning that generic benchmarks miss.

Comprehensive evaluation: 18 agents spanning proprietary, open-source generalist, and GUI-specialized models provide a thorough landscape assessment.

Diagnostic taxonomy: The L1/L2/L3 structure enables precise failure diagnosis rather than single-number comparisons.

Detailed appendix: The construction methodology (Appendix F) and representative cases (Appendix G-H) provide unusual transparency.

Notable limitations:

Scale constraints: 538 tasks total, with only 108 L3 tasks and 152 interactive tasks, limits statistical power for fine-grained analysis.

Evaluator coverage: Some evaluators rely on file-structure checks that may miss semantically correct but structurally different solutions.

Single-simulator dependency: Using one LLM as user simulator could introduce systematic biases that affect all interactive evaluations uniformly.

No human baseline: The paper reports no human performance, making it difficult to calibrate whether 31.6% represents a meaningful capability level or whether tasks are unreasonably difficult.

Reproducibility concerns: Despite promising open-source release, VM setup complexity and reliance on specific software versions may create reproducibility barriers.

Limited novelty in individual components: The difficulty taxonomy, interaction triggers, and execution-based verification are each incremental advances; the contribution is primarily in their combination and application to professional desktop workflows.

Overall Assessment

DeskCraft makes a solid contribution to the agent benchmarking ecosystem by addressing a real and timely gap. Its interaction protocol is well-designed, and the professional software coverage meaningfully extends the evaluation frontier. The empirical findings—particularly around proactive clarification failure and L3 performance cliffs—are actionable. However, the benchmark's moderate scale, single-platform coverage, and absence of human baselines somewhat limit the depth of conclusions. This is a competent benchmarking paper that will likely see adoption in the desktop agent community, though its broader influence will depend on whether the interaction protocol framework generalizes beyond this specific implementation.

Rating:6.5/ 10

Significance 7Rigor 6Novelty 6.5Clarity 7.5

Generated Jun 3, 2026

Comparison History (23)

vs. MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

gpt-5.26/5/2026

Paper 1 (DeskCraft) likely has higher scientific impact because it introduces a large, open benchmark addressing a major gap: long-horizon desktop workflows with realistic human-in-the-loop interaction protocols. Benchmarks can shape and standardize evaluation across many agent architectures and research groups, with broad applicability to HCI, autonomous agents, and real-world productivity tools; the scale (538 tasks, 18 agents) and planned open-sourcing increase adoption potential and timeliness. Paper 2’s method is promising but more domain-specific (math/RLVR) and may see narrower immediate cross-field uptake.

vs. Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

claude-opus-4.66/5/2026

BioManus introduces a novel architectural paradigm (MCP-native graph planning) that addresses fundamental scalability bottlenecks in biomedical agent systems, with theoretical analysis of context compression and demonstrated improvements on benchmarks. It offers both a new system design and a reusable ecosystem (BioinfoMCP Compiler). While DeskCraft is a valuable benchmark contribution for desktop GUI agents with thoughtful human-in-the-loop protocols, benchmarks typically have narrower methodological impact than new architectural paradigms. BioManus's structured capability graph approach could influence agent design across domains beyond biomedicine.

vs. An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

claude-opus-4.66/5/2026

DeskCraft addresses a timely and rapidly growing area—autonomous desktop agents and human-AI collaboration—with a comprehensive benchmark covering 538 tasks across professional creative software, evaluating 18 agents. It fills a clear gap in existing benchmarks by introducing long-horizon workflows and formalized human-in-the-loop interaction protocols. The breadth of impact across AI agent development, HCI, and software automation is substantial. Paper 2 is a solid clinical AI contribution but is more incremental, combining existing techniques (conformal prediction, LCMM) in a narrower medical domain with less potential for broad cross-field influence.

vs. Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

claude-opus-4.66/3/2026

DeskCraft addresses a fundamental gap in AI agent evaluation—long-horizon professional desktop workflows with human-in-the-loop collaboration—which is highly relevant given the rapid deployment of autonomous agents. Its comprehensive benchmark (538 tasks, 18 agents, professional software) provides critical infrastructure for the booming desktop agent field. Paper 2, while interesting in reframing UGC quality assessment around social resonance, addresses a narrower problem with more incremental methodological contributions (Social-CoT is a creative but relatively modest extension). DeskCraft's broader applicability across AI agent research gives it higher impact potential.

vs. Decomposing how prompting steers behavior

gemini-3.16/3/2026

Paper 2 tackles a fundamental theoretical question in AI: how prompting mechanically alters internal representations to steer model behavior. Its novel geometric decomposition framework offers deep mechanistic insights applicable across various LLMs and VLMs. While Paper 1 introduces a valuable benchmark for GUI agents, Paper 2 provides foundational knowledge that could broadly influence model interpretability, alignment, and future architecture design across multiple subfields of AI research.

vs. Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

claude-opus-4.66/3/2026

DeskCraft introduces a novel benchmark addressing a significant gap in desktop GUI agent evaluation—long-horizon professional workflows with human-in-the-loop collaboration. This targets a rapidly growing field (autonomous desktop agents) with broad applicability across creative and engineering domains. The benchmark's formalization of interaction protocols and evaluation of 18 agents on 538 tasks provides substantial community infrastructure. Paper 1, while solid, offers an incremental optimization method (EAPO) for tool-use regulation in agentic RL—a narrower contribution. DeskCraft's benchmark nature means it will likely be cited more broadly as a standard evaluation resource.

vs. SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

gemini-3.16/3/2026

Paper 1 introduces a comprehensive benchmark for long-horizon, human-in-the-loop desktop agents, addressing a critical gap in current AI evaluation. Benchmarks targeting realistic professional workflows often drive significant follow-on research and establish new standards. While Paper 2 offers a strong methodological improvement, Paper 1's focus on proactive collaboration and complex creative tasks positions it to have broader, field-shaping impact across both AI and human-computer interaction.

vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

gpt-5.26/3/2026

Paper 1 likely has higher impact due to stronger timeliness and direct applicability: it introduces a large-scale, open benchmark for long-horizon desktop agent workflows with explicit human-in-the-loop interaction protocols, evaluating 18 agents on 538 tasks across widely used professional creative tools. Benchmarks often become standard infrastructure that accelerates progress across academia and industry, and its failure analyses target a key bottleneck (long-horizon execution and proactive clarification). Paper 2 is novel and relevant for computational social science, but its scope and validation appear narrower and more model-dependent.

vs. An Exploration of Collision-based Enemy Morphology Generation

gemini-3.16/3/2026

Paper 1 addresses a highly relevant and rapidly growing field in AI: autonomous GUI agents and human-in-the-loop collaboration. By introducing a comprehensive benchmark for long-horizon, professional workflows, it provides a crucial evaluation tool that will likely drive future agentic AI research. Paper 2, while interesting, focuses on a niche application of procedural content generation in video games, which inherently has a narrower scope and lower potential for cross-disciplinary impact compared to foundational AI agent benchmarking.

vs. Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

claude-opus-4.66/3/2026

DeskCraft introduces a comprehensive benchmark addressing a significant gap in evaluating desktop GUI agents on realistic professional workflows with human-in-the-loop collaboration. It covers 538 tasks across multiple professional domains, evaluates 18 agents, formalizes interaction protocols, and will be open-sourced. Its breadth of impact is substantial—benchmarks shape entire research directions. Paper 2, while technically sound, addresses a narrower problem (proactive mobile agent efficiency) with an incremental architectural contribution (two-stage gating). DeskCraft's benchmark contribution is likely to be widely adopted and cited, driving research in desktop automation and human-agent collaboration.

vs. CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection

gemini-3.16/3/2026

Paper 1 addresses a critical and urgent societal challenge—multimodal fake news and generative AI manipulation. By introducing a novel, generalizable 'conflict-oriented reasoning' paradigm for MLLMs, it overcomes the severe limitations of existing domain-specific models. Its zero-shot adaptation capabilities provide immediate real-world value for AI safety, security, and trust. While Paper 2 offers a valuable benchmark for GUI agents, Paper 1's methodological innovation and its profound, widespread implications across security and information integrity give it a higher potential for broad scientific and societal impact.

vs. Uncertainty-Aware Clarification in LLM Agents with Information Gain

gpt-5.26/3/2026

Paper 1 (DeskCraft) likely has higher impact because it introduces a large, realistic, long-horizon benchmark for professional creative/engineering desktop workflows with a formalized human-in-the-loop interaction protocol and broad evaluation across many agents. Such benchmarks often become community standards, enabling reproducible comparison and accelerating progress across UI agents, HCI, and applied ML. Paper 2 proposes a focused training objective (information-gain reward) with moderate gains in a specific environment; it is methodologically interesting but narrower in scope and likely less field-shaping than a widely adopted benchmark.

vs. LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

gemini-3.16/3/2026

Paper 1 demonstrates a major breakthrough in automated formal theorem proving, achieving state-of-the-art results on extremely difficult mathematical benchmarks (IMO, Putnam 2025) and contributing to open research problems. While Paper 2 introduces a valuable benchmark for evaluating GUI agents, Paper 1's methodological advancements in mathematical reasoning and its immediate utility in advancing mathematical research give it a significantly higher potential for broad scientific impact and fundamental AI capability enhancement.

vs. When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning

claude-opus-4.66/3/2026

DeskCraft addresses a significant gap in AI agent evaluation by benchmarking desktop agents on realistic professional workflows with human-in-the-loop collaboration. It evaluates 18 agents across 538 tasks in real creative/engineering software, providing a comprehensive resource for the rapidly growing field of GUI agents. Its breadth of impact (touching HCI, AI agents, professional software automation), timeliness (desktop agents are a hot research area), and practical utility as a benchmark give it higher impact potential. Paper 2 offers interesting insights on subgoal persistence but is narrower in scope, evaluated on limited benchmarks (ARC), with incremental architectural contributions.

vs. RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

gemini-3.16/3/2026

Paper 1 introduces a comprehensive benchmark for a highly relevant and rapidly growing field (GUI agents and human-AI collaboration). By addressing the critical gap of long-horizon, real-world tasks in professional software and formalizing human-in-the-loop interaction protocols, it sets a foundational standard likely to spur broad subsequent research. In contrast, Paper 2 presents an incremental architectural improvement for a specific database autocomplete benchmark, which, while valuable, has a narrower scope and lower potential for paradigm-shifting impact.

vs. Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States

claude-opus-4.66/3/2026

Paper 2 offers a fundamental theoretical insight bridging rational decision-making (Bellman optimality) with behavioral economics (prospect theory), showing that prospect-theory-like behaviors emerge from purely rational agents facing catastrophic states—without requiring cognitive biases. This is a surprising, elegant result with broad interdisciplinary impact across economics, AI/RL, cognitive science, and decision theory. Paper 1, while practically useful as a benchmark for desktop GUI agents, is more incremental—extending existing benchmarking paradigms to longer-horizon tasks. Benchmarks have shorter-lived impact as the field evolves, whereas Paper 2's theoretical contribution is more durable and paradigm-shifting.

vs. Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

claude-opus-4.66/3/2026

DeskCraft addresses a significant gap in AI agent evaluation by introducing a benchmark for long-horizon professional desktop workflows with human-in-the-loop collaboration—a largely unexplored but practically crucial area. Its breadth (538 tasks, 18 agents, multiple professional software domains) and formalization of interaction protocols create substantial infrastructure value for the rapidly growing computer-use agent field. Paper 2 makes a solid contribution linking compression to uncertainty calibration, but addresses a narrower question with more incremental findings. DeskCraft's timeliness amid the surge in desktop GUI agents gives it broader impact potential.

vs. TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

gpt-5.26/3/2026

Paper 1 (DeskCraft) likely has higher impact: it introduces a large, open benchmark capturing long-horizon, real professional desktop workflows and formalizes human-in-the-loop interaction protocols—an enabling infrastructure that can steer evaluation and development of desktop agents broadly. Its scale (538 tasks, 18 agents), breadth across creative/engineering tools, and focus on realistic collaboration make it timely and widely applicable across agentic AI, HCI, and software automation. Paper 2 is novel and useful, but is narrower (white-box hallucination detection) and less universally deployable given white-box access requirements.

vs. Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

claude-opus-4.66/3/2026

Paper 1 introduces a novel analytical framework (GAMBLe) for understanding AI-driven research systems, addressing a fundamental gap in how we analyze and optimize LLM-based discovery pipelines. Its theoretical contributions (formalizing effective landscapes, proving standard convergence guarantees don't hold) combined with extensive empirical validation (760+ runs, 46K+ iterations) provide broadly applicable insights across domains. Paper 2, while valuable as a benchmark for desktop GUI agents, is more incremental—extending existing benchmarking paradigms to professional workflows. Benchmarks have shorter-lived impact as models rapidly improve, whereas analytical frameworks like GAMBLe offer lasting methodological contributions.

vs. Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

gemini-3.16/3/2026

While Paper 1 provides valuable diagnostic insights into LLM API use, Paper 2 tackles the next major frontier in AI: long-horizon, human-in-the-loop desktop agents operating specialized professional software. By formalizing realistic collaborative interactions and moving beyond short, simplified GUI tasks, DeskCraft addresses a critical bottleneck in deploying truly autonomous and cooperative AI assistants in real-world workflows.