DocOS: Towards Proactive Document-Guided Actions in GUI Agents

Jingjing Liu, Ziye Huang, Zihao Cheng, Zeming Liu, Jiahong Wu, Yuhang Guo, Kehai Chen, Yunhong Wang

#1025 of 2292 · Artificial Intelligence
Share
Tournament Score
1424±43
10501800
50%
Win Rate
11
Wins
11
Losses
22
Matches
Rating
5.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: DocOS — Towards Proactive Document-Guided Actions in GUI Agents

1. Core Contribution

DocOS introduces the concept of Proactive Document-Guided Action for GUI agents — a paradigm where agents must autonomously search the web for documentation to solve long-tailed tasks they lack parametric knowledge for. This mirrors how human users actually operate: when encountering unfamiliar software functionality, they search for official documentation, read it, and then execute. The paper formalizes this into a two-phase pipeline (Proactive Knowledge Retrieval + Document-Grounded Execution) and constructs a benchmark of 817 tasks across 20 applications in Docker-based interactive desktop environments.

The key insight is that existing GUI agents rely on static parametric knowledge and fail on long-tailed, application-specific tasks. By requiring agents to autonomously retrieve and apply documentation, DocOS tests a capability essential for real-world deployment but previously unexamined in benchmarks.

2. Methodological Rigor

Strengths in design:

  • The POMDP formalization is clean and the decomposition into retrieval and execution phases is well-motivated.
  • The benchmark construction pipeline (task construction → document collection → task filtering) is systematic, with 92.5% quality validation on a 20% sample.
  • The evaluation uses both URL-based metrics (TUI, HPP) and task completion rate (TCR), enabling analysis at different granularities.
  • The two complementary evaluation settings (w/ document vs. w/o document, and oracle document) enable diagnosis of bottlenecks.
  • Weaknesses:

  • The improvement from document guidance is surprisingly small — Table 4 shows only 2-8% relative improvement in TCR. This raises questions about whether the paradigm is truly effective, or whether the agents simply cannot utilize documents well. The paper acknowledges this as a "dual bottleneck" but the marginal gains weaken the motivating argument.
  • The baseline set is limited to 7B-8B scale open-source models, with only Qwen3-VL-32B added in the appendix. No proprietary models (GPT-4o, Claude, Gemini) are tested, which limits understanding of whether these bottlenecks are fundamental or scale-dependent.
  • The TCR numbers are extremely low across the board (best: ~17% for UI-TARS-1.5-7B), making it difficult to draw nuanced conclusions about relative performance.
  • The semantic retrieval evaluation (Appendix D) reports a "similarity score of 0.653 with URL-based evaluation results" but doesn't clearly explain what this correlation means or how it was computed.
  • The human quality validation at 92.5% on 20% of data is reasonable but not exhaustive.
  • 3. Potential Impact

    The paradigm of document-guided GUI agents addresses a genuine practical limitation. In enterprise settings, software frequently updates, and agents that can consult documentation dynamically would be far more robust than those relying on static training data. This has clear applications in:

  • IT automation and helpdesk workflows
  • Software testing across version updates
  • Accessibility tools for users with disabilities
  • General-purpose computer-use agents
  • The benchmark itself fills a clear gap in the evaluation landscape (Table 1 effectively demonstrates this). However, the current results suggest the field is far from solving this problem, which positions DocOS more as a long-term challenge benchmark than an immediately actionable contribution.

    4. Timeliness & Relevance

    This paper is highly timely. The GUI agent space has exploded in 2024-2025 with models like CogAgent, UI-TARS, and OpenCUA, and benchmarks like OSWorld and WebArena. The observation that these agents fail on long-tailed tasks is increasingly recognized, and DocOS directly addresses this gap. The connection to retrieval-augmented generation (RAG) in the GUI agent setting is natural and underexplored.

    The paper appears at ICML 2026, which places it at a moment when the community is moving beyond basic GUI grounding toward more autonomous and adaptive agent behaviors.

    5. Strengths & Limitations

    Key Strengths:

  • Novel and well-motivated paradigm: The proactive document search concept is intuitive, practically important, and clearly differentiated from prior work.
  • Comprehensive benchmark design: 20 applications, 817 tasks, three difficulty levels, Docker-based execution — this is a substantial engineering effort.
  • Diagnostic evaluation framework: The two-phase evaluation with oracle/non-oracle settings enables clear bottleneck identification.
  • Detailed error taxonomy: The qualitative error analysis (imprecise localization, non-official reference, execute before retrieval, action grounding failure, context misidentification) provides actionable insights.
  • Notable Weaknesses:

  • Small performance deltas: The 2-8% relative improvement from documents (Table 4) and 4-11% from oracle documents (Table 5) suggest that current agents barely benefit from the paradigm, undermining its practical value at present.
  • Limited baseline diversity: No closed-source frontier models tested; scale effects underexplored.
  • Writing quality issues: Multiple typos and grammatical errors throughout ("konwledge," "oracel," "limitaions," "througy," "retireval"), suggesting rushed preparation.
  • Shallow analysis of why documents don't help more: The paper identifies bottlenecks but doesn't deeply investigate solutions or why the gaps are so severe.
  • Reproducibility concerns: While Docker environments and code are promised, the reliance on live web documentation introduces temporal instability — official documentation pages change over time, potentially invalidating ground-truth URLs and content.
  • Step-based difficulty categorization (Easy/Medium/Hard by number of steps) is simplistic and may not capture true cognitive difficulty.
  • 6. Additional Observations

    The AutoGen framework experiment (Table 9) is a welcome addition but shows minimal difference from the base model, suggesting that agentic scaffolding alone doesn't resolve the fundamental capability gaps. The paper would benefit from exploring whether chain-of-thought prompting, better document chunking, or multi-turn retrieval strategies could improve performance.

    The dataset contribution is the paper's strongest lasting impact — even if current models perform poorly, DocOS provides a meaningful evaluation target as GUI agents improve.

    Rating:5.5/ 10
    Significance 6.5Rigor 5Novelty 6.5Clarity 5

    Generated May 19, 2026

    Comparison History (22)

    vs. Unlocking Proactivity in Task-Oriented Dialogue
    claude-opus-4.65/22/2026

    Paper 1 introduces a novel paradigm (proactive document-guided action) for GUI agents that addresses a fundamental limitation—reliance on static parametric knowledge—with broader applicability across diverse GUI automation tasks. The benchmark (DocOS) fills an important gap in evaluating agents in dynamic, open-web environments, which is highly relevant given the rapid growth of autonomous agent research. Paper 2, while technically sophisticated with its asymmetric-view training and cognitive user simulator, addresses a narrower problem (proactive task-oriented dialogue/sales). Paper 1's contribution has wider potential impact across the agent research community and more diverse real-world applications.

    vs. Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization
    gpt-5.25/20/2026

    Paper 2 is more novel in framing “proactive document-guided actions” as a distinct capability and contributes a benchmark (DocOS) that can standardize evaluation and drive follow-on work. Its applications span web automation, enterprise tooling, accessibility, and general agentic RAG, giving broader cross-field impact and timeliness as GUI agents rapidly evolve. Paper 1 offers valuable, rigorous negative/diagnostic findings about LLM optimization limits in hardware-aware code, but its impact is narrower (compiler/kernel optimization) and mainly characterizes failure modes rather than enabling a new scalable research direction or widely reusable artifact.

    vs. Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support
    gemini-3.15/20/2026

    Paper 2 offers a rigorous methodological advancement by combining interpretable AI with statistical safety guarantees (conformal risk control) for physical systems. Its direct application to safety-critical, real-world infrastructure (wastewater treatment) addressing energy efficiency and greenhouse gas emissions gives it profound real-world impact. While Paper 1 introduces a useful benchmark for GUI agents, Paper 2 spans AI, control theory, and environmental engineering with validated real-world testing, suggesting broader and more immediate scientific and societal impact.

    vs. Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use
    claude-opus-4.65/20/2026

    Paper 2 introduces a novel, practical paradigm (proactive document-guided action) with a concrete benchmark (DocOS) for GUI agents, addressing a clear limitation in current agentic systems. It has broader applicability across the rapidly growing GUI/web agent community, identifies specific bottlenecks that can drive future research, and is highly timely given the surge in LLM-based agents. Paper 1, while intellectually elegant in formalizing trust calibration as preferential Bayesian optimization, is more incremental—connecting existing frameworks (GP classification, PBO) to a specific application without novel algorithmic contributions or empirical validation.

    vs. Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
    claude-opus-4.65/19/2026

    Paper 2 introduces a more novel and cross-disciplinary framework (GVG) that bridges neuroscience, computer vision, and MLLMs by using generative visual grounding to translate EEG signals into proxy images. This addresses a fundamental limitation in brain-computer interfaces with broad implications for clinical neuroscience and brain foundation models. Paper 1, while addressing a practical limitation of GUI agents, is more incremental—introducing a benchmark for document-guided actions in a narrower application domain. Paper 2's methodological innovation (trimodal alignment, EEG-to-image generation) and potential impact across neuroscience and AI give it higher scientific impact.

    vs. TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens
    gemini-3.15/19/2026

    Paper 1 introduces a fundamental methodological innovation by replacing computationally expensive explicit Chain-of-Thought reasoning with latent think tokens for multimodal representations. This addresses a critical bottleneck in deploying reasoning-heavy models, offering broad applicability across foundation models and representation learning. Paper 2 presents a valuable benchmark for GUI agents, but its impact is relatively confined to the agentic workflow subfield, making Paper 1's architectural advancements more likely to achieve widespread scientific and practical impact.

    vs. AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
    gemini-3.15/19/2026

    Paper 1 addresses a fundamental bottleneck in LLM alignment and complex reasoning (token-level credit assignment in RLVR). Its proposed algorithmic solution, AMR-SD, targets core methodological challenges in training state-of-the-art reasoning models. While Paper 2 introduces a valuable benchmark for GUI agents, advancements in foundational reasoning capabilities and RL training paradigms typically exert a more profound, widespread impact across the broader AI ecosystem.

    vs. Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework
    gemini-3.15/19/2026

    Paper 1 introduces a novel paradigm and benchmark for GUI agents, addressing a fundamental limitation in agent autonomy by enabling proactive document-guided actions. This conceptual leap and the introduction of a new benchmark are likely to spur significant follow-up research. Paper 2 provides valuable empirical analysis of existing paradigms but is more focused on system engineering, making its fundamental scientific impact comparatively lower.

    vs. Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents
    gpt-5.25/19/2026

    Paper 2 (DocOS) likely has higher impact due to broader real-world applicability and timeliness: enabling GUI agents to proactively retrieve and execute procedural knowledge from documentation directly targets a major deployment bottleneck for agents in open-web and enterprise settings. The paradigm and benchmark span IR/search, grounding, planning, and HCI, widening cross-field relevance. Paper 1 is novel and methodologically interesting (causal interventions for memory selection) but is more specialized to long-horizon LLM memory management and depends on synthetic harmful-memory construction, potentially narrowing immediate adoption compared to document-guided GUI autonomy.

    vs. What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models
    gemini-3.15/19/2026

    Paper 2 offers higher scientific impact due to its critical interdisciplinary focus on medical AI ethics. While Paper 1 presents a useful benchmark for GUI agents, Paper 2 addresses a high-stakes, real-world problem: the ethical alignment and potential value monoculture of LLMs in healthcare. Its novel framework for auditing clinical value pluralism bridges AI safety, bioethics, and clinical medicine. This breadth of impact, combined with the urgent timeliness of safely deploying medical AI, gives Paper 2 a significantly higher potential to influence policy, clinical guidelines, and future AI alignment research.

    vs. SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
    gemini-3.15/19/2026

    Paper 2 addresses a fundamental limitation in GUI agents by enabling them to proactively seek out and use documentation for long-tail tasks. This paradigm shift toward self-evolving, adaptable agents has broad implications for general AI automation and OS control. In contrast, Paper 1 focuses on a highly domain-specific benchmark (Chinese gaming short videos), which, while valuable, offers narrower potential impact and applications compared to the broader utility of autonomous GUI navigation and document grounding.

    vs. Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution
    gpt-5.25/19/2026

    Paper 2 (Solvita) has higher estimated impact due to a more broadly applicable innovation: a stateless-to-stateful shift via trainable knowledge networks that accumulate experience without updating the base LLM, validated on multiple standard programming benchmarks plus live Codeforces. This combines methodological rigor (closed-loop agents, certified supervision, adversarial hacking, RL-updated routing) with strong, measurable performance gains and clear downstream utility for reliable code generation and general agent learning. Paper 1 is timely and useful as a benchmark/paradigm for document-guided GUI agents, but its contribution is more evaluation-focused and narrower in cross-domain reach.

    vs. DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition
    gemini-3.15/19/2026

    Paper 1 introduces a novel paradigm and benchmark (DocOS) for autonomous GUI agents, shifting from static knowledge reliance to dynamic, proactive document retrieval. This addresses a fundamental bottleneck in agentic AI (handling long-tail tasks). While Paper 2 demonstrates impressive large-scale industrial deployment, it is primarily an applied engineering contribution specific to IVR systems. Paper 1's foundational approach to agent problem-solving, planning, and grounding offers broader applicability and higher potential to drive future scientific research and citations across the rapidly growing field of LLM agents.

    vs. How Mobile World Model Guides GUI Agents?
    gpt-5.25/19/2026

    Paper 2 likely has higher impact due to a broader, more general contribution: a systematic study and strong empirical results on mobile world models across multiple modalities, with SoTA performance and clear guidance on when different representations help (in-distribution vs OOD, training vs test-time). It also tests transfer to multiple downstream agent benchmarks, increasing methodological rigor and breadth. Paper 1 is novel and timely in proposing document-guided GUI action and a benchmark, but its primary contribution is evaluative/diagnostic and narrower to documentation retrieval+grounding, with less demonstrated performance gain across tasks.

    vs. Brain Vascular Age Prediction Using Cerebral Blood Flow Velocity and Machine Learning Algorithms
    gpt-5.25/19/2026

    Paper 2 is likely to have higher scientific impact due to stronger novelty (proactive document-guided actions for GUI agents) and broader, timely relevance to foundation-model agents, tool use, web interaction, and benchmark-driven progress. DocOS provides a reusable evaluation framework that can influence multiple subfields (NLP, HCI, IR, RL, agentic systems) and catalyze methodological advances by concretely identifying bottlenecks (search + grounding). Paper 1 is applied and clinically relevant, but its ML-on-TCD approach is more incremental and likely narrower in cross-field reach and scalability.

    vs. Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation
    claude-opus-4.65/19/2026

    Paper 1 addresses a critical safety issue in autonomous driving VLA models, revealing alarming unfaithfulness in reasoning (42.5% fidelity, 94 missed pedestrians, 97.7% trajectory fragility). This has immediate, high-stakes real-world implications for autonomous vehicle safety. The formalization of faithfulness information-theoretically and the proposed safety architecture contribute foundational methodology. Paper 2 introduces a useful benchmark for GUI agents but addresses a narrower, lower-stakes problem. Paper 1's findings could reshape how the field approaches VLA deployment and safety verification, giving it broader and more urgent impact.

    vs. What Do EEG Foundation Models Capture from Human Brain Signals?
    gpt-5.25/19/2026

    Paper 2 likely has higher scientific impact: it provides a rigorous, multi-model, multi-task causal auditing framework for EEG foundation models, directly addressing interpretability and clinical trust—highly timely for medical ML. Its methods (probing, subspace erasure, transparent baselines) are broadly reusable across biosignals and representation learning, and the quantified “recoverable advantage” offers actionable guidance for future feature/concept discovery. Paper 1 is novel and relevant for GUI agents, but is primarily a benchmark/paradigm proposal with narrower immediate real-world stakes and less cross-domain methodological reuse.

    vs. When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents
    gemini-3.15/19/2026

    Paper 2 addresses a critical bottleneck in deploying AI agents: the sim-to-real gap caused by real-world noise and API failures. By framing these issues within POMDP components, providing a benchmark grounded in real GitHub issues, and introducing an effective domain-randomization RL solution, it offers both strong theoretical grounding and high practical utility. While Paper 1 is innovative, Paper 2's focus on robustness and reliability gives it broader applicability and higher potential impact across real-world deployments.

    vs. Causal Algorithmic Recourse: Foundations and Methods
    claude-opus-4.65/19/2026

    Paper 1 makes fundamental theoretical contributions to algorithmic recourse by introducing a novel causal framework that addresses a significant gap in existing approaches—modeling recourse as a process with pre/post-intervention outcomes rather than static counterfactuals. It provides rigorous methodological contributions including stability conditions, copula-based inference, goodness-of-fit testing, and distribution-free alternatives. This work has broad impact across AI fairness, causal inference, and trustworthy ML. Paper 2 introduces a useful benchmark for GUI agents but is more incremental, addressing a specific engineering challenge with narrower theoretical contribution.

    vs. NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents
    gpt-5.25/19/2026

    Paper 2 (DocOS) has higher potential impact due to its timely focus on enabling GUI agents to handle long-tailed tasks via proactive documentation search and grounding, and because it introduces a benchmark in a fully interactive open-web setting—likely to catalyze broad, measurable progress across agent research, web automation, retrieval, grounding, and evaluation. Paper 1 offers a practical hybrid memory architecture, but it appears more systems-engineering oriented with narrower generalizability and less standardized evaluation potential. DocOS’s benchmark and identified bottlenecks make it a strong community driver.