DocOS: Towards Proactive Document-Guided Actions in GUI Agents
Jingjing Liu, Ziye Huang, Zihao Cheng, Zeming Liu, Jiahong Wu, Yuhang Guo, Kehai Chen, Yunhong Wang
Abstract
While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.
AI Impact Assessments
(1 models)Scientific Impact Assessment: DocOS — Towards Proactive Document-Guided Actions in GUI Agents
1. Core Contribution
DocOS introduces the concept of Proactive Document-Guided Action for GUI agents — a paradigm where agents must autonomously search the web for documentation to solve long-tailed tasks they lack parametric knowledge for. This mirrors how human users actually operate: when encountering unfamiliar software functionality, they search for official documentation, read it, and then execute. The paper formalizes this into a two-phase pipeline (Proactive Knowledge Retrieval + Document-Grounded Execution) and constructs a benchmark of 817 tasks across 20 applications in Docker-based interactive desktop environments.
The key insight is that existing GUI agents rely on static parametric knowledge and fail on long-tailed, application-specific tasks. By requiring agents to autonomously retrieve and apply documentation, DocOS tests a capability essential for real-world deployment but previously unexamined in benchmarks.
2. Methodological Rigor
Strengths in design:
Weaknesses:
3. Potential Impact
The paradigm of document-guided GUI agents addresses a genuine practical limitation. In enterprise settings, software frequently updates, and agents that can consult documentation dynamically would be far more robust than those relying on static training data. This has clear applications in:
The benchmark itself fills a clear gap in the evaluation landscape (Table 1 effectively demonstrates this). However, the current results suggest the field is far from solving this problem, which positions DocOS more as a long-term challenge benchmark than an immediately actionable contribution.
4. Timeliness & Relevance
This paper is highly timely. The GUI agent space has exploded in 2024-2025 with models like CogAgent, UI-TARS, and OpenCUA, and benchmarks like OSWorld and WebArena. The observation that these agents fail on long-tailed tasks is increasingly recognized, and DocOS directly addresses this gap. The connection to retrieval-augmented generation (RAG) in the GUI agent setting is natural and underexplored.
The paper appears at ICML 2026, which places it at a moment when the community is moving beyond basic GUI grounding toward more autonomous and adaptive agent behaviors.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
6. Additional Observations
The AutoGen framework experiment (Table 9) is a welcome addition but shows minimal difference from the base model, suggesting that agentic scaffolding alone doesn't resolve the fundamental capability gaps. The paper would benefit from exploring whether chain-of-thought prompting, better document chunking, or multi-turn retrieval strategies could improve performance.
The dataset contribution is the paper's strongest lasting impact — even if current models perform poorly, DocOS provides a meaningful evaluation target as GUI agents improve.
Generated May 19, 2026
Comparison History (22)
Paper 1 introduces a novel paradigm (proactive document-guided action) for GUI agents that addresses a fundamental limitation—reliance on static parametric knowledge—with broader applicability across diverse GUI automation tasks. The benchmark (DocOS) fills an important gap in evaluating agents in dynamic, open-web environments, which is highly relevant given the rapid growth of autonomous agent research. Paper 2, while technically sophisticated with its asymmetric-view training and cognitive user simulator, addresses a narrower problem (proactive task-oriented dialogue/sales). Paper 1's contribution has wider potential impact across the agent research community and more diverse real-world applications.
Paper 2 is more novel in framing “proactive document-guided actions” as a distinct capability and contributes a benchmark (DocOS) that can standardize evaluation and drive follow-on work. Its applications span web automation, enterprise tooling, accessibility, and general agentic RAG, giving broader cross-field impact and timeliness as GUI agents rapidly evolve. Paper 1 offers valuable, rigorous negative/diagnostic findings about LLM optimization limits in hardware-aware code, but its impact is narrower (compiler/kernel optimization) and mainly characterizes failure modes rather than enabling a new scalable research direction or widely reusable artifact.
Paper 2 offers a rigorous methodological advancement by combining interpretable AI with statistical safety guarantees (conformal risk control) for physical systems. Its direct application to safety-critical, real-world infrastructure (wastewater treatment) addressing energy efficiency and greenhouse gas emissions gives it profound real-world impact. While Paper 1 introduces a useful benchmark for GUI agents, Paper 2 spans AI, control theory, and environmental engineering with validated real-world testing, suggesting broader and more immediate scientific and societal impact.
Paper 2 introduces a novel, practical paradigm (proactive document-guided action) with a concrete benchmark (DocOS) for GUI agents, addressing a clear limitation in current agentic systems. It has broader applicability across the rapidly growing GUI/web agent community, identifies specific bottlenecks that can drive future research, and is highly timely given the surge in LLM-based agents. Paper 1, while intellectually elegant in formalizing trust calibration as preferential Bayesian optimization, is more incremental—connecting existing frameworks (GP classification, PBO) to a specific application without novel algorithmic contributions or empirical validation.
Paper 2 introduces a more novel and cross-disciplinary framework (GVG) that bridges neuroscience, computer vision, and MLLMs by using generative visual grounding to translate EEG signals into proxy images. This addresses a fundamental limitation in brain-computer interfaces with broad implications for clinical neuroscience and brain foundation models. Paper 1, while addressing a practical limitation of GUI agents, is more incremental—introducing a benchmark for document-guided actions in a narrower application domain. Paper 2's methodological innovation (trimodal alignment, EEG-to-image generation) and potential impact across neuroscience and AI give it higher scientific impact.
Paper 1 introduces a fundamental methodological innovation by replacing computationally expensive explicit Chain-of-Thought reasoning with latent think tokens for multimodal representations. This addresses a critical bottleneck in deploying reasoning-heavy models, offering broad applicability across foundation models and representation learning. Paper 2 presents a valuable benchmark for GUI agents, but its impact is relatively confined to the agentic workflow subfield, making Paper 1's architectural advancements more likely to achieve widespread scientific and practical impact.
Paper 1 addresses a fundamental bottleneck in LLM alignment and complex reasoning (token-level credit assignment in RLVR). Its proposed algorithmic solution, AMR-SD, targets core methodological challenges in training state-of-the-art reasoning models. While Paper 2 introduces a valuable benchmark for GUI agents, advancements in foundational reasoning capabilities and RL training paradigms typically exert a more profound, widespread impact across the broader AI ecosystem.
Paper 1 introduces a novel paradigm and benchmark for GUI agents, addressing a fundamental limitation in agent autonomy by enabling proactive document-guided actions. This conceptual leap and the introduction of a new benchmark are likely to spur significant follow-up research. Paper 2 provides valuable empirical analysis of existing paradigms but is more focused on system engineering, making its fundamental scientific impact comparatively lower.
Paper 2 (DocOS) likely has higher impact due to broader real-world applicability and timeliness: enabling GUI agents to proactively retrieve and execute procedural knowledge from documentation directly targets a major deployment bottleneck for agents in open-web and enterprise settings. The paradigm and benchmark span IR/search, grounding, planning, and HCI, widening cross-field relevance. Paper 1 is novel and methodologically interesting (causal interventions for memory selection) but is more specialized to long-horizon LLM memory management and depends on synthetic harmful-memory construction, potentially narrowing immediate adoption compared to document-guided GUI autonomy.
Paper 2 offers higher scientific impact due to its critical interdisciplinary focus on medical AI ethics. While Paper 1 presents a useful benchmark for GUI agents, Paper 2 addresses a high-stakes, real-world problem: the ethical alignment and potential value monoculture of LLMs in healthcare. Its novel framework for auditing clinical value pluralism bridges AI safety, bioethics, and clinical medicine. This breadth of impact, combined with the urgent timeliness of safely deploying medical AI, gives Paper 2 a significantly higher potential to influence policy, clinical guidelines, and future AI alignment research.
Paper 2 addresses a fundamental limitation in GUI agents by enabling them to proactively seek out and use documentation for long-tail tasks. This paradigm shift toward self-evolving, adaptable agents has broad implications for general AI automation and OS control. In contrast, Paper 1 focuses on a highly domain-specific benchmark (Chinese gaming short videos), which, while valuable, offers narrower potential impact and applications compared to the broader utility of autonomous GUI navigation and document grounding.
Paper 2 (Solvita) has higher estimated impact due to a more broadly applicable innovation: a stateless-to-stateful shift via trainable knowledge networks that accumulate experience without updating the base LLM, validated on multiple standard programming benchmarks plus live Codeforces. This combines methodological rigor (closed-loop agents, certified supervision, adversarial hacking, RL-updated routing) with strong, measurable performance gains and clear downstream utility for reliable code generation and general agent learning. Paper 1 is timely and useful as a benchmark/paradigm for document-guided GUI agents, but its contribution is more evaluation-focused and narrower in cross-domain reach.
Paper 1 introduces a novel paradigm and benchmark (DocOS) for autonomous GUI agents, shifting from static knowledge reliance to dynamic, proactive document retrieval. This addresses a fundamental bottleneck in agentic AI (handling long-tail tasks). While Paper 2 demonstrates impressive large-scale industrial deployment, it is primarily an applied engineering contribution specific to IVR systems. Paper 1's foundational approach to agent problem-solving, planning, and grounding offers broader applicability and higher potential to drive future scientific research and citations across the rapidly growing field of LLM agents.
Paper 2 likely has higher impact due to a broader, more general contribution: a systematic study and strong empirical results on mobile world models across multiple modalities, with SoTA performance and clear guidance on when different representations help (in-distribution vs OOD, training vs test-time). It also tests transfer to multiple downstream agent benchmarks, increasing methodological rigor and breadth. Paper 1 is novel and timely in proposing document-guided GUI action and a benchmark, but its primary contribution is evaluative/diagnostic and narrower to documentation retrieval+grounding, with less demonstrated performance gain across tasks.
Paper 2 is likely to have higher scientific impact due to stronger novelty (proactive document-guided actions for GUI agents) and broader, timely relevance to foundation-model agents, tool use, web interaction, and benchmark-driven progress. DocOS provides a reusable evaluation framework that can influence multiple subfields (NLP, HCI, IR, RL, agentic systems) and catalyze methodological advances by concretely identifying bottlenecks (search + grounding). Paper 1 is applied and clinically relevant, but its ML-on-TCD approach is more incremental and likely narrower in cross-field reach and scalability.
Paper 1 addresses a critical safety issue in autonomous driving VLA models, revealing alarming unfaithfulness in reasoning (42.5% fidelity, 94 missed pedestrians, 97.7% trajectory fragility). This has immediate, high-stakes real-world implications for autonomous vehicle safety. The formalization of faithfulness information-theoretically and the proposed safety architecture contribute foundational methodology. Paper 2 introduces a useful benchmark for GUI agents but addresses a narrower, lower-stakes problem. Paper 1's findings could reshape how the field approaches VLA deployment and safety verification, giving it broader and more urgent impact.
Paper 2 likely has higher scientific impact: it provides a rigorous, multi-model, multi-task causal auditing framework for EEG foundation models, directly addressing interpretability and clinical trust—highly timely for medical ML. Its methods (probing, subspace erasure, transparent baselines) are broadly reusable across biosignals and representation learning, and the quantified “recoverable advantage” offers actionable guidance for future feature/concept discovery. Paper 1 is novel and relevant for GUI agents, but is primarily a benchmark/paradigm proposal with narrower immediate real-world stakes and less cross-domain methodological reuse.
Paper 2 addresses a critical bottleneck in deploying AI agents: the sim-to-real gap caused by real-world noise and API failures. By framing these issues within POMDP components, providing a benchmark grounded in real GitHub issues, and introducing an effective domain-randomization RL solution, it offers both strong theoretical grounding and high practical utility. While Paper 1 is innovative, Paper 2's focus on robustness and reliability gives it broader applicability and higher potential impact across real-world deployments.
Paper 1 makes fundamental theoretical contributions to algorithmic recourse by introducing a novel causal framework that addresses a significant gap in existing approaches—modeling recourse as a process with pre/post-intervention outcomes rather than static counterfactuals. It provides rigorous methodological contributions including stability conditions, copula-based inference, goodness-of-fit testing, and distribution-free alternatives. This work has broad impact across AI fairness, causal inference, and trustworthy ML. Paper 2 introduces a useful benchmark for GUI agents but is more incremental, addressing a specific engineering challenge with narrower theoretical contribution.
Paper 2 (DocOS) has higher potential impact due to its timely focus on enabling GUI agents to handle long-tailed tasks via proactive documentation search and grounding, and because it introduces a benchmark in a fully interactive open-web setting—likely to catalyze broad, measurable progress across agent research, web automation, retrieval, grounding, and evaluation. Paper 1 offers a practical hybrid memory architecture, but it appears more systems-engineering oriented with narrower generalizability and less standardized evaluation potential. DocOS’s benchmark and identified bottlenecks make it a strong community driver.