AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions
Jingwei Sun, Jianing Zhu, Yuanyi Li, Tongliang Liu, Xia HU, Bo Han
Abstract
Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where the uncertainties in dynamic environment disrupt the execution flow without direct adversarial intent. Specifically, AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks that utilize MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation, which emphasizes the fragility of agents and underscores the necessity of robustness evaluation. Afterward, we propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness. Our code, environment, baseline models and data are publicly available at: https://AgentHijack.github.io.
AI Impact Assessments
(1 models)Scientific Impact Assessment: AgentHijack
1. Core Contribution
AgentHijack addresses a meaningful gap in the evaluation of MLLM-based computer-use agents: their robustness to *common* environmental corruptions (pop-ups, resolution changes, network errors, etc.) rather than adversarial attacks. The paper makes two contributions: (1) a benchmark comprising 9 configurable corruption types applied to OSWorld tasks (3,321 total tasks), and (2) AgentHijack-Agent, a framework combining an RL-enhanced action generator with an "onlooker" module for behavior summarization and environment checking.
The distinction between corruption robustness (non-adversarial, naturally occurring disruptions) and adversarial robustness (intentional attacks) is well-articulated and practically important. The formalization using POMDP notation, where corruptions affect the observation function, state transitions, or environment states, provides a clean conceptual framework.
2. Methodological Rigor
Benchmark design: The 9 corruption types are systematically categorized into visual disruptors (pop-ups, resolution, marks, subtitle, multi-apps), unexpected operations (accidental touch, app minimization), and environment errors (network error, verification). Implementation details are thorough — full algorithms are provided in pseudocode. The configurability via YAML parameters adds flexibility.
AgentHijack-Agent: The proposed solution has three components: (a) DA-GRPO, which extends GRPO by rolling out across differently corrupted environments rather than a single clean one — a straightforward but sensible augmentation; (b) behavior summarization via an onlooker agent that records environmental changes between steps; and (c) initial environment checking to detect faulty initialization states.
Evaluation: The paper evaluates 9 baselines covering open-source, closed-source, and specialized models. The ablation studies are comprehensive, examining corruption intensity, content, location, and module necessity. However, several concerns arise:
3. Potential Impact
Practical relevance: The corruptions modeled are genuinely common in real desktop environments. As computer-use agents move toward deployment, this type of robustness evaluation will become essential. The benchmark fills a clear void — most prior work focuses on clean environments or adversarial attacks.
Community utility: The public release of code, environments, and baselines enhances reproducibility and adoption. Building on OSWorld (a well-established benchmark) lowers the barrier to entry.
Limitations in scope: The benchmark is limited to Ubuntu virtual machines and a specific set of applications. The 9 corruption types, while useful, don't cover all real-world disruptions (e.g., slow rendering, partial page loads, accessibility overlays, system update prompts). The corruptions are applied somewhat artificially — they're injected at predetermined steps rather than occurring naturally, which may not fully capture the stochastic nature of real environments.
4. Timeliness & Relevance
The paper is highly timely. Computer-use agents (Claude Computer Use, UI-TARS, etc.) are actively being deployed, and robustness concerns are paramount. The gap between clean benchmark performance and real-world reliability is well-known but poorly quantified — this work provides concrete measurements of that gap. The finding that UI-TARS-1.5-7B drops from 24.21% to as low as 10.28% under pop-ups is striking and practically important.
5. Strengths & Limitations
Strengths:
Limitations:
6. Additional Observations
The paper's naming ("AgentHijack") somewhat misrepresents the nature of the work — these are non-adversarial corruptions, not hijacking attempts. The concurrent GUI-Robust benchmark (Yang et al., 2025b) addresses similar concerns in a QA format, suggesting this is an emerging research direction with multiple parallel efforts.
Generated May 26, 2026
Comparison History (20)
Paper 2 addresses a highly timely and rapidly expanding field—autonomous MLLM computer-use agents—by introducing a much-needed benchmark for real-world robustness. Its focus on immediate, practical deployment challenges and empirical evaluation gives it higher potential for widespread adoption and real-world impact compared to Paper 1's rigorous but relatively niche theoretical exploration of mature CNN architectures via lattice theory.
Paper 2 addresses a critical bottleneck in the widespread adoption of multi-agent LLM systems: error propagation and high token costs. By introducing a framework that improves accuracy by up to 30.7% while reducing communication overhead by 6.5x, it offers immediate and broad impact across general reasoning applications. While Paper 1 provides a valuable benchmark for the emerging niche of computer-use agents, Paper 2's fundamental improvements to multi-agent reasoning systems suggest broader utility and higher overall scientific impact.
AgentHijack addresses a timely and specific gap—robustness evaluation of computer-use agents under realistic environmental corruptions—with a concrete benchmark, systematic corruption taxonomy, and a proposed mitigation framework. This fills an important niche as autonomous computer-use agents become more prevalent. Paper 2 (JT-SAFE-V2) covers safety-by-design LLMs, a crowded space with many competing approaches, and its contributions (data enrichment, training procedures, mixture of models) are more incremental. AgentHijack's clearly defined benchmark is more likely to be adopted by the community and drive follow-up research.
Paper 2 is likely to have higher impact due to timeliness and broad relevance: robustness of MLLM-driven computer-use agents is a rapidly growing area spanning AI, HCI, security, and software engineering. AgentHijack targets realistic, non-adversarial environment corruptions that are common in deployment, and provides configurable perturbations plus a mitigation framework, making it directly actionable for real-world systems. Paper 1 is valuable and rigorous for urban ML evaluation, but its impact is more domain-specific, whereas Paper 2 addresses a cross-domain reliability bottleneck for general-purpose agents.
AgentHijack addresses a practical and timely problem—robustness of autonomous computer-use agents—with a concrete benchmark, reproducible evaluation framework, and publicly available code/data. It provides actionable insights for the rapidly growing field of MLLM-based agents. Paper 1, while conceptually interesting, applies neutrosophic logic to LLMs through prompt engineering rather than architectural changes, lacks formal theoretical grounding for its claims, and the empirical methodology (prompting GPT models to self-report uncertainty) has questionable validity. Paper 2's benchmark utility and immediate applicability give it broader and more lasting impact.
Paper 1 addresses a fundamental AI challenge—how agents abstract episodic experiences into reusable procedural skills—which has broad implications for lifelong learning and AGI. Paper 2, while highly practical and timely, focuses on an applied robustness problem (UI corruptions like pop-ups) for computer-use agents. The theoretical depth, methodological rigor, and broader applicability of evaluating skill abstraction give Paper 1 a higher potential for long-lasting scientific impact compared to the narrower engineering focus of Paper 2.
Paper 1 addresses a fundamental, high-level capability of LLMs (Emotional Intelligence) by bridging AI alignment with established psychological theory. Its identification of 'stochastic empathy' and the fragmentation of EI challenges current scaling laws and RLHF paradigms, offering deep theoretical novelty. Paper 2, while highly practical, focuses on a narrower engineering challenge (robustness to UI noise). Paper 1 has broader interdisciplinary impact, influencing human-AI interaction, clinical applications, and AI safety methodologies.
Paper 2 is likely to have higher impact: it introduces a timely robustness benchmark for rapidly emerging computer-use agents, with clear real-world relevance (noisy GUIs, pop-ups, resolution changes) and broad applicability across agent research, HCI, safety, and evaluation. Benchmarks often become community standards, amplifying citation and adoption. It also pairs evaluation with a mitigation framework, strengthening methodological contribution. Paper 1 is novel for grounded introduction generation with reasoning graphs, but its impact is narrower (scientific writing niche) and depends on adoption of a specific generation paradigm and dataset.
AgentHijack addresses a timely and practical problem in the rapidly growing field of autonomous computer-use agents. It introduces a systematic benchmark for robustness evaluation with 9 configurable corruptions, proposes a mitigation framework, and releases code/data publicly. This has broader immediate impact across AI safety, agent reliability, and HCI communities. While Paper 1 tackles an interesting creative writing challenge with a novel training framework, its impact is narrower (creative AI/NLP), and the reliance on public-domain novels may limit generalizability. Paper 2's focus on agent robustness is more urgently relevant as autonomous agents see wider deployment.
Paper 2 likely has higher impact: it introduces a broadly applicable robustness benchmark for a fast-emerging class of real-world systems (computer-use agents), with clear practical relevance and strong potential to become a standard evaluation tool. Benchmarks with public code/data often catalyze community-wide progress across models, training, and safety/robustness, spanning ML, HCI, and systems. Paper 1 is technically innovative for RL in diffusion MLLMs, but is narrower (specific model family and training pipeline) and may see impact mainly within generative model optimization rather than across multiple subfields.
Paper 1 tackles the critical environmental challenge of data center water and energy consumption. By integrating differentiable optimization to actively manage the electricity-computation-water nexus, it offers a highly novel and rigorous approach with direct, measurable real-world sustainability impacts. In contrast, while Paper 2 is timely and addresses software robustness for AI agents, it contributes to a highly saturated subfield, giving Paper 1 a more profound interdisciplinary and real-world scientific impact.
CoRe-Code addresses a fundamental challenge in LLM-based code generation by combining multi-agent collaboration with reinforcement learning, offering a novel Planner-Coder paradigm with GRPO-based training. Its demonstrated generalizability to other multi-agent frameworks and consistent improvements across multiple benchmarks and base models suggest broader impact. While AgentHijack identifies an important robustness problem for computer-use agents, it is more of a benchmark/evaluation contribution for a still-nascent field. CoRe-Code's methodological innovation in collaborative RL training has wider applicability across code generation and potentially other multi-agent LLM tasks.
STRIDE addresses the fundamental challenge of automated scientific discovery through symbolic regression with a comprehensive multi-component framework. Its contributions—data-aware generation, mixed-fitting evaluation, critic-executor repair, and diversity-preserving memory—represent meaningful methodological advances in AI-driven equation discovery, with demonstrated improvements across multiple benchmarks and LLM backbones. Paper 2, while addressing a practical concern (agent robustness to environmental corruptions), is primarily a benchmark contribution with a narrower scope. STRIDE has broader scientific impact potential as it advances automated scientific reasoning, a high-impact research direction with applications across sciences.
Paper 1 presents a novel, large-scale empirical analysis of AI governance infrastructure with a rigorous methodology (auditing >2M model repositories), introduces formalized concepts (governance horizon, half-life of restriction evidence), and has broad implications for AI policy, supply-chain accountability, and open-source governance. Its findings are structurally fundamental and relevant across regulatory, legal, and technical domains. Paper 2, while timely and practical, is a more incremental robustness benchmark contribution within the narrower scope of computer-use agents, with less transformative potential for the broader field.
Paper 1 addresses a highly timely and critical bottleneck in artificial intelligence: the robustness of general-purpose computer-use agents in realistic, noisy environments. As MLLM-based autonomous agents become ubiquitous, ensuring they can handle everyday UI disruptions like pop-ups is essential for real-world deployment across countless domains. Paper 2 provides a valuable framework for time series analysis, but its scope is more specialized. Paper 1's focus on foundational agent reliability offers broader applicability and higher relevance to current major AI trends, giving it greater potential scientific impact.
Paper 2 addresses a highly timely and widely relevant problem: the fragility of autonomous computer-use agents in real-world, dynamic environments. With the rapid deployment of MLLM-based agents, benchmarking and improving their robustness to common UI corruptions has immediate, broad real-world applicability. In contrast, Paper 1 is highly innovative but focuses on diffusion LLMs, which currently have a narrower adoption footprint compared to autoregressive models and agentic workflows, leading to a comparatively more niche impact.
AgentHijack addresses a timely and broadly impactful problem—robustness of autonomous computer-use agents powered by MLLMs—which is a rapidly growing area with immediate real-world applications. It introduces a systematic benchmark with 9 corruption types and proposes a mitigation framework, providing practical value to the agent community. Paper 2 (ECPO) tackles a more niche task (evidence-certified ranking) with a complex, narrowly scoped formulation that limits its breadth of impact. AgentHijack's open-source benchmark, broader applicability, and alignment with the surging interest in autonomous agents give it higher potential impact.
Paper 2 addresses a fundamental architectural limitation of current LLMs—static, inference-only deployment—by rigorously demonstrating the superiority of daily weight-based consolidation (LoRA) over context compaction for long-term memory. Its findings have massive implications for designing continuously learning, personalized AI assistants, offering broader applicability than Paper 1's narrower focus on desktop agent benchmarking. Furthermore, Paper 2's methodological insight regarding cross-entropy metrics adds significant independent value to the broader machine learning evaluation community.
Paper 2 likely has higher impact: it introduces a new benchmark for an emerging, high-stakes application area (computer-use agents) with clear real-world relevance and broad utility for evaluating robustness across models and labs. Benchmarks often become community standards, shaping subsequent research, and the work includes systematic corruptions, empirical findings, and a mitigation framework plus public release—supporting adoption and follow-on work. Paper 1 is insightful and useful for LM training diagnostics, but it is a narrower metric analysis and is less likely to catalyze widespread, cross-field uptake than a robustness benchmark for deployed agents.
Co-ReAct introduces a novel rubric-guided action-selection framework that addresses a fundamental limitation of ReAct-style agents—reliance on internal judgment—with a principled approach using step-level rubric guidance and a list-wise GRPO training objective with Spearman rank-correlation reward. This has broader applicability across reasoning and search tasks, offers a transferable component (drop-in rubric generator), and advances both methodology (list-wise preference optimization) and practical agent design. AgentHijack, while valuable as a robustness benchmark, is more incremental and narrower in scope, focusing specifically on computer-use agent corruption scenarios.