AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

Jingwei Sun, Jianing Zhu, Yuanyi Li, Tongliang Liu, Xia HU, Bo Han

May 25, 2026

arXiv:2605.25707v1 PDF

cs.AI(primary)

#1496of 2682·Artificial Intelligence

#1496 of 2682 · Artificial Intelligence

Tournament Score

1397±41

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor6

Novelty5.5

Clarity7

Tournament Score

1397±41

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where the uncertainties in dynamic environment disrupt the execution flow without direct adversarial intent. Specifically, AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks that utilize MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation, which emphasizes the fragility of agents and underscores the necessity of robustness evaluation. Afterward, we propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness. Our code, environment, baseline models and data are publicly available at: https://AgentHijack.github.io.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AgentHijack

1. Core Contribution

AgentHijack addresses a meaningful gap in the evaluation of MLLM-based computer-use agents: their robustness to *common* environmental corruptions (pop-ups, resolution changes, network errors, etc.) rather than adversarial attacks. The paper makes two contributions: (1) a benchmark comprising 9 configurable corruption types applied to OSWorld tasks (3,321 total tasks), and (2) AgentHijack-Agent, a framework combining an RL-enhanced action generator with an "onlooker" module for behavior summarization and environment checking.

The distinction between corruption robustness (non-adversarial, naturally occurring disruptions) and adversarial robustness (intentional attacks) is well-articulated and practically important. The formalization using POMDP notation, where corruptions affect the observation function, state transitions, or environment states, provides a clean conceptual framework.

2. Methodological Rigor

Benchmark design: The 9 corruption types are systematically categorized into visual disruptors (pop-ups, resolution, marks, subtitle, multi-apps), unexpected operations (accidental touch, app minimization), and environment errors (network error, verification). Implementation details are thorough — full algorithms are provided in pseudocode. The configurability via YAML parameters adds flexibility.

AgentHijack-Agent: The proposed solution has three components: (a) DA-GRPO, which extends GRPO by rolling out across differently corrupted environments rather than a single clean one — a straightforward but sensible augmentation; (b) behavior summarization via an onlooker agent that records environmental changes between steps; and (c) initial environment checking to detect faulty initialization states.

Evaluation: The paper evaluates 9 baselines covering open-source, closed-source, and specialized models. The ablation studies are comprehensive, examining corruption intensity, content, location, and module necessity. However, several concerns arise:

The training set (128 tasks from AgentHijack) and test set overlap concerns are not clearly addressed — it's unclear how much of the improvement comes from memorization versus genuine robustness.

The absolute performance numbers remain low (22.89% average for the best method), making it difficult to assess whether the improvements are practically meaningful.

The DA-GRPO rollouts across "random corrupted environments" introduces additional computational cost that isn't well-quantified.

The onlooker uses the same fine-tuned model as the action generator, raising questions about whether improvement stems from simply having two inference passes.

3. Potential Impact

Practical relevance: The corruptions modeled are genuinely common in real desktop environments. As computer-use agents move toward deployment, this type of robustness evaluation will become essential. The benchmark fills a clear void — most prior work focuses on clean environments or adversarial attacks.

Community utility: The public release of code, environments, and baselines enhances reproducibility and adoption. Building on OSWorld (a well-established benchmark) lowers the barrier to entry.

Limitations in scope: The benchmark is limited to Ubuntu virtual machines and a specific set of applications. The 9 corruption types, while useful, don't cover all real-world disruptions (e.g., slow rendering, partial page loads, accessibility overlays, system update prompts). The corruptions are applied somewhat artificially — they're injected at predetermined steps rather than occurring naturally, which may not fully capture the stochastic nature of real environments.

4. Timeliness & Relevance

The paper is highly timely. Computer-use agents (Claude Computer Use, UI-TARS, etc.) are actively being deployed, and robustness concerns are paramount. The gap between clean benchmark performance and real-world reliability is well-known but poorly quantified — this work provides concrete measurements of that gap. The finding that UI-TARS-1.5-7B drops from 24.21% to as low as 10.28% under pop-ups is striking and practically important.

5. Strengths & Limitations

Strengths:

Well-motivated problem with clear practical significance

Comprehensive taxonomy of corruption types with configurable parameters

Extensive evaluation across diverse model families (9 baselines)

Thorough ablation studies examining multiple dimensions of corruption variation

Detailed case studies that clearly illustrate failure modes (Observations 1-3)

Clean separation between corruption and adversarial robustness concepts

Limitations:

The AgentHijack-Agent improvements are moderate and sometimes marginal (e.g., +0.84% on resolution). The resolution corruption remains largely unsolved.

DA-GRPO is a relatively incremental extension of GRPO — augmenting rollout environments is a natural idea. The novelty of the solution is limited compared to the benchmark contribution.

The onlooker module relies on prompting the same model for summarization and checking — the improvement could partly stem from additional computation/reasoning rather than architectural innovation.

Training on 128 tasks over 15 epochs with sparse rewards raises concerns about overfitting.

The paper doesn't discuss how corruptions interact (e.g., simultaneous pop-ups and resolution changes), which is more realistic.

Comparison with simple baselines (e.g., prompt engineering with "ignore pop-ups" instructions, or simple image preprocessing) is missing.

The verification and network error corruptions are fundamentally different from visual corruptions — they represent broken environments rather than robustness challenges. The agent *should* fail when the network is down; the question is whether it fails gracefully.

6. Additional Observations

The paper's naming ("AgentHijack") somewhat misrepresents the nature of the work — these are non-adversarial corruptions, not hijacking attempts. The concurrent GUI-Robust benchmark (Yang et al., 2025b) addresses similar concerns in a QA format, suggesting this is an emerging research direction with multiple parallel efforts.

Rating:6.2/ 10

Significance 6.5Rigor 6Novelty 5.5Clarity 7

Generated May 26, 2026

Comparison History (20)

vs. Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology

gemini-3.15/26/2026

Paper 2 addresses a highly timely and rapidly expanding field—autonomous MLLM computer-use agents—by introducing a much-needed benchmark for real-world robustness. Its focus on immediate, practical deployment challenges and empirical evaluation gives it higher potential for widespread adoption and real-world impact compared to Paper 1's rigorous but relatively niche theoretical exploration of mature CNN architectures via lattice theory.

vs. DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

gemini-3.15/26/2026

Paper 2 addresses a critical bottleneck in the widespread adoption of multi-agent LLM systems: error propagation and high token costs. By introducing a framework that improves accuracy by up to 30.7% while reducing communication overhead by 6.5x, it offers immediate and broad impact across general reasoning applications. While Paper 1 provides a valuable benchmark for the emerging niche of computer-use agents, Paper 2's fundamental improvements to multi-agent reasoning systems suggest broader utility and higher overall scientific impact.

vs. JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data

claude-opus-4.65/26/2026

AgentHijack addresses a timely and specific gap—robustness evaluation of computer-use agents under realistic environmental corruptions—with a concrete benchmark, systematic corruption taxonomy, and a proposed mitigation framework. This fills an important niche as autonomous computer-use agents become more prevalent. Paper 2 (JT-SAFE-V2) covers safety-by-design LLMs, a crowded space with many competing approaches, and its contributions (data enrichment, training procedures, mixture of models) are more incremental. AgentHijack's clearly defined benchmark is more likely to be adopted by the community and drive follow-up research.

vs. CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities

gpt-5.25/26/2026

Paper 2 is likely to have higher impact due to timeliness and broad relevance: robustness of MLLM-driven computer-use agents is a rapidly growing area spanning AI, HCI, security, and software engineering. AgentHijack targets realistic, non-adversarial environment corruptions that are common in deployment, and provides configurable perturbations plus a mitigation framework, making it directly actionable for real-world systems. Paper 1 is valuable and rigorous for urban ML evaluation, but its impact is more domain-specific, whereas Paper 2 addresses a cross-domain reliability bottleneck for general-purpose agents.

vs. Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

claude-opus-4.65/26/2026

AgentHijack addresses a practical and timely problem—robustness of autonomous computer-use agents—with a concrete benchmark, reproducible evaluation framework, and publicly available code/data. It provides actionable insights for the rapidly growing field of MLLM-based agents. Paper 1, while conceptually interesting, applies neutrosophic logic to LLMs through prompt engineering rather than architectural changes, lacks formal theoretical grounding for its claims, and the empirical methodology (prompting GPT models to self-report uncertainty) has questionable validity. Paper 2's benchmark utility and immediate applicability give it broader and more lasting impact.

vs. SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

gemini-3.15/26/2026

Paper 1 addresses a fundamental AI challenge—how agents abstract episodic experiences into reusable procedural skills—which has broad implications for lifelong learning and AGI. Paper 2, while highly practical and timely, focuses on an applied robustness problem (UI corruptions like pop-ups) for computer-use agents. The theoretical depth, methodological rigor, and broader applicability of evaluating skill abstraction give Paper 1 a higher potential for long-lasting scientific impact compared to the narrower engineering focus of Paper 2.

vs. Emotional intelligence in large language models is fragmented across perception, cognition, and interaction

gemini-3.15/26/2026

Paper 1 addresses a fundamental, high-level capability of LLMs (Emotional Intelligence) by bridging AI alignment with established psychological theory. Its identification of 'stochastic empathy' and the fragmentation of EI challenges current scaling laws and RLHF paradigms, offering deep theoretical novelty. Paper 2, while highly practical, focuses on a narrower engineering challenge (robustness to UI noise). Paper 1 has broader interdisciplinary impact, influencing human-AI interaction, clinical applications, and AI safety methodologies.

vs. LECTOR: Joint Optimization of Scientific Reasoning Graphs and Introduction Generation

gpt-5.25/26/2026

Paper 2 is likely to have higher impact: it introduces a timely robustness benchmark for rapidly emerging computer-use agents, with clear real-world relevance (noisy GUIs, pop-ups, resolution changes) and broad applicability across agent research, HCI, safety, and evaluation. Benchmarks often become community standards, amplifying citation and adoption. It also pairs evaluation with a mitigation framework, strengthening methodological contribution. Paper 1 is novel for grounded introduction generation with reasoning graphs, but its impact is narrower (scientific writing niche) and depends on adoption of a specific generation paradigm and dataset.

vs. Towards Human-Level Book-Writing Capability

claude-opus-4.65/26/2026

AgentHijack addresses a timely and practical problem in the rapidly growing field of autonomous computer-use agents. It introduces a systematic benchmark for robustness evaluation with 9 configurable corruptions, proposes a mitigation framework, and releases code/data publicly. This has broader immediate impact across AI safety, agent reliability, and HCI communities. While Paper 1 tackles an interesting creative writing challenge with a novel training framework, its impact is narrower (creative AI/NLP), and the reliance on public-domain novels may limit generalizability. Paper 2's focus on agent robustness is more urgently relevant as autonomous agents see wider deployment.

vs. Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

gpt-5.25/26/2026

Paper 2 likely has higher impact: it introduces a broadly applicable robustness benchmark for a fast-emerging class of real-world systems (computer-use agents), with clear practical relevance and strong potential to become a standard evaluation tool. Benchmarks with public code/data often catalyze community-wide progress across models, training, and safety/robustness, spanning ML, HCI, and systems. Paper 1 is technically innovative for RL in diffusion MLLMs, but is narrower (specific model family and training pipeline) and may see impact mainly within generative model optimization rather than across multiple subfields.

vs. From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch

gemini-3.15/26/2026

Paper 1 tackles the critical environmental challenge of data center water and energy consumption. By integrating differentiable optimization to actively manage the electricity-computation-water nexus, it offers a highly novel and rigorous approach with direct, measurable real-world sustainability impacts. In contrast, while Paper 2 is timely and addresses software robustness for AI agents, it contributes to a highly saturated subfield, giving Paper 1 a more profound interdisciplinary and real-world scientific impact.

vs. CoRe-Code: Collaborative Reinforcement Learning for Code Generation

claude-opus-4.65/26/2026

CoRe-Code addresses a fundamental challenge in LLM-based code generation by combining multi-agent collaboration with reinforcement learning, offering a novel Planner-Coder paradigm with GRPO-based training. Its demonstrated generalizability to other multi-agent frameworks and consistent improvements across multiple benchmarks and base models suggest broader impact. While AgentHijack identifies an important robustness problem for computer-use agents, it is more of a benchmark/evaluation contribution for a still-nascent field. CoRe-Code's methodological innovation in collaborative RL training has wider applicability across code generation and potentially other multi-agent LLM tasks.

vs. STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

claude-opus-4.65/26/2026

STRIDE addresses the fundamental challenge of automated scientific discovery through symbolic regression with a comprehensive multi-component framework. Its contributions—data-aware generation, mixed-fitting evaluation, critic-executor repair, and diversity-preserving memory—represent meaningful methodological advances in AI-driven equation discovery, with demonstrated improvements across multiple benchmarks and LLM backbones. Paper 2, while addressing a practical concern (agent robustness to environmental corruptions), is primarily a benchmark contribution with a narrower scope. STRIDE has broader scientific impact potential as it advances automated scientific reasoning, a high-impact research direction with applications across sciences.

vs. A governance horizon for ethical-use constraints in open-weight AI models

claude-opus-4.65/26/2026

Paper 1 presents a novel, large-scale empirical analysis of AI governance infrastructure with a rigorous methodology (auditing >2M model repositories), introduces formalized concepts (governance horizon, half-life of restriction evidence), and has broad implications for AI policy, supply-chain accountability, and open-source governance. Its findings are structurally fundamental and relevant across regulatory, legal, and technical domains. Paper 2, while timely and practical, is a more incremental robustness benchmark contribution within the narrower scope of computer-use agents, with less transformative potential for the broader field.

vs. AION: Next-Generation Tasks and Practical Harness for Time Series

gemini-3.15/26/2026

Paper 1 addresses a highly timely and critical bottleneck in artificial intelligence: the robustness of general-purpose computer-use agents in realistic, noisy environments. As MLLM-based autonomous agents become ubiquitous, ensuring they can handle everyday UI disruptions like pop-ups is essential for real-world deployment across countless domains. Paper 2 provides a valuable framework for time series analysis, but its scope is more specialized. Paper 1's focus on foundational agent reliability offers broader applicability and higher relevance to current major AI trends, giving it greater potential scientific impact.

vs. $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

gemini-3.15/26/2026

Paper 2 addresses a highly timely and widely relevant problem: the fragility of autonomous computer-use agents in real-world, dynamic environments. With the rapid deployment of MLLM-based agents, benchmarking and improving their robustness to common UI corruptions has immediate, broad real-world applicability. In contrast, Paper 1 is highly innovative but focuses on diffusion LLMs, which currently have a narrower adoption footprint compared to autoregressive models and agentic workflows, leading to a comparatively more niche impact.

vs. ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking

claude-opus-4.65/26/2026

AgentHijack addresses a timely and broadly impactful problem—robustness of autonomous computer-use agents powered by MLLMs—which is a rapidly growing area with immediate real-world applications. It introduces a systematic benchmark with 9 corruption types and proposes a mitigation framework, providing practical value to the agent community. Paper 2 (ECPO) tackles a more niche task (evidence-certified ranking) with a complex, narrowly scoped formulation that limits its breadth of impact. AgentHijack's open-source benchmark, broader applicability, and alignment with the surging interest in autonomous agents give it higher potential impact.

vs. Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

gemini-3.15/26/2026

Paper 2 addresses a fundamental architectural limitation of current LLMs—static, inference-only deployment—by rigorously demonstrating the superiority of daily weight-based consolidation (LoRA) over context compaction for long-term memory. Its findings have massive implications for designing continuously learning, personalized AI assistants, offering broader applicability than Paper 1's narrower focus on desktop agent benchmarking. Furthermore, Paper 2's methodological insight regarding cross-entropy metrics adds significant independent value to the broader machine learning evaluation community.

vs. When Mean CE Fails: Median CE Can Better Track Language Model Quality

gpt-5.25/26/2026

Paper 2 likely has higher impact: it introduces a new benchmark for an emerging, high-stakes application area (computer-use agents) with clear real-world relevance and broad utility for evaluating robustness across models and labs. Benchmarks often become community standards, shaping subsequent research, and the work includes systematic corruptions, empirical findings, and a mitigation framework plus public release—supporting adoption and follow-on work. Paper 1 is insightful and useful for LM training diagnostics, but it is a narrower metric analysis and is less likely to catalyze widespread, cross-field uptake than a robustness benchmark for deployed agents.

vs. Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

claude-opus-4.65/26/2026

Co-ReAct introduces a novel rubric-guided action-selection framework that addresses a fundamental limitation of ReAct-style agents—reliance on internal judgment—with a principled approach using step-level rubric guidance and a list-wise GRPO training objective with Spearman rank-correlation reward. This has broader applicability across reasoning and search tasks, offers a transferable component (drop-in rubric generator), and advances both methodology (list-wise preference optimization) and practical agent design. AgentHijack, while valuable as a robustness benchmark, is more incremental and narrower in scope, focusing specifically on computer-use agent corruption scenarios.