RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

Radeen Mostafa, Sawradip Saha

Jun 8, 2026arXiv:2606.09399v1

cs.AI

#2810of 3489·Artificial Intelligence

#2810 of 3489 · Artificial Intelligence

Tournament Score

1309±45

10501800

37%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor3.5

Novelty5

Clarity7

Abstract

We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way a person browses. A human reading a page does not retain every pixel they have seen; they look at a few candidate targets, decide on one, and remember only what is needed to keep the goal alive. We operationalize this perception-cognition-action triad as three coupled mechanisms. First, a vision-first bounding-box pipeline labels candidate interactive regions on every screenshot and feeds them, asynchronously prefetched, to the language model so that the "eye" precedes the "hand". Second, a three-role brain -- an Orchestrator that classifies and routes, a Planner that evaluates progress every few steps, and a Worker that emits per-step actions -- separates strategic from operational reasoning. Third, a structured Ledger stores only what a person would: the goal, the last three actions, a small set of facts and dead-ends, and a handful of checkpoints; a six-phase eviction loop systematically discards stale screenshots, state blobs, and reasoning traces from the live context. Action execution is a three-tier click cascade (Chrome DevTools Protocol to Puppeteer to scripted) with humanized Bezier motion, plus a chevron-aware bounding-box snapper that resolves the "small arrow beside a large label" ambiguity. On the Mind2Web Hard benchmark (66 tasks), SUPERBROWSER attains 89.47% success, placing third overall and ahead of every published open/research browser-agent baseline by a large margin. We argue that the gain comes not from any single trick but from the consistent application of a cognitive contract throughout the system.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: RunAgent SuperBrowser

1. Core Contribution

SUPERBROWSER presents a web navigation agent designed around a cognitive analogy: perception (vision-first bounding boxes), cognition (three-role brain architecture), and memory (structured Ledger with systematic eviction). The central claim is that web agents should mimic human browsing parsimony—retaining minimal state rather than accumulating observations. The system achieves 89.47% on Mind2Web Hard (66 tasks), placing third overall and far ahead of published research baselines.

The most concrete technical contributions are: (1) a six-phase context eviction loop that holds live-context tokens roughly constant across task steps; (2) a three-role decomposition (Orchestrator/Planner/Worker) with role-sliced memory views; (3) a chevron-aware bounding-box snapper resolving compound UI elements; and (4) a three-tier click cascade with humanized Bézier motion.

2. Methodological Rigor

Significant concerns exist regarding evaluation rigor. The benchmark result—89.47% on 66 tasks—is the paper's headline claim, yet the evaluation section is remarkably thin:

No ablation study is actually conducted. The paper repeatedly states ablations are "left to future work" (§7.3). For a systems paper claiming that gains come from "the consistent application of a cognitive contract throughout the system," the absence of component ablations is a critical gap. Without them, we cannot attribute performance to any specific mechanism.

Single-task case studies dominate. The model comparison (§7.4) and instrumentation (Table 1) are based on single representative tasks, not population statistics. The authors acknowledge this limitation but proceed to draw broad conclusions about US vs. Chinese model "tool economy."

No error analysis on failures. With 89.47% success (approximately 59/66 tasks), roughly 7 tasks failed. No analysis of failure modes is provided.

Benchmark narrowness. Only Mind2Web Hard is used. No evaluation on WebArena, VisualWebArena, OSWorld, or Online-Mind2Web—benchmarks the paper itself cites as relevant.

Unclear comparison fairness. The paper compares against baselines that may use different underlying LLMs, budgets, and vision models. The two systems ahead of SUPERBROWSER are "proprietary closed-API systems," making meaningful comparison difficult. The claim of being ahead of "every published open/research baseline by a large margin" is striking but the 8.1% next-best figure (SeeAct) seems to compare systems from different generations with different backbone models.

3. Potential Impact

The paper's strongest practical contributions are engineering insights that the web agent community can adopt:

Context eviction as a first-class design principle is genuinely valuable. The observation that context discipline matters more than context capacity is well-argued and practically important.

The chevron tiebreaker addresses a real, recurring failure mode in vision-grounded agents and is immediately applicable.

The three-tier click cascade with trusted event preservation is practically useful for bot-detection-aware automation.

Asynchronous vision prefetch is a straightforward but important engineering optimization.

However, the "cognitive theory" framing overpromises relative to what's delivered. The paper maps system components to cognitive constructs (working memory → Ledger, System 1/2 → Worker/Planner) but this mapping is post-hoc and largely metaphorical despite claims to the contrary. The "three falsifiable predictions" (§3.4) are engineering predictions about the system's behavior, not predictions about cognition. Calling the deque length of 3 an instantiation of Cowan's "magical number 4" is a stretch—it's a reasonable engineering choice that doesn't require cognitive science to justify.

4. Timeliness & Relevance

The paper addresses a genuine current bottleneck: LLM-based web agents degrade on long-horizon tasks due to context accumulation. The community is actively grappling with this problem, making the bounded-context approach timely. The Mind2Web Hard benchmark, while narrow, is a recognized evaluation target. The model comparison between US and Chinese frontier models (§7.4), while methodologically weak (single task), touches on a topic of significant current interest.

5. Strengths & Limitations

Strengths:

Comprehensive engineering system with many practical innovations

Clear writing with detailed algorithms and pseudocode

Open-source commitment with specific commit hash

The eviction loop is well-formalized and the token-bound invariant (Eq. 5) is clean

The mathematical formalization in Appendix E, while not strictly necessary, adds precision

Honest about limitations (§8) including vision API dependence and the metaphorical nature of the cognitive analogy

Limitations:

No ablations performed—the single most damaging gap for a systems paper

Evaluation on one benchmark, 66 tasks only—statistically fragile (each task is ~1.5% of the score)

Confounded comparison—backbone model differences make baseline comparisons unreliable

The cognitive theory framing adds limited scientific value—the useful contributions are engineering, not theoretical

The US vs. Chinese model comparison draws strong conclusions from a single task with no statistical testing

Reproducibility concerns—depends on specific frontier model APIs, 2Captcha service, and live websites that change

Self-described as "work in progress"—several key sections are explicitly deferred

Overall Assessment

This is a well-engineered web agent system with several genuinely useful components (context eviction, chevron snapping, click cascade), wrapped in an oversold cognitive-theory narrative. The benchmark result is impressive but insufficiently validated: no ablations, one benchmark, and unclear comparison fairness. The paper would be substantially stronger with ablation experiments actually conducted rather than planned, evaluation on multiple benchmarks, and a more modest framing as an engineering contribution rather than a "theory."

Rating:5.5/ 10

Significance 5.5Rigor 3.5Novelty 5Clarity 7

Generated Jun 9, 2026

Comparison History (19)

Wonvs. Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

Paper 1 presents an autonomous web-navigation agent with broad, cross-domain applications in AI, HCI, and general software automation. The timeliness and relevance of web-browsing LLM agents give it immense potential impact. While Paper 2 offers a highly novel methodological bridge between autonomous driving and sports analytics, its primary scope and real-world applications are confined to a narrower niche (football pass evaluation). Consequently, Paper 1 demonstrates significantly greater breadth of impact and wider real-world utility across various fields.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Paper 2 introduces a generalized framework for bootstrapping LLM agents through self-simulation and co-evolution, addressing fundamental challenges in agent training like static environments and inefficient feedback. Its methodology can be applied broadly across numerous domains. In contrast, Paper 1, while demonstrating impressive empirical results, relies heavily on highly specialized engineering heuristics tailored specifically for web navigation, limiting its theoretical breadth and cross-disciplinary applicability.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

Paper 1 has higher scientific impact potential: it introduces a testable, pre-registered framework (occlusion as a diagnostic) that sharpens what “spatial memory” must represent and separates recall from visibility via a simple, general geometric predicate. The work shows strong methodological rigor (frozen tests, multiple preregistrations, exact tests) and yields a broadly applicable principle for embodied agents, robotics, and spatial cognition. Paper 2 is timely and practically strong, but is mainly a systems integration guided by a plausible hypothesis, with impact likely constrained to web-agent engineering and benchmark performance.

gpt-5.2·Jun 10, 2026

Wonvs. A Unified Multi-Modal Framework for Intelligent Financial Systems: Integrating Reinforcement Learning, High-Frequency Trading, and Game-Theoretic Approaches with Cross-Modal Sentiment Analysis

Paper 1 has higher impact potential due to a clearly novel, operationalized human-browsing-grounded architecture (vision-first candidate regions, role-separated cognition, and strict context/ledger eviction) validated on a recognized benchmark with strong, specific results. The methodological contributions are concrete, reproducible system-design ideas likely transferable to many web/GUI agents and human-in-the-loop automation tasks. Paper 2 is broad but reads as an ambitious integration of many established techniques with large claimed gains; without clear novelty boundaries, rigorous ablations, or well-defined unified optimization details, its impact and credibility are harder to assess despite strong application relevance.

gpt-5.2·Jun 10, 2026

Lostvs. Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

Paper 2 addresses a highly critical and timely issue: LLM hallucinations and integrity in scientific publishing. By introducing deterministic integrity gates for clinical manuscript preparation, it safeguards medical literature and ensures reproducibility. This provides a broader and more vital real-world impact across all scientific domains compared to the web navigation advancements presented in Paper 1.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. When Should We Protect AI? A Precautionary Framework for Consciousness Uncertainty

Paper 1 is a technically novel, benchmark-validated autonomous web-navigation system with clear methodological components and strong empirical performance (89.47% on Mind2Web Hard), making it likely to be adopted, extended, and cited across ML, HCI, agents, and automation. Its real-world applicability is immediate (reliable web task automation) and the “human browsing” cognitive-contract framing could influence agent design broadly. Paper 2 is timely and cross-disciplinary, but is primarily normative/conceptual with case studies and less empirical anchoring, making near-term scientific uptake and measurable impact less certain.

gpt-5.2·Jun 9, 2026

Wonvs. Agent Economics: An Entropy-Controlled Pluralistic Alignment Framework for Preventing Artificial Hivemind in Autonomous Agents

Paper 1 demonstrates a concrete, system-level innovation grounded in human browsing behavior with clear architectural contributions and strong empirical validation (89.47% on Mind2Web Hard, outperforming research baselines). It has immediate real-world applicability to web automation and agentic browsing, and its methods (perception-action pipeline, role-separated reasoning, memory/eviction ledger) are transferable across agent systems. Paper 2 is largely a conceptual framework with planned implementation and “anticipated results,” offering less methodological rigor and uncertain impact until validated. Thus Paper 1 has higher near-term scientific impact potential.

gpt-5.2·Jun 9, 2026

Lostvs. AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

Paper 2 has higher potential impact because it isolates a broadly relevant bottleneck for “AI scientist” performance—evidence access—via a controlled, stratified ablation with clear, decision-centric metrics (including a novel completeness-aware utility). Its conclusions generalize across high-stakes scientific/industrial domains (biomed, finance, policy) and directly inform evaluation methodology and system design. Paper 1 is a strong engineering contribution with impressive benchmark results, but its advances are more domain-specific (web navigation) and may generalize less broadly than Paper 2’s framing about data substrates limiting scientific reasoning agents.

gpt-5.2·Jun 9, 2026

Lostvs. Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

Paper 2 addresses a fundamental issue in the evaluation of generative models, a critical challenge across the entire AI community. Validating that pairwise comparisons strongly correlate with ground-truth accuracy bolsters confidence in widely used systems like Chatbot Arena. While Paper 1 presents an impressive web-agent engineering effort, Paper 2's findings have much broader implications, affecting how researchers across multiple subfields evaluate and rank foundation models.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Paper 2 addresses fundamental challenges in AI alignment and reward hacking, proposing an early-warning signal for misalignment. This addresses a critical, theoretical bottleneck in AI safety with broad implications across RL and LLM deployment. Paper 1, while demonstrating strong engineering and practical application in web agents, offers incremental improvements on existing agent architectures, making its long-term scientific impact narrower compared to the foundational safety insights of Paper 2.

gemini-3.1-pro-preview·Jun 9, 2026

#2810of 3489·Artificial Intelligence

#2810 of 3489 · Artificial Intelligence

Tournament Score

1309±45

10501800

37%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor3.5

Novelty5

Clarity7