PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun
Abstract
Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.
AI Impact Assessments
(1 models)Scientific Impact Assessment: PAGER
1. Core Contribution
PAGER identifies and formalizes a genuinely underexplored problem regime: precision-sensitive GUI tasks, where actions must target exact points in continuous canvas space rather than forgiving clickable regions. The paper makes three interlinked contributions: (1) formalization of the precision-sensitive GUI paradigm with dependency-coupled error propagation; (2) PAGE Bench, a benchmark of 4,906 geometry problems with 224K process-supervised pixel-level actions; and (3) the PAGER agent, which combines dependency-structured planning with pixel-grounded supervised fine-tuning (SFT) and precision-aligned reinforcement learning (RL).
The central insight — the Semantic-Execution Gap — is compelling and well-demonstrated: models achieving 88%+ action type accuracy yet below 6% task success reveals that understanding *what* to do is fundamentally different from *precisely doing it*. This reframing is the paper's most valuable conceptual contribution.
2. Methodological Rigor
The methodology is technically sound and well-structured. The formalization using dependency Jacobians (Eq. 3) for error propagation, while somewhat notational, provides a clear theoretical motivation. The two-stage training pipeline (SFT → RL) is well-motivated: SFT establishes the action grammar, while RL with composite rewards (action-type + parameter accuracy + rendered geometric validity) addresses exposure bias from teacher forcing.
The reward design (Eq. 9) is carefully constructed, combining discrete action-type matching with continuous parameter accuracy via exponential kernels. The admissible set construction through a geometric verifier at training time — not used at inference — is a clean design choice.
Ablation study (Table 3) is informative: removing the parameter-accuracy reward causes regression below SFT performance (20.47 → 20.07), while removing action-type reward still yields improvement (20.47 → 24.52), confirming that continuous-space precision is the primary bottleneck. The full model reaches 29.52, demonstrating complementarity.
However, some methodological concerns exist:
3. Potential Impact
Direct impact: This work opens a new evaluation axis for GUI agents. Current benchmarks are overwhelmingly region-tolerant, masking fundamental spatial precision limitations. PAGE Bench could become a standard stress test for GUI agent capabilities.
Broader implications: The precision-sensitive paradigm extends naturally to CAD software, scientific visualization tools, diagram editors, and any interface requiring continuous-space manipulation. The authors acknowledge this as future work but the framework's principles are transferable.
For geometric reasoning: This bridges the gap between symbolic geometric reasoning (proofs, theorem application) and physical execution. Most prior geometric reasoning work operates in symbolic space; PAGER grounds it in pixel-space actions.
For RL in agents: The precision-aligned RL with composite geometric rewards demonstrates how domain-specific reward shaping can address the specific failure mode of continuous parameter drift — a contribution applicable beyond geometry.
4. Timeliness & Relevance
This work is highly timely. GUI agents are a rapidly growing area (UI-TARS, CogAgent, OpenCUA, etc.), and the community needs harder, more discriminating benchmarks. The finding that even GPT-5.4 and Gemini-3.1-Pro fail catastrophically on these tasks (0.56% and 5.82% task success) is a wake-up call about current agent capabilities.
The paper also arrives as RL-based fine-tuning for VLMs gains momentum (R1-style training), and demonstrates that generic RL rewards are insufficient — geometric tasks need geometry-aware reward signals.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
6. Additional Observations
The paper's evaluation infrastructure is a significant contribution in itself — the pixel-level coordinate projection (Eq. 6), the multi-granularity metrics, and the VLM-based holistic scoring represent thoughtful evaluation design. The 10-category skill taxonomy enables fine-grained capability analysis (Appendix B).
The dataset construction prompts (Appendices C-E) are thoroughly documented, supporting reproducibility.
Generated May 18, 2026
Comparison History (19)
Paper 1 introduces a novel paradigm (point-precise GUI control), a comprehensive benchmark, and an innovative agent architecture combining supervised tuning with precision-aligned RL. It fundamentally expands the capabilities of multimodal agents beyond forgiving region-tolerant tasks to precision-sensitive geometric constructions. This constructive contribution offers broader methodological advancements and applications in design and HCI compared to Paper 2, which, while crucial for autonomous driving safety, is primarily an empirical probing study of existing model flaws.
Paper 1 addresses a fundamental limitation in vision-language GUI agents by targeting precision-sensitive, continuous-space tasks. By introducing a massive new benchmark (PAGE Bench) and a topology-aware agent that achieves a 4.1x performance leap, it effectively opens a new sub-field for agentic AI in design and geometric applications. While Paper 2 offers a strong dynamic multi-agent framework, Paper 1's novel paradigm and high-quality benchmark provide broader foundational value and higher citation potential.
PAGER identifies a novel and important problem—the Semantic-Execution Gap in precision-sensitive GUI tasks—that goes beyond existing tolerant-region paradigms. It introduces both a substantial benchmark (PAGE Bench with 224K+ process-supervised actions) and a new agent architecture combining dependency-structured planning with reinforcement learning. The 4.1x improvement over baselines and the dramatic gap between semantic understanding (88%) and task success (<6%) reveal a fundamental limitation in current models. This opens a new research direction with broader implications for geometric reasoning and precise agent control, whereas DRS-GUI offers an incremental (14%) improvement on an existing task using established techniques (MCTS).
Paper 1 addresses a fundamental bottleneck in Vision-Language Models and AI agents: precise spatial grounding in continuous canvas environments. Overcoming the semantic-execution gap is critical for enabling general-purpose AI to operate complex, professional GUI software (e.g., CAD, design). This presents a broader scientific impact and wider applicability across human-computer interaction fields compared to the domain-specific e-commerce buyer simulation presented in Paper 2.
Paper 1 addresses a fundamental methodological gap in AI agent evaluation that affects the entire field. Its taxonomy of evaluation threats and guiding principles for log analysis have broad applicability across all agent benchmarks and real-world deployments, influencing benchmark creators, model developers, evaluators, and deployers. Paper 2, while technically strong with impressive results on geometric GUI control, addresses a narrower problem domain. Paper 1's contributions to evaluation methodology are more likely to shape research practices across the rapidly growing AI agents field, giving it higher potential for broad scientific impact.
Paper 2 has higher likely impact: it defines a new, clearly distinct task regime (precision-sensitive GUI control), releases a large benchmark with dense pixel-level supervision, and introduces an agent (PAGER) combining structured planning, supervised grounding, and RL with geometry-aware feedback—advancing capability, evaluation, and data. Its applications extend to CAD/diagramming, education, robotics-like precise control, and any pixel-accurate UI operation. Paper 1 is valuable and timely for agent evaluation/diagnosis, but is more incremental and primarily impacts measurement rather than opening a new capability frontier.
Paper 1 addresses a critical bottleneck in the highly active field of web agents: scalable and reproducible training environments. By providing a framework for capturing and synthesizing web environments for RL, it has the potential to broadly impact how general web agents are trained across the community. While Paper 2 presents an impressive, rigorous approach to a specialized problem (point-precise geometric GUI control), Paper 1's focus on foundational infrastructure for web-scale agents offers a wider breadth of real-world applications and higher potential for widespread adoption.
Paper 1 offers a more novel technical contribution: it identifies a new task regime (precision-sensitive GUI control), releases a large-scale benchmark with dense pixel-level supervision, and proposes an agent (PAGER) with topology-aware planning plus precision-aligned RL that yields large empirical gains. This combination of dataset + method + demonstrated SOTA is likely to drive follow-on work across GUI agents, embodied/interactive VLMs, and verification-aware RL. Paper 2 is timely and valuable as a diagnostic evaluation, but is primarily an assessment with a domain-limited benchmark and less direct algorithmic advancement.
Paper 1 establishes a fundamental theoretical framework for model exploitation and reward hacking in reinforcement learning. By proving the inevitability of exploitation under certain conditions, it provides crucial, long-lasting insights into AI safety and the limits of world models. While Paper 2 offers a valuable empirical benchmark and system for GUI control, Paper 1's theoretical contributions have broader, more foundational implications across multiple subfields of AI, leading to higher potential scientific impact.
Paper 2 addresses a critical limitation in current GUI agents by shifting focus from region-tolerant clicks to point-precise continuous control. It introduces a substantial new benchmark (PAGE Bench) and identifies the 'Semantic-Execution Gap', offering a robust RL-based solution that significantly improves task success. This novel conceptual framing and strong empirical performance suggest broader implications for VLM-driven continuous control, giving it a higher potential for scientific impact than Paper 1's exploratory data collection framework.
Paper 2 likely has higher impact due to a clear, timely problem (precision-sensitive GUI control), a large benchmark dataset (PAGE Bench) enabling broad follow-on work, and a concrete method (PAGER) with strong empirical gains and a well-quantified failure mode (Semantic-Execution Gap). Its contributions are immediately applicable to GUI agents, robotics-like manipulation in 2D, and multimodal RL, with strong methodological rigor via benchmarking, supervised tuning, and RL. Paper 1 is conceptually novel but more speculative, harder to validate empirically, and narrower in near-term measurable impact.
Paper 2 addresses a highly challenging and novel problem in the rapidly growing field of GUI agents by introducing a large-scale benchmark and a novel topology-aware agent. Its combination of process-supervised tuning and precision-aligned reinforcement learning offers a methodological breakthrough for point-precise control. While Paper 1 provides valuable insights into LLM contamination and neuro-symbolic reasoning in law, Paper 2's introduction of a comprehensive dataset and substantial performance gains over strong baselines give it broader and more immediate impact in AI agent research.
Paper 2 addresses a fundamental limitation in current LVLM agents—pixel-precise GUI control—introducing a substantial new benchmark (PAGE Bench) and a novel topology-aware RL approach. Its focus on general-purpose AI agents interacting with continuous canvas spaces offers broader cross-disciplinary applicability than Paper 1's domain-specific focus on microservice AIOps. The identification of the 'Semantic-Execution Gap' and the provision of a massive new evaluation dataset are highly likely to drive significant downstream research and citations in the rapidly expanding field of autonomous agents.
Paper 2 introduces a novel problem formulation (precision-sensitive GUI tasks), a new benchmark (PAGE Bench with 224K+ actions), and a new method (PAGER) achieving 4.1x improvement over baselines. It identifies and addresses a fundamental 'Semantic-Execution Gap' with broad implications for GUI agents, geometric reasoning, and VLM capabilities. Paper 1 provides useful empirical design guidance for compound LLM agents in adversarial POMDPs but is more of a controlled ablation study within a specific environment (CybORG CAGE-2) with narrower applicability. Paper 2's contributions—benchmark, method, and conceptual framework—have broader potential impact across multiple research communities.
PAGER addresses a fundamental gap in GUI agent capabilities—point-precise geometric control—with a comprehensive benchmark (PAGE Bench, 224K+ actions) and a novel topology-aware agent achieving 4.1x improvement over baselines. It identifies and formalizes the 'Semantic-Execution Gap,' a broadly relevant concept for vision-language models. Paper 2 (ColPackAgent) is a competent but narrower contribution, wrapping existing simulation tools (HOOMD-blue) with LLM agent workflows for colloidal packing—a useful but domain-specific engineering contribution with less methodological novelty and narrower impact across fields.
Paper 2 likely has higher impact: it identifies a timely, underexplored failure mode for multimodal GUI agents (precision-sensitive point control), introduces a large benchmark (PAGE Bench) and a new agent (PAGER) with clear, sizable gains, enabling broad downstream work in agentic AI, HCI, robotics-like control, and verification. The benchmark/data release and framing of the Semantic-Execution Gap can influence many follow-on studies. Paper 1 is rigorous and valuable for exact RCPSP solving, but its impact is narrower to scheduling/OR and less cross-field transformative.
Paper 1 likely has higher impact: it introduces a new task regime (precision-sensitive GUI control), a sizable benchmark (4,906 problems, 224K pixel-level actions), and a novel topology-aware agent with supervised + RL training that yields large SOTA gains, enabling broader progress in GUI agents, CAD/geometry tools, and embodied/interactive AI. Methodologically it provides clear evaluation evidence of a critical failure mode (semantic-execution gap) and a concrete remedy. Paper 2 is timely and important for responsible deployment, but its contribution is mainly evaluative/framework-based and may be narrower in algorithmic spillover.
Paper 1 addresses a fundamental interpretability question for EEG foundation models with a rigorous, systematic methodology (probing, erasure, transparent classifiers) across multiple models and tasks. It bridges the gap between classical neuroscience feature engineering and modern deep learning, offering actionable insights about what these models learn. This has broad impact across neuroscience, clinical EEG, and AI interpretability. Paper 2, while technically strong, addresses a narrower problem (precise geometric GUI control) with more limited cross-disciplinary reach and a smaller research community.
Paper 2 introduces a novel benchmark (PAGE Bench) and identifies a previously under-explored failure mode (Semantic-Execution Gap) in a rapidly growing field (multimodal GUI agents). By providing both a comprehensive dataset and a strong baseline (PAGER) for precision-sensitive tasks, it opens a new avenue for research, which typically leads to higher foundational impact and citation counts compared to the specific methodological optimization presented in Paper 1.