PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun

May 15, 2026

arXiv:2605.15963v1 PDF

cs.AI(primary)

#449of 2292·Artificial Intelligence

#449 of 2292 · Artificial Intelligence

Tournament Score

1478±46

10501800

79%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity7.5

Tournament Score

1478±46

10501800

79%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PAGER

1. Core Contribution

PAGER identifies and formalizes a genuinely underexplored problem regime: precision-sensitive GUI tasks, where actions must target exact points in continuous canvas space rather than forgiving clickable regions. The paper makes three interlinked contributions: (1) formalization of the precision-sensitive GUI paradigm with dependency-coupled error propagation; (2) PAGE Bench, a benchmark of 4,906 geometry problems with 224K process-supervised pixel-level actions; and (3) the PAGER agent, which combines dependency-structured planning with pixel-grounded supervised fine-tuning (SFT) and precision-aligned reinforcement learning (RL).

The central insight — the Semantic-Execution Gap — is compelling and well-demonstrated: models achieving 88%+ action type accuracy yet below 6% task success reveals that understanding *what* to do is fundamentally different from *precisely doing it*. This reframing is the paper's most valuable conceptual contribution.

2. Methodological Rigor

The methodology is technically sound and well-structured. The formalization using dependency Jacobians (Eq. 3) for error propagation, while somewhat notational, provides a clear theoretical motivation. The two-stage training pipeline (SFT → RL) is well-motivated: SFT establishes the action grammar, while RL with composite rewards (action-type + parameter accuracy + rendered geometric validity) addresses exposure bias from teacher forcing.

The reward design (Eq. 9) is carefully constructed, combining discrete action-type matching with continuous parameter accuracy via exponential kernels. The admissible set construction through a geometric verifier at training time — not used at inference — is a clean design choice.

Ablation study (Table 3) is informative: removing the parameter-accuracy reward causes regression below SFT performance (20.47 → 20.07), while removing action-type reward still yields improvement (20.47 → 24.52), confirming that continuous-space precision is the primary bottleneck. The full model reaches 29.52, demonstrating complementarity.

However, some methodological concerns exist:

The planning module appears to use a separate LLM (fφ), but its training and integration details are underspecified.

The RL stage uses rejection sampling with 8 candidates — the interaction between this choice and the geometric verifier could be better analyzed.

Human evaluation correlation (r=0.9397) is reported but based on a relatively sparse scatter plot (Figure 6) with limited detail on the human evaluation protocol.

3. Potential Impact

Direct impact: This work opens a new evaluation axis for GUI agents. Current benchmarks are overwhelmingly region-tolerant, masking fundamental spatial precision limitations. PAGE Bench could become a standard stress test for GUI agent capabilities.

Broader implications: The precision-sensitive paradigm extends naturally to CAD software, scientific visualization tools, diagram editors, and any interface requiring continuous-space manipulation. The authors acknowledge this as future work but the framework's principles are transferable.

For geometric reasoning: This bridges the gap between symbolic geometric reasoning (proofs, theorem application) and physical execution. Most prior geometric reasoning work operates in symbolic space; PAGER grounds it in pixel-space actions.

For RL in agents: The precision-aligned RL with composite geometric rewards demonstrates how domain-specific reward shaping can address the specific failure mode of continuous parameter drift — a contribution applicable beyond geometry.

4. Timeliness & Relevance

This work is highly timely. GUI agents are a rapidly growing area (UI-TARS, CogAgent, OpenCUA, etc.), and the community needs harder, more discriminating benchmarks. The finding that even GPT-5.4 and Gemini-3.1-Pro fail catastrophically on these tasks (0.56% and 5.82% task success) is a wake-up call about current agent capabilities.

The paper also arrives as RL-based fine-tuning for VLMs gains momentum (R1-style training), and demonstrates that generic RL rewards are insufficient — geometric tasks need geometry-aware reward signals.

5. Strengths & Limitations

Key Strengths:

Novel problem formalization: Precision-sensitive GUI tasks are clearly defined and meaningfully distinct from standard GUI interaction.

Comprehensive benchmark: 224K process-supervised actions with pixel-level annotations, multi-level evaluation metrics (process + final), and verified trajectories represent significant construction effort.

Dramatic performance gaps: The 4.1× improvement over the strongest baseline and the elevation of step success from <9% to >62% for GUI-specialized agents are striking.

Extensive evaluation: 17 baselines spanning open-source VLMs, closed-source VLMs, and GUI-specialized agents provide thorough comparison.

Well-designed metrics: The four-stage evaluation protocol (action accuracy → parameter accuracy → step success → task success) provides granular diagnostic capability.

Notable Limitations:

Domain specificity: GeoGebra is the sole environment. Despite the general framing of "precision-sensitive GUI tasks," all evidence comes from one application. Generalization to CAD, diagram editing, or other precision-demanding interfaces is unverified.

Base model dependency: PAGER is built on Qwen3-VL-8B. The contribution of the base model versus the training recipe is unclear — would the same recipe work on other backbones?

Scalability of data pipeline: The dataset construction relies on LLM-assisted screening, manual verification, and multi-stage filtering. Reproducibility and scalability to other domains may be challenging.

Limited analysis of failure modes: While the case study (Figure 5) is illustrative, systematic error analysis (e.g., which geometric primitives cause the most cascading failures, at what trajectory lengths performance degrades) would strengthen understanding.

Comparison fairness: PAGER is specifically trained on PAGE Bench's training set, while baselines use zero-shot or their native training. This is standard practice but means the comparison measures domain adaptation ability as much as architectural contribution.

The "Semantic-Execution Gap" framing, while catchy, somewhat overstates novelty — the observation that knowing what to do differs from doing it precisely exists across robotics and control theory.

6. Additional Observations

The paper's evaluation infrastructure is a significant contribution in itself — the pixel-level coordinate projection (Eq. 6), the multi-granularity metrics, and the VLM-based holistic scoring represent thoughtful evaluation design. The 10-category skill taxonomy enables fine-grained capability analysis (Appendix B).

The dataset construction prompts (Appendices C-E) are thoroughly documented, supporting reproducibility.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 7.5

Generated May 18, 2026

Comparison History (19)

vs. Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation

gemini-3.15/19/2026

Paper 1 introduces a novel paradigm (point-precise GUI control), a comprehensive benchmark, and an innovative agent architecture combining supervised tuning with precision-aligned RL. It fundamentally expands the capabilities of multimodal agents beyond forgiving region-tolerant tasks to precision-sensitive geometric constructions. This constructive contribution offers broader methodological advancements and applications in design and HCI compared to Paper 2, which, while crucial for autonomous driving safety, is primarily an empirical probing study of existing model flaws.

vs. EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

gemini-3.15/18/2026

Paper 1 addresses a fundamental limitation in vision-language GUI agents by targeting precision-sensitive, continuous-space tasks. By introducing a massive new benchmark (PAGE Bench) and a topology-aware agent that achieves a 4.1x performance leap, it effectively opens a new sub-field for agentic AI in design and geometric applications. While Paper 2 offers a strong dynamic multi-agent framework, Paper 1's novel paradigm and high-quality benchmark provide broader foundational value and higher citation potential.

vs. DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

claude-opus-4.65/18/2026

PAGER identifies a novel and important problem—the Semantic-Execution Gap in precision-sensitive GUI tasks—that goes beyond existing tolerant-region paradigms. It introduces both a substantial benchmark (PAGE Bench with 224K+ process-supervised actions) and a new agent architecture combining dependency-structured planning with reinforcement learning. The 4.1x improvement over baselines and the dramatic gap between semantic understanding (88%) and task success (<6%) reveal a fundamental limitation in current models. This opens a new research direction with broader implications for geometric reasoning and precise agent control, whereas DRS-GUI offers an incremental (14%) improvement on an existing task using established techniques (MCTS).

vs. SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

gemini-3.15/18/2026

Paper 1 addresses a fundamental bottleneck in Vision-Language Models and AI agents: precise spatial grounding in continuous canvas environments. Overcoming the semantic-execution gap is critical for enabling general-purpose AI to operate complex, professional GUI software (e.g., CAD, design). This presents a broader scientific impact and wider applicability across human-computer interaction fields compared to the domain-specific e-commerce buyer simulation presented in Paper 2.

vs. Log analysis is necessary for credible evaluation of AI agents

claude-opus-4.65/18/2026

Paper 1 addresses a fundamental methodological gap in AI agent evaluation that affects the entire field. Its taxonomy of evaluation threats and guiding principles for log analysis have broad applicability across all agent benchmarks and real-world deployments, influencing benchmark creators, model developers, evaluators, and deployers. Paper 2, while technically strong with impressive results on geometric GUI control, addresses a narrower problem domain. Paper 1's contributions to evaluation methodology are more likely to shape research practices across the rapidly growing AI agents field, giving it higher potential for broad scientific impact.

vs. Holistic Evaluation and Failure Diagnosis of AI Agents

gpt-5.25/18/2026

Paper 2 has higher likely impact: it defines a new, clearly distinct task regime (precision-sensitive GUI control), releases a large benchmark with dense pixel-level supervision, and introduces an agent (PAGER) combining structured planning, supervised grounding, and RL with geometry-aware feedback—advancing capability, evaluation, and data. Its applications extend to CAD/diagramming, education, robotics-like precise control, and any pixel-accurate UI operation. Paper 1 is valuable and timely for agent evaluation/diagnosis, but is more incremental and primarily impacts measurement rather than opening a new capability frontier.

vs. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

gemini-3.15/18/2026

Paper 1 addresses a critical bottleneck in the highly active field of web agents: scalable and reproducible training environments. By providing a framework for capturing and synthesizing web environments for RL, it has the potential to broadly impact how general web agents are trained across the community. While Paper 2 presents an impressive, rigorous approach to a specialized problem (point-precise geometric GUI control), Paper 1's focus on foundational infrastructure for web-scale agents offers a wider breadth of real-world applications and higher potential for widespread adoption.

vs. Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

gpt-5.25/18/2026

Paper 1 offers a more novel technical contribution: it identifies a new task regime (precision-sensitive GUI control), releases a large-scale benchmark with dense pixel-level supervision, and proposes an agent (PAGER) with topology-aware planning plus precision-aligned RL that yields large empirical gains. This combination of dataset + method + demonstrated SOTA is likely to drive follow-on work across GUI agents, embodied/interactive VLMs, and verification-aware RL. Paper 2 is timely and valuable as a diagnostic evaluation, but is primarily an assessment with a domain-limited benchmark and less direct algorithmic advancement.

vs. Imperfect World Models are Exploitable

gemini-3.15/18/2026

Paper 1 establishes a fundamental theoretical framework for model exploitation and reward hacking in reinforcement learning. By proving the inevitability of exploitation under certain conditions, it provides crucial, long-lasting insights into AI safety and the limits of world models. While Paper 2 offers a valuable empirical benchmark and system for GUI control, Paper 1's theoretical contributions have broader, more foundational implications across multiple subfields of AI, leading to higher potential scientific impact.

vs. ScreenSearch: Uncertainty-Aware OS Exploration

gemini-3.15/18/2026

Paper 2 addresses a critical limitation in current GUI agents by shifting focus from region-tolerant clicks to point-precise continuous control. It introduces a substantial new benchmark (PAGE Bench) and identifies the 'Semantic-Execution Gap', offering a robust RL-based solution that significantly improves task success. This novel conceptual framing and strong empirical performance suggest broader implications for VLM-driven continuous control, giving it a higher potential for scientific impact than Paper 1's exploratory data collection framework.

vs. An Algebraic Exposition of the Theory of Dyadic Morality

gpt-5.25/18/2026

Paper 2 likely has higher impact due to a clear, timely problem (precision-sensitive GUI control), a large benchmark dataset (PAGE Bench) enabling broad follow-on work, and a concrete method (PAGER) with strong empirical gains and a well-quantified failure mode (Semantic-Execution Gap). Its contributions are immediately applicable to GUI agents, robotics-like manipulation in 2D, and multimodal RL, with strong methodological rigor via benchmarking, supervised tuning, and RL. Paper 1 is conceptually novel but more speculative, harder to validate empirically, and narrower in near-term measurable impact.

vs. Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

gemini-3.15/18/2026

Paper 2 addresses a highly challenging and novel problem in the rapidly growing field of GUI agents by introducing a large-scale benchmark and a novel topology-aware agent. Its combination of process-supervised tuning and precision-aligned reinforcement learning offers a methodological breakthrough for point-precise control. While Paper 1 provides valuable insights into LLM contamination and neuro-symbolic reasoning in law, Paper 2's introduction of a comprehensive dataset and substantial performance gains over strong baselines give it broader and more immediate impact in AI agent research.

vs. STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

gemini-3.15/18/2026

Paper 2 addresses a fundamental limitation in current LVLM agents—pixel-precise GUI control—introducing a substantial new benchmark (PAGE Bench) and a novel topology-aware RL approach. Its focus on general-purpose AI agents interacting with continuous canvas spaces offers broader cross-disciplinary applicability than Paper 1's domain-specific focus on microservice AIOps. The identification of the 'Semantic-Execution Gap' and the provision of a massive new evaluation dataset are highly likely to drive significant downstream research and citations in the rapidly expanding field of autonomous agents.

vs. Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

claude-opus-4.65/18/2026

Paper 2 introduces a novel problem formulation (precision-sensitive GUI tasks), a new benchmark (PAGE Bench with 224K+ actions), and a new method (PAGER) achieving 4.1x improvement over baselines. It identifies and addresses a fundamental 'Semantic-Execution Gap' with broad implications for GUI agents, geometric reasoning, and VLM capabilities. Paper 1 provides useful empirical design guidance for compound LLM agents in adversarial POMDPs but is more of a controlled ablation study within a specific environment (CybORG CAGE-2) with narrower applicability. Paper 2's contributions—benchmark, method, and conceptual framework—have broader potential impact across multiple research communities.

vs. ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing

claude-opus-4.65/18/2026

PAGER addresses a fundamental gap in GUI agent capabilities—point-precise geometric control—with a comprehensive benchmark (PAGE Bench, 224K+ actions) and a novel topology-aware agent achieving 4.1x improvement over baselines. It identifies and formalizes the 'Semantic-Execution Gap,' a broadly relevant concept for vision-language models. Paper 2 (ColPackAgent) is a competent but narrower contribution, wrapping existing simulation tools (HOOMD-blue) with LLM agent workflows for colloidal packing—a useful but domain-specific engineering contribution with less methodological novelty and narrower impact across fields.

vs. Petri Net Induced Heuristic Search for Resource Constrained Scheduling

gpt-5.25/18/2026

Paper 2 likely has higher impact: it identifies a timely, underexplored failure mode for multimodal GUI agents (precision-sensitive point control), introduces a large benchmark (PAGE Bench) and a new agent (PAGER) with clear, sizable gains, enabling broad downstream work in agentic AI, HCI, robotics-like control, and verification. The benchmark/data release and framing of the Semantic-Execution Gap can influence many follow-on studies. Paper 1 is rigorous and valuable for exact RCPSP solving, but its impact is narrower to scheduling/OR and less cross-field transformative.

vs. Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

gpt-5.25/18/2026

Paper 1 likely has higher impact: it introduces a new task regime (precision-sensitive GUI control), a sizable benchmark (4,906 problems, 224K pixel-level actions), and a novel topology-aware agent with supervised + RL training that yields large SOTA gains, enabling broader progress in GUI agents, CAD/geometry tools, and embodied/interactive AI. Methodologically it provides clear evaluation evidence of a critical failure mode (semantic-execution gap) and a concrete remedy. Paper 2 is timely and important for responsible deployment, but its contribution is mainly evaluative/framework-based and may be narrower in algorithmic spillover.

vs. What Do EEG Foundation Models Capture from Human Brain Signals?

claude-opus-4.65/18/2026

Paper 1 addresses a fundamental interpretability question for EEG foundation models with a rigorous, systematic methodology (probing, erasure, transparent classifiers) across multiple models and tasks. It bridges the gap between classical neuroscience feature engineering and modern deep learning, offering actionable insights about what these models learn. This has broad impact across neuroscience, clinical EEG, and AI interpretability. Paper 2, while technically strong, addresses a narrower problem (precise geometric GUI control) with more limited cross-disciplinary reach and a smaller research community.

vs. Reasoning Compression with Mixed-Policy Distillation

gemini-3.15/18/2026

Paper 2 introduces a novel benchmark (PAGE Bench) and identifies a previously under-explored failure mode (Semantic-Execution Gap) in a rapidly growing field (multimodal GUI agents). By providing both a comprehensive dataset and a strong baseline (PAGER) for precision-sensitive tasks, it opens a new avenue for research, which typically leads to higher foundational impact and citation counts compared to the specific methodological optimization presented in Paper 1.