DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions
Nathan Bout, Maxime Langevin, Ronan Riochet
Abstract
GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.
AI Impact Assessments
(1 models)Scientific Impact Assessment: DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions
Core Contribution
DragOn addresses a clear gap in the GUI agent ecosystem: while click-grounding has benefited from datasets at the scale of millions (OS-ATLAS with 13M elements, UGround with 10M), drag grounding data has remained one to two orders of magnitude smaller. The paper makes three contributions: (1) a formalization of "rendering-as-supervision," which exploits renderer geometry (PDF, XLSX, PPTX, HTML) to produce pixel-aligned ground-truth labels without expensive human annotation or noisy OCR; (2) a large-scale dataset of 286K training screenshots and 3.5M tasks across four drag domains (text highlighting, cell selection, element resizing, slider manipulation); and (3) a benchmark evaluation showing that frontier models all score below 30%, while a fine-tuned Qwen model trained on DragOn reaches 35.3%.
The problem identification is well-motivated. The authors demonstrate that 13.9% of OSWorld tasks and 82.8% of AndroidWorld tasks require drag actions, making this a practically important capability gap rather than an academic curiosity.
Methodological Rigor
The rendering-as-supervision principle is methodologically sound and elegantly simple. By extracting ground-truth coordinates directly from the rendering pipeline (PDF span coordinates, EMU-to-pixel mappings, HTML track geometry), the authors avoid the noise inherent in OCR-based or VLM-based annotation. The distinction between analytic label maps (direct coordinate extraction) and probe-based label maps (color-key perturbation for cell detection) is a useful categorization that makes the approach extensible.
However, several methodological concerns arise:
1. Synthetic distribution shift: The data is largely procedurally generated or derived from structured templates. While the authors apply light augmentations (JPEG compression, blur, brightness jitter), the visual diversity may not capture the full complexity of real-world GUIs. The slider domain, for instance, uses only six HTML contexts with three variants each—a limited visual vocabulary compared to the wild.
2. Template-based instructions: Natural language instructions are sampled from templates, which may not reflect the distribution of real user commands. The referential uniqueness check is a nice touch, but template diversity and naturalness are not evaluated.
3. Evaluation metric concerns: The strict tolerance for element resizing (5% of element dimensions) leads to low absolute numbers, and the relaxed metrics (acc@10%, acc@15%) show substantial improvement, suggesting the metric choice significantly influences conclusions. The paper acknowledges this but doesn't resolve it.
4. Limited fine-tuning analysis: Only one model (Qwen3.5-VL-35B-A3B) is fine-tuned, trained for just ~2 epochs. There's no ablation on dataset size, domain mixing, or transfer to real-world benchmarks. The claim that "our dataset could improve performance of state-of-the-art models on downstream computer-use tasks" is speculative—only one qualitative example (Figure 4) supports this, and it uses a different model (Holo3 vs. Qwen base) rather than the fine-tuned model.
5. Cell selection underperformance: The fine-tuned model achieves only 13.2% on cell selection versus Claude Opus's 37.2%, suggesting the training data or approach has domain-specific weaknesses that aren't adequately analyzed.
Potential Impact
The dataset fills a genuine need. As GUI agents move toward real-world deployment, drag interactions are unavoidable—spreadsheet manipulation, text selection, slider adjustment, and element resizing are fundamental desktop operations. The rendering-as-supervision principle could generalize beyond the four domains presented, potentially to drag-and-drop file operations, canvas drawing, map interactions, and more.
The benchmark provides a standardized evaluation surface with a public validation set and private test set with leaderboard, which could catalyze community progress similarly to how ScreenSpot advanced click grounding. The scale of the training data (3.5M tasks) is substantial enough to enable meaningful fine-tuning experiments.
However, the practical impact depends heavily on whether improvements on synthetic benchmarks transfer to real-world tasks. The paper provides only anecdotal evidence of this transfer (one qualitative OSWorld example), which is insufficient to establish the dataset's downstream utility.
Timeliness & Relevance
The timing is excellent. Computer-use agents are receiving enormous attention from both industry (Claude Computer Use, GPT with tools, Surfer 2) and academia. The identification that drag grounding is a bottleneck—supported by the quantitative analysis of OSWorld and AndroidWorld task requirements—is timely and actionable. The workshop paper format at ICML SCALE is appropriate for this type of dataset/benchmark contribution.
Strengths
Limitations
Overall Assessment
DragOn is a well-executed dataset paper that addresses a real gap in GUI agent research. The rendering-as-supervision principle is clean and the benchmark design is thoughtful. However, the paper's impact is limited by the absence of downstream transfer experiments, the synthetic nature of the data, and missing ablation studies. It represents solid incremental progress rather than a transformative contribution.
Generated Jun 5, 2026
Comparison History (16)
Paper 2 demonstrates higher potential scientific impact due to its extreme timeliness and relevance to the rapidly growing field of autonomous GUI agents. While Paper 1 offers solid algorithmic insights for SAT solvers, Paper 2 directly addresses a critical data bottleneck in vision-language models by introducing a massive, novel benchmark and dataset. This dataset enables immediate, broad real-world applications in digital automation and is likely to be widely adopted and cited by AI researchers building next-generation computer-use models.
Edit-R2 tackles a more fundamental and novel research problem—multi-turn image editing with reinforcement learning—combining several innovative contributions: a new RL framework unifying discrete text and continuous latent space optimization, trajectory filtering for state contamination, and a comprehensive benchmark (MICE-Bench). Paper 1 (DragOn) addresses an important but narrower gap in GUI agent benchmarks (drag interactions), which is more incremental. Paper 2's methodological innovations in applying RL to multi-turn generative tasks have broader implications across multimodal AI, making it likely to inspire more follow-up research.
DragOn addresses a clear, practical gap in GUI agent research by providing a large-scale benchmark and dataset for drag-based interactions, an underexplored area despite being fundamental to GUI automation. Its scale (286K screenshots, 3.5M tasks), comprehensive evaluation across multiple model families, and immediate utility for the rapidly growing GUI agent community give it high impact potential. Paper 2 proposes a belief-aware VLM framework, which is conceptually interesting but more incremental—combining retrieval memory and RL with VLMs—and evaluates only on VQA datasets with modest improvements over zero-shot baselines, limiting its demonstrated impact.
Paper 2 is likely to have higher scientific impact because it delivers a concrete, reusable dataset and benchmark (286K screenshots, 3.5M tasks) that can immediately accelerate and standardize research on GUI agents, with clear real-world applications in desktop/mobile automation. It also includes comparative evaluations and a fine-tuning baseline, supporting methodological rigor and near-term adoption by the community. Paper 1 is conceptually novel and timely in AI governance/alignment, but is more programmatic and harder to operationalize or validate empirically, which may limit short-term measurable impact.
Datasets and benchmarks often serve as foundational infrastructure for emerging fields, historically driving significant impact. Paper 1 addresses a major data gap in the rapidly growing field of GUI agents by providing a massive, multi-domain dataset for drag-based interactions. This is likely to catalyze widespread adoption and standardize evaluation across vision-language model research. While Paper 2 offers a strong methodological improvement for time-series forecasting, Paper 1's foundational utility and relevance to the broader pursuit of autonomous digital agents give it higher potential for widespread scientific impact.
Paper 2 introduces a massive, million-scale dataset and benchmark addressing a critical bottleneck in GUI agent training. High-quality datasets historically drive significant, broad advancements in model capabilities. In contrast, Paper 1 presents an interesting architectural proxy but evaluates it on a very small dataset (21 sessions), limiting its methodological rigor and broad scientific impact compared to a fundamental benchmark resource.
Paper 2 addresses a fundamental algorithmic limitation in LLMs (early commitment in autoregressive decoding) by introducing a novel diffusion-based planning framework. While Paper 1 provides a valuable dataset for GUI agents, Paper 2 offers a broader methodological innovation that can impact any domain requiring complex combinatorial search and tool use, leading to deeper theoretical and cross-domain scientific impact.
DragOn addresses a significant gap in GUI agent research by providing the first large-scale benchmark and dataset for drag-based interactions, which are underrepresented despite being common in real-world GUI usage. The 286K training examples and comprehensive evaluation of leading models make it a valuable community resource. Paper 2 presents an incremental combination of curriculum learning and multi-model selection for medical text generation, evaluated only with BERTScore on a single dataset, representing more limited methodological novelty and narrower impact potential.
Agents' Last Exam (ALE) addresses a fundamental gap between AI benchmark performance and real-world economic impact, offering a large-scale, living benchmark spanning 55 subfields and 13 industry clusters with 1K+ tasks developed with 250+ industry experts. Its breadth, focus on economically meaningful evaluation, and finding that top models achieve only 2.6% on hard tasks make it highly impactful for the AI agents community. DragOn addresses an important but narrower problem (drag-based GUI interactions), contributing a useful dataset but with more limited scope and breadth of impact across fields.
Paper 2 likely has higher impact due to a clearer algorithmic contribution (state-grounded, stepwise dynamic skill retrieval) that can generalize across web-agent settings and informs broader research on continual/online learning, retrieval, and agentic planning. It reports consistent gains on a standard benchmark (WebArena) across model scales and provides code for reproducibility, supporting methodological rigor and adoption. Paper 1 is valuable infrastructure (dataset/benchmark) but is narrower in scope (drag interactions) and its impact depends more on downstream uptake; the claimed improvements are more suggestive than definitive.
Paper 2 addresses a critical gap in a rapidly expanding field of AI (GUI agents) by providing a large-scale dataset for drag-based interactions. Its potential real-world applications in automation and HCI give it broader and more immediate impact across the tech industry compared to Paper 1, which focuses on the narrower, albeit important, niche of high-fidelity scientific simulation data compression.
Paper 1 addresses a more fundamental and broadly impactful problem: how LLM agents handle tool failures and dynamic replanning. Its systematic benchmark design (DAG-based topology × perturbation taxonomy) reveals critical insights about model scaling limitations for fault tolerance, which has implications across all agentic AI systems. Paper 2, while valuable, addresses a narrower problem (drag-based GUI interactions) that is more incremental in nature—extending existing GUI grounding work to a specific underserved interaction modality. Paper 1's findings about the disconnect between scaling and robustness are likely to influence agent architecture research more broadly.
Paper 1 is more scientifically impactful: it proposes a novel abstraction (typed federated artifacts) that changes the unit of collaboration in federated learning, enabling principled per-field DP, schema-aware merging, and cross-architecture transfer—broadly relevant across FL, privacy, and multi-model systems. It includes formal guarantees and empirical characterization across distributions and multiple LLM families, suggesting methodological rigor and generality. Paper 2 is timely and useful, but primarily contributes a dataset/benchmark in a narrower area (drag interactions for GUI agents) with less fundamental conceptual innovation.
Paper 1 introduces a fundamental algorithmic framework for LLM agents to autonomously acquire and transfer skills from experience. Its broad applicability across diverse domains (math, vision, office workflows) and impressive cross-model transferability suggest significant implications for agentic AI. In contrast, Paper 2 provides a valuable but narrowly focused dataset and benchmark for drag-based GUI interactions. Therefore, Paper 1 is likely to have a broader and more profound scientific impact across various AI subfields.
While Paper 1 provides a valuable dataset for a specific modality in GUI agents, Paper 2 tackles the foundational frontier of recursive self-improvement ('agents optimizing agents'). By introducing a systematic harness and benchmark for evaluating how coding agents improve other agents, Paper 2 addresses a critical bottleneck in autonomous AI development. Its focus on meta-optimization has far broader implications across the entire AI field, offering higher theoretical novelty and potential long-term scientific impact than a domain-specific interaction dataset.
Paper 2 likely has higher impact due to broader applicability and scale: a large (286K screenshots, 3.5M tasks) benchmark/training set for drag-based GUI interactions addresses a clear, widely relevant bottleneck for GUI agents, enabling progress across web/mobile/desktop automation and HCI. It is timely with rapid interest in computer-use agents and can catalyze model development via both evaluation and training. Paper 1 is novel and socially important, but smaller (2,123 conversations), narrower domain-specific (Replika), and more sensitive to annotation/LLM-judge validity, limiting generalizability.