AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

Minghao Chen, Xinyi Hu, Zhou Yu, Yufei Yin

May 20, 2026

arXiv:2605.21082v1 PDF

cs.AI(primary)

#1336of 2292·Artificial Intelligence

#1336 of 2292 · Artificial Intelligence

Tournament Score

1393±41

10501800

54%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty6.5

Clarity7.5

Tournament Score

1393±41

10501800

54%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large Language Model (LLM) based agents have demonstrated proficiency in multi-step interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve repetitive GUI tasks for which invoking LLM reasoning repeatedly, i.e., the ReAct paradigm, is inefficient. Prior to LLMs, traditional Robotic Process Automation (RPA) offers runtime efficiency but demands significant manual effort to develop and maintain. To bridge this gap, we propose AutoRPA, a framework that automatically distills the decision logic of ReAct-style agents into robust RPA functions. AutoRPA introduces two core innovations: (1) A translator-builder pipeline, where a translator agent converts hard-coded ReAct actions into soft-coded procedures, and a builder agent synthesizes robust RPA functions via retrieval-augmented generation over multiple trajectories; (2) A hybrid repair strategy during code verification, combining RPA execution with ReAct-based fallback for iterative refinement. Experiments across multiple GUI environments demonstrate that RPA functions generated by AutoRPA successfully solve similar tasks while reducing token usage by 82% to 96%, significantly improving runtime efficiency and reusability.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AutoRPA

1. Core Contribution

AutoRPA addresses a practical but underexplored problem: how to convert the expensive, per-instance reasoning of LLM-based GUI agents into reusable, low-cost RPA scripts for repetitive task types. The key insight is that many real-world GUI tasks are recurring (e.g., booking flights, filing reports), and invoking full LLM reasoning each time is wasteful. The framework introduces a three-phase pipeline: (1) exploration via a ReAct agent that collects trajectories, (2) a translator-builder pipeline that converts hard-coded actions into soft-coded, environment-resilient RPA functions via RAG over a tree-structured trajectory database, and (3) a hybrid repair strategy combining RPA execution with ReAct-based fallback for iterative code refinement.

The problem formulation is clean and well-motivated. The distinction between "hard-coded" actions (tied to specific UI indices/coordinates) and "soft-coded" actions (based on semantic element attributes) is a meaningful abstraction that enables generalization across task instances with different GUI layouts.

2. Methodological Rigor

The approach is methodologically sound with several well-designed components:

Translator Agent: Converting positional/index-based actions to semantic attribute-based element lookups is a practical and important step. The use of `find_element` with semantic kwargs (content descriptions, hint text, target descriptions) provides a reasonable abstraction layer.

Builder with Tree-structured RAG: The hierarchical trajectory database (interaction blocks → simplified trajectories → conclusions) with a `fetch_info` tool is a thoughtful design that balances context length constraints against the builder's need for detailed information.

Hybrid Repair: The combination of breakpoint analysis + ReAct continuation + builder refinement is more principled than simple retry-and-regenerate approaches. The analyzer agent's role in diagnosing failures and deciding whether to continue from the breakpoint or restart is a practical innovation.

Experimental Evaluation: The paper evaluates on three diverse benchmarks (AndroidWorld, WebArena, MiniWoB++) with multiple LLM backbones (GPT-4o, GPT-4.1, GPT-5, Claude-4.5-sonnet). The ablation study systematically removes key components. The results consistently show 82-96% token reduction with competitive or superior success rates. The scaling analysis (Figure 5) showing improvement with more building tasks is informative.

However, there are some methodological concerns:

The paper relies on N=3 building tasks per task type, which is relatively few. The generalization claim would be stronger with more diverse building sets.

The evaluation metric focuses on success rate and token cost but doesn't deeply analyze failure modes or the types of tasks where AutoRPA struggles.

Statistical significance tests are absent; given the stochastic nature of LLM outputs, confidence intervals would strengthen the claims.

3. Potential Impact

Practical Impact: The framework directly addresses a real deployment bottleneck. Enterprise RPA is a multi-billion dollar industry, and the manual effort required to create and maintain RPA scripts is a major pain point. AutoRPA could significantly reduce this barrier, making RPA accessible to users who can describe tasks in natural language rather than programming scripts.

Cost Reduction: The 82-96% token reduction is substantial and has direct financial implications for production deployments of LLM-based GUI automation. The "AutoRPA (code only)" variant is particularly compelling—it achieves near-ReAct performance with minimal token usage.

Research Direction: The paper opens a useful research direction at the intersection of LLM agents and traditional software automation. The idea of "distilling" agent behavior into executable code has broader applicability beyond GUI tasks—it could extend to API automation, database operations, or any domain where LLM agents perform repetitive structured tasks.

Limitations on Impact: The reliance on accessibility trees/DOM is a significant constraint, though the authors acknowledge this and point to OmniParser as a potential mitigation. The framework requires task type specification and multiple instances for building, which adds friction. The approach is also tightly coupled to specific LLM capabilities (GPT-4o+), limiting accessibility.

4. Timeliness & Relevance

This paper is highly timely. The field is experiencing rapid growth in LLM-based GUI agents (CUA from OpenAI, Computer Use from Anthropic, Gemini CU), but the cost-efficiency problem for deployment at scale is largely unaddressed. Most research optimizes single-task performance; AutoRPA is one of the first to systematically tackle the repetitive-task efficiency problem. The framing of "distilling" ReAct trajectories into code aligns with the broader trend of knowledge distillation from expensive models to cheaper execution pathways.

5. Strengths & Limitations

Key Strengths:

Well-defined, practical problem with clear industrial relevance

The translator concept (hard-coded → soft-coded actions) is elegant and broadly applicable

Comprehensive evaluation across three benchmarks, multiple models, and thorough ablations

The hybrid repair strategy is more sophisticated than naive code regeneration

The case study (Appendix E) comparing AutoRPA's tic-tac-toe code against AdaPlanner and AutoManual vividly demonstrates the qualitative improvement in generated code robustness

Notable Limitations:

Requires DOM/accessibility tree access, limiting applicability to pure-vision scenarios

The building phase itself is expensive (Table 4 shows ~233k tokens per task type on AndroidWorld), requiring several successful task completions to amortize

Only 53.4% of AndroidWorld task types yield verified RPAs (Table 6), meaning nearly half still require full ReAct at test time

The paper doesn't explore how RPA functions degrade over time as applications update their UIs—a critical concern for real-world RPA

Limited analysis of failure cases—understanding *when* AutoRPA fails would be as valuable as knowing when it succeeds

Reproducibility: heavy reliance on proprietary APIs (GPT-4o, GPT-5) limits independent verification

Other Observations:

The comparison with AutoManual and AdaPlanner on MiniWoB++ is somewhat unfair, as those methods use expert demonstrations while AutoRPA uses only a simple "click-button" demo. However, this also highlights AutoRPA's advantage in requiring less human input.

The WebArena results (Figure 4) show more modest improvements, suggesting the approach may be less effective for complex, strategy-diverse web tasks.

The paper would benefit from a formal analysis of the amortization point—how many task instances are needed before AutoRPA's building cost is offset by execution savings.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 6.5Clarity 7.5

Generated May 21, 2026

Comparison History (26)

vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

gemini-3.15/22/2026

Paper 2 addresses a critical bottleneck in LLM agent deployment: the high computational cost and latency of ReAct loops for repetitive tasks. By bridging LLM reasoning with traditional RPA systems, it offers massive, quantifiable efficiency gains (82-96% token reduction) for workflow automation. While Paper 1 introduces a valuable and rigorous benchmark for emotional intelligence, Paper 2's framework has broader, more immediate real-world economic and practical utility across software engineering and autonomous AI agents, directly enabling scalable, cost-effective AI automation.

vs. TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

gemini-3.15/22/2026

Paper 1 offers higher potential impact due to its broader applicability and resolution of a fundamental bottleneck in LLM agents. While Paper 2 is highly innovative for mechanical engineering (topology optimization), Paper 1 addresses the prohibitive computational cost and latency of the ReAct paradigm in repetitive tasks. By distilling expensive LLM reasoning into reusable RPA scripts and reducing token usage by up to 96%, AutoRPA provides a highly scalable automation solution. This promises massive cross-disciplinary adoption, economic value, and establishes a critical bridge between modern generative AI and traditional enterprise automation.

vs. Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

claude-opus-4.65/22/2026

AutoRPA addresses a more novel and practically impactful problem—bridging LLM-based agents with traditional RPA for efficient GUI automation. It introduces a genuinely new framework (translator-builder pipeline, hybrid repair) with significant practical benefits (82-96% token reduction). Paper 1, while solid, proposes incremental improvements to visual token pooling for Video LLMs using training-free techniques, which is a narrower contribution in a crowded space. AutoRPA has broader cross-field impact (software engineering, automation, HCI) and addresses the timely challenge of making LLM agents cost-efficient for real-world deployment.

vs. Knowledge Graph Re-engineering Along the Ontological Continuum (extended version)

gpt-5.25/22/2026

Paper 1 offers a concrete, timely systems contribution: distilling ReAct-style GUI agents into reusable RPA code with measurable efficiency gains (82–96% token reduction) and a hybrid repair/verification loop. This is novel in bridging LLM agents and traditional RPA, has clear near-term real-world applications (enterprise automation), and is supported by experiments across GUI environments. Paper 2 provides an interesting conceptual framework and agenda for KG re-engineering, but is primarily theoretical with limited empirical validation, making impact more uncertain and longer-term.

vs. ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

gpt-5.25/22/2026

Paper 2 (ExComm) likely has higher impact due to broader applicability and timeliness: it addresses a central bottleneck in agentic test-time scaling (error propagation) across math, reasoning, and tool-using settings, with a general communication-and-verification protocol that can be layered onto many multi-agent pipelines. Its methodology includes explicit conflict detection, tool-based verification, belief-update design, and diversity control, evaluated on widely watched benchmarks (AIME/GAIA) with clear gains and cost trade-offs. Paper 1 is practically valuable for GUI RPA, but its impact is narrower and more domain-specific.

vs. LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

claude-opus-4.65/22/2026

LCGuard addresses a fundamental and emerging security concern in multi-agent LLM systems—information leakage through latent KV cache sharing—which is a novel problem formulation with broad implications as latent communication becomes more prevalent. It introduces a principled adversarial training framework with formal definitions of safety. Paper 2 (AutoRPA) offers practical efficiency gains for GUI automation but is more incremental, combining existing paradigms (ReAct + RPA). LCGuard's contribution to AI safety and privacy in multi-agent systems has broader cross-field impact and higher timeliness given rapid multi-agent deployment.

vs. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

gemini-3.15/22/2026

Paper 2 proposes a paradigm shift in autonomous agents by enabling source-level self-evolution, moving beyond the limitations of text-mutable artifacts (like prompts or schemas). This Turing-complete self-rewriting approach has profound implications for AGI and self-improving systems across multiple domains. Paper 1, while highly practical and effective for reducing costs in GUI automation, offers a more incremental optimization within a narrower scope.

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

gpt-5.25/22/2026

Paper 2 has higher potential impact due to stronger novelty (first large-scale evaluation on genuinely open problems), higher scientific stakes (advancing formalized mathematics and reliable theorem proving), and broader cross-field reach (methods applicable across many math domains and beyond to verification-heavy sciences). The demonstrated real-world outcomes (solving open Erdős problems, proving OEIS conjectures, active deployment in multiple research areas) indicate immediate relevance and adoption potential. While Paper 1 offers practical efficiency gains for GUI automation, its impact is narrower and more engineering-focused.

vs. Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

claude-opus-4.65/21/2026

AutoRPA addresses a practical and broadly applicable problem—bridging LLM-based agents and traditional RPA for efficient GUI automation. It introduces a concrete framework with measurable improvements (82-96% token reduction) and has clear real-world applications in enterprise automation. Paper 1, while offering interesting mechanistic insights into sycophancy as a persona-level property, is narrower in scope, focused on interpretability of steering vectors with incremental findings. Paper 2's combination of novelty, practical utility, and cross-field relevance (NLP, software engineering, HCI) gives it higher potential impact.

vs. AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

gpt-5.25/21/2026

Paper 2 (AgentCo-op) likely has higher scientific impact due to broader novelty and cross-domain applicability: it introduces retrieval-based synthesis of interoperable multi-agent workflows with typed artifact handoffs and localized repair, addressing a key bottleneck in open-world scientific automation. Its demonstrated use in genomics case studies suggests strong real-world relevance and timeliness for scientific tool/agent integration, and the interoperability/typed interface idea can generalize across fields. Paper 1 is impactful for GUI RPA efficiency, but its scope is narrower (GUI task automation) and more application-specific.

vs. Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

gemini-3.15/21/2026

Paper 1 addresses a critical bottleneck in LLM agents—high token cost and latency during repetitive tasks—by bridging ReAct paradigms with traditional RPA. This has broad, immediate real-world applications in enterprise automation, HCI, and software engineering. While Paper 2 provides a valuable high-performance simulator for RL research, its direct impact is largely confined to game AI and imperfect-information learning, making Paper 1 significantly more impactful across multiple disciplines and practical domains.

vs. Memory-Augmented Reinforcement Learning Agent for CAD Generation

claude-opus-4.65/21/2026

Paper 1 addresses a more fundamental and technically challenging problem—automatic CAD generation with reinforcement learning and memory augmentation. It combines multiple novel contributions (dual-track memory, dynamic utility retrieval, RL-based optimization for geometric feasibility) targeting a high-impact domain (advanced manufacturing). Paper 2, while practical and well-motivated, primarily offers an engineering optimization (distilling ReAct agents into RPA scripts for efficiency), which is more incremental. Paper 1's framework has broader potential impact across manufacturing, design automation, and AI-assisted engineering, with stronger methodological novelty.

vs. AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

claude-opus-4.65/21/2026

AutoRPA presents a novel and rigorous technical contribution—bridging LLM-based agents with traditional RPA through automated code synthesis, with strong empirical results (82-96% token reduction). It addresses a concrete, widespread problem (repetitive GUI automation) with clear real-world applications and methodological innovation. Paper 1 (AiraXiv) describes a platform/system for AI-era publishing, which is more incremental and infrastructure-oriented rather than introducing fundamental new methods. Its validation is limited to a single conference deployment, and its scientific contribution is less generalizable.

vs. Generative Recursive Reasoning

gemini-3.15/21/2026

Paper 2 proposes a foundational advancement in neural reasoning by introducing probabilistic multi-trajectory latent computation as an alternative to standard autoregressive models. This theoretical innovation has broad implications across all of machine learning and AI reasoning. In contrast, Paper 1 presents a highly practical but domain-specific engineering solution for GUI automation. Paper 2's core architectural contributions offer greater potential for widespread methodological impact.

vs. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

claude-opus-4.65/21/2026

Paper 1 addresses a fundamental and broadly applicable challenge in LLM agent development—systematic diagnosis of agent failures at scale. Its formalization of corpus-level trace diagnostics creates a new problem framework, and the multi-agent architecture with rigorous evaluation (30.4pp improvement, expert ratings) demonstrates strong methodological rigor. Paper 2, while practically useful in bridging LLM agents and RPA, addresses a narrower optimization problem (reducing token usage for repetitive GUI tasks). Paper 1's insights are more transferable across the entire LLM agent ecosystem, giving it broader potential impact.

vs. Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines

gemini-3.15/21/2026

Paper 1 presents a highly novel and broadly applicable framework by bridging LLM-based reasoning agents with traditional Robotic Process Automation (RPA). This approach addresses a major efficiency bottleneck in current agent architectures (high token usage and latency for repetitive tasks). While Paper 2 offers strong domain-specific optimizations for industrial workflows, Paper 1's methodology for synthesizing reusable RPA code from agent trajectories has wider implications across various fields of GUI automation, offering massive improvements in runtime efficiency and cost.

vs. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

gemini-3.15/21/2026

Paper 2 introduces a comprehensive benchmark for multi-agent delegation, an emerging and critical area in LLM research. Benchmarks typically drive significant scientific progress by establishing standard evaluation metrics, highlighting performance gaps, and enabling comparisons across future methods. While Paper 1 offers a highly practical and novel approach to GUI automation efficiency, Paper 2's potential to become a foundational testing substrate for orchestration and long-horizon agentic workflows gives it a broader and longer-lasting potential scientific impact.

vs. Personality Engineering with AI Agents: A New Methodology for Negotiation Research

gpt-5.25/21/2026

Paper 2 has higher likely scientific impact: it introduces a concrete, system-level method (AutoRPA) that materially improves efficiency (82–96% token reduction) and reusability for GUI automation, a high-demand real-world domain with broad applicability across software engineering, HCI, and agent systems. The methodology is experimentally validated across multiple environments and includes robustification via verification and hybrid repair. Paper 1 is conceptually novel for negotiation research, but its impact hinges on adoption and empirical validation, and its immediate cross-field and industrial applicability is narrower.

vs. PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

claude-opus-4.65/21/2026

PlanningBench addresses a fundamental capability (planning) for LLMs with a comprehensive, scalable framework that serves both evaluation and training purposes. Its structured taxonomy of 30+ task types, constraint-driven synthesis pipeline, and demonstrated improvements via reinforcement learning offer broad methodological contributions. The finding that well-specified optimal solutions provide clearer reward signals has implications beyond planning. Paper 2 (AutoRPA) solves a practical but narrower problem—converting ReAct agents into efficient RPA scripts—with strong engineering contributions but more limited scientific breadth and generalizability across the field.

vs. Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

gpt-5.25/21/2026

Paper 1 likely has higher scientific impact: it introduces a practical framework to distill LLM agent interaction traces into reusable, efficient RPA code, addressing a clear deployment bottleneck (cost/latency) and enabling broad real-world automation across enterprise and consumer GUI workflows. The translator–builder pipeline plus hybrid repair strategy suggests methodological completeness, and the large token/runtime reductions indicate immediate utility. Paper 2 is novel and timely in LRM security, but as an offensive jailbreak method its broader adoption and positive downstream impact may be constrained; its contributions may be more specialized to safety research.