Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

Jiaxi Li, Ke Deng, Yun Wang, Jingyuan Huang, Yucheng Shi, Qiaoyu Tan, Jin Lu, Ninghao Liu

Jun 3, 2026

arXiv:2606.04391v1 PDF

cs.AI(primary)

#2292of 3404·Artificial Intelligence

#2292 of 3404 · Artificial Intelligence

Tournament Score

1358±47

10501800

47%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance5.5

Rigor5.5

Novelty5

Clarity7.5

Tournament Score

1358±47

10501800

47%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill-dynamic-retrieval.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval"

1. Core Contribution

The paper identifies a genuine limitation in existing online skill learning for web agents: current methods retrieve skills once based on the initial task instruction and keep them fixed throughout execution. This "task-level static reuse" is misaligned with the dynamic nature of web interaction, where the relevant skill depends on the evolving webpage state. The proposed method, SGDR, introduces three interconnected components: (1) sliding-window extraction to decompose trajectories into intermediate-granularity sub-procedures, (2) dual text-code skill representation linking retrieval descriptions to executable code, and (3) state-grounded dynamic retrieval that re-retrieves skills at each step conditioned on both the task goal and current webpage state. The core insight—that skill retrieval should be state-conditioned and dynamic rather than static—is intuitive and well-motivated, though not deeply surprising.

2. Methodological Rigor

The approach is technically sound but relatively straightforward. The retrieval mechanism combines cosine similarity scores weighted between task-goal and state embeddings (Equation 1), followed by MMR reranking for diversity (Equation 2). These are well-established techniques from information retrieval applied to the skill selection context. The sliding-window extraction uses fixed window lengths {2,3,4,5}, which is simple but effective.

The experimental setup on WebArena is appropriate, covering five domains with 764 single-domain tasks. Two backbone models (GPT-4.1 and Qwen3-4B) are tested, providing some evidence of generalizability across model scales. The ablation studies are well-designed, examining retrieval signals (α values), MMR reranking (λ values), and extraction granularity (full trajectory vs. single action vs. sliding window). Each ablation supports the design choices made.

However, several methodological concerns arise:

Statistical significance: No confidence intervals or significance tests are reported despite the stochastic nature of LLM-based evaluation. The improvements (e.g., 3.6 points absolute over CER with GPT-4.1) could partly be within noise margins.

Evaluator reliability: The proxy evaluator E uses the same backbone LLM. The correlation between proxy judgments ŷ and ground truth y is never reported, making it unclear how much noise the skill induction pipeline absorbs.

Cross-site tasks excluded: Removing cross-site tasks limits the evaluation scope and avoids a potentially challenging setting where dynamic retrieval could show even greater benefits (or limitations).

Task ordering effects: The online setting is sensitive to task ordering, but only one ordering (by original WebArena IDs) appears to be used.

3. Potential Impact

The paper addresses a practical concern in deploying web agents with reusable skills. The idea of step-level dynamic retrieval is applicable beyond web agents to any sequential decision-making setting where an agent maintains a skill library. The dual text-code representation and sliding-window extraction are modular and could be adopted by other skill-learning frameworks.

However, the absolute performance numbers remain modest (37.5% with GPT-4.1, 24.3% with Qwen3-4B), suggesting that skill learning alone is insufficient for reliable web automation. The relative gains (~10.6% over the strongest baseline) are meaningful but not transformative. The approach is also somewhat engineering-heavy: it combines multiple known techniques (sliding windows, embedding-based retrieval, MMR, LLM-based skill induction) without introducing fundamentally new algorithms.

The code release and detailed reproducibility information (task indices, prompts, parameters) are valuable for the community.

4. Timeliness & Relevance

The paper is highly timely. Web agent research is a rapidly growing area, and online skill learning addresses real deployment scenarios. The limitations of static skill reuse have been implicitly acknowledged but not explicitly addressed with a dedicated solution. The choice of WebArena as the benchmark is appropriate, and testing with both proprietary (GPT-4.1) and open-source (Qwen3-4B) models reflects current community interests.

The work sits at the intersection of retrieval-augmented generation, program synthesis, and agent learning—all active areas. However, the landscape is evolving quickly, with concurrent work (SkillWeaver, XSkill, ContractSkill, etc.) exploring similar themes, which may reduce the novelty window.

5. Strengths & Limitations

Strengths:

Clear problem formulation with a well-motivated gap between task-level and step-level skill reuse

Comprehensive ablation studies validating each design choice

Consistent improvements across domains and backbone models

Practical design: modular components that could integrate with other methods

Good reproducibility: detailed prompts, task indices, and code release

Efficiency gains (fewer steps) alongside accuracy improvements

Limitations:

The technical novelty is moderate—the main components (embedding retrieval, MMR, sliding windows) are standard techniques combined in a reasonable way

No statistical significance testing; improvements could be partially within variance

Single benchmark (WebArena) with cross-site tasks excluded

The LLM-based summarization for state representation adds computational overhead at each step, which is not analyzed

The evaluator model's accuracy is never validated against ground truth

Gitlab domain shows mixed results, and the explanation (persistent preconditions) is speculative

The sliding-window lengths are manually specified; no adaptive granularity mechanism is explored

Limited analysis of failure cases or when dynamic retrieval hurts vs. helps

The paper does not compare against non-skill-based approaches that use planning or world models

Additional Observations

The case studies in the appendix are illustrative but cherry-picked. A systematic analysis of skill library composition, retrieval accuracy over time, and the frequency with which dynamically retrieved skills differ from what static retrieval would have selected would strengthen the empirical contribution. The computational cost analysis (how much overhead does per-step retrieval add?) is also missing.

The formalization in Section 3 is clean but the optimization objective (maximizing cumulative success) is stated without any theoretical analysis of regret or convergence properties, which limits the theoretical depth.

Rating:5.8/ 10

Significance 5.5Rigor 5.5Novelty 5Clarity 7.5

Generated Jun 5, 2026

Comparison History (15)

vs. From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

claude-opus-4.66/6/2026

Paper 2 (TRIAD) addresses the critical and timely problem of LLM agent safety with a novel three-way guardrail framework that goes beyond binary allow/deny decisions. Its closed-loop feedback mechanism for plan revision is innovative and has broader impact across all LLM agent applications. The safety-utility trade-off problem is fundamental as agents are deployed in high-stakes settings. Paper 1, while solid, focuses on incremental improvements to web automation skill retrieval—a narrower domain. Paper 2's contributions are more generalizable and address a more pressing concern in the rapidly growing field of autonomous agents.

vs. Semantic Partial Grounding via LLMs

gemini-3.16/6/2026

Paper 1 addresses a major challenge in the rapidly growing field of autonomous web agents by enabling dynamic, state-aware skill learning. This has highly relevant real-world applications in AI assistants and web automation. Paper 2, while offering a clever LLM-based optimization for classical planning, targets a more niche area (PDDL grounding) and is primarily a computational speedup rather than a novel capability. Therefore, Paper 1 demonstrates broader potential impact and timeliness.

vs. From Features to Actions: Explainability in Traditional and Agentic AI Systems

gemini-3.16/6/2026

Paper 1 addresses a fundamental conceptual shift in AI—from static predictions to agentic systems—and highlights the inadequacy of traditional Explainable AI methods in this new paradigm. By proposing a shift towards trajectory-level explainability, it tackles a critical bottleneck for the safe deployment and understanding of autonomous agents. This foundational contribution to AI transparency and evaluation is likely to have broader, cross-disciplinary impact compared to Paper 2, which offers a highly effective but narrower methodological improvement specifically for web agent skill retrieval.

vs. AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks

gemini-3.16/6/2026

Paper 2 addresses a highly timely and rapidly evolving field (LLM-based web agents) and introduces a novel state-grounded dynamic retrieval mechanism for online skill learning. Its methodological innovation and potential broad impact on autonomous agent design outweigh Paper 1, which primarily represents an empirical application of existing memory-augmented neural networks to the specific domain of maritime vessel tracking.

vs. The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

claude-opus-4.66/6/2026

Paper 2 presents a concrete, well-evaluated method (SGDR) with clear empirical improvements on a standard benchmark (WebArena), demonstrating practical gains in web automation through state-grounded dynamic skill retrieval. Its contributions—sliding-window extraction, dual text-code representation, and dynamic retrieval—are technically specific and immediately applicable. Paper 1 introduces a benchmark/evaluation framework addressing important challenges (scheduling, exploration, continuous learning), but benchmarks typically have less direct impact than novel methods unless widely adopted. Paper 2's actionable methodology and demonstrated improvements on established benchmarks give it stronger near-term scientific impact.

vs. DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

gpt-5.26/5/2026

Paper 2 likely has higher impact due to a clearer algorithmic contribution (state-grounded, stepwise dynamic skill retrieval) that can generalize across web-agent settings and informs broader research on continual/online learning, retrieval, and agentic planning. It reports consistent gains on a standard benchmark (WebArena) across model scales and provides code for reproducibility, supporting methodological rigor and adoption. Paper 1 is valuable infrastructure (dataset/benchmark) but is narrower in scope (drag interactions) and its impact depends more on downstream uptake; the claimed improvements are more suggestive than definitive.

vs. PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation

gemini-3.16/5/2026

Paper 2 introduces a novel paradigm of using LLMs as persona-conditioned synthetic users, bridging AI and HCI. This has immense real-world application potential by significantly reducing the cost and time of human-centric UI/UX evaluation. While Paper 1 offers a strong algorithmic improvement for web agents, Paper 2's approach represents a broader conceptual shift with wider interdisciplinary impact across software development and design.

vs. SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

gpt-5.26/5/2026

Paper 2 likely has higher impact: it introduces a concrete algorithmic advance (state-grounded dynamic retrieval for online skill reuse) with strong empirical gains on a widely used benchmark (WebArena), plus released code—factors that drive adoption and follow-on work. It targets timely, high-demand web automation and can generalize to other embodied/interactive agents needing state-conditioned skill composition. Paper 1 is a valuable benchmark and diagnostic framework for long-horizon relational memory, but benchmarks often have narrower immediate real-world uptake unless they rapidly become standard; its scope (1,522 instances, 10 histories) may limit broad community lock-in compared to a method showing clear performance improvements.

vs. Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

gemini-3.16/5/2026

Paper 1 addresses a fundamental and highly timely question in the LLM ecosystem: whether multi-agent systems genuinely outperform single agents under strictly controlled conditions. By introducing a normalized benchmarking framework and challenging prevailing hype, it has the potential to broadly influence how agentic workflows are evaluated and designed across the field. Paper 2, while offering a strong methodological improvement for web agents, is narrower in scope and application.

vs. AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

gemini-3.16/5/2026

Paper 2 addresses a fundamental challenge in AI agents—dynamic, state-aware skill retrieval during execution—offering a highly generalizable methodology. While Paper 1 presents a strong, high-value application in financial auditing, its impact is largely domain-specific. Paper 2's approach to web automation has broader implications across numerous fields relying on interactive language agents, making its potential scientific and practical impact significantly wider.

vs. StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

gemini-3.16/5/2026

Paper 1 addresses a critical bottleneck in autonomous web agents by enabling dynamic, state-grounded skill retrieval, moving beyond static task-level planning. Given the massive interest and broad real-world applicability of web automation agents, improving their adaptability to changing environments offers higher potential for direct impact on core capabilities compared to the debugging and failure attribution framework presented in Paper 2.

vs. Stochastic convergence of parallel asynchronous adaptive first-order methods

gemini-3.16/5/2026

Paper 1 provides foundational theoretical convergence guarantees for asynchronous adaptive optimization methods, which are critical for training large-scale machine learning models. Its rigorous mathematical approach ensures broad, long-lasting impact across all areas of ML. Paper 2 offers a timely and practical improvement for web agents, but its impact is more specialized and potentially less enduring than fundamental optimization theory.

vs. SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

gemini-3.16/5/2026

Paper 2 addresses a fundamental bottleneck in LLM agent scalability: selecting skills from massive libraries where inter-skill relationships (dependencies, conflicts) matter more than simple semantic similarity. By introducing a self-evolving, typed DAG structure for skill retrieval, it offers a more scalable and structurally aware approach than Paper 1's state-grounded dynamic retrieval. This structural innovation is likely to have broader applicability across complex, large-scale agentic systems.

vs. Reasoning Structure of Large Language Models

claude-opus-4.66/5/2026

Paper 2 introduces a novel framework for analyzing the *structure* of reasoning in LLMs, moving beyond surface metrics (accuracy, token count) to graph-based topological analysis. This addresses a fundamental gap in understanding LLM reasoning and has broad applicability across all reasoning-capable models and tasks. Paper 1, while solid engineering work showing incremental improvements in web agent skill retrieval, is more narrowly scoped to web automation. Paper 2's methodological contribution—converting reasoning traces into verifiable graphs with efficiency metrics—offers a new analytical paradigm with wider cross-field impact and greater potential to influence future evaluation standards.

vs. Towards a Science of AI Agent Reliability

claude-opus-4.66/5/2026

Paper 1 addresses a fundamental gap in AI agent evaluation by proposing a comprehensive reliability framework with 12 metrics across four dimensions, grounded in safety-critical engineering. Its breadth of impact is significantly wider—applicable across all AI agent domains, not just web automation. The systematic evaluation of 15 models reveals that capability gains don't translate to reliability improvements, a finding with profound implications for deployment safety. Paper 2, while solid, offers an incremental improvement to web agent skill retrieval with narrower scope. Paper 1's framework has potential to reshape how the community evaluates and develops agents.