PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Yubo Li, Yidi Miao, Haotian Shen, Yuxin Liu

May 24, 2026

arXiv:2605.24785v1 PDF

cs.AI(primary)

#1200of 2682·Artificial Intelligence

#1200 of 2682 · Artificial Intelligence

Tournament Score

1421±43

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty6.5

Clarity7.5

Tournament Score

1421±43

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PANDO

1. Core Contribution

PANDO addresses a genuine and increasingly important tension in AI agent design: the trade-off between performance and inference cost. The paper's central insight is that web agents should become *cheaper* as they accumulate experience, rather than paying a fixed or growing compute tax per task. The framework achieves this through a structured Skill Library that grows online during evaluation, combining rules (guardrails against repeat failures), parameterized routines (reusable multi-step macros), confidence-based demotion (removing brittle skills), polarity-pair merging (consolidating direction-flipped routines), hierarchical routing (reserving expensive models for planning), visual compression, and cache-aware prompting.

The paper also contributes a useful cost decomposition framework (Eq. 3) that makes explicit the hidden compute currencies of different agent architectures—rollout scaling, pre-evaluation discovery budgets, verifier passes, and per-step specialist stacking. This formalization is valuable because it exposes accounting asymmetries in published results (e.g., WALT's offline discovery cost being excluded from headline numbers).

2. Methodological Rigor

Strengths: The experimental setup is thorough. Evaluation covers all 910 VWA tasks, with a 300-task ablation that cleanly separates success-rate contributions (rules, routines, reflector, distillation) from efficiency contributions (routing, compression, cache-aware prompting). The paper introduces three trajectory-level efficiency metrics (ARR, SOR, Prompt Cache Utilization) that meaningfully complement success rate. Stream-wise analysis (Table 4) convincingly demonstrates the learning-curve effect: later tasks are cheaper and more successful. Robustness checks include scrambled task ordering (57.9% vs 58.3%) and 16-worker parallel runs (58.1%).

Weaknesses: The most significant methodological concern is the backbone confound. PANDO uses Claude Opus 4.6 + GPT-5.2, while SGV uses Gemini-2.5-Flash and WALT uses Claude-4-Sonnet. The backbone-controlled experiments in Appendix N are limited in scale (100 or 300 tasks) and don't fully resolve this. The paper acknowledges this but the partial swap experiments are not entirely convincing—SGV-on-Opus actually outperforms PANDO in the cold-start window (56.7% vs 50.5%), and the stratified 300-task comparison shows PANDO-on-Gemini (50.3%) underperforming SGV (53.4%). The full-run advantage may partly reflect backbone capability rather than framework design.

The paper also lacks independent re-runs with proper confidence intervals (acknowledged in the checklist). The bootstrap CIs are computed from task-level verdicts within a single run, which captures sampling variance but not run-to-run variance from different library evolution paths.

3. Potential Impact

Practical relevance: The efficiency gains are substantial—58% fewer tokens than SGV and 61% fewer than WALT—with real deployment implications. The per-task cost of $0.085 v s$ 0.371 (SGV) or $0.641 (WALT amortized) makes a meaningful difference at scale. The sub-linear cumulative cost curve (Figure 9b) is particularly compelling for production settings.

Conceptual contribution: The token-economics framing and cost decomposition could become a standard analytical tool. Making pre-evaluation discovery costs explicit and proposing trajectory-level efficiency metrics beyond terminal success addresses a genuine gap in how the community evaluates agents.

Limitations on generalizability: All results are VWA-only. The paper honestly notes that OSWorld-style desktop tasks would require substantially different rules and grounding. The skill library's reliance on literal keyword matching (rather than embedding retrieval) is deliberately simple and auditable but may not scale to more diverse task distributions. The polarity-pair merging is syntactic—broader program equivalence remains future work.

4. Timeliness & Relevance

This paper is highly timely. Inference cost is becoming a first-order concern as AI agents move toward deployment. The observation that frontier agent systems are on a "spend more tokens" trajectory (Agent S → S2 → S3, from 20.6% to 72.6% at 10× compute) frames a real sustainability problem. The paper arrives as the community is beginning to recognize that SR alone is an insufficient evaluation metric for practical agent systems.

5. Strengths & Limitations

Key strengths:

Clean ablation separating competence gains from efficiency gains

Transparent cost accounting that exposes hidden compute in competing methods

Novel efficiency metrics (ARR, SOR, Cache Utilization) that make trajectory-level waste visible

The learning dynamics are well-characterized (Figure 2b, Table 4)

Comprehensive appendices with full reproducibility details

The demotion mechanism addresses a real weakness of prior skill-library approaches (V oyager's monotone growth)

Notable weaknesses:

Backbone heterogeneity makes the SR comparison unreliable; the strongest claim should be about the *token-efficiency* improvement rather than absolute SR

Single-benchmark evaluation limits generalizability claims

The skill retrieval mechanism (literal keyword containment) is brittle—it works for VWA's structured web tasks but would likely fail on more diverse or linguistically varied task descriptions

The 12-routine seed set introduces human engineering that may be benchmark-specific

No comparison against RL-based efficiency methods or distillation-based approaches

Additional observations: The paper is well-written with clear narrative structure. The Pando metaphor, while charming, takes up space that could have been used for additional analysis. The cost decomposition (Proposition 1) is presented with unnecessary formalism for what is essentially an accounting identity. The related work section is comprehensive but could be more concise.

The paper's most durable contribution may be the efficiency metrics and cost decomposition framework rather than the specific PANDO system, which is tightly coupled to VWA's task structure and current API pricing.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 6.5Clarity 7.5

Generated May 26, 2026

Comparison History (17)

vs. Verifiable Benchmarking of Long-Horizon Spatial Biology

gemini-3.15/28/2026

Paper 1 presents a novel benchmark for long-horizon scientific reasoning in spatial biology, directly advancing the application of AI in scientific discovery (AI4Science). By bridging complex multi-omics data with AI agents, it has profound potential to accelerate real-world biological research. While Paper 2 offers valuable efficiency improvements for multimodal web agents, Paper 1's direct contribution to enabling verifiable, automated scientific conclusions addresses a more transformative and cross-disciplinary scientific challenge.

vs. Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

gpt-5.25/26/2026

Paper 2 likely has higher impact: it targets a timely, widely relevant bottleneck (agent efficiency/cost) with a scalable online distillation framework and clear systems contributions (skill library, routing, compression, cache-aware prompting). It is evaluated on a large benchmark (910 tasks) with strong baselines, ablations, and new efficiency metrics, increasing rigor and adoption potential across multimodal agents and production deployments. Paper 1 is interesting for personalization via weight consolidation but is based on a small n=10 conversation set and a narrower application scope.

vs. A Sober Look at Agentic Misalignment in Automated Workflows

gpt-5.25/26/2026

Paper 2 has higher potential impact due to broader relevance and timeliness: agentic misalignment in automated workflows is a cross-cutting issue for multi-agent LLM systems in many domains. It offers a formal Bayesian framing, identifies a general failure mode (proxy utilities/posterior collapse), and proposes a transferable alignment paradigm (Agentic Evidence Attribution) with multiple instantiations, suggesting methodological depth and extensibility. Paper 1 is strong and practical for efficiency in web agents, but its contributions are more benchmark/engineering-specific and narrower in scope compared to alignment theory and workflow reliability.

vs. Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

gemini-3.15/26/2026

Paper 2 addresses a critical bottleneck in medical AI—verifiability and interpretability—by integrating LLMs with neuro-symbolic fuzzy logic. While Paper 1 offers strong algorithmic improvements for AI agents, Paper 2's focus on high-stakes clinical decision-making gives it a higher potential for transformative real-world impact and cross-disciplinary scientific significance in both computer science and healthcare.

vs. ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

gpt-5.25/26/2026

Paper 2 (PANDO) likely has higher impact due to broader applicability and timeliness: efficient multimodal web agents are a central frontier, and online skill distillation that reduces inference cost while improving success directly addresses real-world deployment constraints. It contributes a system-level framework plus new efficiency metrics, enabling reuse across agent research and evaluation. Paper 1 is novel and rigorous but more specialized to Tree-of-Thoughts KV-cache management, impacting a narrower slice of inference workloads. PANDO’s improvements on a standard benchmark with substantial token reductions suggest wider cross-field and practical influence.

vs. DocOS: Towards Proactive Document-Guided Actions in GUI Agents

claude-opus-4.65/26/2026

PANDO demonstrates stronger scientific impact through concrete, quantifiable improvements: 58.3% success rate on VisualWebArena (beating prior SOTA), 58-61% token reduction, and introduces novel efficiency metrics. It addresses a practical and timely problem (computational cost of AI agents) with a comprehensive framework validated through rigorous ablations on 910 tasks. Paper 2 (DocOS) introduces an interesting paradigm (document-guided agents) and benchmark, but primarily reveals limitations ('dual bottlenecks') without solving them, making it more diagnostic than solution-oriented. PANDO's combination of methodological innovation, strong empirical results, and practical efficiency gains gives it broader and more immediate impact.

vs. The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning

gemini-3.15/26/2026

Paper 1 addresses a critical bottleneck in AI agent deployment—computational inefficiency—by introducing a novel skill-distillation framework and new evaluation metrics. Its quantifiable improvements in success rates and token reduction offer immediate, highly impactful applications in AI engineering. While Paper 2 explores an important societal question regarding human-AI interaction, Paper 1's methodological rigor and direct technical innovations give it a higher potential for driving immediate progress in the fast-moving field of multimodal AI systems.

vs. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

gpt-5.25/26/2026

Paper 2 has higher likely impact because it proposes a broadly applicable, efficiency-focused agent framework (online skill distillation with measurable token savings) on a widely used multimodal benchmark, addressing a timely bottleneck: rising inference cost. Its components and new efficiency metrics can transfer across web agents and other interactive settings, enabling real-world deployment improvements. Paper 1 is valuable infrastructure (a scalable, authentic terminal benchmark) but is primarily evaluative; its impact depends on adoption and may be narrower than a method that directly improves agent efficiency and performance.

vs. FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue

claude-opus-4.65/26/2026

PANDO addresses the broadly impactful problem of efficient multimodal AI agents, a topic at the intersection of large language models, web automation, and inference efficiency—areas of intense current interest across AI research. It introduces a novel online skill-distillation framework with strong empirical results (higher success rate with significantly fewer tokens) and new efficiency metrics applicable across the field. FLOATBench, while valuable for the offshore wind engineering community, targets a narrower domain-specific niche (FOWT fatigue surrogate benchmarking) with a smaller potential audience and cross-disciplinary reach.

vs. Towards end-to-end LLM-based censoring-aware survival analysis

gpt-5.25/26/2026

Paper 2 has higher likely impact due to a more novel, general efficiency-centric framework for multimodal web agents (online skill distillation with a skill library plus routing/compression/cache-aware prompting) and clearly demonstrated gains in both success and token cost on a widely used benchmark. Its methods and proposed efficiency metrics can transfer across many agent settings, making the breadth of impact larger and the work timely given the field’s focus on inference-time compute. Paper 1 is valuable clinically, but shows modest improvements and is narrower in scope.

vs. Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems

claude-opus-4.65/26/2026

PANDO addresses a fundamental and broadly relevant problem in AI agents—improving efficiency as experience accumulates rather than scaling inference cost. It introduces a principled framework (online skill distillation) with novel efficiency metrics applicable across the web agent community, demonstrates strong empirical results on a well-established benchmark (VisualWebArena), and tackles the critical issue of computational cost in LLM-based agents. Paper 2, while technically sound, addresses a narrower domain (crypto portfolio management) with a relatively small agent ensemble (N=3), limiting its broader scientific impact. PANDO's contributions to agent efficiency and its new evaluation metrics have wider applicability.

vs. Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental challenge in test-time scaling for LLM reasoning—a highly active research area—with a principled algorithmic contribution (stochastic backtracking with persistent pools) grounded in SMC theory. Its methods are broadly applicable across reasoning tasks and model scales, offering theoretical depth and practical efficiency gains. Paper 2 makes solid engineering contributions to multimodal web agents with useful efficiency metrics, but is more narrowly scoped to a specific benchmark (VisualWebArena) and relies more on system-level integration than fundamental algorithmic innovation, limiting its broader impact.

vs. Towards Direct Evaluation of Harness Optimizers via Priority Ranking

claude-opus-4.65/26/2026

PANDO addresses a practical and timely problem—making multimodal web agents more efficient rather than more expensive—with concrete, reproducible results (58.3% success rate with 58-61% fewer tokens on VisualWebArena). It introduces a complete framework with actionable techniques and novel efficiency metrics that the community can adopt broadly. Paper 1 proposes a useful evaluation methodology for harness optimizers, but its scope is narrower, serving primarily as a diagnostic benchmark rather than enabling new capabilities. Paper 2's combination of strong empirical results, practical efficiency gains, and new evaluation metrics gives it broader impact potential.

vs. Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental and previously underexplored methodological problem—parametric look-ahead bias in LLM-based financial backtesting—that affects the validity of a growing body of research. It introduces a novel, well-defined concept and a principled inference-time solution (FinCAD) that doesn't require retraining. This has broad implications for financial AI research integrity. Paper 2, while practically useful, is more incremental—combining known efficiency techniques (compression, caching, skill libraries) for web agents. Paper 1's contribution is more foundational and likely to reshape evaluation practices across financial NLP.

vs. Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

claude-opus-4.65/26/2026

Co-ReAct introduces a novel rubric-guided action-selection framework with a theoretically grounded training objective (list-wise Spearman rank-correlation reward via GRPO), demonstrating broad applicability across multiple model scales and both open/closed-source models. Its contribution—using rubrics as step-level inference-time guidance rather than just evaluation signals—represents a more fundamental conceptual advance in agentic reasoning. PANDO, while practically valuable for efficiency gains on web tasks, is more narrowly scoped to multimodal web agents and focuses on engineering optimizations. Co-ReAct's modular rubric generator as a drop-in component gives it wider potential adoption across diverse agent architectures.

vs. Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

gemini-3.15/26/2026

Paper 2 addresses a critical bottleneck in the rapidly growing field of multimodal AI agents: efficiency and inference cost. By introducing an online skill distillation framework that achieves state-of-the-art results on a major benchmark while drastically reducing token usage, it offers high practical utility and broad applicability. Paper 1 provides valuable insights into MoE safety routing, but its findings are highly specific to one architecture and emphasize that routing is diffuse, which may limit direct, broad downstream applications compared to Paper 2's efficiency framework.

vs. EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental problem in LLM post-training—how to transfer privileged context without side effects—introducing novel concepts (evidence masking, guided rollouts) with rigorous ablations revealing where knowledge transfer signals are localized. This has broad implications for knowledge distillation, persona learning, and privacy-preserving training. Paper 2 is a strong engineering contribution for web agents with practical efficiency gains, but is more narrowly scoped to a specific benchmark (VisualWebArena) and relies more on combining existing techniques. Paper 1's methodological insights are more generalizable across LLM training paradigms.