ASH: Agents that Self-Hone via Embodied Learning

Benjamin Schneider, Xavier Schneider, Victor Zhong, Sun Sun

May 14, 2026

arXiv:2605.14211v1 PDF

cs.AI(primary)cs.LG

#95of 2292·Artificial Intelligence

#95 of 2292 · Artificial Intelligence

Tournament Score

1544±46

10501800

89%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor6.8

Novelty7.5

Clarity8

Tournament Score

1544±46

10501800

89%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2 / 12$ milestones in Pokemon Emerald and $9.9 / 12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5 / 12$ and $6.0 / 12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ASH: Agents that Self-Hone via Embodied Learning

1. Core Contribution

ASH introduces a dynamic bootstrapping loop for long-horizon embodied learning from unlabeled internet video. The key insight is that an agent can overcome plateaus by (1) detecting when it's stuck, (2) retrieving visually similar internet demonstrations, (3) labeling them with an Inverse Dynamics Model (IDM) trained on its own trajectories, and (4) updating its policy with this newly constructed supervision. The system additionally uses HDBSCAN clustering to discover "key moments" from internet video, which serve as long-term memory enabling decisions that depend on much earlier context.

The contribution is genuinely novel in its synthesis: while IDMs (VPT), retrieval-augmented agents, and self-improvement loops each exist independently, ASH combines them into a coherent online learning system where the IDM, retrieval corpus selection, and policy co-evolve. The agent requires no reward engineering, no action-labeled demonstrations, and no privileged game state access.

2. Methodological Rigor

Strengths: The experimental design is thoughtful. Two complementary environments (turn-based RPG vs. real-time action-adventure) test different capabilities. The paper includes component ablations (Figure 4 left), IDM accuracy tracking across bootstraps (Figure 4 right), retrieval quality evaluation with human annotators (Table 1), and a catastrophic forgetting analysis (Figure 5). The comparison includes five methods spanning BC, retrieval-augmented, and foundation model approaches.

Concerns:

The milestone evaluation, while intuitive, is coarse-grained. With only 12 milestones, distinguishing between methods relies on relatively few discrete checkpoints. The success rates at later milestones are based on 12 trajectories (3 runs × 4 agents), which limits statistical power.

The compute parity claim (288 GPU hours per method) is reasonable but the methods differ fundamentally in how they use compute. ASH's online bootstrapping means its compute is adaptively allocated, making direct comparison imperfect.

The initialization from pre-trained checkpoints (3 and 9 GPU hours for IDM and π) is somewhat underemphasized. The paper states random initialization is "inefficient" but doesn't clarify how much this initialization contributes.

The VPT baseline uses "the strongest IDM checkpoint produced by ASH," which means the baseline benefits from ASH's innovations. This is generous to the baseline but creates an odd dependency.

The cluster quality analysis (Appendix B) reveals only 48% of clusters correspond to genuine key moments, suggesting significant noise in the long-term memory mechanism. Yet performance gains are still substantial, indicating robustness.

3. Potential Impact

Embodied AI: ASH demonstrates a viable path for learning embodied policies without reward engineering or action-labeled data—two of the most significant bottlenecks in the field. If the approach generalizes beyond video games, it could substantially reduce the human effort required for training embodied agents.

Scalability argument: The use of internet video as a supervision source is compelling for scalability. The 22,000-video Pokémon corpus and 17,000-video Zelda corpus demonstrate that sufficient data exists for many domains on YouTube. The retrieval mechanism (94% relevance) shows this data can be efficiently filtered.

Limitations on generalization: The two test environments, while complementary, are both GBA games with discrete action spaces and relatively low visual complexity. The jump to continuous-action robotics or photorealistic 3D environments is substantial. The IDM's reliance on pixel-level action inference will struggle in domains where actions have subtle or delayed visual effects.

4. Timeliness & Relevance

This work addresses a very current need. The field is grappling with how to move beyond short-horizon tasks and how to leverage internet-scale data without expensive annotation. The paper positions itself well against the backdrop of VPT, foundation model agents (Claude Plays Pokémon), and RL approaches (Pokémon Red via PPO). The critique of reward engineering fragility (citing Pleines et al.) is well-supported. The emergence of multimodal foundation models makes the zero-shot baseline (Qwen3.5) particularly relevant, and ASH's advantage over it demonstrates that learned embodied policies still outperform prompted models for fine-grained control.

5. Strengths & Limitations

Key Strengths:

The self-improvement loop is elegant and well-motivated. The "stuck detection" via key moment discovery is a natural and unsupervised criterion.

The dual memory architecture (short-term for reactive control, long-term for strategic decisions) is well-designed and the ablation confirms its contribution (+2.5 milestones).

The offline replay experiment (Section 4.5) elegantly demonstrates both skill retention and accumulated expertise, addressing catastrophic forgetting concerns.

The retrieval evaluation with human annotators provides strong evidence for the quality of the video selection mechanism.

The qualitative examples (Appendix A) are persuasive—showing ASH learning to battle and use long-term memory for navigation decisions.

Notable Limitations:

Domain specificity: Both environments are retro games with discrete actions. The paper acknowledges but doesn't address the gap to continuous control.

Internet video dependency: Environments without abundant demonstration video cannot benefit from this approach.

The 48% key moment cluster quality suggests room for improvement in the unsupervised discovery mechanism.

No comparison to RL baselines (e.g., PPO with shaped rewards as in Pleines et al.) which would contextualize ASH's performance against methods that do use rewards.

The 8-hour evaluation window, while impressive, covers only the early portion of games that span 20+ hours. It remains unclear whether the self-improvement loop would continue to sustain progress.

Overall Assessment

ASH presents a compelling and well-executed system for self-improving embodied agents. The combination of dynamic bootstrapping, IDM-based pseudo-labeling, and unsupervised key moment discovery is novel and the results are strong. The work's primary limitation is the narrow evaluation domain (two GBA games), which tempers claims about general scalability. Nevertheless, the paper makes a meaningful contribution to long-horizon embodied learning by demonstrating that self-improvement from unlabeled internet video is viable and outperforms static training paradigms.

Rating:7.2/ 10

Significance 7.5Rigor 6.8Novelty 7.5Clarity 8

Generated May 15, 2026

Comparison History (19)

vs. How Far Are We From True Auto-Research?

gpt-5.25/20/2026

Paper 1 is more likely to have higher impact: it proposes a novel self-improvement loop for long-horizon embodied learning using unlabeled internet video, demonstrating substantial performance gains over strong baselines on demanding multi-hour tasks. This advances core capability (scalable supervision and long-horizon planning) with clear applications to robotics/agents and broad relevance across RL, imitation learning, and foundation-model agents. Paper 2 is timely and valuable as an evaluation/audit study, but it is primarily diagnostic; it introduces a benchmark/scaffold and highlights failure modes rather than delivering a new capability breakthrough, limiting downstream transformative impact.

vs. Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

claude-opus-4.65/16/2026

ASH introduces a novel self-improvement paradigm for embodied agents that learns from unlabeled internet video without reward shaping or expert demonstrations—addressing a fundamental scalability bottleneck in AI. The approach combines inverse dynamics models, unsupervised key-moment identification, and a self-honing loop, demonstrating strong results on complex long-horizon tasks (Pokemon Emerald, Zelda). This has broad implications for robotics, game AI, and autonomous agents. Paper 2, while clever in its sample-efficient alignment approach, addresses a narrower problem (LLM safety) with incremental improvements over existing methods.

vs. Belief Memory: Agent Memory Under Partial Observability

claude-opus-4.65/16/2026

ASH presents a more transformative contribution by demonstrating self-improving embodied agents that learn from unlabeled internet video without rewards or expert annotations—addressing a fundamental scalability bottleneck in embodied AI. The approach combines inverse dynamics models, unsupervised key-moment identification, and self-improvement loops in a novel way, with dramatic empirical gains on challenging long-horizon tasks. While BeliefMem's probabilistic memory is a solid incremental contribution to LLM agent memory, ASH's paradigm of scalable self-improvement from internet video has broader implications across robotics, game AI, and autonomous systems.

vs. The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

gpt-5.25/16/2026

Paper 2 likely has higher impact: it proposes a scalable, practical recipe for long-horizon embodied learning using unlabeled internet video and a self-improvement loop (IDM-derived supervision + memory), demonstrated on challenging multi-hour planning benchmarks with large gains over strong baselines. This directly targets a central bottleneck in agent research, with clear downstream applications in robotics and general autonomy, and is timely given interest in self-improving agents and web-scale learning. Paper 1 is conceptually novel and methodologically interesting, but its immediate real-world applicability and breadth of impact are less certain.

vs. A collaborative agent with two lightweight synergistic models for autonomous crystal materials research

claude-opus-4.65/16/2026

ASH introduces a novel self-improvement paradigm for embodied AI that learns from unlabeled internet video without reward shaping or expert demonstrations—addressing a fundamental scalability bottleneck. Its contributions (inverse dynamics models from self-play, unsupervised key-moment detection, long-horizon planning across diverse game environments) represent a broadly applicable methodological advance. While MatBrain offers practical value in materials science with its efficient dual-model architecture, ASH's framework has broader cross-domain impact potential, tackling a core AI challenge (long-horizon embodied learning from unstructured data) with a more generalizable and innovative approach.

vs. EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

claude-opus-4.65/16/2026

ASH addresses a fundamental challenge in embodied AI—learning long-horizon policies from unlabeled internet video without reward shaping or expert demonstrations. Its self-improvement loop combining inverse dynamics models with internet video supervision is highly novel and demonstrates impressive results on complex, multi-hour tasks. This work opens a scalable pathway for embodied agents that could broadly impact robotics and autonomous systems. While EVOCHAMBER presents interesting multi-agent evolution ideas with strong benchmark results, it operates within a more incremental framework of test-time adaptation for LLM-based agents, with narrower long-term implications.

vs. RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

gemini-3.15/16/2026

Paper 1 addresses a critical, high-stakes real-world problem (ICU decision support) by introducing a highly novel hindsight-annotated benchmark. By moving beyond mimicking potentially flawed historical actions, it exposes critical safety failures in current LLMs. Its direct implications for patient safety and medical AI rigor give it profound near-term and interdisciplinary impact, whereas Paper 2, while methodologically impressive for embodied AI, is currently limited to video game environments.

vs. Holistic Evaluation and Failure Diagnosis of AI Agents

gpt-5.25/16/2026

Paper 2 is more novel and likely higher impact: it proposes a scalable self-improvement paradigm for long-horizon embodied learning using unlabeled internet video, avoiding rewards and expert demonstrations—an approach with broad implications for robotics, game-playing, and general agent learning. Its results on multi-hour tasks indicate a meaningful capability jump and a potentially general recipe. Paper 1 is valuable and methodologically strong for agent evaluation/diagnosis, but it is primarily an evaluation framework with narrower downstream impact compared to a training/learning approach that could unlock new capabilities.

vs. CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents

gpt-5.25/16/2026

Paper 2 (CoCoDA) has higher estimated impact due to a more general, reusable contribution: a code-native compositional DAG that addresses a core scaling bottleneck in tool-augmented agents (library growth under fixed context). It proposes principled retrieval/training mechanisms plus theoretical guarantees (cost reduction, monotone co-evolution, well-formedness) and shows broad benchmark gains, suggesting methodological rigor and cross-domain applicability (reasoning, code, data analysis). Paper 1 is novel and impressive for long-horizon game agents, but is narrower in domain and may be harder to generalize beyond specific embodied/video settings.

vs. OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

gpt-5.25/16/2026

Paper 1 (ASH) is likely higher impact due to a more novel learning paradigm: self-improving embodied agents that extract supervision from unlabeled internet video via inverse dynamics and long-term memory, addressing a core bottleneck in long-horizon RL/embodied AI without rewards or expert demos. Its implications could generalize beyond games to robotics and other sequential decision-making. Paper 2 (OpenMobile) is highly timely and valuable for openness and reproducibility, but is primarily a data/synthesis framework with more incremental methodological innovation, and its impact may be narrower to mobile UI agents.

vs. Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

claude-opus-4.65/15/2026

ASH addresses a fundamental challenge in embodied AI—learning long-horizon policies from unlabeled internet video without reward shaping or expert demonstrations. Its self-improvement loop combining inverse dynamics models, unsupervised key-moment detection, and internet video supervision represents a novel and scalable paradigm. The strong empirical results across two complex game environments demonstrate significant advancement over baselines. Paper 2 addresses an important but narrower problem (multi-agent moderation via intent detection) with a more incremental contribution. ASH's broader implications for autonomous learning, scalability, and embodied AI give it substantially higher potential impact across multiple research areas.

vs. Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

gemini-3.15/15/2026

Paper 2 tackles a fundamental and historically difficult challenge in AI—long-horizon embodied learning without hand-engineered rewards or labeled demonstrations. By proposing a self-honing agent that learns from noisy internet videos, it provides a highly scalable paradigm with broad implications for robotics, reinforcement learning, and AGI. While Paper 1 offers strong applied value for enterprise systems, Paper 2's methodological innovation in unsupervised, self-supervised embodied learning represents a deeper foundational shift with wider theoretical and cross-disciplinary impact.

vs. Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

gpt-5.25/15/2026

Paper 1 likely has higher scientific impact due to its novel self-improvement loop that leverages unlabeled internet video to generate supervision for long-horizon embodied control, addressing a core bottleneck in scalable RL/agent learning. The demonstrated multi-hour planning gains over strong baselines in two complex game environments suggest broad applicability to robotics and generalist agents, with timely relevance to autonomous agents and self-supervised learning. Paper 2 is methodologically solid and important for fairness/perspectivist NLP, but its scope is narrower and impacts fewer adjacent fields than a scalable recipe for long-horizon embodied agents.

vs. $π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

gpt-5.25/15/2026

Paper 1 likely has higher impact due to greater methodological novelty (self-improvement loop using IDM to mine supervision from unlabeled internet video plus long-term memory for key moments) and a stronger algorithmic contribution that could generalize to many embodied/interactive domains beyond the showcased games. It targets a central bottleneck in long-horizon RL/embodied learning—scalable supervision without rewards or expert demos—and demonstrates substantial gains over multiple baselines on multi-hour tasks. Paper 2 is timely and useful, but primarily a benchmark/dataset contribution with narrower novelty and impact concentrated in evaluation.

vs. FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory

gemini-3.15/15/2026

Paper 2 tackles a fundamental bottleneck in embodied AI: learning long-horizon tasks from unannotated, noisy internet videos without reward shaping. Its self-improving loop and unsupervised extraction of key moments represent a highly scalable, novel approach with massive implications for robotics and general AI. While Paper 1 introduces an interesting biologically-inspired memory management framework, Paper 2's methodology addresses a more universally challenging problem in AI scaling and demonstrates significant, sustained improvements in notoriously difficult environments.

vs. GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion

gpt-5.25/15/2026

Paper 2 (ASH) has higher likely scientific impact due to a more broadly transformative agenda: scalable, self-improving embodied agents learning from unlabeled internet video without rewards or expert demos. This targets a central, timely bottleneck (long-horizon autonomy) with clear downstream applications in robotics and general agentic systems, and demonstrates strong results on demanding multi-hour tasks. Paper 1 advances KGC via structured quantization for LLM alignment—novel and useful, but narrower in scope and application. ASH’s framework is more cross-domain and paradigm-shifting.

vs. End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

claude-opus-4.65/15/2026

ASH introduces a fundamentally novel paradigm for embodied AI: self-improving agents that learn from unlabeled internet video without reward shaping or expert demonstrations. This addresses a core scalability bottleneck in AI and demonstrates strong results on long-horizon tasks. Paper 2, while practically valuable for clinical AI governance, is more incremental—presenting an evaluation/monitoring framework for a specific deployed system rather than a new scientific method. ASH's contributions to self-supervised learning, inverse dynamics models, and long-horizon planning have broader impact potential across robotics, game AI, and embodied intelligence research.

vs. TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding

gemini-3.15/15/2026

Paper 1 tackles a fundamental and widely researched bottleneck in general AI: long-horizon embodied learning without hand-engineered rewards or labeled demonstrations. Its novel self-improvement loop using unlabeled internet video has broad implications for autonomous agents and robotics. While Paper 2 introduces a valuable benchmark for urban mobility, Paper 1's methodological breakthrough in agentic self-improvement represents a more significant leap with broader applicability across the rapidly growing field of embodied AI.

vs. Quantifying Cross-Query Contradictions in Multi-Query LLM Reasoning

claude-opus-4.65/15/2026

ASH addresses a fundamental challenge in embodied AI—learning long-horizon policies from unlabeled internet video without reward shaping or expert demonstrations. This represents a significant paradigm shift toward scalable, self-improving agents. The approach combines inverse dynamics models, unsupervised key-moment detection, and self-improvement loops in a novel way, with impressive empirical results across two complex game environments. Its breadth of impact spans robotics, reinforcement learning, and foundation models. Paper 1 addresses an important but narrower problem (LLM cross-query consistency) with more incremental contributions to the reasoning/evaluation literature.