ASH: Agents that Self-Hone via Embodied Learning
Benjamin Schneider, Xavier Schneider, Victor Zhong, Sun Sun
Abstract
Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of milestones in Pokemon Emerald and in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of and milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.
AI Impact Assessments
(1 models)Scientific Impact Assessment: ASH: Agents that Self-Hone via Embodied Learning
1. Core Contribution
ASH introduces a dynamic bootstrapping loop for long-horizon embodied learning from unlabeled internet video. The key insight is that an agent can overcome plateaus by (1) detecting when it's stuck, (2) retrieving visually similar internet demonstrations, (3) labeling them with an Inverse Dynamics Model (IDM) trained on its own trajectories, and (4) updating its policy with this newly constructed supervision. The system additionally uses HDBSCAN clustering to discover "key moments" from internet video, which serve as long-term memory enabling decisions that depend on much earlier context.
The contribution is genuinely novel in its synthesis: while IDMs (VPT), retrieval-augmented agents, and self-improvement loops each exist independently, ASH combines them into a coherent online learning system where the IDM, retrieval corpus selection, and policy co-evolve. The agent requires no reward engineering, no action-labeled demonstrations, and no privileged game state access.
2. Methodological Rigor
Strengths: The experimental design is thoughtful. Two complementary environments (turn-based RPG vs. real-time action-adventure) test different capabilities. The paper includes component ablations (Figure 4 left), IDM accuracy tracking across bootstraps (Figure 4 right), retrieval quality evaluation with human annotators (Table 1), and a catastrophic forgetting analysis (Figure 5). The comparison includes five methods spanning BC, retrieval-augmented, and foundation model approaches.
Concerns:
3. Potential Impact
Embodied AI: ASH demonstrates a viable path for learning embodied policies without reward engineering or action-labeled data—two of the most significant bottlenecks in the field. If the approach generalizes beyond video games, it could substantially reduce the human effort required for training embodied agents.
Scalability argument: The use of internet video as a supervision source is compelling for scalability. The 22,000-video Pokémon corpus and 17,000-video Zelda corpus demonstrate that sufficient data exists for many domains on YouTube. The retrieval mechanism (94% relevance) shows this data can be efficiently filtered.
Limitations on generalization: The two test environments, while complementary, are both GBA games with discrete action spaces and relatively low visual complexity. The jump to continuous-action robotics or photorealistic 3D environments is substantial. The IDM's reliance on pixel-level action inference will struggle in domains where actions have subtle or delayed visual effects.
4. Timeliness & Relevance
This work addresses a very current need. The field is grappling with how to move beyond short-horizon tasks and how to leverage internet-scale data without expensive annotation. The paper positions itself well against the backdrop of VPT, foundation model agents (Claude Plays Pokémon), and RL approaches (Pokémon Red via PPO). The critique of reward engineering fragility (citing Pleines et al.) is well-supported. The emergence of multimodal foundation models makes the zero-shot baseline (Qwen3.5) particularly relevant, and ASH's advantage over it demonstrates that learned embodied policies still outperform prompted models for fine-grained control.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
ASH presents a compelling and well-executed system for self-improving embodied agents. The combination of dynamic bootstrapping, IDM-based pseudo-labeling, and unsupervised key moment discovery is novel and the results are strong. The work's primary limitation is the narrow evaluation domain (two GBA games), which tempers claims about general scalability. Nevertheless, the paper makes a meaningful contribution to long-horizon embodied learning by demonstrating that self-improvement from unlabeled internet video is viable and outperforms static training paradigms.
Generated May 15, 2026
Comparison History (19)
Paper 1 is more likely to have higher impact: it proposes a novel self-improvement loop for long-horizon embodied learning using unlabeled internet video, demonstrating substantial performance gains over strong baselines on demanding multi-hour tasks. This advances core capability (scalable supervision and long-horizon planning) with clear applications to robotics/agents and broad relevance across RL, imitation learning, and foundation-model agents. Paper 2 is timely and valuable as an evaluation/audit study, but it is primarily diagnostic; it introduces a benchmark/scaffold and highlights failure modes rather than delivering a new capability breakthrough, limiting downstream transformative impact.
ASH introduces a novel self-improvement paradigm for embodied agents that learns from unlabeled internet video without reward shaping or expert demonstrations—addressing a fundamental scalability bottleneck in AI. The approach combines inverse dynamics models, unsupervised key-moment identification, and a self-honing loop, demonstrating strong results on complex long-horizon tasks (Pokemon Emerald, Zelda). This has broad implications for robotics, game AI, and autonomous agents. Paper 2, while clever in its sample-efficient alignment approach, addresses a narrower problem (LLM safety) with incremental improvements over existing methods.
ASH presents a more transformative contribution by demonstrating self-improving embodied agents that learn from unlabeled internet video without rewards or expert annotations—addressing a fundamental scalability bottleneck in embodied AI. The approach combines inverse dynamics models, unsupervised key-moment identification, and self-improvement loops in a novel way, with dramatic empirical gains on challenging long-horizon tasks. While BeliefMem's probabilistic memory is a solid incremental contribution to LLM agent memory, ASH's paradigm of scalable self-improvement from internet video has broader implications across robotics, game AI, and autonomous systems.
Paper 2 likely has higher impact: it proposes a scalable, practical recipe for long-horizon embodied learning using unlabeled internet video and a self-improvement loop (IDM-derived supervision + memory), demonstrated on challenging multi-hour planning benchmarks with large gains over strong baselines. This directly targets a central bottleneck in agent research, with clear downstream applications in robotics and general autonomy, and is timely given interest in self-improving agents and web-scale learning. Paper 1 is conceptually novel and methodologically interesting, but its immediate real-world applicability and breadth of impact are less certain.
ASH introduces a novel self-improvement paradigm for embodied AI that learns from unlabeled internet video without reward shaping or expert demonstrations—addressing a fundamental scalability bottleneck. Its contributions (inverse dynamics models from self-play, unsupervised key-moment detection, long-horizon planning across diverse game environments) represent a broadly applicable methodological advance. While MatBrain offers practical value in materials science with its efficient dual-model architecture, ASH's framework has broader cross-domain impact potential, tackling a core AI challenge (long-horizon embodied learning from unstructured data) with a more generalizable and innovative approach.
ASH addresses a fundamental challenge in embodied AI—learning long-horizon policies from unlabeled internet video without reward shaping or expert demonstrations. Its self-improvement loop combining inverse dynamics models with internet video supervision is highly novel and demonstrates impressive results on complex, multi-hour tasks. This work opens a scalable pathway for embodied agents that could broadly impact robotics and autonomous systems. While EVOCHAMBER presents interesting multi-agent evolution ideas with strong benchmark results, it operates within a more incremental framework of test-time adaptation for LLM-based agents, with narrower long-term implications.
Paper 1 addresses a critical, high-stakes real-world problem (ICU decision support) by introducing a highly novel hindsight-annotated benchmark. By moving beyond mimicking potentially flawed historical actions, it exposes critical safety failures in current LLMs. Its direct implications for patient safety and medical AI rigor give it profound near-term and interdisciplinary impact, whereas Paper 2, while methodologically impressive for embodied AI, is currently limited to video game environments.
Paper 2 is more novel and likely higher impact: it proposes a scalable self-improvement paradigm for long-horizon embodied learning using unlabeled internet video, avoiding rewards and expert demonstrations—an approach with broad implications for robotics, game-playing, and general agent learning. Its results on multi-hour tasks indicate a meaningful capability jump and a potentially general recipe. Paper 1 is valuable and methodologically strong for agent evaluation/diagnosis, but it is primarily an evaluation framework with narrower downstream impact compared to a training/learning approach that could unlock new capabilities.
Paper 2 (CoCoDA) has higher estimated impact due to a more general, reusable contribution: a code-native compositional DAG that addresses a core scaling bottleneck in tool-augmented agents (library growth under fixed context). It proposes principled retrieval/training mechanisms plus theoretical guarantees (cost reduction, monotone co-evolution, well-formedness) and shows broad benchmark gains, suggesting methodological rigor and cross-domain applicability (reasoning, code, data analysis). Paper 1 is novel and impressive for long-horizon game agents, but is narrower in domain and may be harder to generalize beyond specific embodied/video settings.
Paper 1 (ASH) is likely higher impact due to a more novel learning paradigm: self-improving embodied agents that extract supervision from unlabeled internet video via inverse dynamics and long-term memory, addressing a core bottleneck in long-horizon RL/embodied AI without rewards or expert demos. Its implications could generalize beyond games to robotics and other sequential decision-making. Paper 2 (OpenMobile) is highly timely and valuable for openness and reproducibility, but is primarily a data/synthesis framework with more incremental methodological innovation, and its impact may be narrower to mobile UI agents.
ASH addresses a fundamental challenge in embodied AI—learning long-horizon policies from unlabeled internet video without reward shaping or expert demonstrations. Its self-improvement loop combining inverse dynamics models, unsupervised key-moment detection, and internet video supervision represents a novel and scalable paradigm. The strong empirical results across two complex game environments demonstrate significant advancement over baselines. Paper 2 addresses an important but narrower problem (multi-agent moderation via intent detection) with a more incremental contribution. ASH's broader implications for autonomous learning, scalability, and embodied AI give it substantially higher potential impact across multiple research areas.
Paper 2 tackles a fundamental and historically difficult challenge in AI—long-horizon embodied learning without hand-engineered rewards or labeled demonstrations. By proposing a self-honing agent that learns from noisy internet videos, it provides a highly scalable paradigm with broad implications for robotics, reinforcement learning, and AGI. While Paper 1 offers strong applied value for enterprise systems, Paper 2's methodological innovation in unsupervised, self-supervised embodied learning represents a deeper foundational shift with wider theoretical and cross-disciplinary impact.
Paper 1 likely has higher scientific impact due to its novel self-improvement loop that leverages unlabeled internet video to generate supervision for long-horizon embodied control, addressing a core bottleneck in scalable RL/agent learning. The demonstrated multi-hour planning gains over strong baselines in two complex game environments suggest broad applicability to robotics and generalist agents, with timely relevance to autonomous agents and self-supervised learning. Paper 2 is methodologically solid and important for fairness/perspectivist NLP, but its scope is narrower and impacts fewer adjacent fields than a scalable recipe for long-horizon embodied agents.
Paper 1 likely has higher impact due to greater methodological novelty (self-improvement loop using IDM to mine supervision from unlabeled internet video plus long-term memory for key moments) and a stronger algorithmic contribution that could generalize to many embodied/interactive domains beyond the showcased games. It targets a central bottleneck in long-horizon RL/embodied learning—scalable supervision without rewards or expert demos—and demonstrates substantial gains over multiple baselines on multi-hour tasks. Paper 2 is timely and useful, but primarily a benchmark/dataset contribution with narrower novelty and impact concentrated in evaluation.
Paper 2 tackles a fundamental bottleneck in embodied AI: learning long-horizon tasks from unannotated, noisy internet videos without reward shaping. Its self-improving loop and unsupervised extraction of key moments represent a highly scalable, novel approach with massive implications for robotics and general AI. While Paper 1 introduces an interesting biologically-inspired memory management framework, Paper 2's methodology addresses a more universally challenging problem in AI scaling and demonstrates significant, sustained improvements in notoriously difficult environments.
Paper 2 (ASH) has higher likely scientific impact due to a more broadly transformative agenda: scalable, self-improving embodied agents learning from unlabeled internet video without rewards or expert demos. This targets a central, timely bottleneck (long-horizon autonomy) with clear downstream applications in robotics and general agentic systems, and demonstrates strong results on demanding multi-hour tasks. Paper 1 advances KGC via structured quantization for LLM alignment—novel and useful, but narrower in scope and application. ASH’s framework is more cross-domain and paradigm-shifting.
ASH introduces a fundamentally novel paradigm for embodied AI: self-improving agents that learn from unlabeled internet video without reward shaping or expert demonstrations. This addresses a core scalability bottleneck in AI and demonstrates strong results on long-horizon tasks. Paper 2, while practically valuable for clinical AI governance, is more incremental—presenting an evaluation/monitoring framework for a specific deployed system rather than a new scientific method. ASH's contributions to self-supervised learning, inverse dynamics models, and long-horizon planning have broader impact potential across robotics, game AI, and embodied intelligence research.
Paper 1 tackles a fundamental and widely researched bottleneck in general AI: long-horizon embodied learning without hand-engineered rewards or labeled demonstrations. Its novel self-improvement loop using unlabeled internet video has broad implications for autonomous agents and robotics. While Paper 2 introduces a valuable benchmark for urban mobility, Paper 1's methodological breakthrough in agentic self-improvement represents a more significant leap with broader applicability across the rapidly growing field of embodied AI.
ASH addresses a fundamental challenge in embodied AI—learning long-horizon policies from unlabeled internet video without reward shaping or expert demonstrations. This represents a significant paradigm shift toward scalable, self-improving agents. The approach combines inverse dynamics models, unsupervised key-moment detection, and self-improvement loops in a novel way, with impressive empirical results across two complex game environments. Its breadth of impact spans robotics, reinforcement learning, and foundation models. Paper 1 addresses an important but narrower problem (LLM cross-query consistency) with more incremental contributions to the reasoning/evaluation literature.