Wenli Xiao, Jia Xie, Tonghe Zhang, Haotian Lin, Letian "Max" Fu, Haoru Xue, Jalen Lu, Yi Yang
Achieving dexterous robotic manipulation in the real world heavily relies on human supervision and algorithm engineering, which becomes a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined in digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration. To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module (R) to evaluate policies with one or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes. This closed-loop system transforms real-world manipulation learning into a controllable optimization procedure, minimizing human effort while allowing fair ablations across training recipe and agent variants. Powered by ENPIRE, frontier coding agents can autonomously train a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks, such as organizing a pin box, fastening a zip tie, and tool use, a process that further accelerates when we dispatch an agent team on a robot fleet. Our results suggest a practical and scalable path toward deploying coding agents to autonomously advancing robotics in the physical world.
ENPIRE proposes a framework that enables LLM-based coding agents to autonomously conduct robotics research in the physical world—a concept the authors term "physical autoresearch." The key insight is that the missing piece for automating real-world robotics research is a structured feedback loop consisting of four modules: Environment construction (EN) for automated reset and verification, Policy Improvement (PI) for launching refinement, Rollout (R) for physical evaluation, and Evolution (E) for multi-agent hypothesis testing via Git-based coordination.
The paper addresses a genuine bottleneck: the heavy human supervision required in real-world robot policy learning. Prior autoresearch systems operated exclusively in simulation or digital domains; ENPIRE closes the loop on physical hardware. The two-stage decomposition—first constructing environment interfaces with human feedback (one-time cost), then running fully autonomous policy improvement—is a pragmatic design choice that balances safety with autonomy.
This paper is extremely timely. It sits at the intersection of three rapidly evolving trends: (1) frontier coding agents becoming capable enough for complex multi-step reasoning, (2) real-world robot learning scaling up, and (3) autonomous scientific discovery gaining traction. The paper appears just as coding agents (Codex, Claude Code, Kimi Code) have reached sufficient capability to make this vision feasible. The framing of "physical autoresearch" as a distinct problem is a valuable conceptual contribution that could catalyze a new research direction.
The paper's greatest contribution may be conceptual rather than algorithmic—establishing the vocabulary and evaluation framework for physical autoresearch. The idea that coding agents should be evaluated not just on task success but on resource utilization (MRU, MTU) when operating physical systems is a valuable framing. However, the paper would benefit from deeper analysis of what the agents actually discover versus what a competent human researcher would try, to better understand whether agents bring genuine creativity or merely exhaustive search.
Generated Jun 19, 2026
Paper 1 overcomes a critical bottleneck in embodied AI by automating real-world robotic policy improvement without human supervision. Creating a closed-loop physical feedback system for coding agents is highly novel and offers immense real-world applicability for scalable physical intelligence. While Paper 2 provides rigorous and timely theoretical insights into LLM reasoning limitations, Paper 1 demonstrates a more transformative, tangible breakthrough in autonomous robotic manipulation, giving it a higher potential for broad, paradigm-shifting scientific impact.
ENPIRE addresses a fundamental bottleneck in robotics—automating real-world policy improvement with minimal human supervision. Its closed-loop framework combining coding agents with physical robot experimentation represents a highly novel paradigm shift toward autonomous robotics research. Achieving 99% success on dexterous manipulation tasks demonstrates strong practical impact. The breadth of impact is significant, spanning AI agents, robotics, and automation. Paper 1, while methodologically interesting in combining evolutionary optimization with flow-based models, addresses a narrower problem with more incremental contributions to generative modeling and scientific data editing.
Paper 2 bridges a critical gap in embodied AI by enabling autonomous coding agents to iteratively train physical robots in the real world. Automating the human-in-the-loop bottleneck in robotic policy learning offers immense real-world applications and radically accelerates physical AI research. While Paper 1 introduces a highly valuable metric for trustworthy ML, Paper 2's novel closed-loop system for real-world dexterous manipulation presents a more disruptive, scalable leap in the rapidly advancing field of agentic robotics.
ENPIRE presents a concrete, demonstrated system for autonomous robot policy improvement achieving 99% success on dexterous manipulation tasks. It addresses a critical bottleneck in robotics (human supervision) with a practical, scalable framework validated on real hardware. Its immediate applicability to real-world robotics, clear empirical results, and potential to accelerate autonomous robot learning give it higher near-term and broad scientific impact. Paper 2, while intellectually ambitious in bridging autotelic AI with philosophy and quantum formulations, is primarily theoretical/conceptual with less immediate empirical validation or practical applicability.
Paper 1 likely has higher impact due to its novel, end-to-end closed-loop framework that enables autonomous real-world robot policy improvement with coding agents, addressing a major bottleneck (human-in-the-loop engineering) and offering scalable deployment via robot fleets. Its real-world verification/reset/rollout infrastructure is a key enabling technology with broad implications for robotics, embodied AI, and automated experimentation. Paper 2 is timely and strong, but analogical reasoning for diverse idea generation is conceptually closer to existing prompting/creative-generation lines and its demonstrated gains, while valuable, are narrower and more domain-specific.
Paper 1 presents a groundbreaking framework automating the physical feedback loop for robotic policy improvement, addressing the critical data and supervision bottlenecks in embodied AI. While Paper 2 offers a valuable methodological control for LLM analysis, it relies on a relatively simple prompting technique (anonymization). Paper 1's hardware-software integration—enabling autonomous, real-world dexterous manipulation with high success rates—represents a major paradigm shift toward scalable physical intelligence, promising profound, transformative impacts on robotics and autonomous systems.
Paper 2 likely has higher impact: it introduces a closed-loop, real-world robotic self-improvement framework that operationalizes agentic code generation with physical reset/verification, parallel rollouts, and iterative evolution—directly reducing a major bottleneck (human supervision) and enabling scalable experimentation. The real-world applications (dexterous manipulation, robot fleets) are immediate and broadly relevant across robotics, automation, and agent research, and the timeliness is high given rapid progress in coding agents. Paper 1 is novel and valuable for LLM training, but its scope is narrower and less directly transformative beyond ML.
Paper 2 addresses a critical bottleneck in foundation model training: 4-bit (FP4) quantization. By identifying the geometric 'shrinkage bias' in the current E2M1 hardware consensus (e.g., NVIDIA Blackwell) and proposing the UFP4 recipe, this work challenges next-generation AI accelerator design. While Paper 1 offers an innovative automated pipeline for robotics, Paper 2's findings directly dictate the computational efficiency and hardware primitives of the entire LLM ecosystem. This offers immediate, massive-scale reductions in pretraining costs that will broadly impact all AI subfields.
Paper 2 likely has higher scientific impact: it proposes an agentic, closed-loop framework that demonstrably improves real-world robot policies with minimal human supervision, achieving high success rates on diverse dexterous tasks and scaling via multi-robot fleets. This is methodologically ambitious, timely, and has clear real-world applications in automation and embodied AI, with potential cross-field impact (robotics, ML, agentic systems, systems engineering). Paper 1 is rigorous and important for evaluation validity in LLM psychometrics, but its impact is more corrective and narrower in application scope.
Paper 1 (ENPIRE) is likely higher impact due to its broader, more novel framing: an end-to-end, repeatable real-world closed loop that enables coding agents to autonomously improve robot policies, reducing a core bottleneck in physical AI (human supervision/engineering). If robust, it generalizes across tasks, robots, and algorithms, and could reshape how robotics research and deployment iterates—high cross-field relevance (robotics, AutoML, agentic systems). Paper 2 is strong and timely for humanoid HRI, but its scope is narrower (co-speech motion) and more domain-specific.