PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft
Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su
Abstract
We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into parameter-resident skills internalized through experience. PEAM pairs a slow deliberative LLM for open-ended reasoning with a fast parametric module for reflexive execution of consolidated skills. The fast module is a multimodal Mixture-of-Experts LoRA architecture with per-category physically isolated adapters, enabling parameter-level continual learning without catastrophic forgetting. We treat failure as a first-class training signal: failure--correction trajectory pairs are internalized through a joint behavioral-cloning and contrastive objective, so the agent learns not only what succeeds but also how corrected actions differ from failed ones. To govern consolidation, PEAM introduces a parameterization-worthiness score for deciding which experience should be internalized, and a scale-free self-triggered consolidation mechanism for deciding when to internalize without task-specific hand-tuned thresholds, making the agent self-evolving as the trigger transfers across task distributions without re-tuning. Experiments in Minecraft show that PEAM improves long-horizon task performance, mitigates forgetting on previously consolidated skills, and improves parametric-versus-retrieval efficiency over retrieval-based embodied agents and parametric memory variants.
AI Impact Assessments
(1 models)Scientific Impact Assessment: PEAM
Core Contribution
PEAM introduces a principled framework for converting an embodied agent's episodic memory into parametric skills — essentially moving from "remembering how to look things up" to "knowing how to do things." The paper draws a compelling analogy to complementary learning systems theory from cognitive neuroscience, where fast episodic encoding is gradually consolidated into slow distributed representations. The core novelty lies in the integration of three mechanisms: (1) a parameterization-worthiness (PV) score that decides *what* to internalize, (2) a scale-free self-triggered consolidation (STC) mechanism that decides *when* to internalize, and (3) per-category isolated MoE-LoRA adapters combined with a joint BC+DPO objective on failure-correction pairs that determines *how* internalization occurs. The treatment of failure as a first-class training signal via contrastive failure-correction pairs is a meaningful conceptual contribution, distinguishing PEAM from systems like Reflexion that convert failures to textual guidance.
Methodological Rigor
The experimental design is reasonable but has notable limitations. The evaluation covers 11 tasks across 3 seeds (33 trials per method), which is quite small for drawing strong statistical conclusions. The Wilson 95% confidence intervals are appropriately reported, and the McNemar paired test (p=0.018) provides some statistical backing, though the absolute number of trials limits power.
The ablation study (Table 2) is well-structured, systematically isolating each design choice. The finding that DPO-only achieves high forward-pass preference margins but fails at generation (0/12 wrapper format compliance) is a genuinely useful methodological insight that could benefit the broader RLHF/DPO community.
However, several aspects weaken rigor. The PV weights are selected by grid search and held fixed, but no sensitivity analysis beyond the leave-one-out ablation is provided. The forgetting evaluation is limited to a single sequential consolidation order (craft→gather→combat) — testing permutations would strengthen the claim. The cross-distribution trigger evaluation relies on trajectory re-slicing rather than independently collected distributions, which the authors acknowledge but which limits the strength of the scale-free claim.
The comparison against 8 baselines is comprehensive in coverage but uneven in fairness: some baselines (ReAct, Reflexion) are architecturally much simpler, while the parametric baselines (naive full-FT, single LoRA, EWC) represent somewhat strawman implementations rather than state-of-the-art continual learning methods.
Potential Impact
The conceptual framing — that embodied agents should have a pathway for selected experience to become part of the policy itself — is compelling and addresses a genuine gap. If validated at larger scale, this could influence how memory is designed in LLM-based agent systems across domains beyond Minecraft (robotics, web agents, personal assistants).
The practical efficiency gains are notable: 85% token reduction and 42% latency reduction versus VOYAGER. For deployment scenarios where inference cost matters, parametric skill execution is clearly preferable to repeated retrieval.
The methodology findings (Section 4.5) about forward-pass margins not predicting generation quality, quantization failure modes, and category-dependent cpair yields are independently valuable practical contributions that could save other researchers significant debugging time.
However, real-world impact is constrained by the single-environment evaluation (Minecraft only) and the reliance on GPT-4o for the slow tier, which limits reproducibility and cost accessibility. The navigation category's inability to produce useful failure-correction pairs also signals a fundamental limitation of the contrastive approach for certain skill types.
Timeliness & Relevance
The paper addresses a timely bottleneck. As LLM-based agents are increasingly deployed in long-running settings, the cost of retrieval-based memory (context budget, latency, re-injection overhead) becomes a practical concern. The continual learning angle is also relevant given growing interest in agents that genuinely improve over time. The connection to DeepSeek-V4's cultivate-then-consolidate pattern and the broader trend toward mixture-of-experts architectures positions the work within active research threads.
Strengths
1. Coherent conceptual framework: The what/when/how decomposition of consolidation is clean and well-motivated by cognitive neuroscience.
2. Failure as training signal: Using failure-correction pairs with joint BC+DPO is a principled approach that goes beyond textual reflection.
3. Parameter isolation by design: Per-category adapters provide forgetting resistance structurally rather than through regularization, which is architecturally elegant.
4. Honest methodology reporting: The deployment failure modes, navigation cpair limitations, and forward-pass vs. generate-path discrepancy are unusually candid and useful.
5. Ablation completeness: Each design choice is individually ablated with clear metrics.
Limitations
1. Scale of evaluation: 11 tasks, 3 seeds, single environment — this is insufficient to make strong generalization claims. The confidence intervals are wide (e.g., [53.0, 83.4] for PEAM).
2. Category granularity: Only 3 skill categories are evaluated; the claim that adapter count grows linearly with categories is acknowledged but not stress-tested.
3. Backbone dependency: The system requires both GPT-4o (slow tier) and an 8×A100 serving setup (fast tier), which significantly limits accessibility and reproducibility.
4. No real-world transfer evidence: All claims are restricted to Minecraft; the gap to robotics or other embodied domains is unaddressed.
5. Static PV weights: The consolidation scoring function uses fixed weights from grid search, with no adaptation mechanism.
6. Comparison fairness: Some parametric baselines are relatively naive; comparison against stronger continual learning methods (e.g., progressive neural networks adapted to this setting, or recent LoRA-based continual learning work) would be more convincing.
Overall Assessment
PEAM presents an intellectually coherent and well-motivated framework that addresses a genuine architectural gap in embodied agent memory systems. The conceptual contribution — formalizing the consolidation pathway from episodic to parametric memory — is the paper's strongest element. However, the empirical validation is limited in scale and restricted to a single environment, and several design choices (PV weights, category set) remain under-explored. The paper makes a solid conceptual argument but falls short of demonstrating the robustness needed to claim broad impact.
Generated May 28, 2026
Comparison History (20)
PEAM introduces a comprehensive novel framework addressing multiple fundamental challenges in embodied AI: continual learning without catastrophic forgetting, learning from failures via contrastive objectives, and self-triggered memory consolidation. Its contributions span architecture design (MoE-LoRA), training methodology (failure-correction contrastive learning), and autonomous learning mechanisms. Paper 2, while addressing an important evaluation gap in clinical AI, is primarily a benchmark and evaluation study with more incremental contributions. PEAM's broader methodological innovations have greater potential to influence multiple research directions in embodied agents and continual learning.
PEAM introduces a comprehensive framework with multiple novel contributions: parametric memory internalization replacing retrieval-based approaches, a Mixture-of-Experts LoRA architecture for continual learning without catastrophic forgetting, failure-as-training-signal through contrastive learning, and self-triggered consolidation mechanisms. This addresses fundamental challenges in embodied AI (memory, continual learning, skill acquisition) with broad applicability beyond Minecraft. Paper 2, while presenting a clever observation about first-token diversity in RLVR, is a relatively narrow, incremental improvement to existing RL training pipelines with a simpler conceptual contribution.
Paper 2 addresses a highly urgent and broadly applicable issue: privacy and safety in multi-agent LLM deployments. By demonstrating that current single-turn evaluations systematically underestimate privacy risks compared to social, multi-turn contexts, it has significant implications for AI safety, regulation, and system design. While Paper 1 offers a strong technical innovation for embodied agents in a simulated environment, Paper 2's findings on social contagion and privacy leakage are more timely, impacting a wider range of real-world AI applications and policy considerations.
Paper 1 addresses a highly timely and cross-disciplinary issue (AI productivity inequality) with broad implications for economics, management, and education. Its introduction of 'AI Interaction Competence' provides immediate real-world value for workforce training. While Paper 2 offers strong technical innovation in embodied AI memory systems, its impact is largely confined to the AI research community, giving Paper 1 significantly greater breadth of impact and real-world relevance.
Paper 1 introduces highly novel conceptual frameworks for embodied AI, including contrastive internalization of failures and scale-free self-triggered memory consolidation. These methodological innovations address fundamental challenges in continual learning and agent memory. While Paper 2 offers valuable engineering insights and strong benchmark performance for agentic coding, it functions primarily as a technical report on model training rather than introducing breakthrough scientific paradigms, making Paper 1 more likely to influence future scientific research in AGI and robotics.
PEAM introduces a more novel and broadly impactful contribution: a parametric memory framework that fundamentally rethinks how embodied agents store and utilize experience, moving from retrieval-based to parameter-resident skills. Its innovations—contrastive learning from failures, MoE-LoRA for continual learning without catastrophic forgetting, and self-triggered consolidation—address fundamental challenges in embodied AI and continual learning. Paper 1, while addressing an important security concern in multi-agent systems, is more incremental, extending existing attack/defense paradigms. PEAM's architectural contributions have broader applicability beyond Minecraft to general embodied AI systems.
Paper 2 tackles fundamental challenges in embodied AI and agent architecture, proposing a novel dual-system framework with multimodal MoE LoRA and contrastive learning to solve catastrophic forgetting and memory internalization. Its theoretical contributions to continual learning and self-evolving agents have broad applicability across AI fields. In contrast, Paper 1 presents a valuable but mostly applied, domain-specific pipeline for EdTech, using existing retrieval and LLM techniques, which inherently limits its broader scientific impact compared to Paper 2's foundational AI advancements.
Paper 2 has higher potential impact due to greater novelty and timeliness in embodied LLM agents: it proposes a concrete framework for converting experience into parametric skills with continual learning safeguards (isolated MoE LoRA adapters), explicit failure–correction contrastive learning, and self-triggered consolidation. These ideas can transfer beyond Minecraft to robotics, autonomous agents, and continual learning, affecting multiple fields. Paper 1 is methodologically solid and practically useful for memory-bounded planning, but it is a more incremental extension of GBFS with narrower cross-domain reach compared to Paper 2’s broader implications for agent memory and learning.
Paper 1 addresses a fundamental and pervasive issue in LLMs—epistemic dissonance during knowledge editing. By shifting the paradigm from static fact overwriting to causal editing, it offers a widely applicable solution to safely update foundation models without causing self-refutation. While Paper 2 presents an innovative continual learning framework for embodied agents, Paper 1's focus on the core mechanisms of LLM parametric memory provides broader theoretical implications and more immediate real-world utility across various NLP domains.
PEAM introduces a novel and comprehensive framework addressing multiple fundamental challenges in embodied AI: parametric memory consolidation, continual learning without catastrophic forgetting, learning from failure via contrastive objectives, and self-triggered consolidation. Its contributions span architecture design (MoE-LoRA), training methodology (failure-correction contrastive learning), and autonomous learning triggers. Paper 2 (AsyncTool) introduces a useful benchmark for asynchronous tool calling, but benchmarks typically have narrower long-term impact compared to novel frameworks that advance core capabilities. PEAM's innovations in embodied agent learning have broader implications across robotics, game AI, and continual learning fields.
Paper 2 addresses a critical methodological flaw in the entire field of AI agent research: the reliance on binary outcome metrics. By providing a taxonomy and framework for log analysis, it has the potential to fundamentally change evaluation standards across all agent research, leading to broad, field-wide impact. Paper 1, while presenting an innovative technical architecture for embodied agents, is narrower in scope and application (Minecraft), limiting its comparative breadth of impact.
Paper 2 tackles fundamental challenges in embodied AI, such as catastrophic forgetting, fast/slow reasoning, and continual learning. Its architectural innovations (MoE LoRA, contrastive internalization) are broadly applicable across AI subfields. Paper 1 focuses on a domain-specific application (finance) and system design, making its potential scientific impact narrower.
Paper 1 introduces a novel, evolving benchmark that exposes a critical gap in frontier models (0% success on functional ToM). By distinguishing functional from literal ToM and formally verifying tasks, it establishes a rigorous foundation for multi-agent embodied AI research. While Paper 2 presents a strong architecture for agent memory, Paper 1's benchmark is likely to drive broader community efforts, stimulate new methodologies across various architectures, and serve as a standard metric for a fundamental cognitive capability in AI, leading to higher widespread impact.
Paper 1 likely has higher scientific impact due to a concrete, technically novel method (parametric embodied memory with continual learning, failure-contrastive training, and self-triggered consolidation) demonstrated empirically in a standard embodied benchmark. It offers immediate applicability to LLM-based agents and robotics and is timely within fast-moving agentic AI research, enabling follow-on work and adoption. Paper 2 is broad and potentially influential conceptually, but appears more framework/position-oriented with less methodological specificity and causal identification detail, which may limit near-term impact despite strong application relevance.
Paper 2 (PEAM) presents a technically rigorous, reproducible framework addressing well-defined problems in embodied AI—continual learning, catastrophic forgetting, and efficient skill consolidation—with clear experimental validation. Its MoE-LoRA architecture, contrastive learning from failures, and self-triggered consolidation are novel, broadly applicable contributions. Paper 1, while provocative, relies on auto-ethnographic methodology with a single human-AI dyad, co-authored by the AI itself, raising significant concerns about scientific rigor, reproducibility, and anthropomorphization. Its claims about AI phenomenology lack established epistemological grounding, limiting mainstream scientific impact.
Paper 2 (PEAM) is more novel and broadly impactful: it proposes a concrete framework for continual, parametric skill memory in embodied agents, combining modular MoE-LoRA adapters, failure–correction contrastive internalization, and self-triggered consolidation—mechanisms with clear real-world relevance to robotics and interactive agents. It targets long-horizon autonomy and catastrophic forgetting, central open problems, and is timely given rapid growth in agentic LLMs. Paper 1 is rigorous and valuable for understanding CoT compression in post-training, but its impact is likely more incremental and narrower to LLM training practice.
Paper 2 likely has higher impact due to broader relevance: it advances continual learning and memory consolidation for embodied agents—problems central across robotics, RL, and LLM-agent research. Its parametric memory design (MoE LoRA with isolated adapters), failure-aware contrastive internalization, and self-triggered consolidation are broadly reusable beyond Minecraft. Paper 1 is novel and rigorous, but its applications are narrower (GPU kernel optimization) and more domain-specific, limiting cross-field reach despite strong practical value for systems/compilers.
PEAM introduces a comprehensive and novel framework addressing multiple critical challenges in embodied AI: parametric memory consolidation, continual learning without catastrophic forgetting, contrastive learning from failures, and self-triggered consolidation. Its technical contributions (MoE LoRA architecture, parameterization-worthiness scoring, scale-free consolidation) are substantial and broadly applicable beyond Minecraft. Paper 2 addresses an important fairness topic in multi-agent systems but is more observational/empirical, proposing a metric (FBS) rather than a transformative solution, limiting its methodological depth and breadth of downstream impact.
Paper 1 addresses a fundamental question about how LLMs perform reasoning internally, using mechanistic interpretability techniques to localize and characterize reasoning circuits. This has broad implications across AI safety, interpretability, and understanding of emergent capabilities in LLMs—topics of high current interest. Paper 2, while technically solid with a novel architecture for embodied agents in Minecraft, addresses a narrower application domain. Paper 1's insights into attention head specialization for reasoning steps and the emergence of algorithmic strategies in higher layers provide foundational knowledge applicable across many LLM applications and research directions.
PEAM introduces several novel contributions with broader scientific impact: (1) a principled framework for converting retrieval-based memory into parametric skills via contrastive learning from failures, (2) a Mixture-of-Experts LoRA architecture addressing catastrophic forgetting in continual learning, (3) self-triggered consolidation mechanisms that generalize across task distributions, and (4) treating failure as a first-class training signal. These contributions advance fundamental understanding of embodied agent learning, continual learning, and memory consolidation, with potential impact across robotics and AI. MobileExplorer, while practically useful, offers a more incremental engineering contribution focused on latency reduction in a specific application domain.