Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
Jeongeun Lee, Chanyoung Park, Dongha Lee
Abstract
Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.
AI Impact Assessments
(1 models)Scientific Impact Assessment: POLAR – Personalizing Embodied MLLM Agents over Long-term User Interactions
1. Core Contribution
POLAR introduces a multimodal memory-augmented framework that enables embodied agents to personalize their behavior across long-term user interactions. The key insight is that real-world personalized assistance requires agents to resolve implicit references (e.g., "bring my trip to-go") by leveraging accumulated interaction history, rather than relying on explicit target specifications. The framework organizes prior interactions into a multimodal knowledge graph with two memory types: semantic memory (fine-grained personalized context about objects as concise textual statements) and episodic memory (compressed trajectory experiences capturing planning-relevant information). This object-centric graph structure enables selective retrieval at task time to ground user-intended object instances and guide navigation planning.
The paper addresses a genuine gap: most existing embodied agents assume explicit target references, while personalized references in practice are implicit and distributed across interactions. The distinction between "which category" vs. "which instance" is well-motivated and practically important.
2. Methodological Rigor
Strengths in experimental design:
Weaknesses in rigor:
3. Potential Impact
Direct impact: The paper addresses a realistic and underexplored problem — personalized embodied assistance requiring long-term memory. As home robots and personal assistants become more prevalent, the ability to resolve implicit references from interaction history will be essential. The framework is modular and backbone-agnostic, making it potentially adoptable.
Broader influence: The decomposition into semantic vs. episodic memory, grounded in cognitive science (Tulving, 1972), offers a principled design pattern that could influence memory architectures for non-embodied MLLM agents as well. The finding that structured memory outperforms raw interaction context is valuable for the broader retrieval-augmented generation community.
Limitations on impact: The evaluation is limited to navigation in simulated environments. Extension to manipulation, multi-step household tasks, or real-world deployment remains unclear. The reliance on object-centric memory may not generalize well to tasks where personalization involves procedures, preferences, or social context rather than specific objects.
4. Timeliness & Relevance
This work is highly timely. The convergence of capable MLLMs, embodied AI benchmarks, and growing interest in personalization creates a natural demand for this type of research. The preliminary analysis (Figure 2) demonstrating that current MLLMs fail at personalized instance grounding across interactions is a useful empirical contribution that motivates the work. The paper also engages with very recent references (2025-2026), indicating awareness of the rapidly evolving landscape.
The problem of long-context degradation in MLLMs is a known bottleneck, and POLAR's approach of converting raw interactions into structured, retrievable memory is a practical response to this limitation.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's framing around cognitive memory types (semantic vs. episodic) is appealing but somewhat surface-level — the actual implementation is relatively straightforward (text statements + trajectory summaries in a graph). The deduplication mechanism for semantic nodes and the temporal edge annotations are practical design choices but lack formal analysis of their effectiveness. The writing is generally clear, though the paper could benefit from tighter presentation of the evaluation protocol.
Generated May 27, 2026
Comparison History (23)
POLAR addresses the underexplored and increasingly important problem of long-term personalization for embodied MLLM agents, introducing a novel multimodal memory-augmented framework combining semantic and episodic memory via knowledge graphs. This opens new research directions at the intersection of personalization, embodied AI, and LLMs. While CIVIC makes solid engineering contributions to VLM efficiency (token reduction, KV-cache compression), it is more incremental, building on existing token pruning/merging literature. POLAR's broader conceptual novelty and potential to influence multiple research communities (embodied AI, personal assistants, memory-augmented models) give it higher impact potential.
Paper 2 (POLAR) likely has higher scientific impact due to broader real-world applicability and cross-field relevance: long-term personalization for embodied multimodal agents connects LLMs, robotics, HCI, and memory systems, addressing a timely gap for practical assistants. Its framework (multimodal knowledge graph + episodic/semantic memory) is a generalizable direction for sustained user interaction and could influence benchmarks and deployed systems. Paper 1 (MARI) is innovative and rigorous for alignment via adaptive interventions, but is more narrowly scoped to representation intervention techniques within LLM alignment.
Paper 2 introduces a highly targeted, low-cost intervention (first-token diversification) that directly addresses a key bottleneck in RLVR—rollout diversity—while minimally changing existing pipelines. Its methodological claim is crisp, testable, and broadly applicable across models, sizes, and tasks where verifier-based RL is used, making it timely for current post-training of reasoning LLMs. Paper 1 is valuable but sits in a more application-specific embodied/personalization niche and depends on system design choices (memory graphs, retrieval) that may generalize less cleanly or be harder to standardize.
EgoBench introduces a comprehensive benchmark and interactive environment that addresses a critical evaluation gap for multimodal tool-using agents. High-quality benchmarks in emerging AI domains typically drive significant follow-up research, model development, and widespread adoption, often resulting in higher long-term scientific impact and citations compared to individual methodological frameworks like the memory system proposed in Paper 1.
Paper 2 addresses a fundamental challenge in embodied AI—long-term personalization and multimodal memory—proposing a novel framework (POLAR) using knowledge graphs for semantic and episodic memory. This methodological innovation has broad applicability across robotics and AI agents. In contrast, Paper 1 is primarily an engineering technical report for specific coding models; while valuable, Paper 2 offers more generalized theoretical and structural advancements likely to inspire future research in autonomous, personalized agents.
Paper 1 (POLAR) addresses a more fundamental and broadly impactful research challenge—personalized embodied agents with long-term memory—which spans multiple fields (embodied AI, multimodal learning, human-robot interaction, knowledge graphs). Its multimodal memory-augmented framework introduces novel architectural ideas for personalization that could influence a wide range of downstream applications. Paper 2, while practically valuable for LLM serving infrastructure with solid engineering contributions, addresses a more narrow systems-level optimization problem. POLAR's novelty in combining episodic and semantic memory for embodied personalization has broader scientific reach and timeliness given the rapid growth of MLLM agents.
Paper 2 introduces a diagnostic benchmark for evaluating fundamental failure modes in LLM memory systems. Benchmarks typically have broader scientific impact than individual system architectures (Paper 1), as they provide the methodological foundation for future research across multiple domains. By isolating specific memory operations, MemFail offers actionable insights for improving general LLM agent reliability, giving it wider relevance and higher potential citations than a domain-specific embodied agent framework.
Paper 2 (POLAR) likely has higher impact due to strong timeliness and broad applicability: long-term personalization and memory for embodied multimodal agents is a central bottleneck for real-world assistants/robots. Its multimodal knowledge-graph memory combining semantic and episodic components can generalize across tasks, users, and MLLM backbones, with clear deployment relevance (assistive robots, AR, home agents). Paper 1 improves reliability in multi-agent text/tool workflows via planning+verification, but the space is crowded and may be more incremental; impact may be narrower to software-agent pipelines.
POLAR addresses a critical gap in embodied AI—long-term personalization through multimodal memory—which is highly relevant given the rapid growth of MLLM-based agents. It combines multimodal knowledge graphs with episodic memory for embodied tasks, offering broad applicability across robotics, personal assistants, and human-AI interaction. Paper 2 introduces energy shields for runtime fairness, which is a solid theoretical contribution with formal guarantees, but targets a narrower problem. Paper 1's timeliness with the MLLM wave and broader cross-field impact (NLP, robotics, HCI) gives it higher potential impact.
MedGuideX addresses a critical gap in medical AI by transforming clinical practice guidelines into executable decision logic for training LLMs, achieving significant improvements (10.28%) on clinical reasoning benchmarks with physician validation. Its approach of generating factual and counterfactual QA data from structured guidelines is novel and methodologically rigorous. The direct clinical applications, scalability of the training pipeline, and potential to improve healthcare decision-making give it broader real-world impact. POLAR is innovative for embodied agent personalization but addresses a narrower, less immediately impactful application domain.
Paper 1 addresses a highly timely and rapidly expanding area (multimodal large language models and embodied AI) with broad real-world applications in personalized robotics and virtual assistants. Its integration of multimodal knowledge graphs for long-term memory offers significant potential impact across AI, HCI, and robotics. In contrast, Paper 2 focuses on Answer Set Programming, which, while methodologically rigorous and valuable within formal logic and knowledge representation, serves a much narrower niche with limited interdisciplinary breadth compared to the widespread applicability of LLM-based agents.
Paper 2 has higher likely scientific impact due to stronger timeliness and broader cross-field relevance: long-term personalization for embodied multimodal LLM agents is a fast-moving area with immediate applicability in robotics, assistants, HCI, and memory systems. The proposed multimodal memory/knowledge-graph + episodic trajectory design could be adopted widely across agent platforms and backbones. Paper 1 is methodologically solid and novel for multi-variable complex query answering, but its impact is more specialized to KG reasoning and benchmark design, with narrower real-world deployment pathways.
Paper 2 (POLAR) targets long-term personalization for embodied multimodal agents, a core capability for real-world assistants. Its multimodal memory + knowledge-graph design has clear applications in robotics, AR/VR, smart-home agents, and human–AI interaction, with broad cross-field relevance (MLLMs, memory systems, embodied AI). The problem is timely and likely to persist as agents move into continuous deployment. Paper 1 is strong and practical for multi-agent LLM coordination and efficiency, but is more narrowly scoped to inference-time aggregation; its impact may be constrained by rapid shifts in prompting/agent frameworks. Overall, POLAR appears more enabling and generalizable.
Paper 2 challenges a fundamental paradigm in AI alignment by distinguishing process alignment from outcome alignment. Its findings on domain dependency and the risk of encoding discriminatory patterns offer profound implications for AI ethics, governance, and organizational implementation. While Paper 1 presents a solid architectural advancement for embodied agents, Paper 2's broader interdisciplinary relevance across AI safety, policy, and fairness gives it a higher potential for widespread scientific and societal impact.
Paper 1 targets a high-growth area (personalized embodied MLLM agents) with clear real-world applications in robotics and assistive systems. Its memory-augmented, multimodal knowledge-graph approach addresses a widely recognized limitation—long-term personalization and implicit intent—likely to generalize across tasks and agent platforms. The evaluation across multiple MLLM backbones and scenarios suggests broader applicability and near-term adoption. Paper 2 offers a sharper theoretical diagnosis of policy-gradient failures, but its demonstrated impact is narrower (specific long-horizon cumulative-damage settings) and the example domains are stylized, potentially limiting uptake outside RL methodology.
Paper 1 addresses a broad, fundamental challenge in embodied AI—long-term personalization and memory—by introducing a novel multimodal knowledge graph framework. This has wide-ranging implications for robotics and AI assistants. In contrast, Paper 2 is a highly specific, incremental technical report on quantization for a single model submitted to a challenge. Paper 1's conceptual innovation and broader applicability give it significantly higher potential scientific impact.
Paper 1 addresses a fundamental challenge in AI and robotics—long-term memory and personalization in multimodal embodied agents—offering a novel framework with broad applicability across human-robot interaction and general AI. In contrast, Paper 2 presents a domain-specific application of existing techniques (ReAct, RAG) to a highly specialized niche (competition law). Thus, Paper 1 demonstrates greater methodological innovation and a significantly broader potential impact across multiple scientific fields.
Paper 1 addresses a highly active and cutting-edge area of AI research (embodied agents, MLLMs, and long-term memory) with broad real-world applications in robotics and personalized assistants. While Paper 2 provides a valuable benchmarking tool, it focuses on a narrower, more specialized domain (Constraint Acquisition), making Paper 1 likely to have broader cross-disciplinary impact, higher citation potential, and greater relevance to current technological trends.
Paper 1 identifies a critical, previously uncharacterized failure mode (action-grammar destruction) when applying standard compression to LLM agents. Its proposed solution, AGORA, addresses a major bottleneck in agentic systems (context length and cost) with a highly practical, zero-inference-toll method. While Paper 2 tackles the important problem of personalized embodied agents, its approach using a multimodal knowledge graph for memory is more of an incremental architectural contribution compared to the fundamental systems-level insight and broad applicability of Paper 1.
Paper 1 addresses a fundamental challenge in embodied AI—long-term personalization through multimodal memory-augmented agents—which has broad implications across robotics, human-robot interaction, and AI assistants. The proposed framework (POLAR) introduces a novel multimodal knowledge graph combining semantic and episodic memory, advancing the frontier of personalized embodied agents. Paper 2, while creative in using Gumbel noise for counterfactual text generation in education, addresses a narrower application domain with more limited cross-disciplinary impact. Paper 1's contributions are more timely given the rapid growth of MLLM-based agents and have wider applicability.