Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

Jeongeun Lee, Chanyoung Park, Dongha Lee

#968 of 2682 · Artificial Intelligence
Share
Tournament Score
1440±42
10501800
70%
Win Rate
16
Wins
7
Losses
23
Matches
Rating
5.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: POLAR – Personalizing Embodied MLLM Agents over Long-term User Interactions

1. Core Contribution

POLAR introduces a multimodal memory-augmented framework that enables embodied agents to personalize their behavior across long-term user interactions. The key insight is that real-world personalized assistance requires agents to resolve implicit references (e.g., "bring my trip to-go") by leveraging accumulated interaction history, rather than relying on explicit target specifications. The framework organizes prior interactions into a multimodal knowledge graph with two memory types: semantic memory (fine-grained personalized context about objects as concise textual statements) and episodic memory (compressed trajectory experiences capturing planning-relevant information). This object-centric graph structure enables selective retrieval at task time to ground user-intended object instances and guide navigation planning.

The paper addresses a genuine gap: most existing embodied agents assume explicit target references, while personalized references in practice are implicit and distributed across interactions. The distinction between "which category" vs. "which instance" is well-motivated and practically important.

2. Methodological Rigor

Strengths in experimental design:

  • Evaluation spans five MLLM backbones (open-source and proprietary), lending credibility to generalization claims.
  • Three distinct evaluation scenarios (compositional, distractor, temporal) probe different aspects of personalization, moving beyond a single aggregate metric.
  • Controlled retrieval experiments (Figure 7) isolate the contribution of memory utilization from retrieval quality.
  • Ablation studies on trajectory representations (Table 2) and retrieval granularity (Figure 8) provide mechanistic understanding.
  • Weaknesses in rigor:

  • The dataset is relatively small: 4 scenes with 1,817 episodes. This raises concerns about statistical significance and generalizability to diverse environments.
  • Absolute performance numbers are quite low across the board (SR often 20-30%), making it difficult to assess practical utility. The paper would benefit from confidence intervals or significance tests.
  • The acquisition stage uses GPT-5 to generate personalized context, introducing potential biases. While MTurk validation is mentioned, details are sparse.
  • The raw-interaction baseline uses 15 randomly sampled interactions (including the gold one), which is a somewhat artificial setup. The sensitivity to this sampling strategy is not explored.
  • SPL numbers are often lower than SR, sometimes substantially, suggesting navigation efficiency remains problematic even when targets are found.
  • The comparison baselines are limited — no comparison against existing memory-augmented agent frameworks (e.g., MemGPT, Mem0, MemVerse) adapted to this setting.
  • 3. Potential Impact

    Direct impact: The paper addresses a realistic and underexplored problem — personalized embodied assistance requiring long-term memory. As home robots and personal assistants become more prevalent, the ability to resolve implicit references from interaction history will be essential. The framework is modular and backbone-agnostic, making it potentially adoptable.

    Broader influence: The decomposition into semantic vs. episodic memory, grounded in cognitive science (Tulving, 1972), offers a principled design pattern that could influence memory architectures for non-embodied MLLM agents as well. The finding that structured memory outperforms raw interaction context is valuable for the broader retrieval-augmented generation community.

    Limitations on impact: The evaluation is limited to navigation in simulated environments. Extension to manipulation, multi-step household tasks, or real-world deployment remains unclear. The reliance on object-centric memory may not generalize well to tasks where personalization involves procedures, preferences, or social context rather than specific objects.

    4. Timeliness & Relevance

    This work is highly timely. The convergence of capable MLLMs, embodied AI benchmarks, and growing interest in personalization creates a natural demand for this type of research. The preliminary analysis (Figure 2) demonstrating that current MLLMs fail at personalized instance grounding across interactions is a useful empirical contribution that motivates the work. The paper also engages with very recent references (2025-2026), indicating awareness of the rapidly evolving landscape.

    The problem of long-context degradation in MLLMs is a known bottleneck, and POLAR's approach of converting raw interactions into structured, retrievable memory is a practical response to this limitation.

    5. Strengths & Limitations

    Key Strengths:

  • Well-defined and practically motivated problem formulation (personalized instance grounding from implicit references)
  • Clean framework design with clear separation of memorization and utilization stages
  • Multi-backbone evaluation demonstrating generality
  • Thoughtful evaluation scenarios that isolate different aspects of long-term personalization
  • The case study (Figure 9) effectively illustrates how the same instruction maps to different targets for different users
  • Notable Limitations:

  • Scale is limited (4 scenes), raising generalization concerns
  • No comparison with existing memory-augmented agent systems
  • The knowledge graph construction relies heavily on MLLM prompting for semantic node generation — error propagation and quality of generated memories are not analyzed
  • No analysis of computational overhead or latency introduced by the memory module
  • The temporal scenario results are inconsistent across models, suggesting the temporal reasoning capability may not be robustly achieved
  • Missing analysis of failure modes: when does memory retrieval introduce harmful noise?
  • The paper acknowledges but does not address manipulation or more complex embodied tasks
  • Additional Observations

    The paper's framing around cognitive memory types (semantic vs. episodic) is appealing but somewhat surface-level — the actual implementation is relatively straightforward (text statements + trajectory summaries in a graph). The deduplication mechanism for semantic nodes and the temporal edge annotations are practical design choices but lack formal analysis of their effectiveness. The writing is generally clear, though the paper could benefit from tighter presentation of the evaluation protocol.

    Rating:5.5/ 10
    Significance 6Rigor 5Novelty 5.5Clarity 6.5

    Generated May 27, 2026

    Comparison History (23)

    vs. CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models
    claude-opus-4.65/28/2026

    POLAR addresses the underexplored and increasingly important problem of long-term personalization for embodied MLLM agents, introducing a novel multimodal memory-augmented framework combining semantic and episodic memory via knowledge graphs. This opens new research directions at the intersection of personalization, embodied AI, and LLMs. While CIVIC makes solid engineering contributions to VLM efficiency (token reduction, KV-cache compression), it is more incremental, building on existing token pruning/merging literature. POLAR's broader conceptual novelty and potential to influence multiple research communities (embodied AI, personal assistants, memory-augmented models) give it higher impact potential.

    vs. Multi-Adapter Representation Interventions via Energy Calibration
    gpt-5.25/28/2026

    Paper 2 (POLAR) likely has higher scientific impact due to broader real-world applicability and cross-field relevance: long-term personalization for embodied multimodal agents connects LLMs, robotics, HCI, and memory systems, addressing a timely gap for practical assistants. Its framework (multimodal knowledge graph + episodic/semantic memory) is a generalizable direction for sustained user interaction and could influence benchmarks and deployed systems. Paper 1 (MARI) is innovative and rigorous for alignment via adaptive interventions, but is more narrowly scoped to representation intervention techniques within LLM alignment.

    vs. Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR
    gpt-5.25/28/2026

    Paper 2 introduces a highly targeted, low-cost intervention (first-token diversification) that directly addresses a key bottleneck in RLVR—rollout diversity—while minimally changing existing pipelines. Its methodological claim is crisp, testable, and broadly applicable across models, sizes, and tasks where verifier-based RL is used, making it timely for current post-training of reasoning LLMs. Paper 1 is valuable but sits in a more application-specific embodied/personalization niche and depends on system design choices (memory graphs, retrieval) that may generalize less cleanly or be harder to standardize.

    vs. EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents
    gemini-3.15/28/2026

    EgoBench introduces a comprehensive benchmark and interactive environment that addresses a critical evaluation gap for multimodal tool-using agents. High-quality benchmarks in emerging AI domains typically drive significant follow-up research, model development, and widespread adoption, often resulting in higher long-term scientific impact and citations compared to individual methodological frameworks like the memory system proposed in Paper 1.

    vs. Laguna M.1/XS.2 Technical Report
    gemini-3.15/28/2026

    Paper 2 addresses a fundamental challenge in embodied AI—long-term personalization and multimodal memory—proposing a novel framework (POLAR) using knowledge graphs for semantic and episodic memory. This methodological innovation has broad applicability across robotics and AI agents. In contrast, Paper 1 is primarily an engineering technical report for specific coding models; while valuable, Paper 2 offers more generalized theoretical and structural advancements likely to inspire future research in autonomous, personalized agents.

    vs. A Policy-Driven Runtime Layer for Agentic LLM Serving
    claude-opus-4.65/28/2026

    Paper 1 (POLAR) addresses a more fundamental and broadly impactful research challenge—personalized embodied agents with long-term memory—which spans multiple fields (embodied AI, multimodal learning, human-robot interaction, knowledge graphs). Its multimodal memory-augmented framework introduces novel architectural ideas for personalization that could influence a wide range of downstream applications. Paper 2, while practically valuable for LLM serving infrastructure with solid engineering contributions, addresses a more narrow systems-level optimization problem. POLAR's novelty in combining episodic and semantic memory for embodied personalization has broader scientific reach and timeliness given the rapid growth of MLLM agents.

    vs. MemFail: Stress-Testing Failure Modes of LLM Memory Systems
    gemini-3.15/27/2026

    Paper 2 introduces a diagnostic benchmark for evaluating fundamental failure modes in LLM memory systems. Benchmarks typically have broader scientific impact than individual system architectures (Paper 1), as they provide the methodological foundation for future research across multiple domains. By isolating specific memory operations, MemFail offers actionable insights for improving general LLM agent reliability, giving it wider relevance and higher potential citations than a domain-specific embodied agent framework.

    vs. Meta-Agent: From Task Descriptions to Verified Multi-Agent Systems
    gpt-5.25/27/2026

    Paper 2 (POLAR) likely has higher impact due to strong timeliness and broad applicability: long-term personalization and memory for embodied multimodal agents is a central bottleneck for real-world assistants/robots. Its multimodal knowledge-graph memory combining semantic and episodic components can generalize across tasks, users, and MLLM backbones, with clear deployment relevance (assistive robots, AR, home agents). Paper 1 improves reliability in multi-agent text/tool workflows via planning+verification, but the space is crowded and may be more incremental; impact may be narrower to software-agent pipelines.

    vs. Energy Shields for Fairness
    claude-opus-4.65/27/2026

    POLAR addresses a critical gap in embodied AI—long-term personalization through multimodal memory—which is highly relevant given the rapid growth of MLLM-based agents. It combines multimodal knowledge graphs with episodic memory for embodied tasks, offering broad applicability across robotics, personal assistants, and human-AI interaction. Paper 2 introduces energy shields for runtime fairness, which is a solid theoretical contribution with formal guarantees, but targets a narrower problem. Paper 1's timeliness with the MLLM wave and broader cross-field impact (NLP, robotics, HCI) gives it higher potential impact.

    vs. MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
    claude-opus-4.65/27/2026

    MedGuideX addresses a critical gap in medical AI by transforming clinical practice guidelines into executable decision logic for training LLMs, achieving significant improvements (10.28%) on clinical reasoning benchmarks with physician validation. Its approach of generating factual and counterfactual QA data from structured guidelines is novel and methodologically rigorous. The direct clinical applications, scalability of the training pipeline, and potential to improve healthcare decision-making give it broader real-world impact. POLAR is innovative for embodied agent personalization but addresses a narrower, less immediately impactful application domain.

    vs. 2-ASP(Q) programs with weak constraints: Complexity and efficient implementation
    gemini-3.15/27/2026

    Paper 1 addresses a highly timely and rapidly expanding area (multimodal large language models and embodied AI) with broad real-world applications in personalized robotics and virtual assistants. Its integration of multimodal knowledge graphs for long-term memory offers significant potential impact across AI, HCI, and robotics. In contrast, Paper 2 focuses on Answer Set Programming, which, while methodologically rigorous and valuable within formal logic and knowledge representation, serves a much narrower niche with limited interdisciplinary breadth compared to the widespread applicability of LLM-based agents.

    vs. Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables
    gpt-5.25/27/2026

    Paper 2 has higher likely scientific impact due to stronger timeliness and broader cross-field relevance: long-term personalization for embodied multimodal LLM agents is a fast-moving area with immediate applicability in robotics, assistants, HCI, and memory systems. The proposed multimodal memory/knowledge-graph + episodic trajectory design could be adopted widely across agent platforms and backbones. Paper 1 is methodologically solid and novel for multi-variable complex query answering, but its impact is more specialized to KG reasoning and benchmark design, with narrower real-world deployment pathways.

    vs. DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs
    gpt-5.25/27/2026

    Paper 2 (POLAR) targets long-term personalization for embodied multimodal agents, a core capability for real-world assistants. Its multimodal memory + knowledge-graph design has clear applications in robotics, AR/VR, smart-home agents, and human–AI interaction, with broad cross-field relevance (MLLMs, memory systems, embodied AI). The problem is timely and likely to persist as agents move into continuous deployment. Paper 1 is strong and practical for multi-agent LLM coordination and efficiency, but is more narrowly scoped to inference-time aggregation; its impact may be constrained by rapid shifts in prompting/agent frameworks. Overall, POLAR appears more enabling and generalizable.

    vs. Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts
    gemini-3.15/27/2026

    Paper 2 challenges a fundamental paradigm in AI alignment by distinguishing process alignment from outcome alignment. Its findings on domain dependency and the risk of encoding discriminatory patterns offer profound implications for AI ethics, governance, and organizational implementation. While Paper 1 presents a solid architectural advancement for embodied agents, Paper 2's broader interdisciplinary relevance across AI safety, policy, and fairness gives it a higher potential for widespread scientific and societal impact.

    vs. Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
    gpt-5.25/27/2026

    Paper 1 targets a high-growth area (personalized embodied MLLM agents) with clear real-world applications in robotics and assistive systems. Its memory-augmented, multimodal knowledge-graph approach addresses a widely recognized limitation—long-term personalization and implicit intent—likely to generalize across tasks and agent platforms. The evaluation across multiple MLLM backbones and scenarios suggests broader applicability and near-term adoption. Paper 2 offers a sharper theoretical diagnosis of policy-gradient failures, but its demonstrated impact is narrower (specific long-horizon cumulative-damage settings) and the example domains are stylized, potentially limiting uptake outside RL methodology.

    vs. Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2
    gemini-3.15/27/2026

    Paper 1 addresses a broad, fundamental challenge in embodied AI—long-term personalization and memory—by introducing a novel multimodal knowledge graph framework. This has wide-ranging implications for robotics and AI assistants. In contrast, Paper 2 is a highly specific, incremental technical report on quantization for a single model submitted to a challenge. Paper 1's conceptual innovation and broader applicability give it significantly higher potential scientific impact.

    vs. Maat: The Agentic Legal Research Assistant for Competition Protection
    gemini-3.15/27/2026

    Paper 1 addresses a fundamental challenge in AI and robotics—long-term memory and personalization in multimodal embodied agents—offering a novel framework with broad applicability across human-robot interaction and general AI. In contrast, Paper 2 presents a domain-specific application of existing techniques (ReAct, RAG) to a highly specialized niche (competition law). Thus, Paper 1 demonstrates greater methodological innovation and a significantly broader potential impact across multiple scientific fields.

    vs. Constraint acquisition needs better benchmarks
    gemini-3.15/27/2026

    Paper 1 addresses a highly active and cutting-edge area of AI research (embodied agents, MLLMs, and long-term memory) with broad real-world applications in robotics and personalized assistants. While Paper 2 provides a valuable benchmarking tool, it focuses on a narrower, more specialized domain (Constraint Acquisition), making Paper 1 likely to have broader cross-disciplinary impact, higher citation potential, and greater relevance to current technological trends.

    vs. AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents
    gemini-3.15/27/2026

    Paper 1 identifies a critical, previously uncharacterized failure mode (action-grammar destruction) when applying standard compression to LLM agents. Its proposed solution, AGORA, addresses a major bottleneck in agentic systems (context length and cost) with a highly practical, zero-inference-toll method. While Paper 2 tackles the important problem of personalized embodied agents, its approach using a multimodal knowledge graph for memory is more of an incremental architectural contribution compared to the fundamental systems-level insight and broad applicability of Paper 1.

    vs. Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering
    claude-opus-4.65/27/2026

    Paper 1 addresses a fundamental challenge in embodied AI—long-term personalization through multimodal memory-augmented agents—which has broad implications across robotics, human-robot interaction, and AI assistants. The proposed framework (POLAR) introduces a novel multimodal knowledge graph combining semantic and episodic memory, advancing the frontier of personalized embodied agents. Paper 2, while creative in using Gumbel noise for counterfactual text generation in education, addresses a narrower application domain with more limited cross-disciplinary impact. Paper 1's contributions are more timely given the rapid growth of MLLM-based agents and have wider applicability.