Zhi Zheng, Ziqiao Meng, Hao Luan, Wei Liu, Wee Sun Lee
External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.
The paper introduces Latent Memory, a memory paradigm that compresses each text or image evidence item into a single high-dimensional latent token using a small compressor LLM/VLM. The key innovation is creating a unified latent representation space where the same token serves triple duty: (1) as a retrievable embedding for similarity search, (2) as a compact evidence representation for answer generation, and (3) as a reconstructable summary of the original content. This is achieved through a training objective combining reconstruction, contrastive, and distillation losses trained end-to-end, while keeping the larger generator LLM/VLM frozen.
The core problem addressed is the computational and storage expense of passing raw text and especially images to generator models in RAG pipelines — a genuine bottleneck for resource-constrained deployment scenarios like edge devices.
The methodology is well-structured and technically sound. The three-loss training framework is well-motivated: reconstruction preserves information fidelity, contrastive learning enables retrieval, and distillation ensures the frozen generator can interpret the latent tokens meaningfully. The ablation studies (Tables 4, 13, 14) systematically validate each component's contribution.
Practical impact: The 3-10× reduction in generator tokens directly translates to cost savings in API-based deployments and enables RAG on resource-constrained devices. For image-heavy applications, the storage reduction (26× per image) is significant.
The image-grounded QA results are genuinely impressive: 69.4 F1 on WebQA-Image with only 82 tokens versus 53.0 F1 for the best baseline at 1885 tokens. This suggests that bypassing raw visual token expansion and operating in latent space can actually *improve* quality for image evidence, likely because raw images can exceed context windows and degrade generation.
Broader influence: The unified retrieval-generation representation space is a conceptually appealing idea that could influence how future RAG systems are designed. If latent tokens can replace raw evidence while maintaining quality, this could reshape the entire retrieval-augmented generation paradigm.
However, the impact may be limited by the narrow scope of evidence types currently supported (atomic text sentences and single images only — no tables, documents with layout, or video).
This work addresses a timely bottleneck: as multimodal RAG systems scale and move toward edge deployment, the cost of passing raw evidence (especially images) through large generators becomes prohibitive. The paper positions itself well within the current landscape of efficient inference, latent reasoning, and on-device AI.
The concurrent emergence of related works (CLaRa, xRAG, LCC) validates the timeliness of this research direction. Latent Memory's extension to multimodal settings is a meaningful differentiation.
Overall Assessment: This is a solid systems-oriented contribution that identifies a real problem (token/storage cost in multimodal RAG) and provides a workable solution with clear advantages in the image-heavy regime. The unified retrieval-generation space is conceptually novel. However, the efficiency advantage is primarily realized in multimodal settings, the text performance often lags behind strong baselines, and the current design's limitation to atomic evidence units constrains broader applicability. The work represents meaningful incremental progress rather than a paradigm shift.
Generated Jun 10, 2026
Paper 1 introduces a fundamentally novel paradigm shift: improving spatial reasoning in LRMs through self-supervised consistency verification rather than labeled data, with a new RL strategy (OT-GRPO). This challenges the dominant assumption that spatial reasoning requires external supervision, has broad implications for reasoning alignment beyond spatial tasks, and the theoretical contribution (consistency verifiers, optimal transport-based RL) is more foundational. Paper 2, while practically useful for resource-constrained QA with its latent memory compression, is more incremental—optimizing token efficiency in RAG systems—with narrower conceptual novelty.
Paper 2 addresses a more fundamental and broadly impactful question—whether AI agents can reliably synthesize scientific conclusions—with implications across health, policy, and all evidence-based domains. Its introduction of a large-scale benchmark (SciConBench) with clean-room evaluation methodology tackles the critical issue of data leakage in LLM evaluation, which has wide relevance. The finding that even frontier models achieve only 0.337 F1 and that consumer-facing tools produce incomplete/contradictory conclusions has immediate real-world safety implications. Paper 1, while technically solid in compressing memory tokens for resource-constrained QA, addresses a narrower efficiency optimization problem with less transformative potential.
Paper 2 proposes a highly novel paradigm shift for multimodal RAG by compressing raw evidence into a single latent token used directly for retrieval and generation. This significantly advances efficiency and scalability in resource-constrained settings across both NLP and vision-language domains. While Paper 1 provides a useful structural optimization for agentic memory, Paper 2's fundamental architectural innovation offers broader applicability and higher potential to influence future memory and retrieval designs in large foundation models.
Paper 2 likely has higher scientific impact: it challenges core assumptions in mechanistic interpretability (dataset-defined tasks map to a single circuit) with systematic evidence, then proposes a broadly applicable alternative (clustering-based multi-circuit discovery). This can reshape evaluation norms and downstream methods across interpretability, auditing, and safety, with relevance to many model types and tasks. Paper 1 is practically valuable (large token/storage savings for multimodal RAG/QA) but is more incremental within the fast-moving efficiency/RAG space and may be superseded by engineering advances.
Paper 2 addresses a fundamental efficiency bottleneck in RAG-based QA systems by compressing multimodal evidence into single latent tokens, achieving 3-10x token reduction with competitive performance. This has broader impact across NLP, multimodal AI, and resource-constrained deployment scenarios. The method is generalizable, evaluated on 7+ benchmarks, and addresses the timely problem of LLM efficiency. Paper 1, while addressing a real gap in trajectory anomaly datasets, targets a narrower spatial data mining niche with less transformative potential across the broader AI research community.
Paper 1 addresses a fundamental bottleneck in modern LLM/VLM deployment: context window limitations and high token consumption in Retrieval-Augmented Generation (RAG). By compressing multimodal evidence into single latent tokens, it offers a highly innovative methodology that reduces token usage by up to 10x while maintaining performance. This has massive real-world applicability for resource-constrained systems. While Paper 2 introduces a valuable benchmark for long-horizon agents, Paper 1's architectural innovation has a broader and more immediate impact on the efficiency and scalability of foundational models across multiple domains.
While Paper 1 offers a highly practical and timely efficiency improvement for multimodal LLMs, Paper 2 tackles a profound theoretical limitation in predictive models regarding causal counterfactuals. By introducing a novel mathematical framework (WorldKernel) to model the uncertainty and couplings between counterfactual worlds, Paper 2 challenges fundamental ML assumptions. This deep theoretical contribution to causality and world models has the potential for paradigm-shifting scientific impact, influencing how we conceptualize and build reasoning systems beyond mere pattern recognition.
Paper 2 likely has higher impact: it introduces a concrete, scalable systems-method (latent-space external memory with one-token evidence) that directly addresses pressing deployment constraints (token/storage costs) and demonstrates strong empirical results across many text and multimodal QA benchmarks with large efficiency gains. The approach is timely for RAG/VLM applications and broadly applicable to resource-constrained settings. Paper 1 is conceptually novel for interpretability of LLM decision structure, but its impact may be narrower and more diagnostic than enabling, with fewer immediate real-world deployments.
Paper 2 has higher potential impact because it tackles a broad, timely foundational question—why adaptive benchmark use often doesn’t overfit—using a falsifiable, information-theoretic framing (compression/description length) and experiments across diverse modalities and tasks. Its methodology includes two complementary bottlenecks and a deliberate overfitting condition, strengthening rigor and interpretability. The implications span ML evaluation, agentic AutoML, scientific discovery workflows, and benchmarking practices. Paper 1 is practically valuable for efficient multimodal RAG/QA, but its impact is more application- and systems-specific within retrieval/LLM tooling.
Paper 1 proposes a highly innovative latent memory paradigm that addresses a major bottleneck in RAG systems: token consumption and context window limits. By compressing multimodal evidence into single latent tokens, it reduces token usage by 3x-10x while maintaining performance. This offers immense practical value and broader applicability across any resource-constrained LLM/VLM system compared to Paper 2's unlearning method, which, while rigorous and important for AI safety, addresses a slightly more specialized domain.