NGM: A Plug-and-Play Training-Free Memory Module for LLMs
Yuwen Qu, Wenhui Dong, Chenyang Si, Caifeng Shan
Abstract
Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations, thereby eliminating the need to train separate N-gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge-intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3-14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3-VL-2B).
AI Impact Assessments
(1 models)Scientific Impact Assessment: NGM – A Plug-and-Play Training-Free Memory Module for LLMs
1. Core Contribution
NGM proposes a training-free, parameter-free memory module that can be attached to frozen LLMs at inference time. It consists of two components: (1) a Causal N-Gram Encoder that constructs multi-scale N-gram representations by averaging pretrained token embeddings within local causal windows, and (2) a Cosine-Gated Memory Injector that uses non-parametric cosine similarity with ReLU filtering to selectively inject these representations into intermediate hidden states via residual connections. The key insight is that pretrained embedding spaces already contain exploitable local-memory structure, and aggregated N-gram embeddings are geometrically aligned with hidden states at certain layers—making learned memory tables unnecessary.
This addresses a real limitation of prior conditional memory approaches (Engram, SCONE, L3, MeKi) that all require training dedicated embedding parameters and often specialized infrastructure. The constraint of zero additional training is the paper's distinguishing feature.
2. Methodological Rigor
Strengths in evaluation design:
Weaknesses in rigor:
3. Potential Impact
Practical utility: The training-free, plug-and-play nature is genuinely appealing. Practitioners could attach NGM to any compatible checkpoint without fine-tuning, making it immediately deployable. The overhead is modest (3–16% prefill, 2–10% decode depending on sequence length).
Scope of influence: The impact is likely moderate rather than transformative. The gains are incremental, and the method is most useful for specific task types (code generation, knowledge-intensive QA). The approach doesn't fundamentally change how LLMs work—it adds a lightweight signal that helps in specific scenarios.
Multimodal extension: The preliminary result on Qwen3-VL-2B is encouraging but thin (single model, five benchmarks, small gains). It demonstrates generality potential but needs substantially more validation.
Conceptual contribution: The finding that pretrained embedding spaces contain exploitable local-memory structure accessible via simple averaging is an interesting empirical insight. The alignment analysis (Figure 1) showing N-gram embeddings have significantly higher cosine similarity to hidden states than shuffled/random controls provides a useful characterization of embedding space geometry.
4. Timeliness & Relevance
The paper addresses a timely concern: as LLMs become widely deployed, inference-time enhancements that require no retraining are increasingly valuable. The conditional memory / embedding scaling direction (Engram, SCONE, L3, LongCat-Flash-Lite, MeKi) is an active research area, and NGM occupies a useful position by pushing the training-free boundary. However, the concurrent work it compares against (several from early 2026) suggests a crowded field where differentiation matters.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Additional Observations
The paper is well-written and honest about limitations. The code availability enhances reproducibility. The theoretical motivation—that short-range lexical patterns can be recovered through embedding lookup rather than deep computation—is sound, though the empirical validation shows the practical benefit is modest. The work would be substantially strengthened by evaluation on additional model families and comparison against competing memory-augmentation approaches.
Generated May 19, 2026
Comparison History (20)
Paper 1 (NGM) proposes a novel, training-free plug-and-play memory module for LLMs that demonstrates consistent improvements across multiple model sizes, benchmarks, and modalities. Its training-free design and broad applicability make it highly practical and widely adoptable. Paper 2 (POLAR-Bench) introduces an important privacy-utility benchmark but is primarily a diagnostic evaluation tool rather than a methodological contribution. While timely, benchmarks typically have narrower long-term impact unless they become community standards. NGM's architectural innovation with demonstrated gains across scales suggests broader and more lasting scientific influence.
Paper 1 presents a counterintuitive and novel finding—that higher observation fidelity can hurt embodied LLM performance, and moderate perceptual noise actually improves outcomes by breaking repetitive action loops. This challenges fundamental assumptions about how LLMs function in embodied settings and has broad implications for evaluation methodology, robotics, and understanding LLM reasoning. Paper 2 proposes a useful engineering contribution (training-free memory module with modest benchmark improvements), but the gains are incremental. Paper 1's surprising insights are more likely to reshape thinking across multiple research communities.
PEEK introduces a novel conceptual framework—orientation knowledge caching via context maps—that addresses a fundamental and increasingly important problem in LLM agent systems operating over recurring contexts. Its improvements are substantially larger (6-34% vs 0.5-1.2 points), it demonstrates significant cost/efficiency gains (1.7-5.8x lower cost), and it generalizes across architectures including production systems like OpenAI Codex. The concept of reusable orientation knowledge is a fresh abstraction with broad applicability. NGM, while useful, offers modest incremental gains through a relatively straightforward n-gram averaging technique with limited conceptual novelty.
Paper 2 introduces a training-free, plug-and-play memory module that directly enhances LLM performance across various tasks and modalities. Its broad applicability, lack of training overhead, and immediate practical utility give it a higher potential for widespread adoption and real-world impact compared to Paper 1's benchmark generation pipeline, which primarily serves evaluation purposes.
Paper 2 introduces a training-free, plug-and-play memory module applicable to a wide range of LLMs, demonstrating broad performance improvements across multiple domains including coding and vision. Its broad applicability to foundation models gives it wider potential impact compared to Paper 1, which introduces a valuable but domain-specific benchmark restricted to industrial CAD generation.
Paper 2 is likely to have higher impact because it introduces a large-scale, realistic benchmark and tooling infrastructure (300+ tools across 7 stateful sandboxes) that can become a community standard for evaluating agent robustness—broadly useful across academia and industry. Its seed-driven, noise/failure simulation and trajectory-based diagnoses address timely real-world automation gaps and should catalyze new methods. Paper 1 is a clever, training-free memory add-on with modest average gains; useful, but narrower in scope and likely less field-shaping than a widely adopted benchmark.
Paper 2 presents a concrete, training-free architectural advancement for LLMs with rigorous evaluation across multiple model sizes and benchmarks. Given the foundational role of LLMs in current AI research, a plug-and-play memory module that consistently improves performance without additional training offers broader, more immediate scientific impact and high adoption potential compared to the specialized, application-focused approach of Paper 1.
Paper 1 presents a highly practical, training-free architectural innovation (N-gram Memory) that offers immediate performance gains for LLMs with minimal overhead. Its plug-and-play nature and proven efficacy across varying model scales and modalities ensure broad applicability. While Paper 2 offers an interesting cognitive benchmark, its impact is largely evaluative. Paper 1 directly solves efficiency and knowledge retrieval bottlenecks in state-of-the-art models, giving it much higher potential for widespread adoption and real-world scientific impact in the fast-paced field of LLM optimization.
Paper 2 introduces a more novel and broadly applicable framework—continuous, gradient-based search in a learned latent space for automated algorithm design—potentially impacting ML, optimization, and operations research. Its methodological components (encoder, surrogate model, normalizing flow regularization, LLM-conditioned synthesis) form a general pipeline that can extend beyond the tested problems. Paper 1 is practical and timely but more incremental (training-free memory injection with modest average gains) and likely narrower in conceptual impact, despite strong engineering value.
Paper 2 introduces a training-free, plug-and-play memory module that directly improves LLM performance across varied tasks (including code and multimodal) without additional computational overhead. Such highly accessible, easily integrable architectural improvements typically see rapid, widespread adoption and high citation rates in the fast-paced LLM community. While Paper 1 introduces a valuable benchmark for agent safety, Paper 2's immediate applicability and efficiency gains offer a broader, more immediate impact on both research and practical deployments.
Paper 2 presents a fundamental, training-free architectural enhancement for LLMs that demonstrably improves performance across diverse benchmarks without extra computational overhead. Its algorithmic innovation and scalability offer broad, significant implications for core AI research. Conversely, Paper 1 is primarily a software engineering framework addressing vendor lock-in; while highly practical for developers, it lacks the deep methodological novelty and transformative scientific impact characteristic of Paper 2.
Paper 2 introduces a highly accessible, training-free memory module that can be seamlessly plugged into existing LLMs. Its ability to improve performance across diverse tasks—including code generation, knowledge-intensive benchmarks, and multimodal applications—without requiring additional training makes it exceptionally scalable and widely applicable. In contrast, Paper 1 offers a specialized improvement for RL-based reasoning training, which, while valuable, has a narrower scope of immediate impact.
NeuroMAS introduces a fundamentally novel conceptual framework that bridges neural network architecture design with multi-agent systems, offering both theoretical foundations and a new scaling paradigm for LLMs. Its contribution—treating multi-agent systems as trainable neural-network-like architectures with progressive growth—opens a broad new research direction with significant implications for LLM scaling, multi-agent coordination, and architecture search. Paper 1 (NGM), while practically useful, offers incremental improvements (0.5-1.2 points) via a relatively straightforward n-gram averaging mechanism with limited conceptual novelty.
Paper 1 addresses critical safety flaws in Vision-Language-Action models for autonomous driving, exposing severe real-world risks like missed pedestrians and unfaithful reasoning. Its findings have urgent, high-stakes implications for the deployment of physical AI. Paper 2 offers a clever, broadly applicable LLM efficiency module, but its incremental performance gains lack the profound safety and societal impact of Paper 1.
NGM addresses a highly relevant and timely problem in LLM efficiency and knowledge retrieval, offering a practical, training-free plug-and-play solution with demonstrated improvements across multiple benchmarks and model scales. Its broad applicability to both language and multimodal models, combined with the massive current interest in LLM improvements, gives it higher potential for citations and real-world adoption. Paper 1, while theoretically rigorous in belief function theory, addresses a more niche area of evidential reasoning with a narrower community of interest and fewer immediate practical applications.
Paper 1 likely has higher scientific impact due to stronger timeliness and broader cross-field relevance: plug-and-play, training-free memory augmentation for LLMs can affect many NLP/multimodal applications and model sizes, and is easy to adopt. Its novelty (training-free n-gram memory using pretrained embeddings plus nonparametric gating) targets a central bottleneck—efficient knowledge access. While Paper 2 is rigorous and practically valuable for Raman spectroscopy, its impact is more domain-specific and incremental relative to established Noise2Noise-style denoising.
NGM presents a training-free, plug-and-play module applicable to any LLM, demonstrating consistent improvements across multiple model sizes and diverse benchmarks including code, knowledge, and multimodal tasks. Its broad applicability to the widely-used LLM ecosystem, simplicity of integration (no training required), and demonstrated gains on established benchmarks give it higher near-term scientific impact. ScreenSearch, while novel in framing GUI exploration as a state-space search problem, addresses a narrower domain (desktop GUI agents) and primarily contributes an exploration corpus and empirical analysis rather than a widely adoptable method.
Paper 1 introduces a foundational, training-free architectural enhancement for LLMs that can be easily adopted across various models and tasks. Its plug-and-play nature and strong benchmark improvements (including multimodal) give it broader applicability and higher potential for widespread citation within the rapidly growing core LLM research community, compared to the more domain-specific, applied system presented in Paper 2.
Paper 2 has higher potential scientific impact due to broader real-world applicability and cross-field relevance: it targets persistent memory for production LLM agents (a central bottleneck), adds auditable neuro-symbolic governance (trust/safety, HCI, databases), and offers an explicit lifecycle framework for deduplication, scoping, and pruning. While Paper 1 is novel in being training-free and efficient, its gains are relatively modest and mainly incremental within model internals. Paper 2’s architecture could become a general template for deployable, trustworthy long-term agent memory systems.
Paper 1 introduces a training-free, plug-and-play memory module applicable to a wide range of LLMs without requiring gradient updates. Its broad evaluation across various model sizes, modalities, and general benchmarks (code, QA) suggests high versatility and ease of adoption. In contrast, Paper 2 presents a prompt-based memory evolution framework evaluated on a single, highly specific cybersecurity environment. Consequently, Paper 1 possesses significantly higher breadth of impact and potential for widespread integration into existing LLM architectures.