NGM: A Plug-and-Play Training-Free Memory Module for LLMs

Yuwen Qu, Wenhui Dong, Chenyang Si, Caifeng Shan

May 16, 2026

arXiv:2605.16893v1 PDF

cs.AI(primary)

#776of 2292·Artificial Intelligence

#776 of 2292 · Artificial Intelligence

Tournament Score

1446±43

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance4.5

Rigor5.5

Novelty5

Clarity7.5

Tournament Score

1446±43

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations, thereby eliminating the need to train separate N-gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge-intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3-14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3-VL-2B).

AI Impact Assessments

(1 models)

Scientific Impact Assessment: NGM – A Plug-and-Play Training-Free Memory Module for LLMs

1. Core Contribution

NGM proposes a training-free, parameter-free memory module that can be attached to frozen LLMs at inference time. It consists of two components: (1) a Causal N-Gram Encoder that constructs multi-scale N-gram representations by averaging pretrained token embeddings within local causal windows, and (2) a Cosine-Gated Memory Injector that uses non-parametric cosine similarity with ReLU filtering to selectively inject these representations into intermediate hidden states via residual connections. The key insight is that pretrained embedding spaces already contain exploitable local-memory structure, and aggregated N-gram embeddings are geometrically aligned with hidden states at certain layers—making learned memory tables unnecessary.

This addresses a real limitation of prior conditional memory approaches (Engram, SCONE, L3, MeKi) that all require training dedicated embedding parameters and often specialized infrastructure. The constraint of zero additional training is the paper's distinguishing feature.

2. Methodological Rigor

Strengths in evaluation design:

Controlled comparison across five Qwen3 scales (0.6B–14B) with identical decoding parameters

Eight diverse benchmarks spanning math, code, knowledge, and alignment

Comprehensive ablation studies isolating N-gram sizes, ReLU gating, fusion modes, and compressed tokenizer effects

Mechanistic analysis demonstrating that cosine similarity is genuinely informative (vs. shuffled/random controls) and that interactions are predominantly local

Wall-clock overhead measurements

Weaknesses in rigor:

The evaluation is restricted to a single model family (Qwen3). The authors acknowledge this but justify it on grounds of consistency—this is reasonable but limits generalizability claims. The method's effectiveness on architecturally different models (Llama, Mistral) remains unknown.

Hyperparameters (injection layers, output scale λ) are model-specific and chosen heuristically. The paper acknowledges these are "practical defaults rather than universally optimal," but the sensitivity to these choices is not fully characterized. Only one ablation configuration is shown for layer selection.

The improvements, while consistent, are modest (0.5–1.2 average points). Statistical significance is not reported. Given the stochastic nature of sampling-based decoding (temperature=0.7), it's unclear whether some per-task differences are within noise margins.

IFEval consistently degrades across most scales, suggesting the method can be harmful for instruction-following tasks—a non-trivial limitation that is somewhat underexplored.

The comparison is exclusively against vanilla Qwen3 baselines. No comparison against Engram, kNN-LM, or other memory-augmentation methods is provided, making it impossible to contextualize the magnitude of gains.

3. Potential Impact

Practical utility: The training-free, plug-and-play nature is genuinely appealing. Practitioners could attach NGM to any compatible checkpoint without fine-tuning, making it immediately deployable. The overhead is modest (3–16% preﬁll, 2–10% decode depending on sequence length).

Scope of influence: The impact is likely moderate rather than transformative. The gains are incremental, and the method is most useful for specific task types (code generation, knowledge-intensive QA). The approach doesn't fundamentally change how LLMs work—it adds a lightweight signal that helps in specific scenarios.

Multimodal extension: The preliminary result on Qwen3-VL-2B is encouraging but thin (single model, five benchmarks, small gains). It demonstrates generality potential but needs substantially more validation.

Conceptual contribution: The finding that pretrained embedding spaces contain exploitable local-memory structure accessible via simple averaging is an interesting empirical insight. The alignment analysis (Figure 1) showing N-gram embeddings have significantly higher cosine similarity to hidden states than shuffled/random controls provides a useful characterization of embedding space geometry.

4. Timeliness & Relevance

The paper addresses a timely concern: as LLMs become widely deployed, inference-time enhancements that require no retraining are increasingly valuable. The conditional memory / embedding scaling direction (Engram, SCONE, L3, LongCat-Flash-Lite, MeKi) is an active research area, and NGM occupies a useful position by pushing the training-free boundary. However, the concurrent work it compares against (several from early 2026) suggests a crowded field where differentiation matters.

5. Strengths & Limitations

Key strengths:

Elegant simplicity: the entire method is ~30 lines of PyTorch code, highly reproducible

Genuinely training-free with no learned parameters

Consistent positive average gains across all five model scales

Strong mechanistic analysis supporting design choices (alignment verification, locality analysis)

Thorough ablations identifying which components matter most (ReLU gating, multi-scale construction)

Notable weaknesses:

Single model family evaluation limits generalizability

No head-to-head comparison with existing memory-augmentation methods

Modest and inconsistent gains—harmful on IFEval, mixed on several benchmarks

Heuristic hyperparameter selection (layers, λ) without systematic search methodology

The bag-of-embeddings approximation is inherently order-insensitive, fundamentally limiting what local patterns can be captured

No analysis of when/why NGM helps at the token level—the GSM8K case studies in the appendix are anecdotal and cherry-picked

Additional Observations

The paper is well-written and honest about limitations. The code availability enhances reproducibility. The theoretical motivation—that short-range lexical patterns can be recovered through embedding lookup rather than deep computation—is sound, though the empirical validation shows the practical benefit is modest. The work would be substantially strengthened by evaluation on additional model families and comparison against competing memory-augmentation approaches.

Rating:4.8/ 10

Significance 4.5Rigor 5.5Novelty 5Clarity 7.5

Generated May 19, 2026

Comparison History (20)

vs. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

claude-opus-4.65/20/2026

Paper 1 (NGM) proposes a novel, training-free plug-and-play memory module for LLMs that demonstrates consistent improvements across multiple model sizes, benchmarks, and modalities. Its training-free design and broad applicability make it highly practical and widely adoptable. Paper 2 (POLAR-Bench) introduces an important privacy-utility benchmark but is primarily a diagnostic evaluation tool rather than a methodological contribution. While timely, benchmarks typically have narrower long-term impact unless they become community standards. NGM's architectural innovation with demonstrated gains across scales suggests broader and more lasting scientific influence.

vs. Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

claude-opus-4.65/20/2026

Paper 1 presents a counterintuitive and novel finding—that higher observation fidelity can hurt embodied LLM performance, and moderate perceptual noise actually improves outcomes by breaking repetitive action loops. This challenges fundamental assumptions about how LLMs function in embodied settings and has broad implications for evaluation methodology, robotics, and understanding LLM reasoning. Paper 2 proposes a useful engineering contribution (training-free memory module with modest benchmark improvements), but the gains are incremental. Paper 1's surprising insights are more likely to reshape thinking across multiple research communities.

vs. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

claude-opus-4.65/20/2026

PEEK introduces a novel conceptual framework—orientation knowledge caching via context maps—that addresses a fundamental and increasingly important problem in LLM agent systems operating over recurring contexts. Its improvements are substantially larger (6-34% vs 0.5-1.2 points), it demonstrates significant cost/efficiency gains (1.7-5.8x lower cost), and it generalizes across architectures including production systems like OpenAI Codex. The concept of reusable orientation knowledge is a fresh abstraction with broad applicability. NGM, while useful, offers modest incremental gains through a relatively straightforward n-gram averaging technique with limited conceptual novelty.

vs. A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

gemini-3.15/19/2026

Paper 2 introduces a training-free, plug-and-play memory module that directly enhances LLM performance across various tasks and modalities. Its broad applicability, lack of training overhead, and immediate practical utility give it a higher potential for widespread adoption and real-world impact compared to Paper 1's benchmark generation pipeline, which primarily serves evaluation purposes.

vs. BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

gemini-3.15/19/2026

Paper 2 introduces a training-free, plug-and-play memory module applicable to a wide range of LLMs, demonstrating broad performance improvements across multiple domains including coding and vision. Its broad applicability to foundation models gives it wider potential impact compared to Paper 1, which introduces a valuable but domain-specific benchmark restricted to industrial CAD generation.

vs. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

gpt-5.25/19/2026

Paper 2 is likely to have higher impact because it introduces a large-scale, realistic benchmark and tooling infrastructure (300+ tools across 7 stateful sandboxes) that can become a community standard for evaluating agent robustness—broadly useful across academia and industry. Its seed-driven, noise/failure simulation and trajectory-based diagnoses address timely real-world automation gaps and should catalyze new methods. Paper 1 is a clever, training-free memory add-on with modest average gains; useful, but narrower in scope and likely less field-shaping than a widely adopted benchmark.

vs. Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents

gemini-3.15/19/2026

Paper 2 presents a concrete, training-free architectural advancement for LLMs with rigorous evaluation across multiple model sizes and benchmarks. Given the foundational role of LLMs in current AI research, a plug-and-play memory module that consistently improves performance without additional training offers broader, more immediate scientific impact and high adoption potential compared to the specialized, application-focused approach of Paper 1.

vs. Evaluating Cognitive Age Alignment in Interactive AI Agents

gemini-3.15/19/2026

Paper 1 presents a highly practical, training-free architectural innovation (N-gram Memory) that offers immediate performance gains for LLMs with minimal overhead. Its plug-and-play nature and proven efficacy across varying model scales and modalities ensure broad applicability. While Paper 2 offers an interesting cognitive benchmark, its impact is largely evaluative. Paper 1 directly solves efficiency and knowledge retrieval bottlenecks in state-of-the-art models, giving it much higher potential for widespread adoption and real-world scientific impact in the fast-paced field of LLM optimization.

vs. Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design

gpt-5.25/19/2026

Paper 2 introduces a more novel and broadly applicable framework—continuous, gradient-based search in a learned latent space for automated algorithm design—potentially impacting ML, optimization, and operations research. Its methodological components (encoder, surrogate model, normalizing flow regularization, LLM-conditioned synthesis) form a general pipeline that can extend beyond the tested problems. Paper 1 is practical and timely but more incremental (training-free memory injection with modest average gains) and likely narrower in conceptual impact, despite strong engineering value.

vs. When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

gemini-3.15/19/2026

Paper 2 introduces a training-free, plug-and-play memory module that directly improves LLM performance across varied tasks (including code and multimodal) without additional computational overhead. Such highly accessible, easily integrable architectural improvements typically see rapid, widespread adoption and high citation rates in the fast-paced LLM community. While Paper 1 introduces a valuable benchmark for agent safety, Paper 2's immediate applicability and efficiency gains offer a broader, more immediate impact on both research and practical deployments.

vs. Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

gemini-3.15/19/2026

Paper 2 presents a fundamental, training-free architectural enhancement for LLMs that demonstrably improves performance across diverse benchmarks without extra computational overhead. Its algorithmic innovation and scalability offer broad, significant implications for core AI research. Conversely, Paper 1 is primarily a software engineering framework addressing vendor lock-in; while highly practical for developers, it lacks the deep methodological novelty and transformative scientific impact characteristic of Paper 2.

vs. Selective Off-Policy Reference Tuning with Plan Guidance

gemini-3.15/19/2026

Paper 2 introduces a highly accessible, training-free memory module that can be seamlessly plugged into existing LLMs. Its ability to improve performance across diverse tasks—including code generation, knowledge-intensive benchmarks, and multimodal applications—without requiring additional training makes it exceptionally scalable and widely applicable. In contrast, Paper 1 offers a specialized improvement for RL-based reasoning training, which, while valuable, has a narrower scope of immediate impact.

vs. NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

claude-opus-4.65/19/2026

NeuroMAS introduces a fundamentally novel conceptual framework that bridges neural network architecture design with multi-agent systems, offering both theoretical foundations and a new scaling paradigm for LLMs. Its contribution—treating multi-agent systems as trainable neural-network-like architectures with progressive growth—opens a broad new research direction with significant implications for LLM scaling, multi-agent coordination, and architecture search. Paper 1 (NGM), while practically useful, offers incremental improvements (0.5-1.2 points) via a relatively straightforward n-gram averaging mechanism with limited conceptual novelty.

vs. Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation

gemini-3.15/19/2026

Paper 1 addresses critical safety flaws in Vision-Language-Action models for autonomous driving, exposing severe real-world risks like missed pedestrians and unfaithful reasoning. Its findings have urgent, high-stakes implications for the deployment of physical AI. Paper 2 offers a clever, broadly applicable LLM efficiency module, but its incremental performance gains lack the profound safety and societal impact of Paper 1.

vs. Evidential Information Fusion on Possibilistic Structure

claude-opus-4.65/19/2026

NGM addresses a highly relevant and timely problem in LLM efficiency and knowledge retrieval, offering a practical, training-free plug-and-play solution with demonstrated improvements across multiple benchmarks and model scales. Its broad applicability to both language and multimodal models, combined with the massive current interest in LLM improvements, gives it higher potential for citations and real-world adoption. Paper 1, while theoretically rigorous in belief function theory, addresses a more niche area of evidential reasoning with a narrower community of interest and fewer immediate practical applications.

vs. A Practical Noise2Noise Denoising Pipeline for High-Throughput Raman Spectroscopy

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact due to stronger timeliness and broader cross-field relevance: plug-and-play, training-free memory augmentation for LLMs can affect many NLP/multimodal applications and model sizes, and is easy to adopt. Its novelty (training-free n-gram memory using pretrained embeddings plus nonparametric gating) targets a central bottleneck—efficient knowledge access. While Paper 2 is rigorous and practically valuable for Raman spectroscopy, its impact is more domain-specific and incremental relative to established Noise2Noise-style denoising.

vs. ScreenSearch: Uncertainty-Aware OS Exploration

claude-opus-4.65/19/2026

NGM presents a training-free, plug-and-play module applicable to any LLM, demonstrating consistent improvements across multiple model sizes and diverse benchmarks including code, knowledge, and multimodal tasks. Its broad applicability to the widely-used LLM ecosystem, simplicity of integration (no training required), and demonstrated gains on established benchmarks give it higher near-term scientific impact. ScreenSearch, while novel in framing GUI exploration as a state-space search problem, addresses a narrower domain (desktop GUI agents) and primarily contributes an exploration corpus and empirical analysis rather than a widely adoptable method.

vs. GraphMind: From Operational Traces to Self-Evolving Workflow Automation

gemini-3.15/19/2026

Paper 1 introduces a foundational, training-free architectural enhancement for LLMs that can be easily adopted across various models and tasks. Its plug-and-play nature and strong benchmark improvements (including multimodal) give it broader applicability and higher potential for widespread citation within the rapidly growing core LLM research community, compared to the more domain-specific, applied system presented in Paper 2.

vs. NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents

gpt-5.25/19/2026

Paper 2 has higher potential scientific impact due to broader real-world applicability and cross-field relevance: it targets persistent memory for production LLM agents (a central bottleneck), adds auditable neuro-symbolic governance (trust/safety, HCI, databases), and offers an explicit lifecycle framework for deduplication, scoping, and pruning. While Paper 1 is novel in being training-free and efficient, its gains are relatively modest and mainly incremental within model internals. Paper 2’s architecture could become a general template for deployable, trustworthy long-term agent memory systems.

vs. FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

gemini-3.15/19/2026

Paper 1 introduces a training-free, plug-and-play memory module applicable to a wide range of LLMs without requiring gradient updates. Its broad evaluation across various model sizes, modalities, and general benchmarks (code, QA) suggests high versatility and ease of adoption. In contrast, Paper 2 presents a prompt-based memory evolution framework evaluated on a single, highly specific cybersecurity environment. Consequently, Paper 1 possesses significantly higher breadth of impact and potential for widespread integration into existing LLM architectures.