Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation
Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh
Abstract
Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache degrades performance by corrupting critical context. While preserving the prefill cache is essential, decoding-phase compression remains underexplored, with existing methods relying on rigid recency windows or instantaneous attention. Our analysis of attention dynamics reveals strong temporal patterns: critical tokens receive sustained attention over long horizons, while local reasoning involves short-lived bursts. Static heuristics fail to capture this behavior, leading to premature eviction of important tokens or retention of stale ones. We propose Moment-KV, a decoding-time KV cache compression method based on momentum-driven temporal attention aggregation. Our method models token importance as a continuously evolving state, where attention is aggregated with decay, capturing both long-term influence and recent relevance. Experiments show that Moment-KV significantly improves generation fidelity in long-generation tasks (2.3-3.2 %) while maintaining decoding latency.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation
1. Core Contribution
Moment-KV introduces a momentum-based temporal attention aggregation strategy for compressing the KV cache specifically during the decoding phase of LLM inference. The key insight is that token importance should be modeled as a continuously evolving state rather than assessed via instantaneous attention snapshots. The method maintains an exponentially decaying running average of attention weights per token (Equation 5: I_i(t) = α·I_i(t-1) + ā_i(t)), evicting tokens with the lowest accumulated importance when a fixed decoding budget is exceeded. The prefill cache is kept intact, and compression is applied only to the growing decoding cache.
The paper identifies two specific limitations of prior work: (L1) temporal instability of attention causing premature eviction of "heavy hitter" tokens during temporary dips, and (L2) inefficiency of fixed recency windows where most reserved slots hold low-utility tokens. The proposed solution is conceptually simple — essentially an exponential moving average (EMA) of attention scores — but well-motivated by the empirical observations.
2. Methodological Rigor
Strengths in methodology:
Weaknesses:
3. Potential Impact
The paper addresses a genuine and growing problem: KV cache memory consumption during long-output generation. As LLMs are increasingly deployed for code generation, long-form writing, and multi-turn dialogue, decode-time memory management becomes critical.
However, the practical impact may be limited by several factors:
The compatibility with prefill compression methods (Table 2) is a practical strength that could encourage adoption as a modular component.
4. Timeliness & Relevance
The paper is timely. Decode-time KV cache compression is indeed underexplored compared to prefill-time compression, and the distinction between these two phases is important. The focus on long-generation tasks (as opposed to long-context understanding) addresses an emerging use case. The observation that prefill cache should remain intact aligns with concurrent findings in the field (FlowKV, SCOPE).
The paper correctly identifies that most prior work either compresses prefill or applies uniform strategies, and positions itself well within this gap. However, the competitive landscape is evolving rapidly, and the simplicity of the approach means it could be easily replicated or superseded.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's writing is generally clear, though some claims are overstated given the evidence. The equation numbering has errors (Eq. 7 is referenced in the text discussing eviction but the equation defines overflow). The benchmark selection is appropriate for the stated goals, though broader evaluation (e.g., on coding tasks or multi-turn dialogue) would strengthen the claims. The method's simplicity is both a strength (reproducibility, low overhead) and a weakness (limited novelty).
Generated May 29, 2026
Comparison History (22)
Paper 2 likely has higher impact due to strong real-world applicability and timeliness: KV-cache memory is a key deployment bottleneck for long-context generation, and decode-time compression directly improves serving efficiency with minimal quality loss. The momentum-based temporal attention aggregation is a broadly applicable systems/algorithm idea that could transfer across models and inference stacks, affecting many downstream applications. Paper 1 is novel for post-training stability in low-data SFT→RL, but its impact is narrower (mainly RLHF/RLAIF pipelines) and more benchmark-dependent.
Paper 1 bridges AI and physical sciences by utilizing LLM agents for battery parameter estimation, a critical bottleneck in energy storage innovation. Reframing inverse physics problems as reasoning tasks rather than black-box optimization offers a highly novel paradigm. While Paper 2 provides a valuable efficiency improvement for LLM inference, Paper 1 has broader cross-disciplinary impact and addresses a pressing real-world global challenge in battery technology.
Paper 2 addresses a fundamental bottleneck in LLM deployment (KV cache size during long generation). Its momentum-based approach to dynamic token importance has broad implications for improving the efficiency and scalability of large language models across numerous domains. While Paper 1 presents a practical solution for on-device GUI agents, the foundational nature and widespread applicability of improving core LLM generation efficiency give Paper 2 a higher potential for broad scientific impact.
Paper 1 identifies a fundamental reasoning failure mode in masked diffusion models, revealing that the widely-adopted confidence-based decoding strategy is inherently misaligned with complex reasoning requirements. This is a deeper, more conceptual contribution that challenges prevailing assumptions in a rapidly growing field (diffusion language models). It provides rigorous analysis across five tasks with clear theoretical insight. Paper 2 offers an incremental improvement (2.3-3.2%) to KV cache compression—a well-studied optimization problem—using a relatively straightforward momentum-based approach, with narrower impact scope.
GRASP addresses a broader and more impactful problem—systematic self-improvement of LLM agents with regression-aware gating—demonstrating large performance gains (40.6% to 88.8%) across multiple models and domains. Its contributions (gated skill libraries, cross-model transfer, regression budgets) are more novel and generalizable than Moment-KV's incremental improvement to KV cache compression (2.3-3.2% gains). GRASP's clinical applications, cross-domain generalization, and insights about skill transfer asymmetry give it wider potential impact across AI safety, agent reliability, and practical deployment.
Paper 2 addresses a critical and highly timely bottleneck in Large Language Models (KV cache memory in long generation), a widely adopted technology across numerous fields. Its momentum-driven compression approach offers broad applicability to LLM deployment. While Paper 1 provides valuable contributions to sustainable energy management, the rapid adoption and massive scale of LLM research give Paper 2 a significantly higher potential for immediate, broad scientific impact and widespread practical integration.
Paper 2 addresses a timely and broadly impactful question about AI agents in scientific research, offering concrete empirical evidence about failure modes (symptom-fixing vs. root-cause resolution, inability to reconsider architectural choices). Its insights generalize beyond physics to any domain using AI coding agents, making it relevant across scientific disciplines. It also contributes actionable supervision practices. While Paper 1 offers a solid incremental improvement in KV cache compression (2.3-3.2% gains), Paper 2's findings about fundamental limitations of AI agents and the critical role of human supervision design are more likely to shape research practices and AI development broadly.
MiraBench addresses a fundamental gap in evaluating robotic world models—shifting from visual fidelity to action-conditioned reliability—with a comprehensive benchmark spanning 12 model configurations and 16,000+ human annotations. It reveals important findings (visual fidelity ≠ action fidelity, scaling doesn't help, pervasive optimism bias) that could reshape how the robotics community develops and evaluates world models. Paper 2 offers an incremental improvement (2.3-3.2%) to KV cache compression, a well-studied area with many competing methods. MiraBench's broader scope, novel evaluation framework, and cross-cutting implications give it higher potential impact.
Paper 1 addresses a fundamental and widespread bottleneck in Large Language Models—KV cache memory during long-context generation. By introducing a novel momentum-based approach to decode-time compression, it offers a foundational algorithmic improvement applicable across nearly all LLM architectures and domains. In contrast, Paper 2 presents a domain-specific (financial) benchmark. While highly valuable for evaluating model reliability in fintech, Paper 1 has a significantly broader scientific impact, greater timeliness regarding current LLM scaling challenges, and wider real-world applicability for efficient AI deployment.
DOMINO introduces a fundamentally new paradigm (inductive vs. deductive) for domain-specific data synthesis, addressing a broadly applicable problem with theoretical guarantees and strong empirical results. Its framework for learning domain representations from examples without explicit descriptions has wide applicability across many domains and tasks. Moment-KV, while solving a real engineering bottleneck in KV cache compression, is more incremental—applying momentum-based temporal aggregation to an existing line of work. Paper 2's broader applicability, theoretical contributions, and paradigm-shifting framing give it higher potential impact.
Moment-KV addresses a fundamental scalability bottleneck (KV cache compression) for LLM deployment, proposing a novel momentum-based temporal aggregation method with clear empirical improvements. It has broader impact across the entire LLM ecosystem, affecting diverse downstream applications. Paper 1, while relevant to AI in education, has a narrow scope (testing two specific tools on one classification task), small effect sizes, and findings that are somewhat expected (few-shot prompting helps; model updates don't guarantee improvement). Its contributions are incremental and domain-specific, limiting broader scientific impact.
Paper 1 addresses a fundamental infrastructural bottleneck in LLM deployment (KV cache size). Its technical contribution offers immediate, broad applications across all AI domains relying on long-context generation. Paper 2, while providing valuable empirical insights, focuses on a highly speculative and narrow niche (DeFi AI agents) and is largely observational rather than introducing a core methodological advancement. Thus, Paper 1 has much higher potential for broad scientific and practical impact.
Moment-KV addresses a critical and widely relevant bottleneck (KV cache compression) for deploying LLMs at scale, which has broad impact across the rapidly growing LLM ecosystem. It introduces a novel momentum-based temporal aggregation mechanism with demonstrated improvements. Paper 1, while methodologically sound, is a benchmarking study within a narrower domain (EEG foundation models) that concludes no single positional encoding strategy wins universally—a somewhat incremental finding with limited actionable insight. Paper 2's practical applicability to the massive LLM deployment landscape gives it significantly higher potential impact.
Paper 2 addresses a fundamental and underexplored theoretical problem—compositional incoherence in multi-agent LLM systems—with rigorous mathematical formalization (coherent polytopes, Rayleigh-quotient bounds, e-processes). It introduces novel concepts (compositional residual ε*), provides both theoretical characterization and practical remedies, and has broad implications for the rapidly growing field of multi-agent AI systems. Paper 1, while practically useful, offers an incremental improvement to KV cache compression with modest gains (2.3-3.2%), addressing a well-studied engineering bottleneck rather than opening a new research direction.
Paper 2 addresses a critical bottleneck in Large Language Models (KV cache memory during long generation), offering a fundamental methodological improvement with broad applicability across AI deployments. Its focus on foundational model efficiency promises widespread, immediate impact across numerous downstream applications. In contrast, Paper 1 presents a more niche, albeit valuable, application of AI in educational technology with preliminary, small-scale results.
Paper 2 likely has higher scientific impact because it introduces a broadly applicable benchmark framework for evaluating scientific hypothesis generation under progressive information disclosure, addressing an urgent and widely relevant need for measuring LLM “AI scientist” capabilities. Its potential applications span model evaluation, training, and governance across many scientific domains, and the framework can become a standard reference point. Paper 1 is a solid systems contribution with clear deployment value, but its impact is narrower (decode-time KV cache optimization) and the reported gains are incremental.
Paper 1 offers higher scientific impact by introducing a rigorous statistical framework (conformal prediction) to certify LLM reasoning prefixes. While Paper 2 addresses a critical engineering bottleneck (KV cache memory) with a practical momentum-based heuristic, Paper 1 solves a fundamental theoretical problem in AI reliability and safety. By providing formal guarantees for process-level reasoning steps, Paper 1 bridges process supervision, uncertainty quantification, and model repair. This is likely to spark a broader foundational research direction in certified AI reasoning, whereas Paper 2 represents a valuable but narrower architectural optimization.
Paper 2 addresses a fundamental theoretical challenge in deep learning—quantifying simplicity and generalization. By introducing a novel metric and differentiable regularizer, its methodology demonstrates broad applicability across diverse domains like vision, text, and reinforcement learning. In contrast, while Paper 1 offers a timely and practical engineering optimization for LLM efficiency, its scope is narrower and tied specifically to current transformer architectures.
Moment-KV addresses a concrete, well-defined technical bottleneck (KV cache compression) in LLM deployment with a novel momentum-based approach, demonstrated quantitative improvements, and broad applicability across all long-generation LLM tasks. Paper 1 proposes a conceptual architecture for educational AI without empirical validation. While important, it is more domain-specific and lacks the methodological rigor and measurable results that drive scientific impact. Paper 2's contribution is more likely to be widely adopted and cited across the LLM research community.
Paper 2 offers deeper mechanistic insights into RLVR training dynamics for LLMs, a highly active research area. It introduces novel analytical tools (T-SAE for internal feature dynamics), reveals non-obvious findings about sample difficulty's non-monotonic effects, and proposes actionable difficulty-adaptive strategies. Its breadth of impact is larger—affecting how RLVR training is designed across reasoning tasks. Paper 1 addresses an important but more incremental engineering problem (KV cache compression) with modest improvements (2.3-3.2%), and its scope is narrower, focusing on decode-time efficiency.