From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

Zhanchao Xu, Haoyang Li, Qingfa Xiao, Fei Teng, Chen Jason Zhang, Lei Chen, Qing Li

Jun 8, 2026arXiv:2606.09508v1

cs.AIcs.CL

#407of 3489·Artificial Intelligence

#407 of 3489 · Artificial Intelligence

Tournament Score

1494±40

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.8

Novelty6

Clarity6.5

Abstract

Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39 $\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs"

1. Core Contribution

The paper introduces EntropyInfer, a training-free framework for accelerating long-context LLM inference by exploiting per-head attention entropy as an online signal for adaptive compute allocation. The key insight is the identification of two head regimes—Rigid Heads (near-zero entropy, deterministic attention) and Dynamic Heads (fluctuating entropy, context-dependent attention)—and the observation that this categorization is context-dependent, invalidating offline profiling approaches like RazorAttention and Duo-Attention.

The framework operates in two stages: (1) Entropy-guided sparse prefilling, which allocates variable block budgets per head and per segment based on entropy fluctuation; (2) Latent KV cache compression, which delays cache eviction until a few output tokens have been generated, using them to re-rank cache entries. This addresses the known prefill-to-decode attention shift that undermines existing eviction methods (SnapKV, AdaKV).

2. Methodological Rigor

Strengths in methodology:

The observation about Rigid vs. Dynamic heads is backed by clear empirical evidence (Figures 1 and 2), showing entropy heatmaps across heads/layers on different datasets (GovReport vs. Musique).

The observation attention matrix construction (Algorithms 1-2) is a practical approximation that avoids quadratic cost for entropy estimation.

The complexity analysis (Appendix A) demonstrates near-linear scaling with sequence length.

Concerns:

The entropy threshold (e_t = 10⁻⁵) is presented as fixed without extensive justification or sensitivity analysis. While the budget sensitivity analysis shows robustness to prefill budget, the threshold itself is a critical design choice.

The budget allocation formula (Algorithm 3, line 10) involves multiple hyperparameters (α=0.5, Δ_t=0.4) whose selection rationale is not discussed. The clipping to [B₀, 3·B₀] is ad hoc.

The observation attention matrix uses max/min representations of segments, which is a coarse approximation. The paper does not analyze how faithfully this approximation captures true entropy patterns.

The "latent decode" component, while motivated by recent findings about attention shift, is relatively straightforward—essentially delaying compression by N_d tokens and using those tokens' attention for re-ranking. The novelty here is incremental.

3. Experimental Evaluation

The experiments are comprehensive across multiple dimensions:

Models: Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, openPangu-Embedded (1B and 7B)

Benchmarks: LongBench (16 datasets) and InfiniteBench (9 datasets)

Baselines: SnapKV, AdaKV, CritiPrefill

Results show EntropyInfer achieves the best average scores on both benchmarks and both model families, with up to 2.39× end-to-end speedup at 140K tokens. However, several observations temper the impact:

The improvements over baselines on LongBench are relatively small (e.g., 48.55 vs. 48.32 for CritiPrefill on Llama). The margin is narrow enough that individual dataset variation could be significant.

On InfiniteBench with Qwen, the method shows substantial improvements (37.92 vs. 36.50 for CritiPrefill), but still trails the base model significantly (39.62).

The ablation study (Figure 6a) shows only marginal differences between configurations, making it harder to attribute gains to specific components.

The latency evaluation (Figure 4) convincingly demonstrates speedup at long contexts, but the method provides no benefit at shorter contexts (<16K tokens).

4. Timeliness & Relevance

This paper addresses a highly relevant problem—efficient long-context LLM inference—which is a major deployment bottleneck. The trend toward million-token context windows makes prefill acceleration and KV cache compression increasingly important. The training-free, drop-in nature of EntropyInfer makes it practically deployable. The paper correctly identifies that static head profiling is insufficient and that online, context-dependent adaptation is needed—this is a timely and well-motivated insight.

The latent decode idea, informed by recent findings (LoopServe, LouisKV) about prefill-decode attention mismatch, demonstrates awareness of cutting-edge findings in the field.

5. Strengths & Limitations

Key Strengths:

Clean, well-motivated observation about Rigid vs. Dynamic heads with compelling visualizations

Training-free, requiring no model modifications or calibration data

Comprehensive evaluation across three model families and two benchmarks

Code release for reproducibility

Addresses both prefilling and decoding stages in a unified framework

The idea of varying budget along all three axes (position, head, context) simultaneously is a genuine advance over prior work

Notable Limitations:

The method introduces overhead that makes it counterproductive for short contexts, limiting generality

Multiple hyperparameters (e_t, α, Δ_t, clipping bounds) without principled selection criteria

The observation attention approximation quality is not validated

Improvements over the strongest baseline (CritiPrefill) are often marginal on quality metrics

No evaluation at truly massive scales (e.g., 1M tokens) or with larger models (70B+)

The paper lacks theoretical analysis of when/why entropy-based budgeting should be optimal

The latent decode component requires storing the full KV cache until N_d tokens are generated, which partially undermines memory savings during the critical transition period

6. Additional Observations

The paper positions itself well within the literature, providing a clear taxonomy of prior methods and their limitations. However, the writing could be tighter—the algorithms, while clearly presented, involve many moving parts that make the method feel somewhat engineered rather than principled. The connection between entropy fluctuation and optimal budget allocation deserves deeper theoretical grounding.

The evaluation on openPangu models adds breadth but shows smaller gains, and the LoCoMo results (Table 4) show some degradation, particularly with CoT reasoning on the 7B model.

Rating:6.2/ 10

Significance 6.5Rigor 5.8Novelty 6Clarity 6.5

Generated Jun 9, 2026

Comparison History (30)

Wonvs. TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Paper 1 addresses a critical and highly active bottleneck in modern AI: the computational cost of long-context LLM inference. Its training-free, entropy-guided approach offers immediate, practical speedups for widely used models, ensuring broad applicability and high real-world impact. While Paper 2 provides a valuable benchmarking tool for tabular encoders, its scope is narrower, and foundational LLM efficiency improvements generally drive more widespread and immediate scientific adoption.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Paper 1 (EntropyInfer) demonstrates higher potential impact due to its broader applicability across multiple LLM architectures (Llama, Qwen, openPangu), its training-free nature making it immediately deployable, and its substantial practical speedups (2.39x) for the critical problem of long-context inference. It addresses a more universal bottleneck in LLM deployment. Paper 2's DPVR-LF offers interesting insights about vision token saturation but is narrower in scope (specific to LLaVA-style MLLMs), requires training, and the performance preservation claims need broader validation beyond standard benchmarks.

claude-opus-4-6·Jun 9, 2026

Lostvs. TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

Paper 2 likely has higher scientific impact: it introduces a new benchmark (TheoremBench) and metrics that can become a shared evaluation standard for formal-math LLMs, affecting many future papers and enabling more rigorous, comparable progress. Its design (premised structure, theorem-level coverage, token-efficiency) broadens evaluation beyond contest-style tasks and is timely given rapid growth in Lean4-based provers. Paper 1 is useful and practical for long-context inference, but is more incremental within a crowded optimization space and may have narrower cross-field influence than a widely adopted benchmark.

gpt-5.2·Jun 9, 2026

Lostvs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

IatroBench addresses a critical and underexplored problem—AI safety measures causing iatrogenic harm by withholding medically necessary information based on user identity rather than clinical need. Its pre-registered methodology, clear empirical findings across frontier models, and demonstration that evaluation frameworks share the same blind spots as training pipelines represent a novel contribution with broad implications for AI safety policy, healthcare AI deployment, and regulatory frameworks. Paper 1, while technically solid, is an incremental optimization in the well-explored space of efficient LLM inference. Paper 2's findings challenge fundamental assumptions in AI alignment and have immediate real-world consequences.

claude-opus-4-6·Jun 9, 2026

Wonvs. SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Paper 2 addresses a critical and universal bottleneck in modern AI: the computational cost of long-context LLM inference. Its training-free, adaptive approach offers significant speedups (2.39x) with minimal quality loss, ensuring immediate and widespread applicability across industry and academia. While Paper 1 provides a valuable benchmark for multimodal agents, Paper 2's fundamental efficiency improvements will likely see broader, faster adoption and impact across the entire natural language processing ecosystem.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Generative structure search for efficient and diverse discovery of molecular and crystal structures

Paper 1 presents a foundational advance in AI for Science by unifying generative models and physical forces for molecular and crystal discovery. This tackles a major bottleneck in materials science and chemistry, offering profound potential for discovering new drugs, clean energy materials, and catalysts. While Paper 2 offers significant engineering improvements for LLM efficiency, Paper 1's ability to extrapolate beyond training data to discover novel physical structures represents a more fundamental scientific breakthrough with broader cross-disciplinary impact.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Towards a General Intelligence and Interface for Wearable Health Data

Paper 2 presents a foundation model for wearable health pretrained on data from 5 million participants (1 trillion+ minutes), representing a massive scale-up in health AI. It addresses a fundamental challenge in digital health with broad applications across cardiovascular, metabolic, sleep, and mental health domains. The integration with LLM agents for personalized health creates a novel paradigm with significant real-world clinical impact, validated by clinician ratings. Its breadth of impact across healthcare, AI, and wearable technology exceeds Paper 1's more incremental (though solid) contribution to LLM inference optimization.

claude-opus-4-6·Jun 9, 2026

Wonvs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

Paper 1 targets an immediate, high-impact bottleneck—efficient long-context LLM inference—with a concrete, training-free method, strong empirical validation across multiple popular model families, clear baselines, and released code, making adoption and real-world deployment likely. Its contributions are timely given rapid growth of long-context applications and infrastructure constraints, and impact could extend broadly across LLM serving and systems research. Paper 2 is conceptually ambitious and cross-disciplinary, but such unifying theoretical frameworks are harder to validate, standardize, and translate into widespread near-term use, making impact less certain.

gpt-5.2·Jun 9, 2026

Lostvs. End-to-end autonomous scientific discovery on a real optical platform

Paper 2 demonstrates a fundamentally new capability: end-to-end autonomous scientific discovery on a real physical system, culminating in the identification of a previously unreported physical mechanism (optical bilinear interaction). This represents a paradigm shift in how science can be conducted. Its impact spans AI, physics, and scientific methodology broadly. While Paper 1 offers solid engineering contributions to LLM inference efficiency, it is an incremental optimization in a crowded space. Paper 2's novelty, cross-disciplinary implications, and milestone nature give it substantially higher potential impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Paper 2 likely has higher scientific impact: it introduces a large-scale foundation model trained on unprecedented nationwide claims data (43.8B events, 200M+ enrollees) and demonstrates broad, externally validated improvements across 1,000+ clinical tasks plus expenditure forecasting and bias reduction in target trial emulation—directly relevant to regulators, payers, and health systems. The applications and cross-field reach (clinical prediction, health economics, causal/RWE methods) are wide and timely. Paper 1 is innovative and useful for efficient long-context LLM inference, but its impact is more specialized to systems/LLM optimization.

gpt-5.2·Jun 9, 2026

#407of 3489·Artificial Intelligence

#407 of 3489 · Artificial Intelligence

Tournament Score

1494±40

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.8

Novelty6

Clarity6.5