Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei, Xiangyuan Wang, Mengzhe Ruan, Hanxu Hou, Peisong Wang
Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.
ReasonAlloc addresses a genuine and timely problem: the KV cache memory bottleneck created by long chain-of-thought (CoT) reasoning in LLMs. The key insight is that existing decoding-time KV cache compression methods assume uniform budget allocation across layers and heads, which is suboptimal. The paper recasts KV compression as a hierarchical budget allocation problem with two levels:
The framework is training-free, plug-and-play (compatible with various token-scoring policies), and introduces negligible overhead. The decoupling of "where to allocate budget" from "which tokens to evict" is a clean architectural contribution.
Empirical Profiling (Section 4): The paper provides a thorough empirical characterization of KV demand heterogeneity across layers and heads. The use of Wasserstein distance to quantify variance across three hierarchical levels (intra-dataset, cross-dataset, cross-model) is methodologically sound. The finding that cross-architecture variance (0.0503) is ~3.8× larger than cross-task variance (0.0132) is well-supported and motivates the offline calibration approach.
Robustification Operator: The shared operator R(·) with power smoothing (ν=0.5), clipping, and renormalization is a practical and principled approach to prevent starvation and monopolization. However, the choice of γ=0.5, β=0.5, μ=0.25, and clip bounds [0.25B̄, 2B̄] are described as selected via "manual probing on a single AIME 2024 prompt" — this raises concerns about generalizability, though the authors note shape consistency across thresholds.
Experimental Evaluation: The evaluation covers three models (R1-Llama-8B, R1-Qwen-14B, AceReason-14B) and two benchmarks (MATH-500, AIME 2024). The use of pass@1 with k=8 samples is appropriate. However, there are some concerns:
Ablation Study: Table 1 provides a useful decomposition. Head-wise routing contributes more at small budgets, while layer-wise preallocation has complementary effects. However, results at larger budgets show some instability (e.g., layer-wise only achieves 55.00 at budget 2560 but drops to 44.18 at 3072), suggesting the approach may not be uniformly beneficial.
The practical impact is moderate to high for the specific niche of serving long-CoT reasoning models under memory constraints. The key value propositions are:
The "Reasoning Wave" finding itself is potentially impactful as an empirical contribution that could inform future architecture design and compression methods. However, the finding is limited to distilled reasoning models from the DeepSeek-R1 family and AceReason, and may not generalize to other architectures or training paradigms.
This paper is highly timely. Long-CoT reasoning models (DeepSeek-R1, QwQ, etc.) represent a major trend in 2025, and their inference costs are a recognized bottleneck. The focus on decoding-time (rather than prefill-time) compression for reasoning models addresses a genuine gap, as most prior non-uniform allocation work (PyramidKV, DynamicKV) targets the prefill phase.
Overall Assessment: ReasonAlloc makes a solid, incremental contribution to KV cache compression by introducing a well-motivated hierarchical allocation framework. The "Reasoning Wave" finding is interesting and the practical gains at tight budgets are meaningful. However, the evaluation breadth, statistical rigor, and hyperparameter analysis could be strengthened. The contribution is primarily engineering-oriented rather than introducing deep theoretical insights.
Generated Jun 10, 2026
Paper 1 introduces a fundamentally novel paradigm for AI interpretability and trust by treating behavior forecasting as a learnable task, bypassing traditional and often unfaithful explanations. This conceptual shift has profound implications for AI safety, alignment, and evaluation of Large Reasoning Models. While Paper 2 offers a highly practical and timely systems optimization for KV cache management, Paper 1's approach has broader theoretical impact across the AI community by redefining how we understand and predict complex model behavior.
Paper 2 (HERO) likely has higher scientific impact due to its broader applicability to agentic RL/LLM agents, addressing a core multi-turn credit assignment and supervision-alignment problem with a generally reusable hindsight-feedback mechanism. It targets timely, high-interest benchmarks (WebShop, TauBench) and improves both success and efficiency under scarce-success regimes, a key real-world constraint. Paper 1 is solid and practical for inference efficiency in reasoning LLMs, but is narrower in scope (KV-cache budgeting) and more incremental within an active compression line.
ReasonAlloc addresses a critical and timely problem in LLM inference efficiency—KV cache management for reasoning models with long chain-of-thought. It introduces a novel hierarchical budget allocation framework with the 'Reasoning Wave' concept, is training-free and plug-and-play, and demonstrates clear improvements on established benchmarks. While Workflow-GYM contributes a useful benchmark for GUI agents, benchmarks tend to have more transient impact. ReasonAlloc's methodological contribution to efficient inference for reasoning models has broader applicability as reasoning LLMs become ubiquitous, making it likely to influence a wider body of follow-up work.
ReasonAlloc addresses a critical and timely infrastructure challenge—KV cache management for reasoning LLMs—with a practical, training-free solution showing clear empirical gains. As reasoning models with long CoT become mainstream, efficient inference is a high-priority problem with broad real-world impact. Paper 2 (CIAware-Bench) introduces an interesting benchmark for AI safety/control, but targets a narrower, more speculative concern (models detecting control interventions). While relevant to AI safety, its immediate practical impact and community adoption potential are lower compared to the broadly applicable inference optimization of Paper 1.
Moonshine represents a fundamentally novel paradigm—an autonomous agent for mathematical conjecture generation that bridges classical mathematics with neural network theory. The Neural Jacobian Conjecture is a genuinely new mathematical contribution connecting the Jacobian conjecture to neural networks, with partial proofs obtained. This has broader cross-disciplinary impact (AI, pure mathematics, neural network theory) and introduces a new research methodology. Paper 1, while technically solid, is an incremental optimization of KV cache management for LLM inference—a crowded space with many competing approaches and limited impact beyond efficiency improvements.
Paper 2 addresses a highly critical and timely bottleneck in the deployment of modern Large Language Models: KV cache explosion during long chain-of-thought reasoning. By providing a training-free, plug-and-play solution to optimize decoding-time efficiency for models like DeepSeek-R1, it has massive potential for immediate, widespread real-world adoption. While Paper 1 offers strong theoretical contributions to multilingual fine-tuning, the systems-level impact and relevance of Paper 2 to the current frontier of reasoning models give it a higher potential for broad scientific and practical impact.
Paper 2 likely has higher impact: it proposes a concrete, deployable method that improves inference efficiency for reasoning LLMs—an acute bottleneck with immediate real-world value. The hierarchical, training-free, decoding-time KV budget allocation (offline layer-wise + online head-wise) is a notable technical innovation with broad applicability across architectures and eviction policies, and it is evaluated on strong reasoning benchmarks with multiple models. Paper 1 is timely and useful for evaluation practice, but its impact is more meta-analytical and may translate into slower, less direct adoption than an efficiency technique.
Paper 2 likely has higher impact: it targets a major, broadly felt bottleneck (decoding-time KV cache growth) that affects essentially all long-context reasoning deployments, with immediate real-world efficiency gains and hardware cost implications. Its hierarchical (layer+head) allocation is a clear methodological advance over uniform or static schemes, is training-free, and is plug-and-play with existing eviction policies, aiding adoption. While Paper 1 is novel and useful for agent reliability, its applicability depends on tool-contract availability and agent setups, making its cross-domain impact narrower than ubiquitous inference optimization.
ReasonAlloc addresses a critical and timely infrastructure problem—KV cache management during LLM reasoning—with a novel hierarchical budget allocation framework and the discovery of the 'Reasoning Wave' pattern. It offers a training-free, plug-and-play solution with rigorous evaluation across multiple models, directly enabling more efficient deployment of reasoning LLMs. Paper 2, while a useful benchmark contribution for Office automation, has narrower scope (specific to one exam format) and provides diagnostic findings rather than a technical solution. ReasonAlloc's methodological innovation and broad applicability to the rapidly growing reasoning-model ecosystem give it higher impact potential.
Paper 1 is more novel and broadly impactful: it introduces a new latent-space memory paradigm for multimodal RAG/QA that changes what is retrieved and fed to generators (single latent token per evidence), with end-to-end training objectives and validation across many text and multimodal benchmarks. It targets a major real-world bottleneck (context/token and storage cost) relevant to deploying LLM/VLM QA widely. Paper 2 is timely and practical for inference efficiency, but is narrower (KV-cache budgeting for reasoning) and primarily an engineering optimization, with less cross-domain reach.