Yuhua Zhou, Shaoqi Yu, Shichao Weng, Changhai Zhou, Mingze Yin, Fei Yang, Aimin Pan
Large language models (LLMs) incur high inference cost due to their depth and parameter scale. Depth pruning can reduce latency by skipping redundant Transformer blocks, but existing methods (i) provide limited control under user-specific compute budgets and (ii) typically fix the routing path, failing to adapt as the context grows during decoding. We propose Buddy, a budget-driven dynamic depth routing framework. Buddy uses a lightweight Decision Module to score intermediate layers conditioned on the input and deterministically executes the top-k layers to satisfy a given budget. To support decode-time adaptation, Buddy reuses the first-layer KV cache as a low-overhead global context source and pools it together with the newest token representation before each routing decision. When no explicit budget is provided, an optional Budget Predictor estimates an input-dependent compute level to balance quality and efficiency. Experiments on Llama-family and Qwen models show that Buddy is competitive with strong static pruning baselines and often improves the accuracy-compute trade-off, while uniquely supporting strict budget control, decode-time rerouting, and multiple budgets within a single trained model.
BUDDY proposes a dynamic depth routing framework for LLMs that addresses two specific gaps in existing layer pruning methods: (1) the inability to strictly satisfy heterogeneous user-specified compute budgets with a single model, and (2) the failure to adapt routing paths during autoregressive decoding as context evolves.
The system has three main components: a lightweight Decision Module (two-layer MLP) that scores and selects top-k layers based on current context and budget; a KV-aware planner that reuses the first layer's KV cache to provide global context for routing decisions at each decode step; and an optional Budget Predictor trained via GRPO that automatically selects compute levels when no explicit budget is given.
The key distinction from prior work is the combination of four properties: strict budget control, budget flexibility (single model, multiple budgets), input adaptivity, and decode-time adaptivity. Table 1 makes this positioning clear, though some of these distinctions are incremental rather than transformative.
Practical applications: The framework is well-suited for multi-tenant LLM serving where different users have different latency/quality requirements. The single-model-multiple-budget paradigm reduces deployment complexity compared to maintaining separate pruned checkpoints.
Limitations on impact: A significant limitation acknowledged by the authors is that all Transformer blocks must remain in GPU memory, meaning VRAM savings are zero. This substantially limits practical deployment appeal, especially given that memory is often the binding constraint. The batched inference challenge (different sequences selecting different paths causing KV-cache misses) is also left for future work.
The speedups, while real, are moderate: 1.19× decode speedup at 25% sparsity and 1.64× at 50% sparsity. These are meaningful but not dramatic, especially when accuracy drops significantly at high sparsity levels.
The paper addresses a genuinely important problem: efficient LLM inference under heterogeneous compute constraints. With LLMs being deployed at scale in production systems serving diverse user populations, the ability to dynamically adjust compute per request is highly relevant.
The budget-flexibility angle (single model, multiple budgets) is particularly timely given the operational overhead of managing multiple model variants. However, the field is rapidly evolving with mixture-of-experts, speculative decoding, and quantization approaches, and BUDDY's depth-pruning approach may have limited complementarity with some of these.
BUDDY presents a well-engineered system that unifies several desirable properties for adaptive LLM inference. The framework is technically sound and the experimental coverage is reasonable, though the actual performance gains from the two key innovations (decode-time adaptation and budget flexibility) are modest. The paper is a solid engineering contribution to efficient inference but falls short of providing transformative insights. The inability to reduce memory footprint and the moderate speedups at useful accuracy levels limit its practical impact.
Generated Jun 9, 2026
Paper 2 addresses a critical and universal bottleneck in modern AI: LLM inference cost. By offering dynamic, budget-driven depth routing, it provides a highly practical solution for deploying massive language models under strict latency and compute constraints. While Paper 1 offers valuable methodological improvements for text-to-image flow matching, the breadth of impact and real-world applicability of optimizing LLM inference efficiency gives Paper 2 a higher potential for widespread scientific and industrial impact.
Paper 1 addresses a critical bottleneck in modern AI: the high inference cost of Large Language Models. By offering dynamic, budget-driven depth routing, it provides a highly practical and timely solution with immediate real-world applications across numerous domains relying on LLMs. While Paper 2 offers a valuable methodological improvement in uncertainty quantification via conformal prediction, the sheer scale, timeliness, and economic impact of optimizing LLM deployment give Paper 1 a broader and more significant potential scientific impact.
Paper 2 (BUDDY) likely has higher impact due to broad applicability: it targets inference cost reduction for mainstream LLMs across many tasks and deployment settings, with explicit budget control and decode-time adaptive routing—highly timely for real-world serving constraints. The method is model-agnostic within Transformer LLMs and can affect many downstream applications and systems research. Paper 1 (AuRA) is innovative for speech-to-LLM integration via LoRA distillation, but its impact is narrower to audio/speech multimodality and depends on ASR-teacher availability and benchmark coverage.
Paper 2 provides a fundamental theoretical framework that addresses a critical gap in multimodal learning, offering broad applicability across diverse scientific fields like biomedicine and astrophysics. Its ability to diagnose and predict the success of cross-modal objectives before training represents a deep methodological advance. Paper 1, while highly relevant for the timely issue of LLM inference efficiency, represents a more incremental engineering optimization within a single domain.
Paper 1 addresses the critical and highly timely challenge of LLM inference efficiency. Its budget-driven dynamic depth routing provides a practical solution to reduce computational costs without retraining, offering significant real-world applications across the rapidly expanding AI industry. While Paper 2 offers a novel approach to continual learning, Paper 1 has broader immediate impact and relevance due to the widespread deployment and immense cost of operating large language models.
Paper 2 introduces a fundamentally novel connection between holographic reduced representations (HRR) and disentanglement, bridging symbolic AI and neural representation learning with both theoretical (information-theoretic capacity bounds) and empirical contributions. This cross-pollination of ideas from cognitive science/VSA with modern deep learning has broader potential impact across multiple fields. Paper 1, while technically solid, is an incremental engineering contribution to LLM efficiency—a crowded space with many competing approaches—and its impact is more narrowly scoped to practical deployment optimization.
Paper 2 likely has higher scientific impact because it provides a unifying, problem-oriented framework (discoverability phase diagram + REO abstraction) for a fast-growing area spanning many physical and life sciences. This can shape research directions, standardize thinking across methods, and influence broad application domains (physics, engineering, chemistry, biology). Paper 1 is timely and practically useful for efficient LLM inference, but its impact is narrower (systems/ML inference optimization) and more incremental relative to ongoing work on dynamic routing/pruning. Overall breadth and cross-field relevance favor Paper 2.
BUDDY addresses a broadly impactful problem—efficient LLM inference—with a practical framework offering budget-driven dynamic depth routing, decode-time adaptation, and strict budget control. Its applicability to widely-used LLM families (Llama, Qwen) and the growing demand for efficient inference give it significant real-world relevance and breadth of impact. Paper 2, while introducing an interesting adaptive scale refinement idea for molecular force prediction, is narrower in scope (tested on a single NaCl system), demonstrates modest improvements, and is positioned more as a proof-of-concept. The LLM efficiency space has far larger community engagement and immediate practical demand.
Paper 1 offers a foundational theoretical framework and diagnostic benchmark for a critical bottleneck in scientific AI (size extrapolation in generative models). While Paper 2 provides a valuable system-level optimization for LLM inference, Paper 1's rigorous mathematical approach to quasi-locality and spatial mixing offers deeper, longer-lasting scientific insights with profound implications for domains like molecular modeling and physics simulations.
Paper 2 likely has higher scientific impact due to strong real-world applicability and timeliness: it targets LLM inference cost, a major bottleneck, and offers practical, controllable compute budgets with decode-time adaptive routing—features directly valuable for deployment across many systems. It also appears empirically validated on widely used model families, increasing adoption potential and cross-field impact (systems, ML, NLP). Paper 1 is methodologically rigorous and theoretically novel, but its impact may be narrower and slower to translate into practice than an inference-efficiency framework.