Back to Rankings

BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Model Inference

Yuhua Zhou, Shaoqi Yu, Shichao Weng, Changhai Zhou, Mingze Yin, Fei Yang, Aimin Pan

cs.LG
Share
#2572 of 5669 · cs.LG
Tournament Score
1413±43
10501750
52%
Win Rate
11
Wins
10
Losses
21
Matches
Rating
5.8/ 10
Significance5.5
Rigor6
Novelty5.5
Clarity7

Abstract

Large language models (LLMs) incur high inference cost due to their depth and parameter scale. Depth pruning can reduce latency by skipping redundant Transformer blocks, but existing methods (i) provide limited control under user-specific compute budgets and (ii) typically fix the routing path, failing to adapt as the context grows during decoding. We propose Buddy, a budget-driven dynamic depth routing framework. Buddy uses a lightweight Decision Module to score intermediate layers conditioned on the input and deterministically executes the top-k layers to satisfy a given budget. To support decode-time adaptation, Buddy reuses the first-layer KV cache as a low-overhead global context source and pools it together with the newest token representation before each routing decision. When no explicit budget is provided, an optional Budget Predictor estimates an input-dependent compute level to balance quality and efficiency. Experiments on Llama-family and Qwen models show that Buddy is competitive with strong static pruning baselines and often improves the accuracy-compute trade-off, while uniquely supporting strict budget control, decode-time rerouting, and multiple budgets within a single trained model.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: BUDDY – Budget-Driven Dynamic Depth Routing for Adaptive LLM Inference

1. Core Contribution

BUDDY proposes a dynamic depth routing framework for LLMs that addresses two specific gaps in existing layer pruning methods: (1) the inability to strictly satisfy heterogeneous user-specified compute budgets with a single model, and (2) the failure to adapt routing paths during autoregressive decoding as context evolves.

The system has three main components: a lightweight Decision Module (two-layer MLP) that scores and selects top-k layers based on current context and budget; a KV-aware planner that reuses the first layer's KV cache to provide global context for routing decisions at each decode step; and an optional Budget Predictor trained via GRPO that automatically selects compute levels when no explicit budget is given.

The key distinction from prior work is the combination of four properties: strict budget control, budget flexibility (single model, multiple budgets), input adaptivity, and decode-time adaptivity. Table 1 makes this positioning clear, though some of these distinctions are incremental rather than transformative.

2. Methodological Rigor

Strengths in methodology:

  • The motivation is well-grounded with two concrete empirical observations: input-dependent layer importance (Figure 1) and evolving importance during decoding (Figure 2), both demonstrated on Llama2-7B with WikiText-2.
  • The prior normalization pipeline (Section 4.2.2, Appendix B) is thoughtfully designed with heavy-tail compression, robust z-scoring, and rank normalization to handle heterogeneous importance indicators.
  • The STE-based training with random budget sampling is a practical approach enabling multi-budget support from a single model.
  • Thorough ablation studies cover feature types (Key vs. Value states), prior knowledge fusion, omega coefficient, and layer selection for feature extraction.
  • Weaknesses in methodology:

  • The experimental evaluation is limited primarily to commonsense reasoning benchmarks (8 tasks). The GSM8K results (Appendix D.4) are included but show severe degradation for all methods, making it hard to draw strong conclusions about mathematical reasoning capability preservation.
  • The comparison against dynamic baselines is somewhat uneven. PuDDing and FiRST consistently underperform even static methods in many settings (FiRST dramatically so), raising questions about whether the reproductions are fair or whether these methods were designed for different operating regimes.
  • The decode-adaptive analysis (Section 5.2.3) is limited to path counts and ROUGE scores on SAMSum. The performance differences between Reuse and Recompute are modest (Figure 5), weakening the argument that decode-time rerouting is essential.
  • Latency measurements (Table 4) show that at 12.5% sparsity, speedups are negligible (1.00-1.14×), and the overhead analysis acknowledges gather/scatter costs offset savings at low pruning rates. The decode speedups trail those of PuDDing at matched sparsity (Table 14) because PuDDing uses a fixed decode path.
  • 3. Potential Impact

    Practical applications: The framework is well-suited for multi-tenant LLM serving where different users have different latency/quality requirements. The single-model-multiple-budget paradigm reduces deployment complexity compared to maintaining separate pruned checkpoints.

    Limitations on impact: A significant limitation acknowledged by the authors is that all Transformer blocks must remain in GPU memory, meaning VRAM savings are zero. This substantially limits practical deployment appeal, especially given that memory is often the binding constraint. The batched inference challenge (different sequences selecting different paths causing KV-cache misses) is also left for future work.

    The speedups, while real, are moderate: 1.19× decode speedup at 25% sparsity and 1.64× at 50% sparsity. These are meaningful but not dramatic, especially when accuracy drops significantly at high sparsity levels.

    4. Timeliness & Relevance

    The paper addresses a genuinely important problem: efficient LLM inference under heterogeneous compute constraints. With LLMs being deployed at scale in production systems serving diverse user populations, the ability to dynamically adjust compute per request is highly relevant.

    The budget-flexibility angle (single model, multiple budgets) is particularly timely given the operational overhead of managing multiple model variants. However, the field is rapidly evolving with mixture-of-experts, speculative decoding, and quantization approaches, and BUDDY's depth-pruning approach may have limited complementarity with some of these.

    5. Strengths & Limitations

    Key Strengths:

  • Clean problem formulation with well-defined properties (Table 1)
  • Elegant reuse of first-layer KV cache for global context, adding minimal overhead
  • Comprehensive ablation studies and analysis across multiple model families (Llama1/2/3, Qwen)
  • The multi-budget single-model paradigm is genuinely practical
  • Thorough overhead analysis (Appendix C.2) demonstrating negligible Decision Module cost (~0.003-0.005% FLOPs)
  • Key Limitations:

  • No VRAM reduction — all blocks remain resident in memory
  • Accuracy degradation at high sparsity (50%) is substantial (≈28-30% relative drop)
  • Decode-time adaptation benefits appear marginal in the presented experiments
  • Missing comparisons with other efficiency methods (quantization, distillation, attention optimization)
  • The Budget Predictor evaluation (Figure 13) shows that for commonsense tasks, ~75-80% of predictions concentrate at the lowest budget interval, suggesting limited adaptive behavior on these benchmarks
  • No evaluation on generation quality tasks beyond SAMSum ROUGE scores
  • Training requires SFT on Alpaca plus GRPO, adding 9-26 hours of training cost per backbone
  • Overall Assessment

    BUDDY presents a well-engineered system that unifies several desirable properties for adaptive LLM inference. The framework is technically sound and the experimental coverage is reasonable, though the actual performance gains from the two key innovations (decode-time adaptation and budget flexibility) are modest. The paper is a solid engineering contribution to efficient inference but falls short of providing transformative insights. The inability to reduce memory footprint and the moderate speedups at useful accuracy levels limit its practical impact.

    Rating:5.8/ 10
    Significance 5.5Rigor 6Novelty 5.5Clarity 7

    Generated Jun 9, 2026

    Comparison History (21)

    Wonvs. Exploring the Design Space of Reward Backpropagation for Flow Matching

    Paper 2 addresses a critical and universal bottleneck in modern AI: LLM inference cost. By offering dynamic, budget-driven depth routing, it provides a highly practical solution for deploying massive language models under strict latency and compute constraints. While Paper 1 offers valuable methodological improvements for text-to-image flow matching, the breadth of impact and real-world applicability of optimizing LLM inference efficiency gives Paper 2 a higher potential for widespread scientific and industrial impact.

    gemini-3.1-pro-preview·Jun 10, 2026
    Wonvs. SPACR: Single-Pass Adaptive Training of Uncertainty-Aware Conformal Regressors

    Paper 1 addresses a critical bottleneck in modern AI: the high inference cost of Large Language Models. By offering dynamic, budget-driven depth routing, it provides a highly practical and timely solution with immediate real-world applications across numerous domains relying on LLMs. While Paper 2 offers a valuable methodological improvement in uncertainty quantification via conformal prediction, the sheer scale, timeliness, and economic impact of optimizing LLM deployment give Paper 1 a broader and more significant potential scientific impact.

    gemini-3.1-pro-preview·Jun 10, 2026
    Wonvs. AuRA: Internalizing Audio Understanding into LLMs as LoRA

    Paper 2 (BUDDY) likely has higher impact due to broad applicability: it targets inference cost reduction for mainstream LLMs across many tasks and deployment settings, with explicit budget control and decode-time adaptive routing—highly timely for real-world serving constraints. The method is model-agnostic within Transformer LLMs and can affect many downstream applications and systems research. Paper 1 (AuRA) is innovative for speech-to-LLM integration via LoRA distillation, but its impact is narrower to audio/speech multimodality and depends on ASR-teacher availability and benchmark coverage.

    gpt-5.2·Jun 10, 2026
    Lostvs. When to Align, When to Predict: A Phase Diagram for Multimodal Learning

    Paper 2 provides a fundamental theoretical framework that addresses a critical gap in multimodal learning, offering broad applicability across diverse scientific fields like biomedicine and astrophysics. Its ability to diagnose and predict the success of cross-modal objectives before training represents a deep methodological advance. Paper 1, while highly relevant for the timely issue of LLM inference efficiency, represents a more incremental engineering optimization within a single domain.

    gemini-3.1-pro-preview·Jun 10, 2026
    Wonvs. LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

    Paper 1 addresses the critical and highly timely challenge of LLM inference efficiency. Its budget-driven dynamic depth routing provides a practical solution to reduce computational costs without retraining, offering significant real-world applications across the rapidly expanding AI industry. While Paper 2 offers a novel approach to continual learning, Paper 1 has broader immediate impact and relevance due to the widespread deployment and immense cost of operating large language models.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. Disentanglement with Holographic Reduced Representations

    Paper 2 introduces a fundamentally novel connection between holographic reduced representations (HRR) and disentanglement, bridging symbolic AI and neural representation learning with both theoretical (information-theoretic capacity bounds) and empirical contributions. This cross-pollination of ideas from cognitive science/VSA with modern deep learning has broader potential impact across multiple fields. Paper 1, while technically solid, is an incremental engineering contribution to LLM efficiency—a crowded space with many competing approaches—and its impact is more narrowly scoped to practical deployment optimization.

    claude-opus-4-6·Jun 9, 2026
    Lostvs. Data-driven discovery of governing differential equations across physical systems

    Paper 2 likely has higher scientific impact because it provides a unifying, problem-oriented framework (discoverability phase diagram + REO abstraction) for a fast-growing area spanning many physical and life sciences. This can shape research directions, standardize thinking across methods, and influence broad application domains (physics, engineering, chemistry, biology). Paper 1 is timely and practically useful for efficient LLM inference, but its impact is narrower (systems/ML inference optimization) and more incremental relative to ongoing work on dynamic routing/pruning. Overall breadth and cross-field relevance favor Paper 2.

    gpt-5.2·Jun 9, 2026
    Wonvs. Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction

    BUDDY addresses a broadly impactful problem—efficient LLM inference—with a practical framework offering budget-driven dynamic depth routing, decode-time adaptation, and strict budget control. Its applicability to widely-used LLM families (Llama, Qwen) and the growing demand for efficient inference give it significant real-world relevance and breadth of impact. Paper 2, while introducing an interesting adaptive scale refinement idea for molecular force prediction, is narrower in scope (tested on a single NaCl system), demonstrates modest improvements, and is positioned more as a proof-of-concept. The LLM efficiency space has far larger community engagement and immediate practical demand.

    claude-opus-4-6·Jun 9, 2026
    Lostvs. When Do Local Score Models Extrapolate Across Size? A Diagnostic Theory and Benchmark

    Paper 1 offers a foundational theoretical framework and diagnostic benchmark for a critical bottleneck in scientific AI (size extrapolation in generative models). While Paper 2 provides a valuable system-level optimization for LLM inference, Paper 1's rigorous mathematical approach to quasi-locality and spatial mixing offers deeper, longer-lasting scientific insights with profound implications for domains like molecular modeling and physics simulations.

    gemini-3.1-pro-preview·Jun 9, 2026
    Wonvs. Tight Sample Complexity of Transformers

    Paper 2 likely has higher scientific impact due to strong real-world applicability and timeliness: it targets LLM inference cost, a major bottleneck, and offers practical, controllable compute budgets with decode-time adaptive routing—features directly valuable for deployment across many systems. It also appears empirically validated on widely used model families, increasing adoption potential and cross-field impact (systems, ML, NLP). Paper 1 is methodologically rigorous and theoretically novel, but its impact may be narrower and slower to translate into practice than an inference-efficiency framework.

    gpt-5.2·Jun 9, 2026