Bole Ma, Ayesha Afzal, Jan Eitzinger, Gerhard Wellein
Power capping is the standard GPU energy lever in LLM serving, and it appears to work: throughput drops, power readings fall, and energy budgets are met. We show the appearance is illusory for the phase that dominates production serving: autoregressive decode. Across four attention paradigms -- GQA, MLA, Gated DeltaNet, and Mamba2 -- on NVIDIA H200, decode draws only 137--300\,W on a 700\,W GPU; no cap ever triggers, because memory-bound decode saturates HBM bandwidth rather than compute and leaves power headroom untouched. Firmware-initiated clock throttling compounds the illusion: these deviations can corrupt any throughput measurement that attributes them to the cap. SM clock locking dissolves both confounds. By targeting the lever that is actually on the critical path, clock locking Pareto-dominates power capping universally, recovering up to 32\% of decode energy at minimal throughput loss. We identify three architecture-dependent DVFS behavioural classes and characterise a common energy pattern across novel attention replacements: a heavy prefill cost recouped by efficient decode, eventually halving total request energy relative to GQA at production batch sizes.
This paper makes a sharp, well-defined claim: GPU power capping—the default energy management mechanism in data centers—is structurally ineffective for the autoregressive decode phase of LLM inference. The authors demonstrate that because decode is memory-bound (drawing only 137–300W on a 700W TDP H200 GPU), the power cap never triggers, rendering it inert. They propose SM clock locking as a superior alternative that directly controls the lever on the critical path, recovering up to 32% of decode energy with negligible throughput loss. The paper further characterizes this across four attention paradigms (GQA, MLA, Gated DeltaNet, Mamba2) and identifies three DVFS behavioral classes.
The central insight—that power capping is the wrong abstraction for memory-bound workloads—is not conceptually surprising to hardware performance engineers, but the systematic demonstration across multiple modern attention architectures, combined with the identification of a firmware-level clock clamping artifact that compounds the confusion, constitutes a valuable and actionable contribution.
The experimental methodology is notably careful:
However, some limitations affect rigor:
Immediate practical impact: The finding is directly deployable—replacing power caps with static SM clock locking requires a single `nvidia-smi` command at job start. For disaggregated serving (Splitwise, DistServe), where decode runs on dedicated GPU pools, this is particularly clean. At data-center scale (the paper estimates 0.5 MW savings across 10,000 GPUs), this is significant.
Monitoring and accounting: The paper raises an underappreciated point about misleading metrics—"power cap utilization" makes decode workloads appear efficient when the GPU is simply not being pushed. This could influence how operators instrument and reason about energy.
Cross-architecture energy characterization: The finding that novel architectures (MLA, GDN, Mamba2) share a "heavy prefill, efficient decode" pattern, with crossover points dependent on batch size and context length, provides actionable guidance for architecture selection in production. The specific crossover thresholds (MLA below GQA at BS=32 beyond 4K tokens; Mamba2 after ~1000 output tokens) are deployment-relevant.
Influence on hardware design: The paper implicitly argues that GPU power management firmware should be context-aware—distinguishing memory-bound from compute-bound phases—rather than relying on aggregate power ceilings. This could influence NVIDIA's driver/firmware design, though the feedback loop is indirect.
This paper is highly timely. LLM inference is the dominant GPU workload growth category in data centers, and energy management is an urgent operational concern. The shift toward decode-dominated workloads (agentic systems, streaming, long-output generation) makes the decode phase increasingly important relative to prefill. The paper also arrives as novel attention architectures (MLA in DeepSeek-V3, Mamba2 in Nemotron, GDN in Qwen3.5) are entering production—characterizing their energy profiles is immediately useful.
The gap identification is precise: prior work (POLCA) observed the two-stage power profile but drew capacity-planning conclusions; DVFS studies benchmarked against default governors rather than power capping. This paper explicitly asks and answers the right question.
1. Clean, falsifiable thesis: "Power capping is ineffective for decode" is testable and the evidence is compelling.
2. Controlled experimental design: The TransMLA ablation is exemplary.
3. Actionable results: The per-architecture clock policy table (Figure 2) and three behavioral classes are directly deployable.
4. Firmware artifact discovery: The silent 1980→1830 MHz clamp under `--lock-gpu-clocks` is a practical finding that protects other researchers from measurement errors.
5. Pareto analysis: Presenting throughput vs. energy efficiency frontiers (Figure 3) gives operators a principled trade-off framework.
1. Single hardware platform: H200 SXM only. The claim of universality is argued but not demonstrated across GPU generations.
2. Small models only (~4B): While the arithmetic intensity argument is sound, empirical validation on 70B+ models with tensor parallelism would strengthen the case considerably.
3. Single framework: vLLM with unfused eager mode for GDN/Mamba2. The paper acknowledges this but the prefill penalty characterization is framework-dependent.
4. No dynamic DVFS: Only static clock locking is evaluated. A practical deployment might benefit from phase-aware dynamic switching, which the paper identifies as future work but does not explore.
5. No cost-of-implementation analysis: While `nvidia-smi` is simple, integrating clock policies into production serving schedulers at scale requires engineering effort that is not discussed.
This is a well-executed systems characterization paper that identifies a real and actionable gap between data-center practice and LLM inference reality. Its impact is primarily practical rather than theoretical, but the systematic cross-architecture energy characterization adds scientific value. The paper would benefit from multi-GPU and larger-model validation, but the core finding is sound and immediately useful. It is likely to influence both operational practices and future inference energy research.
Generated May 13, 2026
Paper 1 is likely to have higher impact because it overturns a widely assumed energy-control practice (GPU power capping) for the dominant LLM-serving phase (decode) with phase-aware, architecture-spanning evidence on a flagship production GPU (H200), and proposes a generally applicable, actionable alternative (SM clock locking) with sizable energy gains. Its findings directly affect the rapidly growing LLM inference ecosystem, evaluation methodology (DVFS confounds), and hardware–software co-optimization across multiple emerging attention architectures. Paper 2 is solid systems work, but decentralized FL frameworks face harder adoption barriers and less immediate cross-industry urgency than LLM inference energy.
Paper 2 addresses a highly timely and globally critical issue—LLM energy consumption—by debunking standard assumptions about GPU power capping. Its proposed clock locking method offers immediate, actionable improvements for AI infrastructure. While Paper 1 provides excellent performance gains for biological simulations, Paper 2's findings have far broader applicability and immediate real-world impact across the rapidly expanding field of AI serving and systems architecture.
Paper 1 offers a highly novel and timely technical breakthrough by debunking current assumptions about LLM energy consumption and demonstrating up to 32% energy savings using SM clock locking. Given the massive scale and environmental footprint of LLM deployment, this has immediate, high-impact real-world applications in data center efficiency. In contrast, Paper 2 is a community summit report focusing on policy, standardization, and administrative challenges in scientific workflows. While valuable, it lacks the direct, quantifiable, and immediate technical impact of Paper 1.
Paper 1 reveals a fundamental and previously unrecognized flaw in how power capping is applied to LLM inference on GPUs—showing it's ineffective during the dominant decode phase. This finding has immediate, broad implications for GPU energy management in production LLM serving, challenges conventional assumptions, and provides rigorous characterization across four attention architectures with actionable alternatives (clock locking). Paper 2 applies LLMs to microservice autoscaling with incremental improvements. While useful, it's more application-engineering than a fundamental insight, and its impact is narrower and less paradigm-shifting.
Paper 2 addresses a critical, highly timely issue: the massive energy consumption of LLM serving. By exposing fundamental flaws in standard GPU power capping for memory-bound decode phases and offering a highly effective alternative (SM clock locking), it promises immediate, large-scale real-world impact on AI infrastructure costs and sustainability. Paper 1 offers a clever consensus protocol improvement for distributed ledgers, but the ubiquitous deployment and soaring energy demands of LLMs give Paper 2 a significantly broader and more urgent scientific and industrial impact.
Paper 1 reveals a fundamental and previously unrecognized flaw in how the dominant energy management technique (power capping) is applied to LLM inference decode phases. This insight affects the entire LLM serving industry and challenges conventional wisdom across multiple attention architectures. The discovery that memory-bound decode never triggers power caps, combined with the actionable alternative (clock locking) that Pareto-dominates, has broad immediate practical impact. Paper 2 presents a solid but more incremental optimization for pipeline parallelism training throughput, addressing a narrower problem with well-established techniques (LP optimization, parameter freezing).
Paper 2 is more novel and broadly impactful: it challenges a widely used operational assumption (GPU power capping efficacy) with phase-aware measurements on modern hardware (H200) and across multiple attention architectures, offering a clearer causal explanation (memory-bound decode, DVFS confounds) and a generally applicable alternative (SM clock locking) with quantified benefits. Its findings affect benchmarking methodology and energy/throughput optimization in real-world LLM serving across systems, hardware, and model-architecture communities. Paper 1 is useful but more incremental, focused on an optimization framework within federated fine-tuning over wireless settings.
Paper 1 introduces a new conceptual separation in Byzantine Agreement—showing subquadratic communication suffices to reach univalency via ε-BA and extractable BA—potentially reshaping how lower bounds are interpreted and how protocols are designed. The results are broadly relevant across distributed computing, consensus, and cryptographic protocol design, with strong theoretical novelty and likely lasting citation impact. Paper 2 offers timely, high-value systems insights for LLM serving and GPU energy tuning, but its impact is more hardware/platform-dependent and may age faster as architectures and firmware evolve.
Paper 1 addresses a critical and highly timely issue: the massive energy consumption of LLM serving. By exposing the ineffectiveness of standard GPU power capping for LLM decode and proposing a Pareto-dominating alternative (SM clock locking), it offers immediate, highly valuable real-world applications for data centers. Paper 2 presents a solid theoretical advancement in distributed robotics, but its impact is relatively niche and lacks the broad, immediate applicability and timeliness of Paper 1's findings in the booming field of AI infrastructure.
Paper 1 addresses a highly timely and critical issue: energy efficiency in large language model serving. By debunking standard power capping assumptions and proposing a highly effective alternative, it offers immediate, massive real-world impact for AI infrastructure. Paper 2 provides fundamental theoretical results in distributed computing, but Paper 1's immediate applicability and relevance to the explosive growth of LLMs give it a substantially higher potential for broad and significant scientific and industrial impact.