PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
Can Hankendi, Rana Shahout, Minlan Yu, Ayse K. Coskun
Abstract
Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource. In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size. The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency. We implement PALS within an existing LLM serving framework, vLLM, demonstrating that it requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM inference runtimes, enabling energy-proportional and grid-interactive AI systems.
AI Impact Assessments
(1 models)Scientific Impact Assessment: PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
1. Core Contribution
PALS introduces a power-aware runtime for LLM inference that elevates GPU power caps from static constraints to first-class, dynamically controllable knobs. The key insight is that jointly optimizing hardware-level power limits and software-level parameters (batch size, parallelism) expands the achievable Pareto frontier of efficiency-performance trade-offs beyond what either dimension alone can achieve. The system implements a closed-loop controller combining offline power-performance models (random forests) with PID feedback correction, operating at 500ms granularity within vLLM without requiring model retraining or API changes.
The contribution is particularly relevant for Mixture-of-Experts (MoE) models, where dynamic token routing creates variable compute-communication ratios that fundamentally alter the power-performance relationship. The paper shows that communication-bound MoE models (Qwen-MoE, OLMoE) actually *lose* efficiency beyond ~200W power caps because additional power accelerates communication overhead rather than useful computation—a counterintuitive finding that undermines the default practice of running GPUs at maximum power.
2. Methodological Rigor
The empirical methodology is generally sound. The paper systematically profiles three key dimensions (power caps, batch sizes, parallelism configurations) across eight models spanning dense and MoE architectures. The compute-communication decomposition analysis (Figure 2) provides clear mechanistic explanations for observed behaviors.
However, several methodological concerns arise:
3. Potential Impact
The practical impact potential is significant:
4. Timeliness & Relevance
The paper addresses a genuine and growing need. LLM inference energy consumption is a first-order concern for cloud providers, with major companies publicly committing to sustainability targets while simultaneously scaling AI deployments. The focus on MoE models is well-timed, as the industry is shifting toward sparse architectures (Mixtral, DeepSeek, etc.) for efficiency at scale. The observation that MoE models create unique power-management challenges due to communication overhead is timely and underexplored.
The connection to carbon-aware computing and demand response is forward-looking but not yet validated with real grid signals or carbon intensity data—this remains aspirational rather than demonstrated.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Minor Issues:
Overall Assessment
PALS makes a solid systems contribution by demonstrating that joint hardware-software optimization for LLM inference power management is both feasible and beneficial. The MoE-specific insights regarding communication-bound behavior under power scaling are genuinely novel and practically important. However, the evaluation could be more rigorous (statistical reporting, broader hardware platforms, longer-duration experiments), and some claims slightly overstate generality given the experimental scope. The work is a meaningful step toward energy-proportional AI serving, though it falls short of being a comprehensive solution.
Generated May 21, 2026
Comparison History (21)
AMEL identifies a fundamental and previously undercharacterized bias in LLM-as-judge paradigms, which is now a widespread practice across NLP research, content moderation, and code review. The finding that conversational history systematically shifts LLM judgments has broad implications for any pipeline using LLMs as evaluators, affecting reproducibility across many fields. The large-scale empirical rigor (75K+ API calls, 11 models, 4 providers) and actionable mitigation advice give it high practical relevance. PALS addresses an important but more niche systems optimization problem (energy-efficient LLM serving) with incremental engineering contributions over existing power-capping and scheduling work.
Paper 2 (PALS) likely has higher scientific impact due to strong real-world applicability and timeliness: data-center energy use is a major, immediate constraint, and a runtime integrated into vLLM without retraining lowers adoption barriers. The methodology is concrete (models + feedback control) with clear, measurable gains across hardware and MoE/dense models, suggesting broad systems impact. Paper 1 (LCGuard) is novel and important for emerging latent multi-agent communication safety, but relies on an operational leakage definition (reconstruction) that may not capture all privacy risks and may see slower near-term deployment.
Paper 2 likely has higher scientific impact: it identifies a counterintuitive inverse-scaling failure mode in LLM forecasting under superlinear growth and tail-regime risk, introduces a new benchmark, and shows replication across simulated and multiple real-world domains. The findings affect evaluation methodology (tail-inclusive scoring) and practical deployment in high-stakes settings (finance/epidemiology), with broad relevance to ML safety, calibration, and benchmarking. Paper 1 is a solid systems contribution with clear datacenter applicability, but its novelty and cross-field reach are narrower and more incremental compared to the conceptual and evaluative implications of Paper 2.
Paper 1 likely has higher scientific impact due to a concrete, novel systems contribution (treating GPU power caps as a control variable jointly optimized with batching) with demonstrated, reproducible gains (up to 26.3% energy efficiency, fewer QoS violations) and immediate applicability to widespread LLM serving stacks (integration into vLLM, no retraining). It is timely given datacenter energy constraints and could influence both systems research and production deployments. Paper 2 is important conceptually but is more of a survey/proposal with limited methodological rigor and harder-to-standardize, lower-scalability evaluation evidence.
Paper 1 likely has higher scientific impact due to broad, timely applicability and strong real-world relevance: power/energy-aware LLM serving is a data-center–scale constraint affecting most deployments, and integrating GPU power caps into runtime control (implemented in vLLM without retraining) is a pragmatic, system-level innovation with immediate operational benefits. The methodology (offline models + feedback controller) and evaluation across dense/MoE and multi-GPU settings suggest solid rigor and generality. Paper 2 is impactful for VLM reliability, but may be more benchmark/task- and prompt/framework-dependent and potentially less universally deployable.
Paper 2 addresses a critical, highly timely bottleneck in modern AI: the massive energy consumption and power constraints of serving Large Language Models (LLMs). By dynamically optimizing GPU power caps during MoE model inference without requiring model retraining, it offers immediate, scalable, and high-impact real-world applications for data centers and cloud providers. While Paper 1 introduces a rigorous foundational metric for AI uncertainty, Paper 2's direct solution to the pressing economic and environmental costs of LLM deployment gives it a broader and more immediate potential scientific and industrial impact.
PALS addresses the critical and timely problem of energy efficiency in LLM serving, which is a dominant concern as AI workloads scale rapidly in data centers. It offers practical, deployable contributions (integrated into vLLM, no retraining needed) with significant energy savings (up to 26.3%) and QoS improvements. The breadth of impact spans systems, sustainability, and grid-interactive computing. Paper 2, while a solid engineering contribution, is a niche simulator for a specific game environment with narrower impact, primarily serving the RL-for-games community.
PALS addresses a critical and growing real-world problem—energy efficiency of LLM inference in data centers—with broad practical applicability. It integrates into existing frameworks (vLLM) without retraining, making adoption straightforward. The results (26.3% energy efficiency gains, 4-7x QoS violation reduction) are immediately actionable across the industry. While OSCToM makes meaningful contributions to Theory of Mind benchmarks, it targets a narrower research community. PALS has broader cross-disciplinary impact spanning systems, sustainability, and AI infrastructure, with high timeliness given growing concerns about AI energy consumption.
PALS addresses the increasingly critical problem of energy efficiency in LLM inference, which has broad real-world impact as AI deployment scales. It offers a practical, system-level solution integrated into an existing framework (vLLM) requiring no retraining, making adoption straightforward. The results (26.3% energy efficiency improvement, 4-7x QoS violation reduction) are substantial and immediately applicable across data centers. While OSCToM makes meaningful contributions to Theory of Mind benchmarks, its impact is narrower—focused on a specific cognitive reasoning capability. PALS has broader cross-disciplinary relevance spanning systems, sustainability, and AI infrastructure, with high timeliness given growing energy concerns around AI.
Paper 2 has higher likely scientific impact: it presents a concrete, implemented system (PALS) with measurable gains (energy efficiency, QoS) in a timely, high-demand area (LLM/MoE serving and datacenter energy). The methodology appears more rigorous (models + feedback control, integration into vLLM, multi-GPU experiments) and offers immediate real-world applicability for operators and cloud providers. Paper 1 is a forward-looking 6G vision that is novel conceptually but less methodologically grounded and may have longer, more uncertain translation to practice.
Paper 2 likely has higher scientific impact: it introduces a novel, broadly applicable systems approach (power caps as a first-class control knob jointly optimized with batching) with clear methodological rigor (implemented in vLLM, evaluated across multi-GPU, dense and MoE, dynamic budgets) and immediate real-world relevance for data-center energy and QoS. Its benefits (up to 26.3% efficiency, fewer QoS violations) align with timely concerns about AI sustainability and could influence both ML systems research and industry practice. Paper 1 is valuable, but its dataset scale and domain specificity may limit breadth and uptake.
Paper 2 has higher estimated impact because it presents a concrete, implementable system (PALS) with demonstrated gains (up to 26.3% energy efficiency, 4–7× fewer QoS violations) and clear applicability to today’s widespread LLM serving in data centers. Its methodology (offline modeling + feedback control) is testable and reproducible, and integration into vLLM without retraining/APIs increases adoption potential. Paper 1 is a forward-looking 6G vision with high-level directions but limited empirical validation, making near-term scientific and practical impact less certain despite its timeliness.
Paper 1 likely has higher impact due to timeliness and broad applicability: power/energy constraints are a central, immediate bottleneck for large-scale LLM deployment, and a vLLM-integrated runtime that improves energy efficiency and QoS under power caps can be adopted quickly across many models and data centers. The approach is methodologically solid (models + feedback control) and affects systems, ML serving, and sustainable computing. Paper 2 is valuable and well-packaged, but its small scale (10 groups) and narrower domain may limit near-term, cross-field impact compared to scalable infrastructure advances.
AgentCo-op addresses a more fundamental and broadly impactful challenge: composing multi-agent workflows in open-ended settings without curated benchmarks. It introduces a novel retrieval-based synthesis framework with typed artifact handoffs and local repair, demonstrating versatility across scientific discovery (genomics) and standard benchmarks. Its contribution to agentic AI design paradigms has broader applicability across many fields. PALS, while practically valuable for energy-efficient LLM serving, addresses a more incremental systems optimization problem with narrower scope (GPU power management), making it impactful but less transformative.
AgentCo-op addresses a broader and more fundamental challenge—composing multi-agent workflows for open-ended scientific tasks without curated benchmarks—which has wider applicability across scientific domains. Its retrieval-based synthesis framework with typed artifact handoffs is more novel conceptually, and the demonstrated applications in genomics show real-world scientific discovery potential. PALS is a solid engineering contribution for energy-efficient LLM serving, but its scope is narrower (GPU power management) and represents incremental optimization rather than a new paradigm. AgentCo-op's cross-domain applicability and implications for AI-driven science give it higher impact potential.
Paper 2 is likely to have higher scientific impact due to broader cross-domain relevance and conceptual novelty: it analyzes a fundamental failure mode (off-manifold drift) in compositional guidance for diffusion/flow models and proposes a general, lightweight conflict-aware mechanism applicable across vision, synthetic benchmarks, and planning/control. This targets an active, timely area (inference-time control without fine-tuning) and can influence multiple downstream applications and research directions. Paper 1 is impactful for systems/efficiency, but is more incremental and primarily scoped to LLM serving/runtime engineering in datacenters.
Paper 1 addresses a fundamental challenge in compositional guided generation for diffusion/flow models—a rapidly growing area with broad applications across image generation, planning, and control. The theoretical analysis of gradient misalignment and the proposed conflict-aware correction mechanism is novel and generalizable. Paper 2, while practically valuable, addresses a more narrow systems-level optimization (power-aware LLM serving) with incremental contributions over existing serving frameworks. Paper 1's broader applicability across domains, stronger methodological novelty, and relevance to the fast-moving generative AI field give it higher impact potential.
Paper 2 is likely to have higher scientific impact because it introduces a broadly reusable benchmark and auditing framework for deep web research agents, a timely and fast-growing area. Benchmarks often become community standards, shaping model development, evaluation methodology, and downstream products across NLP, IR, HCI, and AI safety (calibration, provenance). It also releases data/code and provides diagnostic findings that can redirect research. Paper 1 is methodologically solid with clear real-world energy benefits, but its impact is narrower (LLM serving/power management) and more incremental relative to existing systems optimization work.
Paper 2 addresses a fundamental theoretical question about the equivalence of DPO and RLHF—two of the most widely used alignment methods for LLMs. Identifying implicit assumptions and failure modes in DPO has broad implications for the entire LLM alignment community, which is massive and rapidly growing. The paper provides rigorous theoretical analysis, practical solutions (CPO), and experimental validation. Paper 1, while practically useful for energy efficiency in LLM serving, addresses a more incremental systems optimization problem with narrower scope. Paper 2's theoretical insights are likely to influence future alignment research and methodology choices across the field.
Paper 2 addresses fundamental theoretical limitations of DPO, a core alignment technique for modern LLMs. By proving its conditional equivalence to RLHF and introducing a provably aligned alternative, it offers profound theoretical insights with immediate, widespread applicability in AI alignment. While Paper 1 provides valuable systems-level optimizations for energy efficiency, Paper 2's theoretical contributions will likely influence a broader range of foundational research and model training paradigms across the machine learning community.