PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Can Hankendi, Rana Shahout, Minlan Yu, Ayse K. Coskun

May 20, 2026

arXiv:2605.21427v1 PDF

cs.AI(primary)cs.DC

#989of 2292·Artificial Intelligence

#989 of 2292 · Artificial Intelligence

Tournament Score

1428±45

10501800

52%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor5.5

Novelty6.5

Clarity6.5

Tournament Score

1428±45

10501800

52%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource. In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size. The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency. We implement PALS within an existing LLM serving framework, vLLM, demonstrating that it requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM inference runtimes, enabling energy-proportional and grid-interactive AI systems.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

1. Core Contribution

PALS introduces a power-aware runtime for LLM inference that elevates GPU power caps from static constraints to first-class, dynamically controllable knobs. The key insight is that jointly optimizing hardware-level power limits and software-level parameters (batch size, parallelism) expands the achievable Pareto frontier of efficiency-performance trade-offs beyond what either dimension alone can achieve. The system implements a closed-loop controller combining offline power-performance models (random forests) with PID feedback correction, operating at 500ms granularity within vLLM without requiring model retraining or API changes.

The contribution is particularly relevant for Mixture-of-Experts (MoE) models, where dynamic token routing creates variable compute-communication ratios that fundamentally alter the power-performance relationship. The paper shows that communication-bound MoE models (Qwen-MoE, OLMoE) actually *lose* efficiency beyond ~200W power caps because additional power accelerates communication overhead rather than useful computation—a counterintuitive finding that undermines the default practice of running GPUs at maximum power.

2. Methodological Rigor

The empirical methodology is generally sound. The paper systematically profiles three key dimensions (power caps, batch sizes, parallelism configurations) across eight models spanning dense and MoE architectures. The compute-communication decomposition analysis (Figure 2) provides clear mechanistic explanations for observed behaviors.

However, several methodological concerns arise:

Offline profiling limitations: The random forest model is trained on offline sweeps with fixed sequence lengths and representative prompts. The paper acknowledges but does not rigorously evaluate robustness to distribution shifts in prompt characteristics (length, complexity). The reported MAPE of 6.8% for throughput and 4.5% for power is reasonable but evaluated only on held-out configurations from the same distribution.

Hardware scope: All experiments use NVIDIA A100 GPUs (up to 3 nodes × 4 GPUs). Generalizability to newer architectures (H100, B200) with different power-performance characteristics, or to AMD/Intel accelerators, remains unvalidated.

Workload realism: While Poisson arrivals are standard, production LLM serving exhibits more complex patterns (correlations, heavy tails, priority classes). The 60-minute evaluation windows, while adequate for demonstration, are short relative to production deployment timescales.

Limited statistical reporting: Results are presented as single-run numbers without confidence intervals or variance across runs, which weakens claims of specific percentage improvements.

Control loop stability: The PID controller and 5% threshold for configuration changes are described but not formally analyzed for stability guarantees. The interaction between the discrete configuration space and continuous PID correction deserves more rigorous treatment.

3. Potential Impact

The practical impact potential is significant:

Energy savings at scale: A 26.3% efficiency improvement in LLM serving—now one of the largest data center workloads—translates to substantial absolute energy and cost savings. At hyperscaler scale, this could mean megawatts of power reduction.

Grid interactivity: The demand-response capability (Section 5.3) is forward-looking and aligns with growing regulatory pressure for data centers to participate in grid flexibility programs. Demonstrating that LLM inference can dynamically modulate power consumption while maintaining QoS is valuable for grid operators and data center planners.

Plug-and-play deployment: Integration into vLLM without model changes significantly lowers the adoption barrier. This engineering pragmatism increases the likelihood of real-world deployment compared to approaches requiring architectural modifications.

Complementarity with cluster-level systems: The paper clearly positions PALS as complementary to cluster-level solutions like DynamoLLM, operating at the node level with sub-second responsiveness. This layered architecture is practical and composable.

4. Timeliness & Relevance

The paper addresses a genuine and growing need. LLM inference energy consumption is a first-order concern for cloud providers, with major companies publicly committing to sustainability targets while simultaneously scaling AI deployments. The focus on MoE models is well-timed, as the industry is shifting toward sparse architectures (Mixtral, DeepSeek, etc.) for efficiency at scale. The observation that MoE models create unique power-management challenges due to communication overhead is timely and underexplored.

The connection to carbon-aware computing and demand response is forward-looking but not yet validated with real grid signals or carbon intensity data—this remains aspirational rather than demonstrated.

5. Strengths & Limitations

Key Strengths:

Clear identification of the compute-communication dichotomy in power-performance behavior, especially for MoE models

Practical system design that works within existing infrastructure (vLLM, NVML)

Comprehensive model coverage (8 models, both dense and MoE)

The Pareto frontier analysis (Section 2.2) is compelling and well-presented

Multi-node evaluation adds realism beyond single-GPU studies

Notable Limitations:

The "up to 26.3%" headline result appears to be a best-case figure; average improvements are less clearly stated

No comparison against DynamoLLM or other recent systems beyond the four baselines defined by the authors

The parallelism knob (TP/EP) is static at deployment time, limiting the runtime's adaptability in the dimension the paper identifies as important

Missing analysis of controller overhead and its impact on tail latency

No evaluation with prefill-heavy workloads or mixed prefill/decode scenarios, which are critical in production

The paper does not address KV-cache memory pressure interactions with batch size changes

Minor Issues:

Inconsistent naming (PALS vs. PA-vLLM vs. Power-Aware vLLM) throughout the paper

Table 3 comparison is self-selected and somewhat superficial

The acknowledgment of generative AI assistance for editing, while transparent, raises minor questions about the depth of certain prose sections

Overall Assessment

PALS makes a solid systems contribution by demonstrating that joint hardware-software optimization for LLM inference power management is both feasible and beneficial. The MoE-specific insights regarding communication-bound behavior under power scaling are genuinely novel and practically important. However, the evaluation could be more rigorous (statistical reporting, broader hardware platforms, longer-duration experiments), and some claims slightly overstate generality given the experimental scope. The work is a meaningful step toward energy-proportional AI serving, though it falls short of being a comprehensive solution.

Rating:6.5/ 10

Significance 7Rigor 5.5Novelty 6.5Clarity 6.5

Generated May 21, 2026

Comparison History (21)

vs. AMEL: Accumulated Message Effects on LLM Judgments

claude-opus-4.65/22/2026

AMEL identifies a fundamental and previously undercharacterized bias in LLM-as-judge paradigms, which is now a widespread practice across NLP research, content moderation, and code review. The finding that conversational history systematically shifts LLM judgments has broad implications for any pipeline using LLMs as evaluators, affecting reproducibility across many fields. The large-scale empirical rigor (75K+ API calls, 11 models, 4 providers) and actionable mitigation advice give it high practical relevance. PALS addresses an important but more niche systems optimization problem (energy-efficient LLM serving) with incremental engineering contributions over existing power-capping and scheduling work.

vs. LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

gpt-5.25/22/2026

Paper 2 (PALS) likely has higher scientific impact due to strong real-world applicability and timeliness: data-center energy use is a major, immediate constraint, and a runtime integrated into vLLM without retraining lowers adoption barriers. The methodology is concrete (models + feedback control) with clear, measurable gains across hardware and MoE/dense models, suggesting broad systems impact. Paper 1 (LCGuard) is novel and important for emerging latent multi-agent communication safety, but relies on an operational leakage definition (reconstruction) that may not capture all privacy risks and may see slower near-term deployment.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact: it identifies a counterintuitive inverse-scaling failure mode in LLM forecasting under superlinear growth and tail-regime risk, introduces a new benchmark, and shows replication across simulated and multiple real-world domains. The findings affect evaluation methodology (tail-inclusive scoring) and practical deployment in high-stakes settings (finance/epidemiology), with broad relevance to ML safety, calibration, and benchmarking. Paper 1 is a solid systems contribution with clear datacenter applicability, but its novelty and cross-field reach are narrower and more incremental compared to the conceptual and evaluative implications of Paper 2.

vs. Open-World Evaluations for Measuring Frontier AI Capabilities

gpt-5.25/21/2026

Paper 1 likely has higher scientific impact due to a concrete, novel systems contribution (treating GPU power caps as a control variable jointly optimized with batching) with demonstrated, reproducible gains (up to 26.3% energy efficiency, fewer QoS violations) and immediate applicability to widespread LLM serving stacks (integration into vLLM, no retraining). It is timely given datacenter energy constraints and could influence both systems research and production deployments. Paper 2 is important conceptually but is more of a survey/proposal with limited methodological rigor and harder-to-standardize, lower-scalability evaluation evidence.

vs. Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

gpt-5.25/21/2026

Paper 1 likely has higher scientific impact due to broad, timely applicability and strong real-world relevance: power/energy-aware LLM serving is a data-center–scale constraint affecting most deployments, and integrating GPU power caps into runtime control (implemented in vLLM without retraining) is a pragmatic, system-level innovation with immediate operational benefits. The methodology (offline models + feedback controller) and evaluation across dense/MoE and multi-GPU settings suggest solid rigor and generality. Paper 2 is impactful for VLM reliability, but may be more benchmark/task- and prompt/framework-dependent and potentially less universally deployable.

vs. \ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems

gemini-3.15/21/2026

Paper 2 addresses a critical, highly timely bottleneck in modern AI: the massive energy consumption and power constraints of serving Large Language Models (LLMs). By dynamically optimizing GPU power caps during MoE model inference without requiring model retraining, it offers immediate, scalable, and high-impact real-world applications for data centers and cloud providers. While Paper 1 introduces a rigorous foundational metric for AI uncertainty, Paper 2's direct solution to the pressing economic and environmental costs of LLM deployment gives it a broader and more immediate potential scientific and industrial impact.

vs. Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

claude-opus-4.65/21/2026

PALS addresses the critical and timely problem of energy efficiency in LLM serving, which is a dominant concern as AI workloads scale rapidly in data centers. It offers practical, deployable contributions (integrated into vLLM, no retraining needed) with significant energy savings (up to 26.3%) and QoS improvements. The breadth of impact spans systems, sustainability, and grid-interactive computing. Paper 2, while a solid engineering contribution, is a niche simulator for a specific game environment with narrower impact, primarily serving the RL-for-games community.

vs. OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

claude-opus-4.65/21/2026

PALS addresses a critical and growing real-world problem—energy efficiency of LLM inference in data centers—with broad practical applicability. It integrates into existing frameworks (vLLM) without retraining, making adoption straightforward. The results (26.3% energy efficiency gains, 4-7x QoS violation reduction) are immediately actionable across the industry. While OSCToM makes meaningful contributions to Theory of Mind benchmarks, it targets a narrower research community. PALS has broader cross-disciplinary impact spanning systems, sustainability, and AI infrastructure, with high timeliness given growing concerns about AI energy consumption.

vs. OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

claude-opus-4.65/21/2026

PALS addresses the increasingly critical problem of energy efficiency in LLM inference, which has broad real-world impact as AI deployment scales. It offers a practical, system-level solution integrated into an existing framework (vLLM) requiring no retraining, making adoption straightforward. The results (26.3% energy efficiency improvement, 4-7x QoS violation reduction) are substantial and immediately applicable across data centers. While OSCToM makes meaningful contributions to Theory of Mind benchmarks, its impact is narrower—focused on a specific cognitive reasoning capability. PALS has broader cross-disciplinary relevance spanning systems, sustainability, and AI infrastructure, with high timeliness given growing energy concerns around AI.

vs. Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

gpt-5.25/21/2026

Paper 2 has higher likely scientific impact: it presents a concrete, implemented system (PALS) with measurable gains (energy efficiency, QoS) in a timely, high-demand area (LLM/MoE serving and datacenter energy). The methodology appears more rigorous (models + feedback control, integration into vLLM, multi-GPU experiments) and offers immediate real-world applicability for operators and cloud providers. Paper 1 is a forward-looking 6G vision that is novel conceptually but less methodologically grounded and may have longer, more uncertain translation to practice.

vs. GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction

gpt-5.25/21/2026

Paper 2 likely has higher scientific impact: it introduces a novel, broadly applicable systems approach (power caps as a first-class control knob jointly optimized with batching) with clear methodological rigor (implemented in vLLM, evaluated across multi-GPU, dense and MoE, dynamic budgets) and immediate real-world relevance for data-center energy and QoS. Its benefits (up to 26.3% efficiency, fewer QoS violations) align with timely concerns about AI sustainability and could influence both ML systems research and industry practice. Paper 1 is valuable, but its dataset scale and domain specificity may limit breadth and uptake.

vs. Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

gpt-5.25/21/2026

Paper 2 has higher estimated impact because it presents a concrete, implementable system (PALS) with demonstrated gains (up to 26.3% energy efficiency, 4–7× fewer QoS violations) and clear applicability to today’s widespread LLM serving in data centers. Its methodology (offline modeling + feedback control) is testable and reproducible, and integration into vLLM without retraining/APIs increases adoption potential. Paper 1 is a forward-looking 6G vision with high-level directions but limited empirical validation, making near-term scientific and practical impact less certain despite its timeliness.

vs. GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction

gpt-5.25/21/2026

Paper 1 likely has higher impact due to timeliness and broad applicability: power/energy constraints are a central, immediate bottleneck for large-scale LLM deployment, and a vLLM-integrated runtime that improves energy efficiency and QoS under power caps can be adopted quickly across many models and data centers. The approach is methodologically solid (models + feedback control) and affects systems, ML serving, and sustainable computing. Paper 2 is valuable and well-packaged, but its small scale (10 groups) and narrower domain may limit near-term, cross-field impact compared to scalable infrastructure advances.

vs. AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

claude-opus-4.65/21/2026

AgentCo-op addresses a more fundamental and broadly impactful challenge: composing multi-agent workflows in open-ended settings without curated benchmarks. It introduces a novel retrieval-based synthesis framework with typed artifact handoffs and local repair, demonstrating versatility across scientific discovery (genomics) and standard benchmarks. Its contribution to agentic AI design paradigms has broader applicability across many fields. PALS, while practically valuable for energy-efficient LLM serving, addresses a more incremental systems optimization problem with narrower scope (GPU power management), making it impactful but less transformative.

vs. AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

claude-opus-4.65/21/2026

AgentCo-op addresses a broader and more fundamental challenge—composing multi-agent workflows for open-ended scientific tasks without curated benchmarks—which has wider applicability across scientific domains. Its retrieval-based synthesis framework with typed artifact handoffs is more novel conceptually, and the demonstrated applications in genomics show real-world scientific discovery potential. PALS is a solid engineering contribution for energy-efficient LLM serving, but its scope is narrower (GPU power management) and represents incremental optimization rather than a new paradigm. AgentCo-op's cross-domain applicability and implications for AI-driven science give it higher impact potential.

vs. Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

gpt-5.25/21/2026

Paper 2 is likely to have higher scientific impact due to broader cross-domain relevance and conceptual novelty: it analyzes a fundamental failure mode (off-manifold drift) in compositional guidance for diffusion/flow models and proposes a general, lightweight conflict-aware mechanism applicable across vision, synthetic benchmarks, and planning/control. This targets an active, timely area (inference-time control without fine-tuning) and can influence multiple downstream applications and research directions. Paper 1 is impactful for systems/efficiency, but is more incremental and primarily scoped to LLM serving/runtime engineering in datacenters.

vs. Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

claude-opus-4.65/21/2026

Paper 1 addresses a fundamental challenge in compositional guided generation for diffusion/flow models—a rapidly growing area with broad applications across image generation, planning, and control. The theoretical analysis of gradient misalignment and the proposed conflict-aware correction mechanism is novel and generalizable. Paper 2, while practically valuable, addresses a more narrow systems-level optimization (power-aware LLM serving) with incremental contributions over existing serving frameworks. Paper 1's broader applicability across domains, stronger methodological novelty, and relevance to the fast-moving generative AI field give it higher impact potential.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gpt-5.25/21/2026

Paper 2 is likely to have higher scientific impact because it introduces a broadly reusable benchmark and auditing framework for deep web research agents, a timely and fast-growing area. Benchmarks often become community standards, shaping model development, evaluation methodology, and downstream products across NLP, IR, HCI, and AI safety (calibration, provenance). It also releases data/code and provides diagnostic findings that can redirect research. Paper 1 is methodologically solid with clear real-world energy benefits, but its impact is narrower (LLM serving/power management) and more incremental relative to existing systems optimization work.

vs. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

claude-opus-4.65/21/2026

Paper 2 addresses a fundamental theoretical question about the equivalence of DPO and RLHF—two of the most widely used alignment methods for LLMs. Identifying implicit assumptions and failure modes in DPO has broad implications for the entire LLM alignment community, which is massive and rapidly growing. The paper provides rigorous theoretical analysis, practical solutions (CPO), and experimental validation. Paper 1, while practically useful for energy efficiency in LLM serving, addresses a more incremental systems optimization problem with narrower scope. Paper 2's theoretical insights are likely to influence future alignment research and methodology choices across the field.

vs. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

gemini-3.15/21/2026

Paper 2 addresses fundamental theoretical limitations of DPO, a core alignment technique for modern LLMs. By proving its conditional equivalence to RLHF and introducing a provably aligned alternative, it offers profound theoretical insights with immediate, widespread applicability in AI alignment. While Paper 1 provides valuable systems-level optimizations for energy efficiency, Paper 2's theoretical contributions will likely influence a broader range of foundational research and model training paradigms across the machine learning community.