LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Pranay Tummalapalli, Sahil Arayakandy, Ritam Pal, Kautuk Kundan

Mar 24, 2026arXiv:2603.23640v1

cs.DCcs.LG

#748of 1075·Distributed Computing

#748 of 1075 · Distributed Computing

Tournament Score

1340±29

10501750

38%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor4

Novelty4

Clarity7

Abstract

Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management supersedes peak compute as the primary constraint: the iPhone 16 Pro loses nearly half its throughput within two iterations, and the S24 Ultra suffers a hard OS-enforced GPU frequency floor that terminates inference entirely. On dedicated hardware, distinct constraints dominate: the RTX 4050 is bounded by its battery power ceiling, while the Hailo-10H is limited by on-module memory bandwidth. The RTX 4050 sustains 131.7 tok/s at 34.1 W; the Hailo-10H sustains 6.9 tok/s at under 2 W with near-zero variance, matching the RTX 4050 in energy proportionality at 19x lower throughput. Results should be interpreted as platform-level deployment characterisations for a single model and prompt type, reflecting hardware and software combined, rather than general claims about hardware capability alone.

AI Impact Assessments

(3 models)

Scientific Impact Assessment

1. Core Contribution

This paper provides a cross-platform empirical benchmark of sustained LLM inference on four edge devices: a Raspberry Pi 5 with Hailo-10H NPU, Samsung Galaxy S24 Ultra, iPhone 16 Pro, and an NVIDIA RTX 4050 laptop GPU. The central thesis is that thermal management, not peak compute, is the binding constraint for mobile LLM inference—a claim supported by showing the iPhone 16 Pro loses ~44% throughput within two iterations and the S24 Ultra hits an OS-enforced GPU frequency floor that terminates inference after six iterations. The paper also provides what appears to be the first independent benchmark of LLM inference on Hailo NPU hardware, showing near-zero variance throughput at competitive energy-per-token.

The contribution is primarily empirical and characterization-oriented. There is no new algorithm, architecture, or theoretical framework. The novelty lies in the systematic documentation of thermal degradation curves under sustained load across heterogeneous platforms—something prior work (MELTing Point, Xiao et al.) touched on but did not fully characterize with 20-iteration sustained workloads.

2. Methodological Rigor

The experimental protocol is reasonably well-defined: thermal equilibration, warm-up inference, 20 iterations with 1-second inter-iteration gaps, and CSV-based metric logging. The flowchart in Figure 1 is helpful for reproducibility. However, several methodological issues significantly weaken the study:

Heterogeneous inference stacks: Each platform uses a different framework (vLLM, MLC-LLM, MLX, hailo-ollama) with different quantization formats (Q4_0, q4f16_2, GPTQ Int4). The authors acknowledge this but it fundamentally undermines cross-platform hardware comparisons. The S24 Ultra's anomalous 25-second prefill time is likely framework-driven, yet it's presented in the cross-platform comparison table.

Inconsistent power measurement: The RTX 4050 uses nvidia-smi (GPU-level), the Hailo uses INA219 on system supply rails (whole-system), Android's Battery API was deemed unreliable, and iOS exposes nothing. The headline claim of "near-identical energy proportionality" between the Hailo and RTX 4050 (270.5 vs. 297.3 mJ/token) compares GPU-level power against whole-system power—this is not a valid comparison, and the authors' own caveat acknowledges this but the comparison is still prominently featured.

Limited sample size: Only one device per platform, 20 iterations, single model, single prompt. The S24 Ultra yields only 5 usable data points. While the authors are transparent about these limitations, drawing deployment recommendations from n=1 devices with n=5 iterations (Android) is tenuous.

Token count variation: Output lengths range from 564 (Hailo) to 1789 (RTX 4050) tokens across platforms, which affects thermal load duration and makes thermal trajectory comparisons uneven.

3. Potential Impact

The practical relevance is clear: as LLM-powered agents move toward always-on, on-device deployment, understanding sustained thermal behavior is essential. The paper's findings that smartphones are poorly suited for continuous inference and that dedicated NPUs offer stable low-power alternatives are actionable for system designers.

However, the impact is somewhat limited by:

The rapid evolution of mobile hardware and software stacks (results may be outdated within one product cycle)

The single-model, single-prompt scope

The lack of mitigation strategies beyond identifying the problem (no duty-cycling experiments, no active cooling tests, no framework optimization)

The Hailo-10H characterization is perhaps the most valuable contribution, as it fills a genuine gap in the literature. The finding that a sub-2W NPU can achieve competitive energy-per-token with deterministic behavior is useful for embedded systems designers.

4. Timeliness & Relevance

The paper addresses a timely problem. The proliferation of sub-2B parameter models (Qwen 2.5, Llama, Phi, Gemma) and the push toward on-device AI agents make edge inference characterization increasingly important. MLPerf v5.1 has added edge LLM scenarios but without thermal tracking, so this work fills a real gap. The always-on agent framing is topical given the current industry focus on AI assistants.

5. Strengths & Limitations

Strengths:

Addresses a genuinely underexplored aspect of edge LLM deployment (sustained thermal behavior)

The iPhone thermal three-phase trajectory (Normal → Warm → Hot) and the S24 Ultra's hard frequency floor are well-documented and practically important findings

The Hailo-10H NPU benchmark fills a gap in the literature

Transparent about limitations; the caveats are unusually thorough for a benchmarking paper

The deployment scenario mapping (Table 10) is practical and useful

Limitations:

The cross-platform comparison is confounded by framework differences to the point where hardware-to-hardware conclusions are unreliable

Power measurement inconsistency undermines the energy efficiency narrative

20 iterations is modest for "sustained load" characterization; the paper's own future work acknowledges 100+ iterations are needed

No statistical tests are performed; confidence intervals or significance tests on degradation claims would strengthen the analysis

The S24 Ultra data (5 usable iterations) is too sparse for reliable characterization

Single prompt type limits generalizability substantially

The paper does not explore any mitigation strategies

Additional observations:

The paper is well-written with clear tables and figures. The honest framing as "platform-level deployment characterisations" rather than hardware benchmarks is appropriate. However, the framing as a "benchmark" is generous given the single-model, single-prompt, single-device design. This reads more as a preliminary characterization study than a comprehensive benchmark.

The practical takeaway—that smartphones throttle severely under sustained LLM inference and dedicated NPUs offer stable alternatives—is useful but not deeply surprising given well-known mobile thermal constraints. The quantitative characterization of *how* throttling manifests (gradual DVFS vs. hard frequency floor) is the more novel finding.

Rating:4.5/ 10

Significance 4.5Rigor 4Novelty 4Clarity 7

Generated Mar 26, 2026

Comparison History (47)

Lostvs. State Twins: An Off-Chain Substrate for Agentic Reasoning over Decentralized Finance Protocols

Paper 1 introduces a novel architectural abstraction (State Twins) that bridges DeFi protocol reasoning with agentic AI systems, formalizing AMM families as dynamical systems with proven fidelity bounds. It offers a new conceptual framework with broad implications for autonomous finance, LLM tool integration via MCP, and counterfactual reasoning—areas of high current interest. Paper 2 provides useful empirical benchmarks of edge LLM inference but is primarily a measurement study of existing hardware with a single model, offering narrower novelty and limited generalizability. Paper 1's theoretical contributions and open-source tooling give it greater potential for cross-disciplinary impact.

claude-opus-4-6·May 13, 2026

Wonvs. FractalSortCPU: Bandwidth-Efficient Compressed Radix Sort on CPU

Paper 2 addresses the timely and rapidly growing field of on-device LLM inference, which has broad relevance across mobile computing, edge AI, and hardware design communities. Its empirical characterization of thermal throttling as the primary constraint for sustained LLM inference on mobile devices is a novel and practical insight with immediate real-world implications for product design and deployment. Paper 1, while technically solid with strong bandwidth efficiency improvements for radix sort, addresses a more niche area with incremental advances. Paper 2's broader audience and timeliness give it higher estimated impact.

claude-opus-4-6·May 13, 2026

Wonvs. MERBIT: A GPU-Based SpMV Method for Iterative Workloads

Paper 1 addresses a highly timely and critical bottleneck: deploying LLMs on edge devices under power and thermal constraints. While Paper 2 offers a solid algorithmic improvement for SpMV, Paper 1's focus on modern on-device AI addresses a broader, more immediate industry and academic challenge, likely garnering significant short- to medium-term citations across systems, ML, and hardware communities.

gemini-3.1-pro-preview·May 11, 2026

Wonvs. AnTi-MiCS: Analytical Framework for Bounding Time in Embedded Mixed-Criticality Systems

Paper 2 addresses a highly timely and broadly relevant problem: deploying Large Language Models on edge devices. Given the current explosion of interest in Edge AI, identifying thermal, power, and memory bottlenecks across diverse hardware platforms will significantly impact both ML optimization and hardware design. While Paper 1 presents a rigorous algorithmic contribution for embedded systems, Paper 2's focus on on-device LLM inference guarantees a wider readership, broader real-world applicability, and higher likelihood of cross-disciplinary citations.

gemini-3-pro-preview·May 1, 2026

Wonvs. AnTi-MiCS: Analytical Framework for Bounding Time in Embedded Mixed-Criticality Systems

Paper 1 addresses a highly timely and critical challenge in modern AI: deploying LLMs on edge devices under sustained thermal and power constraints. While Paper 2 offers rigorous theoretical advancements in embedded system scheduling, Paper 1's empirical findings on thermal throttling versus memory bounds in flagship processors will likely have a broader, immediate impact across both academia and industry hardware design given the current massive push toward on-device AI.

gemini-3-pro-preview·May 1, 2026

Wonvs. End-to-End and Phase-Level Performance Optimization for Hyperledger Fabric

Paper 2 is more timely and broadly relevant: sustained on-device LLM inference is a cross-cutting problem spanning ML systems, mobile/edge computing, hardware, and product deployment. Its measurements under thermal/power constraints address an immediate real-world bottleneck for always-on agents, with clear applicability to practitioners and researchers. While Paper 1 is rigorous and valuable for enterprise blockchain optimization, its impact is narrower to Hyperledger Fabric deployments and incremental protocol/config tuning within a specific stack. Paper 2’s findings generalize as deployment characterization methodology for edge LLMs.

gpt-5.2·May 1, 2026

Wonvs. End-to-End and Phase-Level Performance Optimization for Hyperledger Fabric

Paper 2 likely has higher impact due to timeliness and breadth: sustained on-device LLM inference is a rapidly growing, cross-disciplinary area (ML systems, mobile/edge computing, hardware, energy/thermal management) with clear real-world deployment relevance. Its comparative, sustained-load characterization across major consumer platforms addresses an immediate pain point (thermal throttling, power ceilings) and can inform design and product decisions broadly. Paper 1 is rigorous and useful but is more domain-specific to Hyperledger Fabric and offers incremental, platform-tied optimizations with narrower spillover beyond permissioned blockchain deployments.

gpt-5.2·May 1, 2026

Wonvs. CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Paper 2 addresses the highly timely and broadly impactful topic of LLM deployment on edge devices, which is relevant across mobile computing, AI systems, and hardware design communities. Its cross-platform benchmarking (NPU, mobile, GPU) with practical insights about thermal throttling and sustained workloads fills a significant gap as on-device LLM inference becomes increasingly important. Paper 1, while methodologically sound, addresses a narrower optimization problem (CUDA kernels for a specific convolution operator) with incremental contributions. Paper 2's findings on thermal constraints and energy proportionality have broader real-world implications for product design and deployment decisions.

claude-opus-4-6·Apr 30, 2026

Wonvs. CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Paper 2 addresses the highly timely and rapidly growing field of on-device LLM deployment, which has broad real-world relevance for mobile AI agents. Its cross-platform benchmarking (NPU, mobile, GPU) reveals novel practical insights about thermal throttling as the primary constraint—a finding with immediate implications for system design. Paper 1, while methodologically rigorous, is narrower in scope (optimizing a specific CUDA kernel for S4ConvD) and its counter-free profiling methodology, though useful for cloud environments, addresses a more niche audience. Paper 2's breadth of impact across edge AI, mobile computing, and hardware design communities gives it higher potential impact.

claude-opus-4-6·Apr 30, 2026

Lostvs. Adaptive Self-Organization in Anonymous Dynamic Networks

Paper 2 offers higher long-term scientific impact due to its fundamental theoretical contributions to distributed computing. While Paper 1 provides a highly timely empirical benchmark for edge LLMs, its impact is tied to transient hardware generations and lacks theoretical novelty. In contrast, Paper 2 introduces a novel algorithmic framework for adaptive self-organization in dynamic networks, providing rigorous mathematical proofs and complexity bounds. This foundational approach yields broad, long-lasting implications across fields like swarm robotics, sensor networks, and distributed systems, extending well beyond the short-lived relevance of specific hardware benchmarks.

gemini-3-pro-preview·Apr 30, 2026

#748of 1075·Distributed Computing

#748 of 1075 · Distributed Computing

Tournament Score

1340±29

10501750

38%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor4

Novelty4

Clarity7