Motasem Alfarra, Cristina Pinneri, Dana Kianfar, Mohammed Almousa, Christos Louizos
Deploying safe large language models (LLMs) on resource-constrained edge devices presents a critical challenge: while dual-model systems combining LLMs with guard models provide effective safety guarantees, their substantial memory and computational demands make them prohibitively expensive for on-device deployment. This paper presents a comprehensive study of parameter-efficient safety alignment methods for resource-constrained settings. Through systematic evaluation across multiple LLM architectures, training objectives, and parameter-efficient fine-tuning approaches, we identify that soft prompts combined with distillation-based training consistently outperform alternative methods. We introduce distillation frameworks based on total variation and KL divergence that effectively transfer safety behaviors from guard models into learned soft prompts. Our evaluations on various benchmarks demonstrate that this combination achieves superior safety-usefulness trade-offs compared to LoRA adapters, steering vectors, and direct optimization methods, while requiring minimal additional memory and compute at inference time. These findings establish soft prompt distillation as the preferred approach for safety alignment in on-device LLM deployment.
The paper addresses a practical deployment challenge: dual-model safety systems (LLM + guard model) are too expensive for edge devices due to ~2× memory and compute overhead. The proposed solution distills the safety behavior of the guard model into a small set of learned soft prompts (100 continuous embedding vectors) prepended to user inputs. Two distillation frameworks are introduced: TV-DiSP (total variation-based) and KL-DiSP (KL divergence-based), which train soft prompts to approximate the output distribution of the full safe LLM system.
The core formalization is clean: the safe system's output distribution p(r|x) is defined as a mixture over safe (pass-through) and unsafe (refusal) responses weighted by the guard model's probability, and the distillation minimizes distributional divergence between this target and the soft-prompted model q(r|x,W). The theoretical contribution (Theorem 3.1) is relatively straightforward—it's a direct application of the variational representation of total variation distance—but it provides a useful framing for why TV distillation is appropriate for safety-critical applications.
Practical impact: The work directly addresses a real industry need—deploying safe LLMs on mobile/edge devices. The <1% memory and <10% compute overhead compared to the base model is compelling. Qualcomm's involvement signals potential for productization. The method is simple to implement and deploy.
Academic impact: The systematic comparison is useful as a reference point, but the individual techniques (soft prompts, distillation, TV distance) are all well-established. The combination is sensible but not deeply novel. The finding that soft prompts outperform LoRA for safety distillation is interesting but the explanation remains somewhat superficial—the paper argues soft prompts are preferable because they "control behavior via input conditioning without altering the quantized backbone," but this is more intuition than rigorous analysis.
Limitations in impact scope: The safety improvements, while consistent, are moderate in absolute terms (e.g., ~20% SGS improvement on HarmBench for some models, but the gap to the full safe system is not always closed). The method fundamentally cannot exceed the safety of the teacher guard model. The DAN attack experiment (37% → 77% SGS) shows improvement but also that 23% of adversarial prompts still succeed, which may be insufficient for safety-critical applications.
The paper is highly timely. On-device LLM deployment is rapidly growing, and safety alignment for edge settings is an underexplored area. The tension between safety and efficiency is a genuine bottleneck. The work from an industry lab (Qualcomm AI Research) with actual hardware measurements adds practical relevance.
However, the landscape is moving quickly. Guard models are becoming smaller and more efficient, and techniques like speculative decoding could reduce the overhead of dual-model systems. The paper doesn't discuss these alternatives.
This is a solid applied research contribution that identifies a practical and effective approach (soft prompt distillation) for on-device LLM safety. The systematic comparison is its primary value. However, the novelty is incremental—combining existing techniques (soft prompts, TV/KL distillation, guard models) in a straightforward way. The theoretical contributions are standard, and the safety improvements, while consistent, are moderate. The paper would benefit from deeper analysis of why soft prompts outperform alternatives and from human evaluation of safety outcomes.
Generated Jun 9, 2026
Paper 2 proposes a fundamental, mathematically rigorous framework (FlowBP) that solves critical memory and gradient pathologies in reward backpropagation for flow matching models. Its methodological innovation and potential to influence how state-of-the-art generative models are aligned give it a higher foundational scientific impact compared to the more applied, though practical, on-device LLM safety distillation approach in Paper 1.
MODIP addresses a fundamental challenge in combining diffusion policies with reinforcement learning, a rapidly growing area in robot learning. Its novel framework bridging world models, MPC, and diffusion policies has broader methodological impact across robotics and RL communities. The approach elegantly sidesteps the difficulty of backpropagating through multi-step denoising by using MPC-generated trajectories as BC targets, offering a principled and general solution. Paper 1, while practically useful for on-device safety, is more incremental—combining existing techniques (soft prompts, distillation) for a specific deployment scenario with narrower impact scope.
N-GRPO introduces a fundamental algorithmic innovation in policy optimization, directly addressing the critical challenge of diverse trajectory generation in LLM reasoning. Given the immense current interest in GRPO and reasoning models, this method offers high novelty and broad potential impact across alignment research. Conversely, while Paper 2 addresses an important practical issue (on-device safety), it is primarily an empirical study combining existing techniques (soft prompts and distillation), making its fundamental scientific contribution comparatively narrower.
Paper 1 introduces a fundamentally new theoretical concept (epistemic calibration) that addresses a significant gap in uncertainty quantification—evaluating whether epistemic uncertainty estimates themselves are trustworthy. It provides formal definitions, an impossibility theorem, a consistent estimator (EECE), and broad experimental validation. This has wide-reaching implications across all fields using uncertainty-aware ML models. Paper 2 makes a solid engineering contribution to on-device LLM safety but is more incremental, combining existing techniques (soft prompts, distillation, parameter-efficient methods) for a specific deployment scenario, with narrower theoretical novelty and impact breadth.
Paper 2 has higher potential impact due to broader applicability and clearer methodological novelty: a tiny (7M) foundation model enabling covariate-informed zero-shot forecasting with real-time CPU inference, plus CovSynth to address pretraining data scarcity and a new Shifted Attention mechanism. This targets many high-impact domains (energy, finance, health, industry) where covariates are crucial. Paper 1 is timely for on-device safe LLMs, but largely a comparative study identifying soft-prompt distillation as best-in-class rather than introducing a broadly general new paradigm, and its impact is narrower to LLM safety alignment.
Paper 1 addresses a highly relevant and immediate bottleneck in AI deployment: running safe LLMs on resource-constrained edge devices. Its practical approach combining soft prompts and distillation offers clear, high-impact real-world applications in mobile and IoT computing. While Paper 2 presents rigorous advancements in mechanistic interpretability, Paper 1's methodology directly enables safer, broader accessibility of LLMs, giving it a higher potential for widespread technological and societal impact.
While Paper 1 provides rigorous and fundamental theoretical contributions to optimization, Paper 2 addresses a highly urgent and practical challenge: deploying safe LLMs on resource-constrained edge devices. Given the current explosive growth of LLMs, parameter-efficient safety alignment has immediate, widespread real-world applications. Its timeliness, relevance to a booming industry, and practical utility in overcoming critical deployment bottlenecks give it a higher potential for broad and rapid scientific and technological impact.
Paper 1 makes fundamental theoretical contributions to understanding deep Gaussian processes, establishing sharp phase transitions and proving the existence of non-trivial, non-Gaussian limiting distributions. This advances core mathematical understanding of deep probabilistic models with broad implications across Bayesian deep learning and probability theory. Paper 2 addresses a practical engineering problem (safe on-device LLM deployment) with incremental contributions—combining existing techniques (soft prompts, distillation, parameter-efficient methods). While useful, it is more applied and narrower in scope, with findings likely to be superseded as LLM architectures evolve.
Paper 1 addresses the highly timely challenge of deploying safe LLMs on resource-constrained edge devices. By eliminating the need for a secondary guard model through soft prompt distillation, it offers immense real-world applicability for mobile and IoT ecosystems. While Paper 2 presents a strong calibration method, Paper 1 directly unlocks broader on-device LLM adoption, giving it higher potential for widespread technological and industrial impact.
Paper 2 has higher potential impact due to timeliness and broad applicability: enabling safe LLM deployment on-device addresses a major current bottleneck across consumer electronics, privacy-preserving assistants, and edge AI. Its methodological contribution (systematic comparison plus new distillation objectives for transferring guard behavior into soft prompts) is directly actionable and likely to influence both research and product deployment. Paper 1 is valuable as a large benchmark for building/energy ML, but its impact is more domain-specific and relies heavily on simulated data, which may limit adoption relative to the widespread demand for efficient LLM safety methods.