Distilling Safe LLM Systems via Soft Prompts for On Device Settings

Motasem Alfarra, Cristina Pinneri, Dana Kianfar, Mohammed Almousa, Christos Louizos

Jun 8, 2026arXiv:2606.09388v1

cs.LG

#3840of 5669·cs.LG

#3840 of 5669 · cs.LG

Tournament Score

1356±44

10501750

35%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor6.5

Novelty4.5

Clarity7.5

Abstract

Deploying safe large language models (LLMs) on resource-constrained edge devices presents a critical challenge: while dual-model systems combining LLMs with guard models provide effective safety guarantees, their substantial memory and computational demands make them prohibitively expensive for on-device deployment. This paper presents a comprehensive study of parameter-efficient safety alignment methods for resource-constrained settings. Through systematic evaluation across multiple LLM architectures, training objectives, and parameter-efficient fine-tuning approaches, we identify that soft prompts combined with distillation-based training consistently outperform alternative methods. We introduce distillation frameworks based on total variation and KL divergence that effectively transfer safety behaviors from guard models into learned soft prompts. Our evaluations on various benchmarks demonstrate that this combination achieves superior safety-usefulness trade-offs compared to LoRA adapters, steering vectors, and direct optimization methods, while requiring minimal additional memory and compute at inference time. These findings establish soft prompt distillation as the preferred approach for safety alignment in on-device LLM deployment.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

The paper addresses a practical deployment challenge: dual-model safety systems (LLM + guard model) are too expensive for edge devices due to ~2× memory and compute overhead. The proposed solution distills the safety behavior of the guard model into a small set of learned soft prompts (100 continuous embedding vectors) prepended to user inputs. Two distillation frameworks are introduced: TV-DiSP (total variation-based) and KL-DiSP (KL divergence-based), which train soft prompts to approximate the output distribution of the full safe LLM system.

The core formalization is clean: the safe system's output distribution p(r|x) is defined as a mixture over safe (pass-through) and unsafe (refusal) responses weighted by the guard model's probability, and the distillation minimizes distributional divergence between this target and the soft-prompted model q(r|x,W). The theoretical contribution (Theorem 3.1) is relatively straightforward—it's a direct application of the variational representation of total variation distance—but it provides a useful framing for why TV distillation is appropriate for safety-critical applications.

2. Methodological Rigor

Strengths in experimental design:

Systematic comparison across four LLM architectures (Llama3-1B, Llama3-3B, Qwen2-1.5B, Gemma2-2B), all quantized to 4-bit to simulate real edge conditions.

Multiple PEFT methods compared (soft prompts, LoRA, steering vectors) under the same distillation objective, with matched parameter counts.

Multiple training objectives compared (TV distillation, KL distillation, perplexity, REINFORCE, PPO).

Out-of-distribution evaluation: training on Beavertails/Toxigen, testing on HarmBench and Detect-Jailbreak.

On-device measurements on Qualcomm Snapdragon hardware, adding practical credibility.

Ablations on prompt size, seed consistency, and guard model generalization.

Weaknesses in rigor:

The safety evaluation relies heavily on LlamaGuard-8B as an automated judge, which is itself an LLM-based classifier with known limitations. No human evaluation is reported.

The use of a single training epoch, while justified by convergence analysis, limits exploration of whether more sophisticated training schedules could help alternatives like LoRA.

The over-refusal analysis (Table 4) reveals a concerning asymmetry: KL-DiSP has 84.6% over-refusal rate, yet KL-DiSP and TV-DiSP are presented as roughly comparable throughout the paper. This undermines the paper's earlier claims about KL distillation's effectiveness.

Training on only 10K/5K prompts for a single epoch is computationally minimal, which is good for the efficiency narrative but raises questions about whether the method would scale differently with more data.

The theoretical guarantees (Theorem 3.1) are standard and provide no new insight beyond motivating the choice of TV distance; the bound is not empirically validated (i.e., no measurement of how tight it is).

3. Potential Impact

Practical impact: The work directly addresses a real industry need—deploying safe LLMs on mobile/edge devices. The <1% memory and <10% compute overhead compared to the base model is compelling. Qualcomm's involvement signals potential for productization. The method is simple to implement and deploy.

Academic impact: The systematic comparison is useful as a reference point, but the individual techniques (soft prompts, distillation, TV distance) are all well-established. The combination is sensible but not deeply novel. The finding that soft prompts outperform LoRA for safety distillation is interesting but the explanation remains somewhat superficial—the paper argues soft prompts are preferable because they "control behavior via input conditioning without altering the quantized backbone," but this is more intuition than rigorous analysis.

Limitations in impact scope: The safety improvements, while consistent, are moderate in absolute terms (e.g., ~20% SGS improvement on HarmBench for some models, but the gap to the full safe system is not always closed). The method fundamentally cannot exceed the safety of the teacher guard model. The DAN attack experiment (37% → 77% SGS) shows improvement but also that 23% of adversarial prompts still succeed, which may be insufficient for safety-critical applications.

4. Timeliness & Relevance

The paper is highly timely. On-device LLM deployment is rapidly growing, and safety alignment for edge settings is an underexplored area. The tension between safety and efficiency is a genuine bottleneck. The work from an industry lab (Qualcomm AI Research) with actual hardware measurements adds practical relevance.

However, the landscape is moving quickly. Guard models are becoming smaller and more efficient, and techniques like speculative decoding could reduce the overhead of dual-model systems. The paper doesn't discuss these alternatives.

5. Strengths & Limitations

Key strengths:

Clear problem formulation with practical motivation

Comprehensive experimental comparison across multiple axes

On-device validation on real hardware

Simple, deployable solution with minimal overhead

Good ablation studies and robustness checks (seeds, guard model shift)

Notable limitations:

No human evaluation of safety or quality

The theoretical contribution is minimal (standard results)

The KL-DiSP over-refusal issue (84.6%) undermines claims of comparability between TV and KL

Limited analysis of failure modes—when does the distilled model fail?

No comparison against other lightweight safety methods (e.g., representation engineering, constitutional AI approaches adapted for small models)

The base models used are already instruction-tuned with some safety training; the marginal benefit over further safety fine-tuning (without distillation) is not explored

Reproducibility concerns: while the algorithm is clearly described, the paper relies on proprietary hardware for some measurements

Overall Assessment

This is a solid applied research contribution that identifies a practical and effective approach (soft prompt distillation) for on-device LLM safety. The systematic comparison is its primary value. However, the novelty is incremental—combining existing techniques (soft prompts, TV/KL distillation, guard models) in a straightforward way. The theoretical contributions are standard, and the safety improvements, while consistent, are moderate. The paper would benefit from deeper analysis of why soft prompts outperform alternatives and from human evaluation of safety outcomes.

Rating:5.8/ 10

Significance 6Rigor 6.5Novelty 4.5Clarity 7.5

Generated Jun 9, 2026

Comparison History (17)

Lostvs. Exploring the Design Space of Reward Backpropagation for Flow Matching

Paper 2 proposes a fundamental, mathematically rigorous framework (FlowBP) that solves critical memory and gradient pathologies in reward backpropagation for flow matching models. Its methodological innovation and potential to influence how state-of-the-art generative models are aligned give it a higher foundational scientific impact compared to the more applied, though practical, on-device LLM safety distillation approach in Paper 1.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. MODIP: Efficient Model-Based Optimization for Diffusion Policies

MODIP addresses a fundamental challenge in combining diffusion policies with reinforcement learning, a rapidly growing area in robot learning. Its novel framework bridging world models, MPC, and diffusion policies has broader methodological impact across robotics and RL communities. The approach elegantly sidesteps the difficulty of backpropagating through multi-step denoising by using MPC-generated trajectories as BC targets, offering a principled and general solution. Paper 1, while practically useful for on-device safety, is more incremental—combining existing techniques (soft prompts, distillation) for a specific deployment scenario with narrower impact scope.

claude-opus-4-6·Jun 10, 2026

Lostvs. N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

N-GRPO introduces a fundamental algorithmic innovation in policy optimization, directly addressing the critical challenge of diverse trajectory generation in LLM reasoning. Given the immense current interest in GRPO and reasoning models, this method offers high novelty and broad potential impact across alignment research. Conversely, while Paper 2 addresses an important practical issue (on-device safety), it is primarily an empirical study combining existing techniques (soft prompts and distillation), making its fundamental scientific contribution comparatively narrower.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Can we trust our models? Epistemic calibration in second-order classification

Paper 1 introduces a fundamentally new theoretical concept (epistemic calibration) that addresses a significant gap in uncertainty quantification—evaluating whether epistemic uncertainty estimates themselves are trustworthy. It provides formal definitions, an impossibility theorem, a consistent estimator (EECE), and broad experimental validation. This has wide-reaching implications across all fields using uncertainty-aware ML models. Paper 2 makes a solid engineering contribution to on-device LLM safety but is more incremental, combining existing techniques (soft prompts, distillation, parameter-efficient methods) for a specific deployment scenario, with narrower theoretical novelty and impact breadth.

claude-opus-4-6·Jun 10, 2026

Lostvs. CITRAS-FM: Tiny Time Series Foundation Model for Covariate-Informed Zero-Shot Forecasting

Paper 2 has higher potential impact due to broader applicability and clearer methodological novelty: a tiny (7M) foundation model enabling covariate-informed zero-shot forecasting with real-time CPU inference, plus CovSynth to address pretraining data scarcity and a new Shifted Attention mechanism. This targets many high-impact domains (energy, finance, health, industry) where covariates are crucial. Paper 1 is timely for on-device safe LLMs, but largely a comparative study identifying soft-prompt distillation as best-in-class rather than introducing a broadly general new paradigm, and its impact is narrower to LLM safety alignment.

gpt-5.2·Jun 10, 2026

Wonvs. Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

Paper 1 addresses a highly relevant and immediate bottleneck in AI deployment: running safe LLMs on resource-constrained edge devices. Its practical approach combining soft prompts and distillation offers clear, high-impact real-world applications in mobile and IoT computing. While Paper 2 presents rigorous advancements in mechanistic interpretability, Paper 1's methodology directly enables safer, broader accessibility of LLMs, giving it a higher potential for widespread technological and societal impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent

While Paper 1 provides rigorous and fundamental theoretical contributions to optimization, Paper 2 addresses a highly urgent and practical challenge: deploying safe LLMs on resource-constrained edge devices. Given the current explosive growth of LLMs, parameter-efficient safety alignment has immediate, widespread real-world applications. Its timeliness, relevance to a booming industry, and practical utility in overcoming critical deployment bottlenecks give it a higher potential for broad and rapid scientific and technological impact.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. How Deep Are Deep GPs, Really? A Sharp Threshold and a Non-Gaussian Limit for Compositional GPs

Paper 1 makes fundamental theoretical contributions to understanding deep Gaussian processes, establishing sharp phase transitions and proving the existence of non-trivial, non-Gaussian limiting distributions. This advances core mathematical understanding of deep probabilistic models with broad implications across Bayesian deep learning and probability theory. Paper 2 addresses a practical engineering problem (safe on-device LLM deployment) with incremental contributions—combining existing techniques (soft prompts, distillation, parameter-efficient methods). While useful, it is more applied and narrower in scope, with findings likely to be superseded as LLM architectures evolve.

claude-opus-4-6·Jun 9, 2026

Wonvs. CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction

Paper 1 addresses the highly timely challenge of deploying safe LLMs on resource-constrained edge devices. By eliminating the need for a secondary guard model through soft prompt distillation, it offers immense real-world applicability for mobile and IoT ecosystems. While Paper 2 presents a strong calibration method, Paper 1 directly unlocks broader on-device LLM adoption, giving it higher potential for widespread technological and industrial impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. RESCAST-100K: A Comprehensive Dataset for Cross-Domain Residential Load and Indoor Temperature Forecasting

Paper 2 has higher potential impact due to timeliness and broad applicability: enabling safe LLM deployment on-device addresses a major current bottleneck across consumer electronics, privacy-preserving assistants, and edge AI. Its methodological contribution (systematic comparison plus new distillation objectives for transferring guard behavior into soft prompts) is directly actionable and likely to influence both research and product deployment. Paper 1 is valuable as a large benchmark for building/energy ML, but its impact is more domain-specific and relies heavily on simulated data, which may limit adoption relative to the widespread demand for efficient LLM safety methods.

gpt-5.2·Jun 9, 2026

#3840of 5669·cs.LG

#3840 of 5669 · cs.LG

Tournament Score

1356±44

10501750

35%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor6.5

Novelty4.5

Clarity7.5