Bartłomiej Marek, Lorenzo Rossi, Vincent Hanke, Xun Wang, Michael Backes, Franziska Boenisch, Adam Dziedzic
Recent work has applied differential privacy (DP) to adapt large language models (LLMs) for sensitive applications, offering theoretical guarantees. However, its practical effectiveness remains unclear, partly due to LLM pretraining, where overlaps and interdependencies with adaptation data can undermine privacy despite DP efforts. To analyze this issue in practice, we investigate privacy risks under DP adaptations in LLMs using state-of-the-art attacks such as robust membership inference and canary data extraction. We benchmark these risks by systematically varying the adaptation data distribution, from exact overlaps with pretraining data, through in-distribution (IID) cases, to entirely out-of-distribution (OOD) examples. Additionally, we evaluate how different adaptation methods and different privacy regimes impact the vulnerability. Our results show that distribution shifts strongly influence privacy vulnerability: the closer the adaptation data is to the pretraining distribution, the higher the practical privacy risk at the same theoretical guarantee, even without direct data overlap. We find that parameter-efficient fine-tuning methods, such as LoRA, achieve the highest empirical privacy protection for OOD data. Our benchmark identifies key factors for achieving practical privacy in DP LLM adaptation, providing actionable insights for deploying customized models in sensitive settings. Looking forward, we propose a structured framework for holistic privacy assessment beyond adaptation privacy, to identify and evaluate risks across the full pretrain-adapt pipeline of LLMs.
This paper provides a comprehensive empirical benchmark examining the practical privacy risks of differentially private (DP) adaptations of large language models (LLMs). The central insight is that the distributional relationship between pretraining and adaptation data critically determines empirical privacy leakage — even when DP guarantees are formally satisfied. The paper systematically varies this relationship across three regimes: exact overlap with pretraining data, in-distribution (IID) with no overlap, and out-of-distribution (OOD). A secondary contribution is a formalized four-stage holistic privacy auditing framework for the pretrain-adapt paradigm, instantiating adversarial games for each stage.
The key finding — that IID data from the pretraining validation set leaks as much as directly overlapping data — is significant because it demonstrates that distributional closeness, not literal data overlap, is the primary driver of privacy risk. This challenges the implicit assumption that DP protections for adaptation data are equally effective regardless of the data's relationship to pretraining.
The experimental design is thorough. The benchmark spans six datasets (three IID, two OOD, plus overlap), four adaptation methods (Full Fine-Tuning, LoRA, Prefix Tuning, Head Fine-Tuning), seven privacy regimes, and nine pretrained models. The use of Wasserstein distance over Sentence-BERT embeddings to empirically quantify distributional distances validates the IID/OOD classification. The choice to focus on Pythia and GPT-Neo families (trained on the Pile) enables controlled analysis since the pretraining data is known, while OLMo models extend generalizability.
The paper employs state-of-the-art attacks: RMIA as the primary MIA, complemented by the Reference method, Min-K%, and canary data extraction. The authors appropriately control for confounds by ensuring similar validation loss across adaptation methods at each privacy level, enabling fair privacy leakage comparisons. The hyperparameter grid search with top-4 selection for privacy-utility curves adds practical value.
However, there are methodological considerations. The paper acknowledges that the RMIA shadow model setup (same architecture, data distribution) represents an unusually strong attacker. When this assumption is relaxed, attack success drops substantially, sometimes to near-random. This creates a tension: the strongest results rely on an attacker with near-perfect knowledge of the target model's provenance, which may overstate risks in some practical scenarios while being entirely realistic in the open-source model regime.
The focus on sentence-level DP (256-token chunks) rather than example-level or user-level DP is a specific design choice that should be considered when interpreting results, as different granularities yield different privacy semantics.
The practical implications are substantial for organizations deploying LLMs on sensitive data (healthcare, legal, financial). The paper provides actionable guidance:
1. Adaptation method selection: LoRA consistently offers the best privacy-utility tradeoffs, with the lowest empirical leakage for OOD data while maintaining competitive utility.
2. Privacy budget guidance: Moderate ε (e.g., ε=8) still permits significant leakage for IID data, suggesting practitioners need ε<0.1 for genuine protection.
3. Model selection: The distributional relationship between the base model's pretraining data and adaptation data matters more than previously appreciated.
4. Pretraining data protection: Prefix Tuning uniquely reduces memorized pretraining data leakage, relevant when pretraining data also requires protection.
The holistic auditing framework, while primarily conceptual, provides a structured way to reason about privacy across the full LLM pipeline. The four adversarial games (pretraining audit, adaptation audit, joint audit, post-adaptation audit) formalize threat models that were previously only informally discussed.
This work addresses a critical gap at the intersection of two major trends: the widespread adoption of LLMs in sensitive domains and the application of DP as a privacy mechanism. The paper directly responds to Tramèr et al. (2024)'s position paper highlighting concerns about DP with large-scale pretraining. Published at ICLR 2026, it arrives at a time when organizations are actively deploying private fine-tuning pipelines, making empirical guidance urgently needed.
The focus on open-source models with known pretraining data (Pythia, GPT-Neo, OLMo) is both a strength (enabling rigorous analysis) and a limitation (excluding closed-source models where practitioners may have less control).
This is a well-executed empirical study that fills an important gap in understanding the practical privacy implications of DP LLM adaptation. The finding that distributional closeness — not just data overlap — drives privacy risk is the paper's most impactful contribution. While the holistic framework adds conceptual value, its empirical validation remains incomplete. The work provides a solid foundation for future privacy auditing research and practical deployment decisions.
Generated Jun 9, 2026
Paper 1 likely has higher scientific impact due to broad, timely relevance to privacy in large language models—a central deployment barrier across many sensitive real-world applications. Its benchmarking of empirical privacy leakage under DP adaptation across distribution shift and adaptation methods addresses a widely recognized gap between theoretical guarantees and practical risk, with actionable guidance and a proposed end-to-end assessment framework. Paper 2 is methodologically strong and impactful for neural decoding/BCI, but its domain is narrower and affects fewer fields compared to privacy evaluation frameworks for LLM adaptation.
Flow-DPPO addresses a fundamental structural mismatch in applying PPO-style methods to flow matching models, proposing a principled divergence-based alternative with clear theoretical motivation and strong empirical results across multiple dimensions. Its direct applicability to the rapidly growing field of RL-aligned generative models (images/video), availability of code/models, and broad practical relevance give it higher near-term impact. Paper 1 provides valuable empirical benchmarking of DP in LLM adaptation but is more diagnostic/observational rather than introducing a novel method that changes practice.
Paper 1 (GASLoC) introduces a novel decentralized training algorithm addressing a critical bottleneck in LLM pretraining—communication efficiency in heterogeneous settings. It unifies local updates with gossip-based communication, demonstrating competitive performance with state-of-the-art methods like DiLoCo while enabling practical advantages in heterogeneous bandwidth environments. This addresses a fundamental scalability challenge as LLM training increasingly spans distributed infrastructure. Paper 2 provides useful empirical benchmarking of DP in LLM adaptation but is more incremental, analyzing known privacy-utility tradeoffs rather than introducing fundamentally new methods. Paper 1's broader applicability to the growing distributed training paradigm gives it higher impact potential.
Paper 2 addresses the highly timely and practically important problem of privacy in LLM adaptation, an area of intense current interest. It provides actionable benchmarking insights for deploying LLMs in sensitive settings, with broad real-world applicability across healthcare, legal, and other domains. While Paper 1 makes rigorous theoretical contributions about conservation laws in neural network training—a novel and elegant topic—its impact is more niche, primarily of interest to the mathematical deep learning theory community. Paper 2's breadth of impact, timeliness given widespread LLM deployment, and practical relevance give it higher estimated scientific impact.
Paper 2 likely has higher scientific impact: it addresses a timely, high-stakes problem (privacy in LLM deployment) with broad cross-domain relevance (ML security/privacy, NLP, policy/compliance). It provides a systematic benchmark across data-distribution regimes and attack types, yielding actionable insights (e.g., distribution shift effects, LoRA benefits) and motivating a holistic pretrain–adapt privacy framework. Paper 1 is technically novel and practically useful for inference speedups, but the gains are moderate and the impact is narrower to decoding acceleration, with less immediate societal/operational leverage than privacy risk measurement.
PBSD addresses a fundamental and timely challenge in reinforcement learning for LLM agents—credit assignment in long-horizon tasks with sparse rewards. Its novel Bayesian framework for converting trajectory-level signals into turn-level credit is theoretically principled and broadly applicable to the rapidly growing field of agentic AI. Paper 2 provides valuable empirical benchmarking of DP in LLM adaptation, but is more incremental and narrower in scope. Paper 1's methodological innovation, generalizability across domains, and relevance to the frontier of agentic reasoning give it higher potential impact.
Paper 1 (OSAQ) presents a novel, theoretically grounded technique for LLM quantization that exploits the low-rank structure of the Hessian null space for outlier suppression—a key bottleneck in low-bit quantization. Its closed-form solution, zero inference overhead, and dramatic performance gains (40% lower perplexity at 2-bit) make it highly practical and broadly applicable to LLM deployment. Paper 2 provides useful empirical benchmarking of DP in LLM adaptation but is more incremental, combining existing attacks and methods without introducing a fundamentally new technique. OSAQ's novelty and direct impact on efficient LLM inference give it higher potential impact.
Paper 1 introduces a highly novel framework (UPMs) that solves a critical security and intellectual property challenge in decentralized AI, enabling new paradigms for community-driven AI. In contrast, Paper 2 is a valuable but fundamentally less innovative benchmarking study of existing differential privacy methods. The architectural breakthrough in Paper 1 has a broader potential to spawn entirely new research directions in distributed systems and collaborative machine learning.
Paper 1 introduces a novel and broadly applicable framework for hybrid neural surrogates that implicitly detects discontinuities without explicit supervision, achieving significant speedups (26-72x) across diverse physical systems. Its methodological innovation—continuous horizon conditioning, implicit error maps competitive with ensemble methods, and a principled fallback gating mechanism—addresses a fundamental challenge in scientific computing. Paper 2 provides valuable empirical benchmarking of DP in LLM adaptation but is more incremental, primarily characterizing known privacy risks rather than introducing new methods. Paper 1's cross-domain applicability and practical speedups give it broader and deeper impact potential.
Paper 2 has higher potential impact due to the explosive adoption of Large Language Models (LLMs) and the critical need for privacy in sensitive domains (healthcare, finance). While Paper 1 offers a novel defense for offline Safe RL, its application is relatively niche (robotics/control). Paper 2 addresses a fundamental gap between theoretical differential privacy and empirical vulnerability during LLM fine-tuning. By systematically benchmarking distribution shifts and methods like LoRA, it provides highly actionable insights that will broadly influence how researchers and practitioners securely adapt foundation models across numerous fields.