Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Bartłomiej Marek, Lorenzo Rossi, Vincent Hanke, Xun Wang, Michael Backes, Franziska Boenisch, Adam Dziedzic

Jun 8, 2026arXiv:2606.09401v1

cs.LGcs.CR

#1446of 5669·cs.LG

#1446 of 5669 · cs.LG

Tournament Score

1454±43

10501750

64%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty6.5

Clarity7.5

Abstract

Recent work has applied differential privacy (DP) to adapt large language models (LLMs) for sensitive applications, offering theoretical guarantees. However, its practical effectiveness remains unclear, partly due to LLM pretraining, where overlaps and interdependencies with adaptation data can undermine privacy despite DP efforts. To analyze this issue in practice, we investigate privacy risks under DP adaptations in LLMs using state-of-the-art attacks such as robust membership inference and canary data extraction. We benchmark these risks by systematically varying the adaptation data distribution, from exact overlaps with pretraining data, through in-distribution (IID) cases, to entirely out-of-distribution (OOD) examples. Additionally, we evaluate how different adaptation methods and different privacy regimes impact the vulnerability. Our results show that distribution shifts strongly influence privacy vulnerability: the closer the adaptation data is to the pretraining distribution, the higher the practical privacy risk at the same theoretical guarantee, even without direct data overlap. We find that parameter-efficient fine-tuning methods, such as LoRA, achieve the highest empirical privacy protection for OOD data. Our benchmark identifies key factors for achieving practical privacy in DP LLM adaptation, providing actionable insights for deploying customized models in sensitive settings. Looking forward, we propose a structured framework for holistic privacy assessment beyond adaptation privacy, to identify and evaluate risks across the full pretrain-adapt pipeline of LLMs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper provides a comprehensive empirical benchmark examining the practical privacy risks of differentially private (DP) adaptations of large language models (LLMs). The central insight is that the distributional relationship between pretraining and adaptation data critically determines empirical privacy leakage — even when DP guarantees are formally satisfied. The paper systematically varies this relationship across three regimes: exact overlap with pretraining data, in-distribution (IID) with no overlap, and out-of-distribution (OOD). A secondary contribution is a formalized four-stage holistic privacy auditing framework for the pretrain-adapt paradigm, instantiating adversarial games for each stage.

The key finding — that IID data from the pretraining validation set leaks as much as directly overlapping data — is significant because it demonstrates that distributional closeness, not literal data overlap, is the primary driver of privacy risk. This challenges the implicit assumption that DP protections for adaptation data are equally effective regardless of the data's relationship to pretraining.

Methodological Rigor

The experimental design is thorough. The benchmark spans six datasets (three IID, two OOD, plus overlap), four adaptation methods (Full Fine-Tuning, LoRA, Prefix Tuning, Head Fine-Tuning), seven privacy regimes, and nine pretrained models. The use of Wasserstein distance over Sentence-BERT embeddings to empirically quantify distributional distances validates the IID/OOD classification. The choice to focus on Pythia and GPT-Neo families (trained on the Pile) enables controlled analysis since the pretraining data is known, while OLMo models extend generalizability.

The paper employs state-of-the-art attacks: RMIA as the primary MIA, complemented by the Reference method, Min-K%, and canary data extraction. The authors appropriately control for confounds by ensuring similar validation loss across adaptation methods at each privacy level, enabling fair privacy leakage comparisons. The hyperparameter grid search with top-4 selection for privacy-utility curves adds practical value.

However, there are methodological considerations. The paper acknowledges that the RMIA shadow model setup (same architecture, data distribution) represents an unusually strong attacker. When this assumption is relaxed, attack success drops substantially, sometimes to near-random. This creates a tension: the strongest results rely on an attacker with near-perfect knowledge of the target model's provenance, which may overstate risks in some practical scenarios while being entirely realistic in the open-source model regime.

The focus on sentence-level DP (256-token chunks) rather than example-level or user-level DP is a specific design choice that should be considered when interpreting results, as different granularities yield different privacy semantics.

Potential Impact

The practical implications are substantial for organizations deploying LLMs on sensitive data (healthcare, legal, financial). The paper provides actionable guidance:

1. Adaptation method selection: LoRA consistently offers the best privacy-utility tradeoffs, with the lowest empirical leakage for OOD data while maintaining competitive utility.

2. Privacy budget guidance: Moderate ε (e.g., ε=8) still permits significant leakage for IID data, suggesting practitioners need ε<0.1 for genuine protection.

3. Model selection: The distributional relationship between the base model's pretraining data and adaptation data matters more than previously appreciated.

4. Pretraining data protection: Prefix Tuning uniquely reduces memorized pretraining data leakage, relevant when pretraining data also requires protection.

The holistic auditing framework, while primarily conceptual, provides a structured way to reason about privacy across the full LLM pipeline. The four adversarial games (pretraining audit, adaptation audit, joint audit, post-adaptation audit) formalize threat models that were previously only informally discussed.

Timeliness & Relevance

This work addresses a critical gap at the intersection of two major trends: the widespread adoption of LLMs in sensitive domains and the application of DP as a privacy mechanism. The paper directly responds to Tramèr et al. (2024)'s position paper highlighting concerns about DP with large-scale pretraining. Published at ICLR 2026, it arrives at a time when organizations are actively deploying private fine-tuning pipelines, making empirical guidance urgently needed.

The focus on open-source models with known pretraining data (Pythia, GPT-Neo, OLMo) is both a strength (enabling rigorous analysis) and a limitation (excluding closed-source models where practitioners may have less control).

Strengths

Systematic coverage: The benchmark's breadth across datasets, models, adaptation methods, and privacy regimes is impressive and provides a comprehensive picture.

Surprising finding about IID vs. overlap: The equivalent leakage between IID and overlapping data is a non-obvious result with significant implications.

Practical relevance: The privacy-utility curves (RQ6) and computational cost analysis provide directly actionable information for practitioners.

Formalization of auditing stages: The adversarial game framework for the pretrain-adapt paradigm offers a principled way to reason about privacy beyond adaptation alone.

Reproducibility: The use of publicly available models, datasets, and well-documented experimental setups supports reproducibility.

Limitations

Formalization without instantiation: The holistic auditing framework (Section 6) is primarily formal. Only Stage 2 (adaptation auditing) and partially Stage 4 (post-adaptation) are empirically evaluated. Stages 1 and 3 remain future work, limiting the framework's current practical impact.

Model scale: The largest model evaluated is ~2.8B parameters, well below the scale of models deployed in practice (70B+). Privacy dynamics may differ at larger scales.

Attack realism: The strongest attack (RMIA with shadow model) requires significant attacker knowledge. The gap between this and more realistic attacks is large, complicating practical risk assessment.

Limited adaptation paradigms: The paper focuses on gradient-based DP methods, excluding PATE-based approaches and in-context learning methods for private adaptation.

Single pretraining distribution: The analysis is primarily anchored to the Pile, limiting generalization to models trained on different data compositions.

Overall Assessment

This is a well-executed empirical study that fills an important gap in understanding the practical privacy implications of DP LLM adaptation. The finding that distributional closeness — not just data overlap — drives privacy risk is the paper's most impactful contribution. While the holistic framework adds conceptual value, its empirical validation remains incomplete. The work provides a solid foundation for future privacy auditing research and practical deployment decisions.

Rating:7.2/ 10

Significance 7.5Rigor 7.5Novelty 6.5Clarity 7.5

Generated Jun 9, 2026

Comparison History (25)

Wonvs. GRAFT: Gain-Recalibrated Adapters for Transformer-Based Neural Population Activity Modeling

Paper 1 likely has higher scientific impact due to broad, timely relevance to privacy in large language models—a central deployment barrier across many sensitive real-world applications. Its benchmarking of empirical privacy leakage under DP adaptation across distribution shift and adaptation methods addresses a widely recognized gap between theoretical guarantees and practical risk, with actionable guidance and a proposed end-to-end assessment framework. Paper 2 is methodologically strong and impactful for neural decoding/BCI, but its domain is narrower and affects fewer fields compared to privacy evaluation frameworks for LLM adaptation.

gpt-5.2·Jun 10, 2026

Lostvs. Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Flow-DPPO addresses a fundamental structural mismatch in applying PPO-style methods to flow matching models, proposing a principled divergence-based alternative with clear theoretical motivation and strong empirical results across multiple dimensions. Its direct applicability to the rapidly growing field of RL-aligned generative models (images/video), availability of code/models, and broad practical relevance give it higher near-term impact. Paper 1 provides valuable empirical benchmarking of DP in LLM adaptation but is more diagnostic/observational rather than introducing a novel method that changes practice.

claude-opus-4-6·Jun 10, 2026

Lostvs. Unifying Local Communications and Local Updates for LLM Pretraining

Paper 1 (GASLoC) introduces a novel decentralized training algorithm addressing a critical bottleneck in LLM pretraining—communication efficiency in heterogeneous settings. It unifies local updates with gossip-based communication, demonstrating competitive performance with state-of-the-art methods like DiLoCo while enabling practical advantages in heterogeneous bandwidth environments. This addresses a fundamental scalability challenge as LLM training increasingly spans distributed infrastructure. Paper 2 provides useful empirical benchmarking of DP in LLM adaptation but is more incremental, analyzing known privacy-utility tradeoffs rather than introducing fundamentally new methods. Paper 1's broader applicability to the growing distributed training paradigm gives it higher impact potential.

claude-opus-4-6·Jun 10, 2026

Wonvs. Conservation Laws from Data Symmetry in Neural Networks

Paper 2 addresses the highly timely and practically important problem of privacy in LLM adaptation, an area of intense current interest. It provides actionable benchmarking insights for deploying LLMs in sensitive settings, with broad real-world applicability across healthcare, legal, and other domains. While Paper 1 makes rigorous theoretical contributions about conservation laws in neural network training—a novel and elegant topic—its impact is more niche, primarily of interest to the mathematical deep learning theory community. Paper 2's breadth of impact, timeliness given widespread LLM deployment, and practical relevance give it higher estimated scientific impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

Paper 2 likely has higher scientific impact: it addresses a timely, high-stakes problem (privacy in LLM deployment) with broad cross-domain relevance (ML security/privacy, NLP, policy/compliance). It provides a systematic benchmark across data-distribution regimes and attack types, yielding actionable insights (e.g., distribution shift effects, LoRA benefits) and motivating a holistic pretrain–adapt privacy framework. Paper 1 is technically novel and practically useful for inference speedups, but the gains are moderate and the impact is narrower to decoding acceleration, with less immediate societal/operational leverage than privacy risk measurement.

gpt-5.2·Jun 10, 2026

Lostvs. PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

PBSD addresses a fundamental and timely challenge in reinforcement learning for LLM agents—credit assignment in long-horizon tasks with sparse rewards. Its novel Bayesian framework for converting trajectory-level signals into turn-level credit is theoretically principled and broadly applicable to the rapidly growing field of agentic AI. Paper 2 provides valuable empirical benchmarking of DP in LLM adaptation, but is more incremental and narrower in scope. Paper 1's methodological innovation, generalizability across domains, and relevance to the frontier of agentic reasoning give it higher potential impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

Paper 1 (OSAQ) presents a novel, theoretically grounded technique for LLM quantization that exploits the low-rank structure of the Hessian null space for outlier suppression—a key bottleneck in low-bit quantization. Its closed-form solution, zero inference overhead, and dramatic performance gains (40% lower perplexity at 2-bit) make it highly practical and broadly applicable to LLM deployment. Paper 2 provides useful empirical benchmarking of DP in LLM adaptation but is more incremental, combining existing attacks and methods without introducing a fundamentally new technique. OSAQ's novelty and direct impact on efficient LLM inference give it higher potential impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization

Paper 1 introduces a highly novel framework (UPMs) that solves a critical security and intellectual property challenge in decentralized AI, enabling new paradigms for community-driven AI. In contrast, Paper 2 is a valuable but fundamentally less innovative benchmarking study of existing differential privacy methods. The architectural breakthrough in Paper 1 has a broader potential to spawn entirely new research directions in distributed systems and collaborative machine learning.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Hybrid Neural World Models

Paper 1 introduces a novel and broadly applicable framework for hybrid neural surrogates that implicitly detects discontinuities without explicit supervision, achieving significant speedups (26-72x) across diverse physical systems. Its methodological innovation—continuous horizon conditioning, implicit error maps competitive with ensemble methods, and a principled fallback gating mechanism—addresses a fundamental challenge in scientific computing. Paper 2 provides valuable empirical benchmarking of DP in LLM adaptation but is more incremental, primarily characterizing known privacy risks rather than introducing new methods. Paper 1's cross-domain applicability and practical speedups give it broader and deeper impact potential.

claude-opus-4-6·Jun 9, 2026

Wonvs. Safe-RULE: Safe Reinforcement UnLEarning

Paper 2 has higher potential impact due to the explosive adoption of Large Language Models (LLMs) and the critical need for privacy in sensitive domains (healthcare, finance). While Paper 1 offers a novel defense for offline Safe RL, its application is relatively niche (robotics/control). Paper 2 addresses a fundamental gap between theoretical differential privacy and empirical vulnerability during LLM fine-tuning. By systematically benchmarking distribution shifts and methods like LoRA, it provides highly actionable insights that will broadly influence how researchers and practitioners securely adapt foundation models across numerous fields.

gemini-3.1-pro-preview·Jun 9, 2026

#1446of 5669·cs.LG

#1446 of 5669 · cs.LG

Tournament Score

1454±43

10501750

64%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty6.5

Clarity7.5