Loss Landscape Poisoning: Targeted Extraction of Unseen Training Data from LLMs

Md Abdullah Al Mamun, Ngoc Phu Doan, Pedram Zaree, Ihsen Alouani, Nael Abu-Ghazaleh

Jun 15, 2026arXiv:2606.17110v1

cs.CRcs.LG

#114of 2618·Cryptography & Security

#114 of 2618 · Cryptography & Security

Tournament Score

1557±45

10501750

91%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance8

Rigor6.5

Novelty8

Clarity7.5

Abstract

Large Language Models are increasingly trained on proprietary or sensitive data, from private healthcare and financial records to user conversations containing secrets. Ensuring the privacy of such data against extraction attacks has become a central concern. In this paper, we ask whether an attacker who can poison a portion of the training data can facilitate the leakage of a separate target record they have no access to. We answer in the affirmative and show that such leakage can be induced by a poisoning mechanism that reshapes the model's local loss landscape around the target completion. Our key insight is that poisoning to create a sharp loss minimum at the target, surrounded by elevated loss on nearby alternatives, forces the model to memorize the target as the unique low-loss solution in its neighborhood. The attack requires no architectural changes, and generalizes across centralized and federated learning settings. We demonstrate that the attack amplifies privacy leakage across language (up to 100% successful extraction), and vision-language models (up 90% successful extraction). We show that the attack is thwarted when the model is trained to be differentially private. However, we introduce a new attack that directly probes the loss landscape bypassing even differential privacy defenses.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Loss Landscape Poisoning

1. Core Contribution

This paper introduces Loss Landscape Poisoning (LLP), a training-time poisoning attack that forces LLMs and VLMs to memorize sensitive training records the attacker has never seen. The key mechanism is elegant: by injecting poison samples that share the schema of a target record (e.g., same form template but with random values in the secret field) and applying gradient ascent on these decoys, the attacker elevates the loss across the target's neighborhood. When the victim's own data supplies the descent signal at the true secret, a sharp local minimum forms, compelling the model to memorize the exact secret as the unique low-loss solution.

The paper contributes three attack variants (direct model poisoning, data-only poisoning, and federated learning poisoning) and, most notably, a new leakage primitive called Direct Loss Region Probing (DLRP) that bypasses differential privacy defenses by probing the geometry of the loss landscape rather than relying on direct generation.

2. Methodological Rigor

Strengths in experimental design:

Comprehensive evaluation across 9 LLM architectures (82M to 13B parameters) and 3 VLMs, spanning both full fine-tuning and LoRA adaptation

Consistent targeting of 100 secrets simultaneously, providing statistical reliability

Utility preservation demonstrated across 6 standard benchmarks with minimal degradation (±0.01-0.02 accuracy)

The loss ratio analysis (Figure 2b) provides mechanistic insight into *why* the attack succeeds, showing clean separation between successful and failed attacks

Methodological concerns:

The threat model for direct model poisoning (Threat Model 1) is quite strong—requiring access to modify the loss function during training—though the authors acknowledge this and provide the more realistic data-only variant

The secret format is limited to structured data (SSNs, credit cards, phone numbers). It's unclear how well the attack generalizes to unstructured secrets like free-text conversations

The neighborhood construction relies on knowing the schema of the target record, which is a non-trivial assumption

For LLP-Data, the gradient matching uses a surrogate model, and transferability across architectures is not thoroughly tested (the paper uses architecture-matching surrogates)

The DLRP evaluation is limited to GPT-2 Small; broader evaluation across model scales would strengthen the DP evasion claims

3. Potential Impact

This work has significant implications across multiple dimensions:

Privacy and security: The demonstration that an attacker can induce memorization of data they've never seen fundamentally changes the threat landscape for collaborative and federated training. The cross-client leakage result—where a single malicious participant among 10 can extract secrets from others—is particularly alarming for real-world FL deployments in healthcare and finance.

Differential privacy implications: The DLRP result is arguably the most impactful finding. The insight that DP-SGD protects against generation-based extraction but not landscape-geometry-based probing challenges a core assumption in the private ML community. The observation that "the privacy of training data depends not only on what the model generates but on the geometry of the loss surface it carries" could redirect research on privacy-preserving training.

Practical relevance: The data-only variant (LLP-Data) with 86% success on LLMs is particularly threatening because it requires only the ability to contribute training samples—a realistic capability given web-scraped training data, crowdsourced datasets, or federated learning.

4. Timeliness & Relevance

The paper addresses a critical and timely concern. As LLMs are increasingly trained on proprietary data (medical records, financial data, enterprise communications), understanding the full attack surface for training data extraction is essential. The federated learning angle is especially relevant given growing interest in privacy-preserving collaborative fine-tuning (e.g., for hospital networks, financial institutions).

The work builds on established foundations (Carlini et al.'s memorization work, gradient matching from Geiping et al.) but synthesizes them into a genuinely novel attack paradigm. The timing is appropriate given the proliferation of fine-tuning-as-a-service and federated LLM training platforms.

5. Strengths & Limitations

Key Strengths:

Novel threat model: Extracting data the attacker has never seen through poisoning is a conceptual advance over prior work that targets known data

Unified framework: The same geometric principle operates across centralized, data-only, and federated settings

DLRP primitive: The loss-landscape probing attack that evades DP-SGD is the most novel technical contribution and opens a new attack surface

Comprehensive evaluation: Wide range of models, modalities (text and vision-language), and settings

Utility preservation: Poisoned models maintain benchmark performance, making the attack stealthy

Notable Limitations:

Structured secrets only: The attack is demonstrated on formatted data (SSNs, credit cards) with known schemas. Extension to arbitrary text memorization is unclear

Knowledge assumptions: The attacker must know the schema and approximate format of target records

DLRP scalability: Enumerating candidates in the target region assumes a bounded search space (e.g., 9-digit numbers). For high-entropy secrets, DLRP's computational cost could be prohibitive

Limited defense analysis: The FL defense analysis (Appendix C) is helpful but cursory; AlignIns appears effective but is only tested on one model

Poisoning rate sensitivity: The attack requires ~100 poison samples per secret (Figure 9), and the optimal operating point is narrow, raising questions about robustness in noisy real-world settings

DP evaluation depth: DLRP is only evaluated on GPT-2 Small under DP-SGD; whether the result holds at scale with stronger DP guarantees (e.g., ε < 1) is not established

Additional Observations

The paper would benefit from a formal analysis of when DLRP fails—specifically, the relationship between DP noise, secret entropy, and the survival of the loss-landscape fingerprint. The claim that "mitigations strong enough to suppress DLRP require noise levels that harm utility significantly" needs more rigorous characterization across model scales. The connection between the loss ratio (Section 4.1) and information-theoretic bounds on memorization could be formalized.

Rating:7.2/ 10

Significance 8Rigor 6.5Novelty 8Clarity 7.5

Generated Jun 17, 2026

Comparison History (22)

Lostvs. FloatDoor: Platform-Triggered Backdoors in LLMs

Paper 2 introduces a fundamentally novel, cross-disciplinary attack vector by exploiting hardware-level floating-point divergence to trigger backdoors. Bridging machine learning security and systems engineering, this exposes a critical time-of-check to time-of-use vulnerability in model auditing across diverse hardware (GPUs, TPUs, ARM). While Paper 1 presents a strong privacy attack, data extraction and poisoning are established paradigms. Paper 2's creation of an entirely new threat class—platform-dependent hardware signatures as malicious triggers—has broader implications for trusted AI supply chains and hardware-software co-design, promising higher overall scientific impact.

gemini-3.1-pro-preview·Jun 19, 2026

Wonvs. TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

Paper 2 demonstrates a profound vulnerability by using data poisoning to extract unseen training data, fundamentally challenging current ML security paradigms. Crucially, it introduces a novel attack that bypasses Differential Privacy—the gold standard for privacy-preserving ML. This will force a major re-evaluation of privacy guarantees across centralized and federated learning, impacting multiple modalities. While Paper 1 offers a valuable benchmark and theoretical result for LLM agents, Paper 2's compromise of Differential Privacy represents a more significant conceptual breakthrough with wider, more disruptive implications for the machine learning community.

gemini-3.1-pro-preview·Jun 18, 2026

Wonvs. Ghost Vectors: Soft-Deleted Embeddings Remain Reconstructible in HNSW Vector Databases

Paper 1 likely has higher scientific impact due to its novel, general mechanism (loss-landscape reshaping) for inducing targeted memorization/extraction of unseen training records, spanning both centralized and federated settings and extending beyond LMs to VLMs. The claim of a new attack that can probe/bypass differential privacy defenses (a gold-standard mitigation) makes it especially timely and consequential for ML privacy theory and practice. Paper 2 is highly actionable for RAG infrastructure and compliance, but its core issue (soft-deleted data recoverability) is more systems-specific and narrower in breadth than a broadly applicable training-time privacy attack.

gpt-5.2·Jun 18, 2026

Wonvs. Hidden in Memory: Sleeper Memory Poisoning in LLM Agents

Paper 1 presents a fundamentally novel attack paradigm—loss landscape poisoning for training data extraction—with strong theoretical insight (sharp minima forcing memorization), broad applicability across centralized/federated settings and modalities (language, vision-language), and importantly demonstrates limitations of differential privacy defenses. Its methodological contribution (reshaping loss landscapes) is more technically deep and generalizable. Paper 2 identifies an important practical vulnerability in memory-augmented LLM agents, but is more application-specific and closer to existing prompt injection research. Paper 1's implications for privacy, machine learning theory, and defense mechanisms give it broader and deeper scientific impact.

claude-opus-4-6·Jun 17, 2026

Wonvs. Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

Paper 1 addresses a fundamental privacy vulnerability in LLM training with a novel loss landscape manipulation technique. It provides strong theoretical insights (sharp minima memorization), demonstrates generalizability across centralized/federated settings and modalities (language and vision-language), and critically engages with differential privacy defenses while introducing a bypass. This has broader impact across ML security, privacy, and training methodology. Paper 2, while practically relevant, is more narrowly focused on knowledge graph poisoning in agentic systems—an important but more specific attack surface with more straightforward mitigation (read-only access).

claude-opus-4-6·Jun 17, 2026

Wonvs. Gatling: Rapid-Fire Consensus from Parallel Composition

Paper 1 addresses a critical and highly timely issue: the privacy and security of Large Language Models. By demonstrating a novel vulnerability that achieves up to 100% data extraction and bypasses traditional differential privacy defenses, it has profound implications for AI safety, centralized, and federated learning. While Paper 2 offers solid advancements in distributed consensus for blockchains, the widespread deployment of LLMs trained on sensitive data makes the security vulnerabilities exposed in Paper 1 far more urgent and broadly impactful across multiple disciplines.

gemini-3.1-pro-preview·Jun 17, 2026

Wonvs. Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports

Paper 2 has higher potential impact due to a more novel, broadly relevant security/privacy contribution: a new poisoning-based mechanism to extract unseen training records by reshaping the loss landscape, shown across centralized/federated settings and language/vision-language models with very high success rates. It directly affects real-world deployment of LLMs trained on sensitive data, intersects ML security, privacy, and federated learning, and is highly timely. Paper 1 is valuable for a realistic benchmark/dataset in CTI classification, but its scope is narrower and the main result is an empirical baseline with low performance rather than a new method.

gpt-5.2·Jun 17, 2026

Wonvs. Same-Origin Policy for Agentic Browsers

Paper 2 likely has higher scientific impact due to its novel, broadly applicable privacy attack paradigm (loss-landscape poisoning) affecting both centralized and federated training and spanning language and vision-language models. The results suggest severe real-world implications for training on sensitive/proprietary data, with strong timeliness given widespread LLM deployment. It also advances methodology by tying poisoning to geometric properties of the loss landscape and evaluating defenses (DP) while proposing a bypass. Paper 1 is rigorous and valuable for agentic browser security, but its impact is narrower to the emerging agentic browser ecosystem.

gpt-5.2·Jun 17, 2026

Wonvs. Graph neural networks at war: integrating cybersecurity and drone intelligence in the Israeli-Iranian conflict

Paper 2 addresses a highly critical and timely issue (LLM privacy and data extraction) with a novel and broadly applicable methodology (loss landscape poisoning). Its findings span centralized, federated, and differentially private settings, impacting NLP, vision, and AI safety. In contrast, Paper 1 applies existing GNN architectures to a highly specific, geographically constrained use case, limiting its broader theoretical and cross-disciplinary impact compared to Paper 2's fundamental AI security contributions.

gemini-3.1-pro-preview·Jun 17, 2026

Wonvs. Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners

Paper 1 introduces a fundamentally novel attack paradigm—loss landscape poisoning—that reveals deep vulnerabilities in LLM training pipelines across multiple settings (centralized, federated, multimodal). Its theoretical insight about reshaping loss landscapes is broadly applicable, it achieves near-perfect extraction rates, and it challenges differential privacy defenses with a new probe-based attack. This has broad implications for AI safety, privacy, and machine learning theory. Paper 2 addresses an important but narrower security concern (skill scanner evasion) with a more applied contribution. Paper 1's methodological depth and cross-domain generalizability give it greater potential impact.

claude-opus-4-6·Jun 17, 2026

#114of 2618·Cryptography & Security

#114 of 2618 · Cryptography & Security

Tournament Score

1557±45

10501750

91%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance8

Rigor6.5

Novelty8

Clarity7.5