Md Abdullah Al Mamun, Ngoc Phu Doan, Pedram Zaree, Ihsen Alouani, Nael Abu-Ghazaleh
Large Language Models are increasingly trained on proprietary or sensitive data, from private healthcare and financial records to user conversations containing secrets. Ensuring the privacy of such data against extraction attacks has become a central concern. In this paper, we ask whether an attacker who can poison a portion of the training data can facilitate the leakage of a separate target record they have no access to. We answer in the affirmative and show that such leakage can be induced by a poisoning mechanism that reshapes the model's local loss landscape around the target completion. Our key insight is that poisoning to create a sharp loss minimum at the target, surrounded by elevated loss on nearby alternatives, forces the model to memorize the target as the unique low-loss solution in its neighborhood. The attack requires no architectural changes, and generalizes across centralized and federated learning settings. We demonstrate that the attack amplifies privacy leakage across language (up to 100% successful extraction), and vision-language models (up 90% successful extraction). We show that the attack is thwarted when the model is trained to be differentially private. However, we introduce a new attack that directly probes the loss landscape bypassing even differential privacy defenses.
This paper introduces Loss Landscape Poisoning (LLP), a training-time poisoning attack that forces LLMs and VLMs to memorize sensitive training records the attacker has never seen. The key mechanism is elegant: by injecting poison samples that share the schema of a target record (e.g., same form template but with random values in the secret field) and applying gradient ascent on these decoys, the attacker elevates the loss across the target's neighborhood. When the victim's own data supplies the descent signal at the true secret, a sharp local minimum forms, compelling the model to memorize the exact secret as the unique low-loss solution.
The paper contributes three attack variants (direct model poisoning, data-only poisoning, and federated learning poisoning) and, most notably, a new leakage primitive called Direct Loss Region Probing (DLRP) that bypasses differential privacy defenses by probing the geometry of the loss landscape rather than relying on direct generation.
This work has significant implications across multiple dimensions:
Privacy and security: The demonstration that an attacker can induce memorization of data they've never seen fundamentally changes the threat landscape for collaborative and federated training. The cross-client leakage result—where a single malicious participant among 10 can extract secrets from others—is particularly alarming for real-world FL deployments in healthcare and finance.
Differential privacy implications: The DLRP result is arguably the most impactful finding. The insight that DP-SGD protects against generation-based extraction but not landscape-geometry-based probing challenges a core assumption in the private ML community. The observation that "the privacy of training data depends not only on what the model generates but on the geometry of the loss surface it carries" could redirect research on privacy-preserving training.
Practical relevance: The data-only variant (LLP-Data) with 86% success on LLMs is particularly threatening because it requires only the ability to contribute training samples—a realistic capability given web-scraped training data, crowdsourced datasets, or federated learning.
The paper addresses a critical and timely concern. As LLMs are increasingly trained on proprietary data (medical records, financial data, enterprise communications), understanding the full attack surface for training data extraction is essential. The federated learning angle is especially relevant given growing interest in privacy-preserving collaborative fine-tuning (e.g., for hospital networks, financial institutions).
The work builds on established foundations (Carlini et al.'s memorization work, gradient matching from Geiping et al.) but synthesizes them into a genuinely novel attack paradigm. The timing is appropriate given the proliferation of fine-tuning-as-a-service and federated LLM training platforms.
The paper would benefit from a formal analysis of when DLRP fails—specifically, the relationship between DP noise, secret entropy, and the survival of the loss-landscape fingerprint. The claim that "mitigations strong enough to suppress DLRP require noise levels that harm utility significantly" needs more rigorous characterization across model scales. The connection between the loss ratio (Section 4.1) and information-theoretic bounds on memorization could be formalized.
Generated Jun 17, 2026
Paper 2 introduces a fundamentally novel, cross-disciplinary attack vector by exploiting hardware-level floating-point divergence to trigger backdoors. Bridging machine learning security and systems engineering, this exposes a critical time-of-check to time-of-use vulnerability in model auditing across diverse hardware (GPUs, TPUs, ARM). While Paper 1 presents a strong privacy attack, data extraction and poisoning are established paradigms. Paper 2's creation of an entirely new threat class—platform-dependent hardware signatures as malicious triggers—has broader implications for trusted AI supply chains and hardware-software co-design, promising higher overall scientific impact.
Paper 2 demonstrates a profound vulnerability by using data poisoning to extract unseen training data, fundamentally challenging current ML security paradigms. Crucially, it introduces a novel attack that bypasses Differential Privacy—the gold standard for privacy-preserving ML. This will force a major re-evaluation of privacy guarantees across centralized and federated learning, impacting multiple modalities. While Paper 1 offers a valuable benchmark and theoretical result for LLM agents, Paper 2's compromise of Differential Privacy represents a more significant conceptual breakthrough with wider, more disruptive implications for the machine learning community.
Paper 1 likely has higher scientific impact due to its novel, general mechanism (loss-landscape reshaping) for inducing targeted memorization/extraction of unseen training records, spanning both centralized and federated settings and extending beyond LMs to VLMs. The claim of a new attack that can probe/bypass differential privacy defenses (a gold-standard mitigation) makes it especially timely and consequential for ML privacy theory and practice. Paper 2 is highly actionable for RAG infrastructure and compliance, but its core issue (soft-deleted data recoverability) is more systems-specific and narrower in breadth than a broadly applicable training-time privacy attack.
Paper 1 presents a fundamentally novel attack paradigm—loss landscape poisoning for training data extraction—with strong theoretical insight (sharp minima forcing memorization), broad applicability across centralized/federated settings and modalities (language, vision-language), and importantly demonstrates limitations of differential privacy defenses. Its methodological contribution (reshaping loss landscapes) is more technically deep and generalizable. Paper 2 identifies an important practical vulnerability in memory-augmented LLM agents, but is more application-specific and closer to existing prompt injection research. Paper 1's implications for privacy, machine learning theory, and defense mechanisms give it broader and deeper scientific impact.
Paper 1 addresses a fundamental privacy vulnerability in LLM training with a novel loss landscape manipulation technique. It provides strong theoretical insights (sharp minima memorization), demonstrates generalizability across centralized/federated settings and modalities (language and vision-language), and critically engages with differential privacy defenses while introducing a bypass. This has broader impact across ML security, privacy, and training methodology. Paper 2, while practically relevant, is more narrowly focused on knowledge graph poisoning in agentic systems—an important but more specific attack surface with more straightforward mitigation (read-only access).
Paper 1 addresses a critical and highly timely issue: the privacy and security of Large Language Models. By demonstrating a novel vulnerability that achieves up to 100% data extraction and bypasses traditional differential privacy defenses, it has profound implications for AI safety, centralized, and federated learning. While Paper 2 offers solid advancements in distributed consensus for blockchains, the widespread deployment of LLMs trained on sensitive data makes the security vulnerabilities exposed in Paper 1 far more urgent and broadly impactful across multiple disciplines.
Paper 2 has higher potential impact due to a more novel, broadly relevant security/privacy contribution: a new poisoning-based mechanism to extract unseen training records by reshaping the loss landscape, shown across centralized/federated settings and language/vision-language models with very high success rates. It directly affects real-world deployment of LLMs trained on sensitive data, intersects ML security, privacy, and federated learning, and is highly timely. Paper 1 is valuable for a realistic benchmark/dataset in CTI classification, but its scope is narrower and the main result is an empirical baseline with low performance rather than a new method.
Paper 2 likely has higher scientific impact due to its novel, broadly applicable privacy attack paradigm (loss-landscape poisoning) affecting both centralized and federated training and spanning language and vision-language models. The results suggest severe real-world implications for training on sensitive/proprietary data, with strong timeliness given widespread LLM deployment. It also advances methodology by tying poisoning to geometric properties of the loss landscape and evaluating defenses (DP) while proposing a bypass. Paper 1 is rigorous and valuable for agentic browser security, but its impact is narrower to the emerging agentic browser ecosystem.
Paper 2 addresses a highly critical and timely issue (LLM privacy and data extraction) with a novel and broadly applicable methodology (loss landscape poisoning). Its findings span centralized, federated, and differentially private settings, impacting NLP, vision, and AI safety. In contrast, Paper 1 applies existing GNN architectures to a highly specific, geographically constrained use case, limiting its broader theoretical and cross-disciplinary impact compared to Paper 2's fundamental AI security contributions.
Paper 1 introduces a fundamentally novel attack paradigm—loss landscape poisoning—that reveals deep vulnerabilities in LLM training pipelines across multiple settings (centralized, federated, multimodal). Its theoretical insight about reshaping loss landscapes is broadly applicable, it achieves near-perfect extraction rates, and it challenges differential privacy defenses with a new probe-based attack. This has broad implications for AI safety, privacy, and machine learning theory. Paper 2 addresses an important but narrower security concern (skill scanner evasion) with a more applied contribution. Paper 1's methodological depth and cross-domain generalizability give it greater potential impact.