Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Aman Priyanshu, Supriti Vijay, Esha Pahwa

#689 of 2682 · Artificial Intelligence
Share
Tournament Score
1461±49
10501800
75%
Win Rate
12
Wins
4
Losses
16
Matches
Rating
7.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a significant gap in LLM safety evaluation: the mismatch between how models are tested (isolated, single-turn interactions) and how they are increasingly deployed (persistent, socially embedded multi-agent environments). The authors introduce a Moltbook-style simulation platform where 2,533 LLM agents interact across 124 communities over 25 simulated days, generating over 111,000 content items. They operationalize privacy through contextual integrity violations and demonstrate that social context alone—without explicit adversarial prompting—substantially amplifies privacy leakage compared to single-turn baselines.

The key empirical contributions are threefold: (1) privacy violations roughly double when moving from single-turn to multi-turn social settings (19.95% → 45.30% for OpenAI models), (2) leakage is "socially contagious" with an ~8× increase in disclosure probability following a peer's disclosure, and (3) explicit privacy instructions reduce but do not eliminate leakage (>37.8% persists). These findings collectively argue that static benchmarks systematically underestimate agentic privacy risks.

Methodological Rigor

The experimental design is well-structured, combining two complementary evaluations: organic simulation (unscripted agent interaction) and a controlled testbed (frozen environments with calibrated adversarial contamination across 5 levels). The controlled testbed spans 7 frontier models, 10 personas, and 5 budget checkpoints, yielding 7,000 evaluation traces—a reasonable scale for statistical analysis.

However, several methodological concerns warrant attention:

Detection pipeline reliability. The LLM-as-a-judge approach for privacy violation detection using GPT-5-nano is a known noisy proxy. The authors acknowledge this but do not report precision/recall estimates, inter-annotator agreement with human judgments, or false positive/negative rates. Without calibration, the reported leakage rates are difficult to interpret absolutely—they may represent an upper bound as the authors suggest, but the magnitude of this overestimation is unknown.

Contagion analysis. The 8× contagion effect (12.8% vs. 1.6%) is striking but the analysis appears correlational rather than causal. Threads that contain leaking content may differ systematically from clean threads in ways beyond the leak itself (e.g., topic, community norms, thread depth). The authors do not report controls for thread-level confounds, making it difficult to distinguish genuine social contagion from selection effects.

Adversarial design. The adversarial contamination levels are hand-crafted and somewhat coarse (1, 3, 5, or all 124 subreddits). The paper does not clearly report how leakage varies across these levels in a systematic way, which would have strengthened the dose-response argument.

Synthetic personas. While standard practice for privacy evaluation, using Faker-generated profiles with ~97 key-value pairs means leakage detection is matching against a known dictionary. This may inflate apparent leakage for attributes that overlap with common conversational topics (e.g., names, employment details), which the domain-level results seem to confirm.

Potential Impact

The paper's central insight—that social context is a first-order variable for LLM safety—has broad implications:

1. Benchmark design: Current safety benchmarks (HarmBench, SORRY-Bench, CIMemories) are predominantly single-turn. This work provides a strong empirical argument for developing multi-turn, socially situated evaluation protocols, potentially influencing the direction of safety benchmark development.

2. Platform governance: With AI agent communities like Moltbook growing rapidly, the findings on social contagion and community-dependent leakage rates are directly actionable for platform designers. The recommendation that controlling community participation may be more effective than prompt-level safeguards is practically relevant.

3. Contextual integrity theory: The extension of Nissenbaum's contextual integrity framework to multi-agent AI societies is a meaningful conceptual contribution, bridging privacy theory with LLM safety research.

4. Deployment safeguards: The finding that explicit privacy instructions degrade under social pressure (but remain partially effective) informs real-world deployment strategies, suggesting that layered defenses rather than prompt-only approaches are necessary.

Timeliness & Relevance

This paper is exceptionally timely. The emergence of Moltbook in early 2026 and the rapid growth of autonomous agent deployments create an urgent need for evaluation frameworks that capture social dynamics. The paper leverages a real-world phenomenon (agent social networks) to motivate a safety concern that had been theoretically anticipated but not empirically demonstrated at this scale. The growing body of Moltbook-related literature (11 citations to 2026 preprints) establishes this as an active research front where this paper makes a distinctive contribution by focusing on privacy rather than social structure per se.

Strengths

  • Novel evaluation paradigm: The shift from single-turn to persistent, socially embedded evaluation is well-motivated and fills a genuine gap.
  • Scale: 2,533 agents, 111K content items, 7,000 controlled traces—this is substantially larger than prior agent simulation studies.
  • Multi-dimensional analysis: The paper systematically varies model, community, persona, adversarial level, and instruction condition, providing a rich empirical picture.
  • Clear research questions: The four RQs are well-defined and the results are organized around them effectively.
  • Ecological validity: Grounding the simulation in real Moltbook data enhances relevance to actual deployment scenarios.
  • Limitations

  • No human validation of the LLM judge's detection accuracy; this is the most significant methodological gap.
  • Causal claims about contagion are not adequately supported by the correlational analysis presented.
  • Limited model diversity in the organic simulation (only OpenAI models); the controlled testbed adds Google models but the organic findings may not generalize.
  • The comparison to CIMemories (19.95% → 45.30%) involves substantially different evaluation conditions, making it a somewhat imprecise baseline comparison rather than a controlled ablation.
  • Reproducibility concerns: While code is promised, the reliance on frontier commercial APIs (GPT-5, Gemini-3) limits independent replication and long-term reproducibility.
  • Overall Assessment

    This is a well-executed and timely paper that introduces an important evaluation paradigm for LLM privacy in multi-agent settings. The core insight—that social context alone amplifies privacy violations beyond what single-turn benchmarks capture—is convincingly demonstrated, even if individual analyses could be strengthened. The work is positioned at the intersection of AI safety, privacy, and computational social science, giving it cross-disciplinary appeal. The primary weaknesses are in detection validation and causal identification, which temper the quantitative precision of the claims without undermining the qualitative direction.

    Rating:7.2/ 10
    Significance 7.5Rigor 6.3Novelty 7.5Clarity 7.8

    Generated May 28, 2026

    Comparison History (16)

    vs. A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
    gpt-5.25/28/2026

    Paper 2 likely has higher impact: it targets a timely, high-stakes real-world problem (privacy leakage) in the increasingly relevant setting of persistent multi-agent social environments, and provides concrete, measurable findings (contagion effects, instruction limits) that can reshape safety evaluation practice across academia and industry. The platform-based methodology enables broad follow-on work in alignment, security, and HCI. Paper 1 is valuable for scalable benchmark generation, but its impact is more niche (agent evaluation tooling) and primarily advances benchmarking rather than exposing a societally critical failure mode.

    vs. A Query Engine for the Agents
    gemini-3.15/28/2026

    Paper 2 addresses a critical and timely issue in AI safety by revealing fundamental vulnerabilities in multi-agent systems. Its discovery of 'social contagion' in privacy leakage challenges the current paradigm of isolated, single-turn safety evaluations. This conceptual shift has broad implications for AI safety, privacy research, and deployment policies, offering deeper scientific insights compared to Paper 1, which primarily presents a highly useful but applied systems engineering solution.

    vs. PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft
    gemini-3.15/28/2026

    Paper 2 addresses a highly urgent and broadly applicable issue: privacy and safety in multi-agent LLM deployments. By demonstrating that current single-turn evaluations systematically underestimate privacy risks compared to social, multi-turn contexts, it has significant implications for AI safety, regulation, and system design. While Paper 1 offers a strong technical innovation for embodied agents in a simulated environment, Paper 2's findings on social contagion and privacy leakage are more timely, impacting a wider range of real-world AI applications and policy considerations.

    vs. CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models
    gemini-3.15/28/2026

    Paper 1 exposes a critical, previously underestimated vulnerability by demonstrating that social environments systematically compromise LLM privacy. This paradigm shift from static, single-agent safety evaluations to multi-agent, socially contextualized testing is highly novel and urgently needed for AI alignment. While Paper 2 offers a rigorous framework for self-correction, Paper 1's findings have broader implications across AI safety, privacy, and the real-world deployment of autonomous agent networks, likely prompting widespread re-evaluation of safety benchmarks.

    vs. Do Clinical Models Change Treatment Decisions?
    gpt-5.25/28/2026

    Paper 2 likely has higher scientific impact due to clearer, high-stakes real-world applicability (clinical treatment decisions) and strong timeliness as clinical foundation models move toward deployment. ClinPivot targets a core failure mode—context-dependent decision changes—beyond exam-style QA, offering an auditable benchmark and actionable training insights (decision-structured supervision, replay) that can directly influence model development and evaluation practices in medicine and beyond. Paper 1 is novel and important for agentic safety, but its simulation-based findings may generalize less directly to regulated deployment settings compared to clinically grounded decision benchmarks.

    vs. SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks
    gemini-3.15/28/2026

    Paper 2 addresses a critical and highly timely issue in AI safety, demonstrating empirically that current single-agent benchmarks systematically underestimate privacy risks in multi-agent environments. This finding challenges existing evaluation paradigms and has immediate, broad implications for LLM deployment. Paper 1 presents an innovative decentralized compute protocol, but it operates in a more niche intersection of distributed systems and AI, making its broad scientific impact less immediate than fundamental AI safety research.

    vs. TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems
    claude-opus-4.65/28/2026

    Paper 2 addresses a critical and timely safety concern—privacy leakage in multi-agent LLM systems—that has broad implications for AI deployment policy and regulation. Its finding that social context amplifies privacy violations (from ~20% to ~45%) and that leakage is socially contagious reveals a fundamental gap in current safety evaluation paradigms. This has immediate real-world relevance as agentic AI systems are being rapidly deployed. Paper 1, while technically solid, represents an incremental optimization contribution to multi-agent prompt/topology co-evolution with narrower impact. Paper 2's novel evaluation framework and alarming findings are more likely to influence safety standards across the field.

    vs. JobBench: Aligning Agent Work With Human Will
    gpt-5.25/28/2026

    Paper 1 likely has higher scientific impact because it introduces a novel, multi-agent, month-long social simulation to reveal privacy failures that standard single-agent benchmarks miss, directly affecting safety evaluation paradigms for agentic systems. Its findings (social contagion of leakage, robustness against explicit instructions) are broadly relevant across AI safety, privacy, multi-agent systems, and deployment policy, and are timely as persistent agent communities become common. Paper 2 is valuable and rigorous as a benchmark for human-aligned occupational delegation, but its impact is more domain-scoped to evaluation of work agents rather than uncovering a new systemic safety failure mode.

    vs. TMAS: Scaling Test-Time Compute via Multi-Agent Synergy
    claude-opus-4.65/28/2026

    Paper 1 (TMAS) presents a novel framework for test-time compute scaling with multi-agent synergy, introducing hierarchical memories and hybrid reward RL—a significant methodological contribution to a hot research area (LLM reasoning). It offers broad applicability across reasoning tasks with demonstrated empirical improvements. Paper 2 addresses an important but narrower concern (privacy in multi-agent LLM systems), providing valuable empirical findings about social contagion of privacy violations. While timely, it is more of an evaluation/benchmark contribution than a methodological advance, limiting its potential to spawn follow-up research compared to TMAS's framework.

    vs. CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
    gemini-3.15/28/2026

    Paper 1 addresses a fundamental limitation in current AI—causal reasoning—by introducing a novel, scalable environment to evaluate structural causal model recovery rather than mere predictive accuracy. Its focus on 'AI scientists' and interactive causal discovery presents a significant methodological advancement with profound implications for the development of AGI, arguably offering deeper long-term scientific impact than the behavioral safety evaluation in Paper 2.

    vs. STAB: Specification-driven Testing for Algorithmic Bottlenecks
    gpt-5.25/28/2026

    Paper 2 likely has higher impact due to timeliness and broad relevance: privacy leakage in multi-agent LLM deployments is an urgent real-world concern for safety, security, and regulation. Its Moltbook-style month-long social simulation introduces a more deployment-faithful evaluation paradigm, revealing qualitatively new phenomena (contagious leakage, large amplification vs single-turn tests) that could reshape benchmarking practices across AI safety, HCI, and security. Paper 1 is innovative and rigorous for software testing/PL, but its impact is more domain-specific and incremental relative to the wider societal and cross-field implications of Paper 2.

    vs. The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context
    gemini-3.15/28/2026

    Paper 2 addresses a critical bottleneck in the most widely adopted LLM architecture (RAG) by tackling the 'attribution blind spot.' While Paper 1 offers valuable insights into multi-agent privacy, Paper 2's mechanistic approach to distinguishing parametric memory from retrieved context has profound implications for AI reliability, safety, and high-stakes enterprise deployments. Its novel methodology (Computational Reality Monitoring) bridging cognitive science and mechanistic interpretability provides a foundational tool for the broader NLP community, giving it higher potential for widespread scientific and real-world impact.

    vs. Examining Agents' Bias Amplification versus Suppression in Multi-Agent Systems
    gemini-3.15/28/2026

    Paper 1 offers a highly innovative methodology by evaluating privacy through large-scale, multi-turn social simulations rather than static benchmarks. Its discovery that privacy leakage is socially contagious and bypasses standard safeguards reveals a critical, emergent vulnerability in deployed LLM agents. While Paper 2 addresses the important issue of bias, Paper 1's concrete quantitative findings on systemic security failures demonstrate a more immediate and severe real-world impact for AI safety, pushing the field to fundamentally rethink how LLM vulnerabilities are evaluated.

    vs. Measuring Progress Toward AGI: A Cognitive Framework
    claude-opus-4.65/28/2026

    Paper 2 addresses a timely, concrete, and empirically grounded problem—privacy leakage in multi-agent LLM systems—with novel experimental methodology (multi-agent social simulation) and striking quantitative findings (e.g., 8x contagion effect, amplified leakage rates). It directly challenges current safety evaluation paradigms with actionable results relevant to real-world deployments. Paper 1 proposes a conceptual framework for AGI measurement, which is valuable but more speculative, harder to validate, and less immediately actionable. Paper 2's concrete findings and methodological innovation give it broader and more immediate impact across AI safety, policy, and deployment practices.

    vs. SkillGrad: Optimizing Agent Skills Like Gradient Descent
    claude-opus-4.65/28/2026

    Paper 1 addresses a critical and timely gap in AI safety evaluation—privacy risks in multi-agent social systems—which has broad implications for deployed AI systems. The finding that social context amplifies privacy violations (from ~20% to ~45%) and that leakage is socially contagious reveals a fundamental blind spot in current safety benchmarks. This has immediate policy and deployment implications across the AI safety community. Paper 2, while technically solid, proposes an incremental optimization framework for agent skills with narrower scope and more limited cross-field impact.

    vs. Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning
    gpt-5.25/28/2026

    Paper 2 likely has higher impact due to strong real-world relevance and timeliness: privacy risks in deployed multi-agent settings are immediate and broadly important. Its Moltbook-style month-long, large-scale social simulation provides a scalable evaluation paradigm and yields clear, actionable findings (social contagion of leakage, limits of instruction-based safeguards) that can influence safety standards, policy, and system design across many applications. Paper 1 is novel mechanistic work, but its applications are more indirect and its impact may be narrower to interpretability research compared with the broad cross-field implications of privacy evaluation.