POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents
Qiaoyuan Zheng, Yiqu Yang, Qi Gao, Imanol Schlag
Abstract
LLM agents increasingly have access to private user data and act on the user's behalf when interacting with third-party systems. The user defines what may and must not be shared, and the agent must robustly follow that intent even when third-party systems behave adversarially. We introduce POLAR-Bench (Policy-aware adversarial Benchmark), in which a trusted model with a privacy policy and a task converses with a third-party model that adversarially probes for both task-relevant and protected attributes. Across 10 domains and 7,852 samples, we score privacy and utility by deterministic set-membership and vary privacy policy dimension and attack strategy along two orthogonal axes, producing a 5 times 5 diagnostic surface per model. Our results reveal a sharp split: current frontier models withhold over 99% of protected attributes, while smaller open-weight models in the 1--30B range, the class users most commonly run as their own trusted agent on-device or via private inference, score notably worse, with the weakest leaking over half. POLAR-Bench thus localizes where each model's intent-following breaks down, providing a foothold for privacy alignment where it matters most.
AI Impact Assessments
(1 models)Scientific Impact Assessment: POLAR-Bench
1. Core Contribution
POLAR-Bench introduces a diagnostic benchmark for evaluating how well LLM agents balance privacy protection and task utility when interacting with adversarial third-party systems. The key novelty is the joint variation of two orthogonal axes: privacy policy complexity (5 levels, from explicit field rules to conflicting objectives) and attack strategy (5 levels, from direct single-turn to multi-turn progressive elicitation). This produces a 5×5 diagnostic surface per model, enabling fine-grained localization of where intent-following breaks down. The benchmark spans 10 domains and 7,852 samples, uses deterministic set-membership scoring (no LLM judge for outcomes), and evaluates 22 models ranging from 3B open-weight to frontier proprietary systems.
The central finding is a "sharp split": frontier models (GLM-5.1, GPT-5.4, Gemma-4-31B) maintain >99% privacy with high utility, while smaller open-weight models in the 1–30B range—precisely those users deploy locally to avoid sending data to third-party APIs—leak substantially more, with the weakest disclosing over half of protected attributes.
2. Methodological Rigor
Strengths in design:
Concerns:
3. Potential Impact
Practical value: The benchmark directly addresses a deployment-relevant gap. Users running local LLM agents (1–30B range) to avoid cloud API privacy risks are ironically using the models least capable of enforcing privacy policies. This framing is compelling and actionable for the alignment community.
As a stress-test platform: The paper demonstrates that PrivacyChecker (an inference-time defense) produces measurable improvements on POLAR-Bench (+27.5pp privacy for Ministral-3-3B, +5.9pp for Apertus-70B), validating the benchmark's sensitivity to interventions. This positions POLAR-Bench as a useful evaluation tool for future privacy defenses.
Diagnostic utility: The 5×5 surface reveals that attack strategy matters far more than policy dimension. S2 (yes/no narrowing) and S5 (multi-turn progressive) are most privacy-threatening, while direct and prompt-injection attacks are easier to defend against. This is a practically useful finding for defense design.
Broader influence: The benchmark could influence training practices for open-weight models, where privacy alignment is clearly underinvested compared to frontier models. The correlation analysis (GPQA Diamond ↔ privacy: r=0.752) suggests reasoning capability helps but is insufficient, pointing toward specific alignment interventions rather than just scaling.
4. Timeliness & Relevance
The paper addresses a genuine and growing concern: as LLM agents become intermediaries handling private data (medical scheduling, financial planning, legal consultations), the gap between their capability and their privacy robustness becomes a practical liability. The timing is appropriate given the rapid deployment of agentic systems and the simultaneous push for on-device/private inference. Prior benchmarks (AirGapAgent, AgentLeak, ConfAIde) each address one dimension; POLAR-Bench's joint evaluation fills a clear gap in the evaluation landscape (Table 1).
5. Strengths & Limitations
Key strengths:
Notable limitations:
Overall Assessment
POLAR-Bench makes a solid contribution as a benchmark paper: it fills a specific gap in privacy-utility evaluation for LLM agents, introduces a well-structured diagnostic framework, and produces empirically interesting findings about the frontier-vs-open-weight privacy gap. The benchmark design is thoughtful, though the limited discriminative power of the policy dimension axis and the reliance on synthetic data temper the impact. The paper's strongest contribution is establishing that the models users deploy for privacy (small, local) are precisely those worst at maintaining it—a finding with clear practical implications.
Generated May 20, 2026
Comparison History (25)
Paper 2 likely has higher impact due to timeliness and broad applicability: privacy-preserving LLM agents are an immediate deployment concern across consumer, enterprise, and regulated domains. POLAR-Bench offers a clear, diagnostic evaluation framework (policy axes × attack strategies) with deterministic scoring and adversarial interaction, directly informing safety/privacy alignment and model selection—especially for widely used smaller open-weight models. Paper 1 is novel and valuable for grounded social cognition and bias diagnosis in MLLMs, but its application scope is narrower (personality inference) and less universally critical than privacy-utility trade-offs in agentic systems.
POLAR-Bench addresses a more urgent and broadly impactful problem—privacy leakage in LLM agents acting on users' behalf—which has immediate real-world implications for deployment safety, policy, and regulation. Its methodology (adversarial probing across 10 domains, 7,852 samples, diagnostic 5×5 surface) is rigorous and scalable. The finding that smaller open-weight models leak significantly more private data is highly actionable for the AI safety community. While AttuneBench makes a solid contribution to emotional intelligence evaluation, privacy-utility trade-offs have broader cross-field relevance (security, law, HCI, ML) and greater timeliness given rapid agent deployment.
Paper 1 is likely higher impact: it introduces a broadly applicable diagnostic benchmark for privacy-utility tradeoffs in LLM agents, a timely and high-stakes deployment issue with clear real-world implications (data leakage under adversarial probing). Its multi-domain evaluation and interpretable “diagnostic surface” can standardize comparisons and guide alignment work across academia and industry, influencing safety, policy, and product practices. Paper 2 is technically valuable for efficiency, but KV-cache compression is a narrower systems optimization area with more incremental differentiation and typically less cross-field societal impact than privacy alignment benchmarks.
Paper 2 addresses a critical, universal bottleneck in the deployment of autonomous LLM agents: data privacy and adversarial robustness. While Paper 1 provides excellent insights into multimodal social cognition and bias, Paper 2's focus on privacy-utility trade-offs impacts virtually every domain where AI agents interact with third-party systems. Furthermore, its specific finding that smaller, on-device models are highly vulnerable to data leakage provides immediate, highly actionable value for the open-source and AI security communities, granting it broader and more urgent real-world impact.
Paper 1 addresses a fundamental and widely-studied problem in LLM efficiency—KV cache compression for long-context processing—with a technically novel approach combining meta-learning, composable tokens, and attention-flow redistribution. This has broad applicability across all LLM deployment scenarios and directly enables practical scaling. Paper 2 introduces a valuable benchmark for privacy-utility trade-offs in LLM agents, but benchmarks generally have narrower methodological impact compared to novel architectural methods. While Paper 2 addresses an important and timely topic, Paper 1's contribution to the core efficiency challenge of LLMs gives it broader and deeper potential impact across the field.
While emotional intelligence (Paper 1) is important for human-computer interaction, privacy and security (Paper 2) are critical, high-stakes bottlenecks for the real-world deployment of LLM agents. POLAR-Bench addresses a pressing vulnerability—data leakage in smaller, on-device models—which has immediate regulatory, legal, and safety implications across multiple domains, giving it a higher potential impact.
Paper 1 introduces a novel problem formulation bridging behavioral economics (prospect theory) with strategic classification in ML, creating a new subfield with broad theoretical and practical implications. It challenges a fundamental assumption (agent rationality) across the SC literature and provides a principled framework. Paper 2, while timely and practically useful, is primarily a benchmark contribution for LLM privacy evaluation—important but more incremental. Paper 1's interdisciplinary novelty and potential to reshape how strategic interactions are modeled in ML gives it higher long-term scientific impact.
Paper 2 proposes a framework for autonomous scientific discovery, addressing a profound bottleneck in research. By integrating multi-agent debate, self-healing execution, and human-in-the-loop collaboration, it has the potential to accelerate innovation across all scientific disciplines. While Paper 1 offers a valuable benchmark for LLM privacy, Paper 2's capacity to transform the broader scientific method gives it significantly higher potential for transformative, cross-disciplinary impact.
Paper 2 proposes a fundamental paradigm shift and theoretical framework for the entire field of interactive AI evaluation. While Paper 1 introduces a valuable and timely benchmark for privacy in LLM agents, Paper 2's methodological contributions, design principles, and taxonomy have the potential to influence the creation of all future interactive benchmarks, giving it a significantly broader impact across the rapidly growing field of AI agent research.
Paper 1 addresses a foundational and widely debated topic in LLM pretraining: the role of code in developing reasoning capabilities. By conducting massive-scale (10T tokens) controlled experiments, it challenges prevailing assumptions and provides actionable insights into data-centric optimization (structured reasoning signals vs. pure code). This will likely broadly influence how next-generation foundation models are trained. While Paper 2 introduces a valuable privacy benchmark, Paper 1's fundamental insights into model cognition and pretraining data dynamics offer a wider, more transformative impact on the trajectory of AI capability research.
Paper 2 is likely higher impact due to a concrete, scalable benchmark addressing a timely, high-stakes problem (privacy/utility in agentic LLMs). It offers clearer methodological rigor (large multi-domain dataset, deterministic scoring, controlled axes producing diagnostic surfaces) and enables broad adoption across academia/industry for model evaluation and privacy alignment. Paper 1 provides valuable architectural framing and patterns for production systems, but is more methodology/design-oriented with narrower scientific generalizability and fewer standardized, reusable artifacts for comparative research.
GRAM introduces a fundamentally new framework for neural reasoning by combining recursive latent-state refinement with probabilistic multi-trajectory computation, addressing core limitations of deterministic recursive reasoning models. This has broad implications for reasoning architectures, inference-time scaling, and generative modeling. While POLAR-Bench addresses the important and timely topic of privacy-utility trade-offs in LLM agents, it is primarily a benchmark contribution with diagnostic findings rather than a new methodology. GRAM's theoretical novelty, methodological depth, and potential to influence future reasoning system design give it higher long-term scientific impact.
Paper 1 addresses a critical, universal bottleneck in LLM agent deployment: privacy and security against adversarial third-party interactions. Its focus on the vulnerabilities of smaller, on-device models provides essential insights for the broader AI alignment and security communities. While Paper 2 offers a robust and rigorous framework for engineering design, its impact is highly domain-specific. Paper 1's generalizable findings on privacy-utility trade-offs have far-reaching implications across almost all real-world LLM agent applications, giving it a higher potential for broad scientific impact.
Paper 2 has higher likely impact: it introduces a new, scalable benchmark (POLAR-Bench) targeting an urgent, widely relevant problem—privacy-utility trade-offs for LLM agents under adversarial interaction—enabling standardized evaluation across many models and domains. Benchmarks often catalyze rapid follow-on work (training, alignment, red-teaming, regulation, and product safety). Its methodology (multi-domain dataset, controlled axes for policy/attack, deterministic scoring) supports broad adoption. Paper 1 offers valuable empirical systems insights but is less novel and narrower to LLM serving/scheduling research.
Paper 1 introduces a concrete, actionable benchmark with extensive empirical results, providing immediate utility for evaluating and improving LLM agents. In contrast, Paper 2 is a vision paper presenting a conceptual framework, which, while valuable, lacks the immediate empirical grounding and measurable impact of a novel dataset and evaluation methodology.
Paper 1 addresses the critical issue of privacy in autonomous LLM agents, a major bottleneck for real-world deployment across domains like healthcare, finance, and personal assistance. By introducing a benchmark to evaluate privacy-utility trade-offs against adversarial probing, it provides a foundational tool for a rapidly growing field. Paper 2, while methodologically rigorous, focuses on educational assessment and simulated students, which represents a narrower scope of impact compared to the universal applicability of privacy alignment in AI agents.
Paper 1 (NGM) proposes a novel, training-free plug-and-play memory module for LLMs that demonstrates consistent improvements across multiple model sizes, benchmarks, and modalities. Its training-free design and broad applicability make it highly practical and widely adoptable. Paper 2 (POLAR-Bench) introduces an important privacy-utility benchmark but is primarily a diagnostic evaluation tool rather than a methodological contribution. While timely, benchmarks typically have narrower long-term impact unless they become community standards. NGM's architectural innovation with demonstrated gains across scales suggests broader and more lasting scientific influence.
Paper 1 (POW3R) introduces a novel and broadly applicable framework for improving rubric-based RLHF training by dynamically adapting reward weights based on policy state. It addresses a fundamental problem in RL-based model training with strong empirical results (24/30 comparisons won, 2.5-4x faster convergence) across multiple settings. Paper 2 (POLAR-Bench) contributes a useful benchmark for privacy-utility trade-offs but is more narrowly scoped as an evaluation resource. POW3R's methodological innovation in reward shaping has broader potential impact across the entire RLHF/RLVR training paradigm, while benchmarks, though valuable, typically have more incremental impact unless they fundamentally redefine a field.
Paper 2 likely has higher scientific impact: it introduces a novel, safety-oriented digital-twin framework combining interpretable regime-switching simulators with conformal risk control and self-falsifying “witnesses,” and validates it on multiple full-scale plants plus an international benchmark with strong quantitative gains. The work is methodologically rigorous (finite-sample guarantees, matched protocols, statistical tests) and has clear real-world deployment relevance in safety-critical infrastructure. Paper 1 is timely and useful as an LLM privacy benchmark, but its primary contribution is evaluative/diagnostic within one domain, with potentially narrower cross-field impact than certified decision support for industrial control.
POLAR-Bench addresses a fundamental and rapidly growing concern—privacy in LLM agents—with broad applicability across the entire LLM ecosystem. It introduces a reusable benchmark (7,852 samples, 10 domains) with a clear diagnostic framework that can be adopted widely by the AI safety and alignment community. Its finding that smaller open-weight models leak significantly more private data has immediate practical implications for on-device deployment. Paper 1, while methodologically thorough, addresses a narrower application (disaster survey imputation) with incremental improvements over existing methods and more limited cross-field relevance.