POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

Qiaoyuan Zheng, Yiqu Yang, Qi Gao, Imanol Schlag

May 18, 2026

arXiv:2605.19127v1 PDF

cs.AI(primary)

#799of 2292·Artificial Intelligence

#799 of 2292 · Artificial Intelligence

Tournament Score

1444±39

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty6.5

Clarity7.5

Tournament Score

1444±39

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLM agents increasingly have access to private user data and act on the user's behalf when interacting with third-party systems. The user defines what may and must not be shared, and the agent must robustly follow that intent even when third-party systems behave adversarially. We introduce POLAR-Bench (Policy-aware adversarial Benchmark), in which a trusted model with a privacy policy and a task converses with a third-party model that adversarially probes for both task-relevant and protected attributes. Across 10 domains and 7,852 samples, we score privacy and utility by deterministic set-membership and vary privacy policy dimension and attack strategy along two orthogonal axes, producing a 5 times 5 diagnostic surface per model. Our results reveal a sharp split: current frontier models withhold over 99% of protected attributes, while smaller open-weight models in the 1--30B range, the class users most commonly run as their own trusted agent on-device or via private inference, score notably worse, with the weakest leaking over half. POLAR-Bench thus localizes where each model's intent-following breaks down, providing a foothold for privacy alignment where it matters most.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: POLAR-Bench

1. Core Contribution

POLAR-Bench introduces a diagnostic benchmark for evaluating how well LLM agents balance privacy protection and task utility when interacting with adversarial third-party systems. The key novelty is the joint variation of two orthogonal axes: privacy policy complexity (5 levels, from explicit field rules to conflicting objectives) and attack strategy (5 levels, from direct single-turn to multi-turn progressive elicitation). This produces a 5×5 diagnostic surface per model, enabling fine-grained localization of where intent-following breaks down. The benchmark spans 10 domains and 7,852 samples, uses deterministic set-membership scoring (no LLM judge for outcomes), and evaluates 22 models ranging from 3B open-weight to frontier proprietary systems.

The central finding is a "sharp split": frontier models (GLM-5.1, GPT-5.4, Gemma-4-31B) maintain >99% privacy with high utility, while smaller open-weight models in the 1–30B range—precisely those users deploy locally to avoid sending data to third-party APIs—leak substantially more, with the weakest disclosing over half of protected attributes.

2. Methodological Rigor

Strengths in design:

The decoupling of task instruction from privacy policy (T ⊥ P) is a principled design choice that prevents shortcut solutions where models satisfy privacy constraints through cues in the task rather than interpreting the policy independently.

Deterministic evaluation via regex-based attribute matching eliminates LLM-judge variance in scoring, improving reproducibility.

The two-stage pipeline (symbolic profile sampling → natural-language rendering) with quality controls (regex coverage verification, LLM-judge validation for attacker prompts, repair passes) is thorough and well-documented.

Concerns:

A single model (Llama-3.3-70B) generates all attacker prompts, conditioning the reported attack effectiveness on that model's capabilities. Table 2 partially addresses this by showing E-sensitivity is small for strong T but produces ~3.9pp swings for weak T, though this is only tested with three Es.

The P1–P5 difficulty ordering is not validated against human judgments. Table 11 shows surprisingly small variation across policy dimensions (privacy mean ranges only from 78.74 to 82.57), raising questions about whether the hierarchy captures meaningfully different reasoning demands.

The scoring is blind to paraphrased disclosures, which is particularly problematic for P4 (partial/abstracted disclosure) and P5 (conflicting objectives). This is acknowledged but not quantified.

Synthetic data generation from structured profiles raises external validity concerns. Real user documents have messier structure, implicit information, and contextual nuances that structured profiles may not capture.

3. Potential Impact

Practical value: The benchmark directly addresses a deployment-relevant gap. Users running local LLM agents (1–30B range) to avoid cloud API privacy risks are ironically using the models least capable of enforcing privacy policies. This framing is compelling and actionable for the alignment community.

As a stress-test platform: The paper demonstrates that PrivacyChecker (an inference-time defense) produces measurable improvements on POLAR-Bench (+27.5pp privacy for Ministral-3-3B, +5.9pp for Apertus-70B), validating the benchmark's sensitivity to interventions. This positions POLAR-Bench as a useful evaluation tool for future privacy defenses.

Diagnostic utility: The 5×5 surface reveals that attack strategy matters far more than policy dimension. S2 (yes/no narrowing) and S5 (multi-turn progressive) are most privacy-threatening, while direct and prompt-injection attacks are easier to defend against. This is a practically useful finding for defense design.

Broader influence: The benchmark could influence training practices for open-weight models, where privacy alignment is clearly underinvested compared to frontier models. The correlation analysis (GPQA Diamond ↔ privacy: r=0.752) suggests reasoning capability helps but is insufficient, pointing toward specific alignment interventions rather than just scaling.

4. Timeliness & Relevance

The paper addresses a genuine and growing concern: as LLM agents become intermediaries handling private data (medical scheduling, financial planning, legal consultations), the gap between their capability and their privacy robustness becomes a practical liability. The timing is appropriate given the rapid deployment of agentic systems and the simultaneous push for on-device/private inference. Prior benchmarks (AirGapAgent, AgentLeak, ConfAIde) each address one dimension; POLAR-Bench's joint evaluation fills a clear gap in the evaluation landscape (Table 1).

5. Strengths & Limitations

Key strengths:

Clean experimental design with orthogonal axes enabling controlled analysis

Large-scale evaluation (22 models × 7,852 samples × 10 domains)

Deterministic scoring eliminates evaluator variance

Comprehensive robustness checks (cross-domain consistency with Kendall's W=0.915, E-sensitivity, reproducibility measures)

The finding that scaling is non-monotonic within families (Apertus-70B has *worse* privacy than 8B; Ministral privacy is non-monotonic at 3B→8B→14B) is an important empirical contribution

Notable limitations:

English-only, excluding multilingual privacy norms

Defense evaluation is limited to PrivacyChecker on only 3 models

The across-benchmark analysis is limited to GPQA Diamond due to score availability

The privacy policy dimension axis shows surprisingly little discriminative power, suggesting the hierarchy may need refinement

No human evaluation of any component (policy difficulty ordering, transcript quality, scoring accuracy)

Overall Assessment

POLAR-Bench makes a solid contribution as a benchmark paper: it fills a specific gap in privacy-utility evaluation for LLM agents, introduces a well-structured diagnostic framework, and produces empirically interesting findings about the frontier-vs-open-weight privacy gap. The benchmark design is thoughtful, though the limited discriminative power of the policy dimension axis and the reliance on synthetic data temper the impact. The paper's strongest contribution is establishing that the models users deploy for privacy (small, local) are precisely those worst at maintaining it—a finding with clear practical implications.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 6.5Clarity 7.5

Generated May 20, 2026

Comparison History (25)

vs. Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

gpt-5.25/22/2026

Paper 2 likely has higher impact due to timeliness and broad applicability: privacy-preserving LLM agents are an immediate deployment concern across consumer, enterprise, and regulated domains. POLAR-Bench offers a clear, diagnostic evaluation framework (policy axes × attack strategies) with deterministic scoring and adversarial interaction, directly informing safety/privacy alignment and model selection—especially for widely used smaller open-weight models. Paper 1 is novel and valuable for grounded social cognition and bias diagnosis in MLLMs, but its application scope is narrower (personality inference) and less universally critical than privacy-utility trade-offs in agentic systems.

vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

claude-opus-4.65/22/2026

POLAR-Bench addresses a more urgent and broadly impactful problem—privacy leakage in LLM agents acting on users' behalf—which has immediate real-world implications for deployment safety, policy, and regulation. Its methodology (adversarial probing across 10 domains, 7,852 samples, diagnostic 5×5 surface) is rigorous and scalable. The finding that smaller open-weight models leak significantly more private data is highly actionable for the AI safety community. While AttuneBench makes a solid contribution to emotional intelligence evaluation, privacy-utility trade-offs have broader cross-field relevance (security, law, HCI, ML) and greater timeliness given rapid agent deployment.

vs. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

gpt-5.25/22/2026

Paper 1 is likely higher impact: it introduces a broadly applicable diagnostic benchmark for privacy-utility tradeoffs in LLM agents, a timely and high-stakes deployment issue with clear real-world implications (data leakage under adversarial probing). Its multi-domain evaluation and interpretable “diagnostic surface” can standardize comparisons and guide alignment work across academia and industry, influencing safety, policy, and product practices. Paper 2 is technically valuable for efficiency, but KV-cache compression is a narrower systems optimization area with more incremental differentiation and typically less cross-field societal impact than privacy alignment benchmarks.

vs. Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

gemini-3.15/22/2026

Paper 2 addresses a critical, universal bottleneck in the deployment of autonomous LLM agents: data privacy and adversarial robustness. While Paper 1 provides excellent insights into multimodal social cognition and bias, Paper 2's focus on privacy-utility trade-offs impacts virtually every domain where AI agents interact with third-party systems. Furthermore, its specific finding that smaller, on-device models are highly vulnerable to data leakage provides immediate, highly actionable value for the open-source and AI security communities, granting it broader and more urgent real-world impact.

vs. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental and widely-studied problem in LLM efficiency—KV cache compression for long-context processing—with a technically novel approach combining meta-learning, composable tokens, and attention-flow redistribution. This has broad applicability across all LLM deployment scenarios and directly enables practical scaling. Paper 2 introduces a valuable benchmark for privacy-utility trade-offs in LLM agents, but benchmarks generally have narrower methodological impact compared to novel architectural methods. While Paper 2 addresses an important and timely topic, Paper 1's contribution to the core efficiency challenge of LLMs gives it broader and deeper potential impact across the field.

vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

gemini-3.15/22/2026

While emotional intelligence (Paper 1) is important for human-computer interaction, privacy and security (Paper 2) are critical, high-stakes bottlenecks for the real-world deployment of LLM agents. POLAR-Bench addresses a pressing vulnerability—data leakage in smaller, on-device models—which has immediate regulatory, legal, and safety implications across multiple domains, giving it a higher potential impact.

vs. Beyond Rational Illusion: Behaviorally Realistic Strategic Classification

claude-opus-4.65/20/2026

Paper 1 introduces a novel problem formulation bridging behavioral economics (prospect theory) with strategic classification in ML, creating a new subfield with broad theoretical and practical implications. It challenges a fundamental assumption (agent rationality) across the SC literature and provides a principled framework. Paper 2, while timely and practically useful, is primarily a benchmark contribution for LLM privacy evaluation—important but more incremental. Paper 1's interdisciplinary novelty and potential to reshape how strategic interactions are modeled in ML gives it higher long-term scientific impact.

vs. AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

gemini-3.15/20/2026

Paper 2 proposes a framework for autonomous scientific discovery, addressing a profound bottleneck in research. By integrating multi-agent debate, self-healing execution, and human-in-the-loop collaboration, it has the potential to accelerate innovation across all scientific disciplines. While Paper 1 offers a valuable benchmark for LLM privacy, Paper 2's capacity to transform the broader scientific method gives it significantly higher potential for transformative, cross-disciplinary impact.

vs. Interactive Evaluation Requires a Design Science

gemini-3.15/20/2026

Paper 2 proposes a fundamental paradigm shift and theoretical framework for the entire field of interactive AI evaluation. While Paper 1 introduces a valuable and timely benchmark for privacy in LLM agents, Paper 2's methodological contributions, design principles, and taxonomy have the potential to influence the creation of all future interactive benchmarks, giving it a significantly broader impact across the rapidly growing field of AI agent research.

vs. What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

gemini-3.15/20/2026

Paper 1 addresses a foundational and widely debated topic in LLM pretraining: the role of code in developing reasoning capabilities. By conducting massive-scale (10T tokens) controlled experiments, it challenges prevailing assumptions and provides actionable insights into data-centric optimization (structured reasoning signals vs. pure code). This will likely broadly influence how next-generation foundation models are trained. While Paper 2 introduces a valuable privacy benchmark, Paper 1's fundamental insights into model cognition and pretraining data dynamics offer a wider, more transformative impact on the trajectory of AI capability research.

vs. A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

gpt-5.25/20/2026

Paper 2 is likely higher impact due to a concrete, scalable benchmark addressing a timely, high-stakes problem (privacy/utility in agentic LLMs). It offers clearer methodological rigor (large multi-domain dataset, deterministic scoring, controlled axes producing diagnostic surfaces) and enables broad adoption across academia/industry for model evaluation and privacy alignment. Paper 1 provides valuable architectural framing and patterns for production systems, but is more methodology/design-oriented with narrower scientific generalizability and fewer standardized, reusable artifacts for comparative research.

vs. Generative Recursive Reasoning

claude-opus-4.65/20/2026

GRAM introduces a fundamentally new framework for neural reasoning by combining recursive latent-state refinement with probabilistic multi-trajectory computation, addressing core limitations of deterministic recursive reasoning models. This has broad implications for reasoning architectures, inference-time scaling, and generative modeling. While POLAR-Bench addresses the important and timely topic of privacy-utility trade-offs in LLM agents, it is primarily a benchmark contribution with diagnostic findings rather than a new methodology. GRAM's theoretical novelty, methodological depth, and potential to influence future reasoning system design give it higher long-term scientific impact.

vs. EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

gemini-3.15/20/2026

Paper 1 addresses a critical, universal bottleneck in LLM agent deployment: privacy and security against adversarial third-party interactions. Its focus on the vulnerabilities of smaller, on-device models provides essential insights for the broader AI alignment and security communities. While Paper 2 offers a robust and rigorous framework for engineering design, its impact is highly domain-specific. Paper 1's generalizable findings on privacy-utility trade-offs have far-reaching implications across almost all real-world LLM agent applications, giving it a higher potential for broad scientific impact.

vs. Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

gpt-5.25/20/2026

Paper 2 has higher likely impact: it introduces a new, scalable benchmark (POLAR-Bench) targeting an urgent, widely relevant problem—privacy-utility trade-offs for LLM agents under adversarial interaction—enabling standardized evaluation across many models and domains. Benchmarks often catalyze rapid follow-on work (training, alignment, red-teaming, regulation, and product safety). Its methodology (multi-domain dataset, controlled axes for policy/attack, deterministic scoring) supports broad adoption. Paper 1 offers valuable empirical systems insights but is less novel and narrower to LLM serving/scheduling research.

vs. Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On

gemini-3.15/20/2026

Paper 1 introduces a concrete, actionable benchmark with extensive empirical results, providing immediate utility for evaluating and improving LLM agents. In contrast, Paper 2 is a vision paper presenting a conceptual framework, which, while valuable, lacks the immediate empirical grounding and measurable impact of a novel dataset and evaluation methodology.

vs. Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

gemini-3.15/20/2026

Paper 1 addresses the critical issue of privacy in autonomous LLM agents, a major bottleneck for real-world deployment across domains like healthcare, finance, and personal assistance. By introducing a benchmark to evaluate privacy-utility trade-offs against adversarial probing, it provides a foundational tool for a rapidly growing field. Paper 2, while methodologically rigorous, focuses on educational assessment and simulated students, which represents a narrower scope of impact compared to the universal applicability of privacy alignment in AI agents.

vs. NGM: A Plug-and-Play Training-Free Memory Module for LLMs

claude-opus-4.65/20/2026

Paper 1 (NGM) proposes a novel, training-free plug-and-play memory module for LLMs that demonstrates consistent improvements across multiple model sizes, benchmarks, and modalities. Its training-free design and broad applicability make it highly practical and widely adoptable. Paper 2 (POLAR-Bench) introduces an important privacy-utility benchmark but is primarily a diagnostic evaluation tool rather than a methodological contribution. While timely, benchmarks typically have narrower long-term impact unless they become community standards. NGM's architectural innovation with demonstrated gains across scales suggests broader and more lasting scientific influence.

vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

claude-opus-4.65/20/2026

Paper 1 (POW3R) introduces a novel and broadly applicable framework for improving rubric-based RLHF training by dynamically adapting reward weights based on policy state. It addresses a fundamental problem in RL-based model training with strong empirical results (24/30 comparisons won, 2.5-4x faster convergence) across multiple settings. Paper 2 (POLAR-Bench) contributes a useful benchmark for privacy-utility trade-offs but is more narrowly scoped as an evaluation resource. POW3R's methodological innovation in reward shaping has broader potential impact across the entire RLHF/RLVR training paradigm, while benchmarks, though valuable, typically have more incremental impact unless they fundamentally redefine a field.

vs. Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact: it introduces a novel, safety-oriented digital-twin framework combining interpretable regime-switching simulators with conformal risk control and self-falsifying “witnesses,” and validates it on multiple full-scale plants plus an international benchmark with strong quantitative gains. The work is methodologically rigorous (finite-sample guarantees, matched protocols, statistical tests) and has clear real-world deployment relevance in safety-critical infrastructure. Paper 1 is timely and useful as an LLM privacy benchmark, but its primary contribution is evaluative/diagnostic within one domain, with potentially narrower cross-field impact than certified decision support for industrial control.

vs. Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

claude-opus-4.65/20/2026

POLAR-Bench addresses a fundamental and rapidly growing concern—privacy in LLM agents—with broad applicability across the entire LLM ecosystem. It introduces a reusable benchmark (7,852 samples, 10 domains) with a clear diagnostic framework that can be adopted widely by the AI safety and alignment community. Its finding that smaller open-weight models leak significantly more private data has immediate practical implications for on-device deployment. Paper 1, while methodologically thorough, addresses a narrower application (disaster survey imputation) with incremental improvements over existing methods and more limited cross-field relevance.