Joachim Schaeffer, Thomas Jiralerspong, Alexander Panfilov, Guillaume Lajoie, Jonas Geiping, Yoshua Bengio, Roland S. Zimmermann
AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.
CIAware-Bench introduces a systematic benchmark for measuring control intervention (CI) awareness — the ability of an LLM to detect when a control protocol has tampered with its execution trajectory by substituting one step with output from another (trusted) model. This fills a specific gap in AI safety research: while AI control protocols (e.g., Defer-to-Trusted, Trusted Editing) have been studied for their effectiveness at catching misaligned agents, no prior work systematically measured the reverse direction — whether the agent can detect the intervention itself. This is a meaningful threat vector because intervention detection could enable a misaligned model to learn monitor properties, adapt evasion strategies, or enable collusion when the same model family serves as both agent and monitor.
The benchmark spans four task domains (essay writing, BigCodeBench coding, Bash Arena, SHADE-Arena), three watermark conditions, three intervention protocols, and eleven frontier models from four providers, yielding a comprehensive evaluation matrix.
The experimental design is generally sound but has notable limitations the authors themselves acknowledge:
Direct impact on AI control protocol design: The finding that CI awareness is a per-(acting, intervening, environment) triplet property rather than a fixed model-level trait has immediate practical implications. Protocol designers should test specific model pairings before deployment rather than relying on general assumptions. The finding that larger interventions are easier to detect (monotonically from ~50% at <100 chars to ~65% at >1000 chars) provides actionable guidance for minimizing intervention footprint.
Benchmark infrastructure: Releasing a reusable benchmark with canary GUIDs, code, and data enables tracking CI awareness across model releases — a valuable longitudinal capability for the safety community.
Limitations on impact: The paper measures detection but does not demonstrate exploitation. The preliminary SHADE-Arena experiment (Section A.6) shows only a non-significant uplift in attack success when models receive oracle CI feedback (+2.8pp, p≈0.44). Without evidence that detection translates to meaningful evasion, the practical threat remains speculative. The "detection-to-exploitation" gap is the key open question this paper raises but does not answer.
Highly timely. AI control protocols are an active area of safety research (Greenblatt et al., 2024; Bhatt et al., 2025), and as frontier models are increasingly deployed in agentic settings with monitoring, understanding their ability to detect monitoring interventions is directly relevant. The paper addresses a genuine blind spot: prior work assumed interventions were invisible to the acting model, and this benchmark provides the first systematic challenge to that assumption.
The finding that cross-family interventions are more detectable than within-family interventions (0.55 vs. 0.51 on Essay, p<10⁻³) and that this effect concentrates in Anthropic models detecting GPT outputs has implications for the common deployment pattern of using models from different providers as agent and monitor.
The watermarking results are striking: hidden-scratchpad watermarking lifts detection by +0.43 on BigCodeBench, demonstrating that models can deliberately create and exploit self-identification signals when prompted. This suggests that even if current spontaneous CI awareness is low, the latent capability exists and could be elicited through fine-tuning or in-context learning. The low watermark implementation rate in agentic settings (18.7% on SHADE-Arena) is an interesting finding about the limitations of text-based watermarking strategies in tool-use-heavy environments.
The paper is well-written and well-organized, though the sheer volume of appendix material (44 pages) relative to the main text (9 pages) suggests that the core narrative could be tightened.
Generated Jun 10, 2026
ReasonAlloc addresses a critical and timely infrastructure challenge—KV cache management for reasoning LLMs—with a practical, training-free solution showing clear empirical gains. As reasoning models with long CoT become mainstream, efficient inference is a high-priority problem with broad real-world impact. Paper 2 (CIAware-Bench) introduces an interesting benchmark for AI safety/control, but targets a narrower, more speculative concern (models detecting control interventions). While relevant to AI safety, its immediate practical impact and community adoption potential are lower compared to the broadly applicable inference optimization of Paper 1.
Paper 2 likely has higher impact: it introduces a new training framework (Visual-SDPO) combining privileged visual feedback, credit assignment from defects to code, and GRPO, showing consistent multi-domain gains on established benchmarks with no inference-time cost—high novelty plus clear, broad applications in UI/chart/slide generation and code agents. Paper 1 provides an important benchmark for control-intervention awareness in LLM safety, but is primarily evaluative; impact depends on downstream adoption and may be narrower and more scenario-specific than a general optimization method.
Paper 1 tackles the highly relevant and complex problem of emergent misalignment in multi-agent systems. Its introduction of an active, budget-aware auditing agent provides a novel methodological framework for dynamic AI oversight, extending beyond static evaluation. While Paper 2 offers a valuable benchmark, Paper 1's proactive approach addresses critical real-world safety challenges in deploying autonomous, interacting AI agents, suggesting broader theoretical and practical implications for AI alignment.
Paper 1 addresses a highly critical and emerging problem in AI safety: the ability of untrusted LLMs to detect and evade safety control interventions. As AI systems become more autonomous, evaluating and mitigating these evasion capabilities is paramount for alignment and safe deployment. While Paper 2 offers a valuable real-world dataset for user personalization, Paper 1's focus on frontier model safety and control protocols has broader and more urgent implications for the future of AI development and deployment.
Paper 2 likely has higher scientific impact: it introduces a timely, broadly relevant benchmark for control/intervention awareness—central to AI safety, governance, and deployment of frontier LLMs across many applications. Benchmarks often become standard evaluation tools, shaping research agendas and operational practices. Its multi-domain design and empirical evaluation across 11 models supports methodological rigor and immediate real-world applicability for monitoring and control protocols. Paper 1 is innovative and useful for scientific computing, but its impact is narrower (PDE-solver agent tooling) and more dependent on adoption within a specific technical community.
Paper 2 has higher potential impact because it introduces a new safety-relevant benchmark (CIAware-Bench) targeting an under-measured capability—detecting control interventions—which directly affects the design of scalable oversight and AI control protocols. It is timely for frontier-model governance, broadly applicable across providers, task domains, and deployment settings, and enables longitudinal tracking across new model releases. The work also suggests actionable implications (make interventions harder to detect). Paper 1 improves evaluation rigor for agent workflows, but its main takeaway is largely negative/diagnostic and narrower to LLM agent orchestration.
CIAware-Bench addresses a novel and timely problem—whether frontier LLMs can detect control interventions—which is critical for AI safety and alignment. It introduces a new benchmark across multiple task domains and evaluates eleven frontier models, providing broadly useful infrastructure for the community. While Paper 1 makes solid contributions to LLM unlearning with a well-engineered method (NSRU), it is more incremental, combining existing techniques (LoRA, null-space projection) in a specific application area. Paper 2 opens a new research direction with broader implications for AI governance and deployment safety.
CIAware-Bench addresses a more novel and timely problem at the frontier of AI safety—whether untrusted models can detect control interventions and potentially subvert oversight mechanisms. This has broad implications for AI alignment and governance as models become more capable. The benchmark evaluates 11 frontier models across multiple domains and provides foundational infrastructure for ongoing safety evaluation. Paper 2, while practical, applies existing VLM techniques to a relatively narrow privacy de-identification task with incremental methodological contributions. Paper 1's novelty, relevance to the critical AI safety field, and potential to shape future control protocol design give it higher impact potential.
Paper 1 addresses a fundamental and widely applicable bottleneck in AI agents—long-term memory management. By proposing a scalable, topic-structured memory architecture, it offers broad utility across numerous real-world LLM agent applications. While Paper 2 provides a valuable safety benchmark, Paper 1's architectural innovation has a higher potential for widespread adoption and immediate impact across the rapidly growing field of autonomous AI agents.
Paper 1 addresses a fundamental challenge in reinforcement learning—handling state-dependent feasible action sets in MDPs—with a principled framework (Bellman-Taylor score decoding), theoretical guarantees, and demonstrated practical applications in operations research. This bridges a significant gap between DRL theory and real-world constrained optimization problems, with broad applicability. Paper 2, while timely in AI safety, is primarily a benchmark for a narrow phenomenon (control intervention awareness) with empirical findings that are somewhat incremental and specific to current frontier models, limiting its long-term scientific impact.