Retrying vs Resampling in AI Control
James Lucassen, Adam Kaufman
Abstract
AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study \textit{resampling}: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61\% to 71\% at a 0.3\% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10\% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Retrying vs Resampling in AI Control"
1. Core Contribution
This paper systematically compares two approaches for improving safety in AI coding scaffolds under the AI control framework: retrying (blocking risky actions and continuing with feedback) and resampling (drawing multiple independent samples from the same context). The key insight is that retrying with feedback leaks exploitable information to a strategic adversary — the monitor's rationale enables the untrusted model to craft stealthier attacks, negating safety gains. Resampling, by contrast, leaks no information and can improve safety by ~10 percentage points at no usefulness cost.
The paper also disentangles previously bundled design choices in resampling protocols, producing two findings that contradict prior work (Ctrl-Z/Bhatt et al. 2025): (1) max aggregation outperforms min for auditing decisions, and (2) executing the least suspicious sample provides only marginal safety gains. The authors provide a principled analysis of *why* these findings differ from prior work, rooted in cross-resample correlation structures of honest vs. attack suspicion score distributions.
2. Methodological Rigor
The experimental methodology is thorough and well-considered:
Notable limitations in rigor: The attack policies are prompted rather than fine-tuned, which may underestimate adversarial capability. The red team exploration is acknowledged as incomplete — particularly for attack selection strategies against cautious retrying protocols, where the authors explicitly defer analysis. The paper also doesn't fully explore whether more sophisticated attack adaptation strategies could overcome resampling benefits.
3. Potential Impact
Immediate practical relevance: The findings directly inform the design of production AI coding scaffolds. Claude Code's auto mode and OpenAI's Codex both use retrying with feedback — this paper provides evidence that this design choice may be counterproductive from a safety perspective. This could influence how major AI labs design their deployment infrastructure.
Theoretical contributions: The cross-resample correlation analysis provides a general framework for understanding when different aggregation strategies are appropriate. The insight that the optimal aggregation function depends on the *relative* correlation structures of honest and attack distributions (not just one in isolation) is broadly applicable.
Efficiency insights: The finding that two-stage selective resampling recovers 6.2 of 10.2 pp safety gain while using only 10% of the extra samples is practically important for deployment cost management.
4. Timeliness & Relevance
This paper is highly timely. AI coding agents are being deployed at scale with exactly the retrying+feedback mechanisms this paper critiques. The AI control framework is gaining traction as a practical approach to AI safety, and this paper fills an important gap by providing rigorous empirical analysis of common design choices.
The paper also arrives at a moment when the AI safety community is debating the practical value of control evaluations — showing that intuitions from simpler settings (Ctrl-Z) don't always transfer to more realistic ones is a valuable cautionary finding.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Overall Assessment
This is a solid empirical contribution to the AI control literature that produces practically relevant findings contradicting prior work, with good experimental methodology and appropriate caveats. Its impact is somewhat bounded by being a single-setting study, but the quality of analysis and practical relevance elevate it.
Generated May 26, 2026
Comparison History (19)
Paper 2 addresses a fundamental and highly timely issue in AI safety and alignment, offering actionable insights that challenge existing assumptions about resampling and retrying protocols. Its findings have immediate, broad applicability across AI scaffolding systems and foundation models. While Paper 1 tackles an important public health issue, its impact is somewhat limited by the restricted dataset release and its specific linguistic/geographic focus, making Paper 2's methodological contributions more broadly impactful to the rapidly growing AI field.
Paper 2 addresses a timely and critical problem in AI safety—how to handle potentially adversarial behavior in deployed AI coding agents. It provides concrete, actionable findings (retrying leaks exploitable information, resampling strategies, contradictions with prior work) that directly inform the design of safer AI systems. Its findings challenge established assumptions in the AI control literature, which amplifies its impact. Paper 1 introduces a useful benchmark for studying harness effects on agent performance, which is valuable but more incremental. Paper 2's safety implications give it broader urgency and cross-field relevance.
Paper 2 has higher impact potential due to broader, timely relevance to AI control and deployment safety for real-world coding agents. It identifies a fundamental vulnerability in “retrying” (information leakage via monitor rationales) under adversarial assumptions and offers a principled alternative (resampling) with clear, budgeted empirical gains and ablations that overturn prior findings—suggesting field-wide implications for safety protocols and evaluation. Paper 1 is novel within multi-agent LLM security, but its scope is narrower and impact more specialized to MAS communication defenses.
Paper 2 likely has higher impact due to broader applicability and clearer generality: it addresses a fundamental, well-known problem (multi-turn distribution shift) with both theoretical characterization (quadratic compounding, decomposed sources) and a unified mitigation framework (calibrated interactive RL via simulator alignment). This is timely for deploying interactive LLM agents and can influence RLHF, dialogue systems, simulation-to-real transfer, and evaluation methodology. Paper 1 is novel and practically relevant for AI control in tool-use settings, but its scope is narrower (retry/resample protocols, specific monitoring setup) and may generalize less across domains.
Paper 1 addresses a more critical and timely problem—AI safety and control of potentially adversarial models in deployed coding scaffolds (Claude Code, Codex). It provides rigorous empirical findings that contradict prior work, offers practical design guidance for safety-critical systems, and contributes to the foundational AI control framework. Paper 2 proposes an incremental improvement to LLM routing with a practical but narrower contribution. The safety implications and broader relevance of Paper 1 to the rapidly growing field of AI alignment give it substantially higher potential impact.
Paper 2 has higher potential impact due to its novel framing of “retrying” as an information-leaking control intervention under adversarial models and its concrete, testable alternative (resampling) with decomposed design choices. It provides quantitative results (safety/usefulness, audit budgets) and reports contradictions to prior findings, directly advancing AI safety/control methodology with immediate applicability to widely deployed coding agents. Paper 1 is valuable and timely as a large-scale empirical audit of an A2A ecosystem, but its impact is more diagnostic and platform-specific, with less generalizable methodological contribution.
Paper 2 likely has higher impact: it addresses AI control for agentic coding systems, a timely, broadly relevant safety problem with implications across many deployed tools. It offers clear conceptual contributions (retrying vs resampling, info-leak mechanism), disentangles prior confounds, and reports quantitative gains plus contradictions to influential prior work—suggesting field-shaping potential. Paper 1 is strong and clinically valuable, but its impact is narrower (computational pathology) and more domain-specific, whereas Paper 2’s insights can generalize across monitoring, auditing, and agent design in AI systems.
Paper 1 addresses a critical and timely problem in AI safety—controlling potentially adversarial AI models in real-world coding scaffolds. It provides rigorous methodology with concrete findings that contradict prior work, advancing fundamental understanding of AI control protocols (retrying vs resampling). Its implications span AI safety, alignment, and deployment practices. Paper 2 studies citation dynamics in AI answer engines, which is practically useful for SEO/GEO practitioners but is narrower in scientific scope, more applied/marketing-oriented, and less likely to drive broad scientific discourse or methodological advances.
Paper 1 presents novel empirical findings that directly contradict prior work on AI control resampling strategies, providing concrete, actionable insights for improving AI safety in deployed coding scaffolds. Its methodological rigor—disentangling bundled design choices and providing quantitative safety gains—offers immediate practical value for the rapidly growing AI safety community. Paper 2, while comprehensive as a survey of AI-powered research automation, primarily organizes and taxonomizes existing work rather than generating new empirical knowledge. Surveys have broad but diffuse impact, whereas Paper 1's specific, surprising findings (e.g., max vs min suspicion scoring) are more likely to directly influence system design and future research directions in a critical area.
Paper 1 addresses a more novel and impactful problem at the frontier of AI safety—how to maintain control over potentially adversarial AI systems during deployment. It contributes new theoretical insights (retrying leaks exploitable information, resampling strategies differ from prior work) and contradicts earlier findings, advancing the AI control field. Paper 2, while practically useful, addresses a narrower engineering concern (benchmarking bias from Python's GIL) with limited breadth of scientific impact. AI safety/control is a rapidly growing field with broad implications, giving Paper 1 greater potential for cross-disciplinary influence and long-term relevance.
Paper 1 addresses a fundamental conceptual challenge in AI alignment by distinguishing between process and outcome alignment across diverse real-world contexts (legal and credit decisions). Its interdisciplinary approach bridges AI, organizational behavior, and ethics, offering broader societal and theoretical impact. Paper 2, while methodologically rigorous and practically useful for AI control, focuses on narrower technical mechanisms (retrying vs resampling) and has a more constrained scope of impact.
Paper 1 addresses the highly active and impactful area of self-improving reasoning agents (similar to o1/R1 models). By demonstrating that complex, computationally expensive pipelines can be replaced with a simplified combination of GRPO and offline self-distillation while achieving state-of-the-art results, it offers a highly scalable and accessible training recipe. While Paper 2 provides valuable insights into AI safety mechanisms, Paper 1's methodology is likely to see broader, immediate adoption and drive significant advancements in open-source reasoning models.
Paper 2 likely has higher impact: it targets a major, timely bottleneck (long-context attention cost) with broad relevance to LLM systems, hardware, and future model/training design. It combines a principled argument with extensive multi-model empirical evidence and concrete systems contributions (kernels, up to 10x speedups on H100), enabling immediate real-world deployment and influencing architectures. Paper 1 is novel and important for AI safety/control, but its empirical scope appears narrower (specific monitor/model/task setup) and its applicability may be more contingent on particular scaffolding and monitoring regimes.
Paper 2 demonstrates higher potential impact by uncovering a fundamental vulnerability in Multimodal LLMs (Structural Cognitive Overload) that achieves a 92-97% attack success rate on state-of-the-art models like Gemini. By combining a practical black-box attack framework with deep mechanistic interpretability, it fundamentally challenges current safety alignment paradigms. While Paper 1 offers valuable insights into AI control and sampling strategies, Paper 2 exposes a broader, more critical flaw in multimodal reasoning safety that will likely spur significant, widespread follow-up research in foundational model alignment.
Paper 1 addresses a highly timely and practical problem in AI safety—control of potentially adversarial AI coding agents—directly relevant to deployed systems like Claude Code and Codex. It provides rigorous empirical analysis that contradicts prior work, disentangles confounded design choices, and offers practical, actionable insights (selective resampling). Its findings about retrying leaking exploitable information to adversarial models have broad implications for AI deployment safety. Paper 2 contributes a useful defense framework for safe fine-tuning, but operates in a narrower, more incremental space with less paradigm-shifting potential.
Paper 2 offers broader impact by addressing AI governance and alignment at scale, evaluating major frontier models against their official behavioral specifications. Its multi-method audit pipeline provides a highly relevant, real-world application for AI safety and policy, whereas Paper 1 focuses on a narrower, albeit important, technical mechanism (retrying vs resampling) in AI control.
Paper 1 is more novel and timely for AI control: it directly analyzes a widely deployed safety mechanism (retrying) under adversarial assumptions, identifies an information-leak failure mode, and provides a concrete alternative (resampling) with actionable design guidance and empirical results that overturn prior conclusions. This has immediate real-world applicability to agent safety pipelines and broader implications for oversight, monitoring, and alignment. Paper 2 is rigorous and broadly useful as a systematic evaluation of skill lifecycles, but its primary contribution is integrative/diagnostic rather than a new safety-critical mechanism, likely yielding slower or more incremental downstream impact.
Paper 1 has higher scientific impact potential due to its novel, timely contribution to AI safety/control: it rigorously analyzes retrying vs. resampling under adversarial assumptions, demonstrates concrete exploit pathways (information leakage via monitor rationales), and provides experimentally validated protocol improvements with quantified safety–utility tradeoffs. Its methods and findings are broadly relevant across LLM tooling, alignment, monitoring, and secure deployment, and it directly informs real-world agentic systems design. Paper 2 is largely a clarificatory synthesis of established axiomatic design doctrine with limited methodological novelty and narrower cross-field reach.
Paper 1 offers fundamental methodological contributions to AI safety by formally comparing retrying and resampling in adversarial AI control. It challenges previous findings and provides actionable, empirical safety improvements. In contrast, Paper 2 is primarily an empirical evaluation showing that a specific model (Claude Code) saturates an existing benchmark, which, while useful for tracking progress, offers less novel methodological innovation.