Retrying vs Resampling in AI Control

James Lucassen, Adam Kaufman

May 25, 2026

Frozen v1 — this version was superseded on arXiv. Stats below reflect the state at freeze time and will not change.View latest (v2) →

#657of 2682·Artificial Intelligence

#657 of 2682 · Artificial Intelligence

Tournament Score

1463±42

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7.5

Novelty6.5

Clarity7.5

Tournament Score

1463±42

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study \textit{resampling}: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61\% to 71\% at a 0.3\% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10\% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Retrying vs Resampling in AI Control"

1. Core Contribution

This paper systematically compares two approaches for improving safety in AI coding scaffolds under the AI control framework: retrying (blocking risky actions and continuing with feedback) and resampling (drawing multiple independent samples from the same context). The key insight is that retrying with feedback leaks exploitable information to a strategic adversary — the monitor's rationale enables the untrusted model to craft stealthier attacks, negating safety gains. Resampling, by contrast, leaks no information and can improve safety by ~10 percentage points at no usefulness cost.

The paper also disentangles previously bundled design choices in resampling protocols, producing two findings that contradict prior work (Ctrl-Z/Bhatt et al. 2025): (1) max aggregation outperforms min for auditing decisions, and (2) executing the least suspicious sample provides only marginal safety gains. The authors provide a principled analysis of *why* these findings differ from prior work, rooted in cross-resample correlation structures of honest vs. attack suspicion score distributions.

2. Methodological Rigor

The experimental methodology is thorough and well-considered:

Stackelberg equilibrium framing: The paper correctly models the blue team/red team interaction as a Stackelberg game, where the red team moves last with full knowledge of the protocol. This is the appropriate adversarial model.

Statistical rigor: Hierarchical bootstrapping with paired comparisons and 95% CIs are used throughout. The authors are careful to note when CIs overlap zero.

Updated evaluation setting: Using Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor on BashArena 1.1 represents a more frontier-relevant evaluation than prior work's GPT-4o-mini/Llama-3.1-8B.

Extensive ablations: The paper includes sweeps over audit budgets, aggregation functions, cost-latency tradeoffs, and artificial correlation structures. The synthetic correlation experiments (Figures 4-5) are particularly illuminating for understanding when max vs. min aggregation is preferred.

Auditing model validation: Appendix B carefully addresses incrimination concerns, showing results are robust to stricter auditing rules and calibrating with a frontier-model auditor (91% TPR at FPR=0).

Notable limitations in rigor: The attack policies are prompted rather than fine-tuned, which may underestimate adversarial capability. The red team exploration is acknowledged as incomplete — particularly for attack selection strategies against cautious retrying protocols, where the authors explicitly defer analysis. The paper also doesn't fully explore whether more sophisticated attack adaptation strategies could overcome resampling benefits.

3. Potential Impact

Immediate practical relevance: The findings directly inform the design of production AI coding scaffolds. Claude Code's auto mode and OpenAI's Codex both use retrying with feedback — this paper provides evidence that this design choice may be counterproductive from a safety perspective. This could influence how major AI labs design their deployment infrastructure.

Theoretical contributions: The cross-resample correlation analysis provides a general framework for understanding when different aggregation strategies are appropriate. The insight that the optimal aggregation function depends on the *relative* correlation structures of honest and attack distributions (not just one in isolation) is broadly applicable.

Efficiency insights: The finding that two-stage selective resampling recovers 6.2 of 10.2 pp safety gain while using only 10% of the extra samples is practically important for deployment cost management.

4. Timeliness & Relevance

This paper is highly timely. AI coding agents are being deployed at scale with exactly the retrying+feedback mechanisms this paper critiques. The AI control framework is gaining traction as a practical approach to AI safety, and this paper fills an important gap by providing rigorous empirical analysis of common design choices.

The paper also arrives at a moment when the AI safety community is debating the practical value of control evaluations — showing that intuitions from simpler settings (Ctrl-Z) don't always transfer to more realistic ones is a valuable cautionary finding.

5. Strengths & Limitations

Key Strengths:

*Actionable findings*: Clear, specific recommendations backed by empirical evidence

*Contradiction of prior work*: The reversal of max vs. min aggregation findings, with a principled explanation, advances understanding significantly

*Clean experimental design*: Disentangling bundled design choices (information vs. execution effects of resampling) is methodologically sound

*Cost-aware analysis*: The cost-latency-safety tradeoff analysis makes findings practically deployable

*Transparency*: Links to example trajectories, detailed appendices with attack prompts, and extensive ablations

Key Limitations:

*Single setting*: Results are from BashArena with one untrusted/trusted model pair. The authors acknowledge that correlation structures (and therefore optimal aggregation) may differ across settings

*Incomplete red teaming*: Attack selection strategies against cautious retrying are deferred; the attack policy is prompt-based rather than fine-tuned or RL-optimized

*Narrow audit budget*: While appendices show budget sensitivity, the headline 0.3% budget is somewhat arbitrary

*Limited model diversity*: Only one trusted monitor (MiMo-V2-Flash) is evaluated; monitor choice likely affects correlation structures significantly

*No formal theoretical analysis*: The correlation-based reasoning is empirically supported but lacks formal guarantees about when findings generalize

Overall Assessment

This is a solid empirical contribution to the AI control literature that produces practically relevant findings contradicting prior work, with good experimental methodology and appropriate caveats. Its impact is somewhat bounded by being a single-setting study, but the quality of analysis and practical relevance elevate it.

Rating:7/ 10

Significance 7.5Rigor 7.5Novelty 6.5Clarity 7.5

Generated May 26, 2026

Comparison History (19)

vs. SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

gemini-3.15/28/2026

Paper 2 addresses a fundamental and highly timely issue in AI safety and alignment, offering actionable insights that challenge existing assumptions about resampling and retrying protocols. Its findings have immediate, broad applicability across AI scaffolding systems and foundation models. While Paper 1 tackles an important public health issue, its impact is somewhat limited by the restricted dataset release and its specific linguistic/geographic focus, making Paper 2's methodological contributions more broadly impactful to the rapidly growing AI field.

vs. Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

claude-opus-4.65/28/2026

Paper 2 addresses a timely and critical problem in AI safety—how to handle potentially adversarial behavior in deployed AI coding agents. It provides concrete, actionable findings (retrying leaks exploitable information, resampling strategies, contradictions with prior work) that directly inform the design of safer AI systems. Its findings challenge established assumptions in the AI control literature, which amplifies its impact. Paper 1 introduces a useful benchmark for studying harness effects on agent performance, which is valuable but more incremental. Paper 2's safety implications give it broader urgency and cross-field relevance.

vs. Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

gpt-5.25/28/2026

Paper 2 has higher impact potential due to broader, timely relevance to AI control and deployment safety for real-world coding agents. It identifies a fundamental vulnerability in “retrying” (information leakage via monitor rationales) under adversarial assumptions and offers a principled alternative (resampling) with clear, budgeted empirical gains and ablations that overturn prior findings—suggesting field-wide implications for safety protocols and evaluation. Paper 1 is novel within multi-agent LLM security, but its scope is narrower and impact more specialized to MAS communication defenses.

vs. From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

gpt-5.25/27/2026

Paper 2 likely has higher impact due to broader applicability and clearer generality: it addresses a fundamental, well-known problem (multi-turn distribution shift) with both theoretical characterization (quadratic compounding, decomposed sources) and a unified mitigation framework (calibrated interactive RL via simulator alignment). This is timely for deploying interactive LLM agents and can influence RLHF, dialogue systems, simulation-to-real transfer, and evaluation methodology. Paper 1 is novel and practically relevant for AI control in tool-use settings, but its scope is narrower (retry/resample protocols, specific monitoring setup) and may generalize less across domains.

vs. Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching

claude-opus-4.65/26/2026

Paper 1 addresses a more critical and timely problem—AI safety and control of potentially adversarial models in deployed coding scaffolds (Claude Code, Codex). It provides rigorous empirical findings that contradict prior work, offers practical design guidance for safety-critical systems, and contributes to the foundational AI control framework. Paper 2 proposes an incremental improvement to LLM routing with a practical but narrower contribution. The safety implications and broader relevance of Paper 1 to the rapidly growing field of AI alignment give it substantially higher potential impact.

vs. Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

gpt-5.25/26/2026

Paper 2 has higher potential impact due to its novel framing of “retrying” as an information-leaking control intervention under adversarial models and its concrete, testable alternative (resampling) with decomposed design choices. It provides quantitative results (safety/usefulness, audit budgets) and reports contradictions to prior findings, directly advancing AI safety/control methodology with immediate applicability to widely deployed coding agents. Paper 1 is valuable and timely as a large-scale empirical audit of an A2A ecosystem, but its impact is more diagnostic and platform-specific, with less generalizable methodological contribution.

vs. ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

gpt-5.25/26/2026

Paper 2 likely has higher impact: it addresses AI control for agentic coding systems, a timely, broadly relevant safety problem with implications across many deployed tools. It offers clear conceptual contributions (retrying vs resampling, info-leak mechanism), disentangles prior confounds, and reports quantitative gains plus contradictions to influential prior work—suggesting field-shaping potential. Paper 1 is strong and clinically valuable, but its impact is narrower (computational pathology) and more domain-specific, whereas Paper 2’s insights can generalize across monitoring, auditing, and agent design in AI systems.

vs. What Gets Cited: Competitive GEO in AI Answer Engines

claude-opus-4.65/26/2026

Paper 1 addresses a critical and timely problem in AI safety—controlling potentially adversarial AI models in real-world coding scaffolds. It provides rigorous methodology with concrete findings that contradict prior work, advancing fundamental understanding of AI control protocols (retrying vs resampling). Its implications span AI safety, alignment, and deployment practices. Paper 2 studies citation dynamics in AI answer engines, which is practically useful for SEO/GEO practitioners but is narrower in scientific scope, more applied/marketing-oriented, and less likely to drive broad scientific discourse or methodological advances.

vs. AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

claude-opus-4.65/26/2026

Paper 1 presents novel empirical findings that directly contradict prior work on AI control resampling strategies, providing concrete, actionable insights for improving AI safety in deployed coding scaffolds. Its methodological rigor—disentangling bundled design choices and providing quantitative safety gains—offers immediate practical value for the rapidly growing AI safety community. Paper 2, while comprehensive as a survey of AI-powered research automation, primarily organizes and taxonomizes existing work rather than generating new empirical knowledge. Surveys have broad but diffuse impact, whereas Paper 1's specific, surprising findings (e.g., max vs min suspicion scoring) are more likely to directly influence system design and future research directions in a critical area.

vs. Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

claude-opus-4.65/26/2026

Paper 1 addresses a more novel and impactful problem at the frontier of AI safety—how to maintain control over potentially adversarial AI systems during deployment. It contributes new theoretical insights (retrying leaks exploitable information, resampling strategies differ from prior work) and contradicts earlier findings, advancing the AI control field. Paper 2, while practically useful, addresses a narrower engineering concern (benchmarking bias from Python's GIL) with limited breadth of scientific impact. AI safety/control is a rapidly growing field with broad implications, giving Paper 1 greater potential for cross-disciplinary influence and long-term relevance.

vs. Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts

gemini-3.15/26/2026

Paper 1 addresses a fundamental conceptual challenge in AI alignment by distinguishing between process and outcome alignment across diverse real-world contexts (legal and credit decisions). Its interdisciplinary approach bridges AI, organizational behavior, and ethics, offering broader societal and theoretical impact. Paper 2, while methodologically rigorous and practically useful for AI control, focuses on narrower technical mechanisms (retrying vs resampling) and has a more constrained scope of impact.

vs. Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

gemini-3.15/26/2026

Paper 1 addresses the highly active and impactful area of self-improving reasoning agents (similar to o1/R1 models). By demonstrating that complex, computationally expensive pipelines can be replaced with a simplified combination of GRPO and offline self-distillation while achieving state-of-the-art results, it offers a highly scalable and accessible training recipe. While Paper 2 provides valuable insights into AI safety mechanisms, Paper 1's methodology is likely to see broader, immediate adoption and drive significant advancements in open-source reasoning models.

vs. Inference Time Context Sparsity: Illusion or Opportunity?

gpt-5.25/26/2026

Paper 2 likely has higher impact: it targets a major, timely bottleneck (long-context attention cost) with broad relevance to LLM systems, hardware, and future model/training design. It combines a principled argument with extensive multi-model empirical evidence and concrete systems contributions (kernels, up to 10x speedups on H100), enabling immediate real-world deployment and influencing architectures. Paper 1 is novel and important for AI safety/control, but its empirical scope appears narrower (specific monitor/model/task setup) and its applicability may be more contingent on particular scaffolding and monitoring regimes.

vs. StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs

gemini-3.15/26/2026

Paper 2 demonstrates higher potential impact by uncovering a fundamental vulnerability in Multimodal LLMs (Structural Cognitive Overload) that achieves a 92-97% attack success rate on state-of-the-art models like Gemini. By combining a practical black-box attack framework with deep mechanistic interpretability, it fundamentally challenges current safety alignment paradigms. While Paper 1 offers valuable insights into AI control and sampling strategies, Paper 2 exposes a broader, more critical flaw in multimodal reasoning safety that will likely spur significant, widespread follow-up research in foundational model alignment.

vs. Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models

claude-opus-4.65/26/2026

Paper 1 addresses a highly timely and practical problem in AI safety—control of potentially adversarial AI coding agents—directly relevant to deployed systems like Claude Code and Codex. It provides rigorous empirical analysis that contradicts prior work, disentangles confounded design choices, and offers practical, actionable insights (selective resampling). Its findings about retrying leaking exploitable information to adversarial models have broad implications for AI deployment safety. Paper 2 contributes a useful defense framework for safe fine-tuning, but operates in a narrower, more incremental space with less paradigm-shifting potential.

vs. How Well Do Models Follow Their Constitutions?

gemini-3.15/26/2026

Paper 2 offers broader impact by addressing AI governance and alignment at scale, evaluating major frontier models against their official behavioral specifications. Its multi-method audit pipeline provides a highly relevant, real-world application for AI safety and policy, whereas Paper 1 focuses on a narrower, albeit important, technical mechanism (retrying vs resampling) in AI control.

vs. From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

gpt-5.25/26/2026

Paper 1 is more novel and timely for AI control: it directly analyzes a widely deployed safety mechanism (retrying) under adversarial assumptions, identifies an information-leak failure mode, and provides a concrete alternative (resampling) with actionable design guidance and empirical results that overturn prior conclusions. This has immediate real-world applicability to agent safety pipelines and broader implications for oversight, monitoring, and alignment. Paper 2 is rigorous and broadly useful as a systematic evaluation of skill lifecycles, but its primary contribution is integrative/diagnostic rather than a new safety-critical mechanism, likely yielding slower or more incremental downstream impact.

vs. A Deep Dive into Axiomatic Design -- Part I: Problem Formulation

gpt-5.25/26/2026

Paper 1 has higher scientific impact potential due to its novel, timely contribution to AI safety/control: it rigorously analyzes retrying vs. resampling under adversarial assumptions, demonstrates concrete exploit pathways (information leakage via monitor rationales), and provides experimentally validated protocol improvements with quantified safety–utility tradeoffs. Its methods and findings are broadly relevant across LLM tooling, alignment, monitoring, and secure deployment, and it directly informs real-world agentic systems design. Paper 2 is largely a clarificatory synthesis of established axiomatic design doctrine with limited methodological novelty and narrower cross-field reach.

vs. Agentic Proving for Program Verification

gemini-3.15/26/2026

Paper 1 offers fundamental methodological contributions to AI safety by formally comparing retrying and resampling in adversarial AI control. It challenges previous findings and provides actionable, empirical safety improvements. In contrast, Paper 2 is primarily an empirical evaluation showing that a specific model (Claude Code) saturates an existing benchmark, which, while useful for tracking progress, offers less novel methodological innovation.