Retrying vs Resampling in AI Control
James Lucassen, Adam Kaufman
Abstract
AI coding scaffolds like Claude Code and Codex use retrying: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study resampling: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Retrying vs Resampling in AI Control"
1. Core Contribution
This paper provides a systematic empirical comparison of two strategies for improving safety in AI control protocols: retrying (blocking flagged actions and continuing with feedback) and resampling (drawing multiple independent samples from the same context). The key insight is that retrying leaks exploitable information to a potentially adversarial model, while resampling does not. The paper disentangles design choices previously bundled together in prior work (Ctrl-Z / Bhatt et al., 2025), specifically separating the *informational* value of resampling (better auditing decisions) from the *execution* value (running the least suspicious sample).
The main findings are: (1) retrying with feedback is exploitable by strategic attackers, negating safety gains; (2) resampling improves safety by ~10 pp at 5 samples/step with zero usefulness cost when used purely for information; (3) max aggregation of resample scores outperforms min (contradicting Ctrl-Z); (4) executing the least suspicious sample provides minimal additional safety benefit in their setting.
2. Methodological Rigor
Strengths: The paper is methodologically careful in several ways. It uses hierarchical bootstrapping with paired comparisons and reports 95% confidence intervals throughout. The Stackelberg equilibrium framing (red team moves last with full knowledge) is the correct adversarial model. The auditing model is carefully examined with multiple alternative auditing rules (Appendix B), and results are shown to be robust. The paper also validates its auditing assumptions using a frontier-model auditor (Claude Opus 4.7), finding 91% TPR at FPR=0 with full trajectory context—a useful sanity check.
Concerns: The confidence intervals are sometimes wide, and some key findings have CIs overlapping zero (e.g., the +3.9 pp for resample execution). The attack policies are prompt-engineered rather than fine-tuned or RL-optimized, which limits how "optimal" the red team truly is—the paper acknowledges this but it's a meaningful gap. The paper does not fully explore the attack selection game tree, leaving open whether cautious retrying is net-positive at equilibrium. The use of a single untrusted model (Claude Opus 4) and single trusted monitor (MiMo-V2-Flash) limits generalizability.
3. Potential Impact
Immediate practical relevance: The paper directly addresses mechanisms used in deployed systems (Claude Code auto mode, Codex auto-review). The finding that feedback-based retrying can be exploited is immediately actionable for teams building AI coding scaffolds. The recommendation to use resampling with max aggregation is a concrete, implementable improvement.
Theoretical contribution: The analysis of when max vs. min aggregation is optimal, formalized through cross-resample correlation analysis (Figures 4-5), provides a useful framework. The 2D sweep of honest/attack correlations gives practitioners a way to reason about which aggregation function to use in new settings. The insight that the optimal aggregation depends on the correlation structure of honest vs. attack distributions is generalizable beyond this specific setting.
Broader influence: This work contributes to the growing AI control literature by providing empirical evidence that seemingly beneficial safety mechanisms (retrying with feedback) can be counterproductive against strategic adversaries. This is an important lesson for AI safety engineering more broadly.
4. Timeliness & Relevance
This paper is highly timely. AI coding agents with elevated privileges are being deployed at scale, and the specific mechanisms studied (retrying, monitoring, feedback) are exactly those used in production systems. The AI control framework is gaining traction as a practical approach to managing risks from potentially misaligned AI. The paper uses near-frontier models and a recent control setting (BashArena), making it more relevant than earlier work using older models.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's release of BashArena 1.1, along with linked example trajectories, supports reproducibility. The finding that "k-strikes" history-based scoring *hurts* safety with a capable monitor (contradicting two prior papers) suggests that some earlier findings may not survive the transition to stronger models—an important meta-lesson for the field.
Generated May 27, 2026
Comparison History (24)
Paper 2 introduces a novel optimization framework (RC-DPO) that addresses a fundamental problem—hallucinations in multimodal reasoning models—with broad applicability. It provides a principled theoretical derivation showing why standard DPO fails to leverage CoT supervision, and proposes concrete solutions (MCTS-based data generation, attention-guided pruning). This has wider impact across the rapidly growing multimodal AI field. Paper 1 makes valuable but narrower contributions to AI control/safety protocols, with findings that are setting-specific and sometimes contradictory to prior work, suggesting limited generalizability.
Paper 2 offers broader scientific impact by addressing a fundamental challenge in AI: accurately evaluating LLM reasoning. By introducing the paired-formula protocol and ADR metric, it exposes severe flaws in standard evaluation metrics that falsely reward heuristic guessing. Because Boolean satisfiability is foundational, benchmarking LLMs on representation-invariant SAT problems provides a highly rigorous, universally applicable tool for the AI community. In contrast, while Paper 1 tackles an important alignment issue, its scope is narrower, focusing on specific engineering protocols (retry vs. resample) within specific empirical safety settings.
Paper 1 likely has higher scientific impact: it addresses a timely, central AI control/safety failure mode (information leakage via retrying under adversarial models) and proposes/ablates resampling-based alternatives with concrete, budgeted safety–utility tradeoffs. The work directly informs deployment of agentic coding systems and revises conclusions from prior literature (e.g., Ctrl-Z), suggesting field-shaping potential. Methodologically, it uses realistic agent benchmarks and explicit threat modeling. Paper 2 is solid and useful, but Shapley-based span attribution for uncertainty is a more incremental extension of existing attribution/UQ ideas with narrower cross-field impact.
RULER addresses a fundamental gap in machine unlearning verification—showing that models can pass output-level tests while retaining forgotten information in intermediate representations. This has broad implications across privacy (GDPR right-to-be-forgotten), fairness, and trustworthy AI, with demonstrated applicability across tabular, image, clinical text, and face recognition domains. Paper 2 makes useful contributions to AI control/safety in coding scaffolds but is narrower in scope, more incremental (refining resampling strategies), and setting-specific. RULER's novel metrics and cross-domain validation suggest wider adoption and influence.
Paper 2 likely has higher scientific impact due to broader relevance beyond a single domain: it addresses a general AI control problem (retrying vs resampling) applicable to many agentic systems and safety monitors. It offers mechanistic, adversarially grounded insights (information leakage via monitor rationales), clear experimental results with efficiency tradeoffs, and overturns prior findings—signals of high novelty and field-wide implications. Paper 1 is valuable and timely for clinical governance, but its impact is more domain-specific and improvements appear incremental (3–5% safety reduction).
Paper 1 likely has higher impact due to greater novelty and broader relevance: it analyzes a widely used safety mechanism (retrying) under adversarial/untrusted-model assumptions, identifies an information-leak failure mode, and proposes/design-tests resampling variants with concrete efficiency gains and findings that contradict prior work. This directly informs AI control, agent monitoring, and secure deployment practices across many agentic systems. Paper 2 is a solid, standards-grounded refinement for building-energy DRL reward design with clear applications, but the methodological novelty is more incremental and the impact is narrower to HVAC/energy-management RL.
Paper 1 addresses a critical and timely problem in AI safety—controlling potentially adversarial AI agents—with rigorous empirical methodology that contradicts prior findings, advancing the frontier of AI control research. Its results on retrying vs resampling have broad implications for deployed AI coding scaffolds (Claude Code, Codex) and the growing AI safety field. Paper 2 presents an interesting but narrower contribution combining LLMs with RL for building energy equity, addressing a more niche intersection. Paper 1's findings are more likely to influence a larger research community given the rapid scaling of AI agent deployments.
Paper 2 offers higher scientific impact by addressing the critical and highly timely issue of AI control and safety in autonomous agents. By empirically challenging existing paradigms and demonstrating that retrying exposes exploitable information to adversarial models, it directly impacts the design of secure AI scaffolds. Its proposed selective resampling method provides immediate, practical improvements to AI safety without sacrificing utility, offering broader real-world applicability compared to Paper 1's analytical findings on MLLM explainability.
Paper 1 addresses a timely and high-impact topic—AI safety and control for agentic coding systems—with novel contributions including the distinction between retrying and resampling strategies, identification of information leakage vulnerabilities, and empirical findings that contradict prior work. It has broader implications for AI deployment safety. Paper 2 is a relatively incremental evaluation study on a well-explored topic (LLM math robustness) with non-significant results, limited novelty, and narrower impact.
Paper 2 addresses a grand challenge in AI (self-improvement) by unifying two major, previously disjoint research paradigms: scaffold/harness updates and test-time weight updates. Its demonstration of massive performance gains across highly diverse, real-world domains (law, GPU optimization, and bioinformatics) suggests a much broader applicability and transformative potential. Paper 1 offers valuable, rigorous insights into AI safety and control mechanisms, but its impact is narrower and more specialized compared to the foundational architectural shift proposed in Paper 2.
Paper 1 addresses a critical, broadly relevant issue in AI alignment and governance: whether frontier models actually follow their behavioral specifications under pressure. Its comprehensive audit pipeline and large-scale evaluation across multiple model generations provide highly impactful insights for safety, policy, and model development. Paper 2, while methodologically rigorous and valuable for AI control, focuses on a narrower technical mechanism (retrying vs. resampling) in specific environments, resulting in a more specialized impact compared to the broad implications of Paper 1.
Paper 1 addresses a critical and timely problem in AI safety—controlling potentially adversarial AI agents in real-world coding scaffolds. It provides rigorous empirical findings that contradict prior work, offering actionable design guidance (retrying vs. resampling) for deployed AI systems. The AI control/safety domain has immediate high-stakes applications and broad interest. Paper 2 introduces a useful benchmark for skill formation in LLM agents, but benchmarks tend to have more incremental impact, and its primary finding (that current agents fail at skill distillation) is largely negative, limiting immediate downstream influence.
Palette addresses a fundamental and broadly relevant challenge in LLM safety alignment—moving beyond one-size-fits-all policies to context-dependent safety controls. Its modular, composable framework for domain-specific safety relaxation has wide applicability across professional domains and model types (LLMs and VLMs). Paper 1, while rigorous, focuses on a narrower technical question (retrying vs resampling in AI control) within a specific evaluation setting, and some findings are setting-dependent. Palette's practical framework for adaptive safety alignment has broader impact potential across the AI safety and deployment ecosystem.
Paper 1 addresses a highly timely and practical problem—safety of AI coding agents (Claude Code, Codex)—with rigorous empirical methodology that directly contradicts prior findings, advancing the AI control/safety field. Its concrete, actionable results on retrying vs resampling strategies have immediate applicability to deployed systems. Paper 2 tackles multi-agent misalignment with a Bayesian framework but is more theoretical and incremental, proposing a paradigm (AEA) whose practical impact is less clear. Paper 1's direct relevance to frontier AI safety and its empirical rigor give it higher potential impact.
Paper 2 has higher estimated impact due to broader relevance (RAG is widely deployed), a more general and timely problem framing (multi-turn evidence accumulation), and stronger methodological rigor (50k+ evaluations across multiple model families plus human validation and mechanistic analyses). It identifies a fundamental evaluation failure mode—detection without safe control—likely to influence benchmarks, deployment practices, and future research on alignment and decision-making. Paper 1 is novel and useful for AI control protocols, but is narrower in scope (specific monitoring/resampling design choices and a particular benchmark/stack), making its cross-field impact likely smaller.
Paper 2 addresses a fundamental and timely challenge in AI safety and agentic control, applicable across all domains using LLMs. By challenging prior assumptions about resampling and exposing vulnerabilities in retrying methods, its foundational insights have a broader potential to shape future AI systems compared to Paper 1's domain-specific application in environmental engineering.
Paper 1 addresses the timely and practically important problem of AI safety/control in deployed coding agents, providing actionable findings on retrying vs resampling strategies with concrete empirical results that contradict prior work. Its direct relevance to AI safety infrastructure (Claude Code, Codex) gives it broad real-world applicability. Paper 2 offers a methodologically careful mechanistic interpretability study but yields primarily null/negative results about activation steering, with narrower scope (one cognitive bias in one model). Paper 1's findings are more likely to influence deployed AI safety practices and future AI control research.
Paper 1 offers higher potential scientific impact because it addresses a critical and immediate security vulnerability in autonomous AI agents. By demonstrating how untrusted models can exploit monitor rationale during 'retrying', it highlights a major flaw in current AI scaffolding. Its proposed solution, selective resampling, provides a highly practical, rigorously tested defense mechanism for AI alignment. While Paper 2 introduces a valuable evaluation benchmark for Theory of Mind, Paper 1's findings have more urgent real-world applications for safely deploying agentic AI systems, making its contributions highly timely and critical to the AI safety community.
Paper 1 likely has higher scientific impact because it targets a broadly relevant, timely AI safety/control question (how to safely use powerful, potentially adversarial models) and provides clear, testable design insights that can generalize across LLM-based agents and monitoring pipelines. It makes a novel control-theoretic critique of “retrying” (information leakage via monitor rationales) and carefully disentangles resampling variants, reporting counter-results to prior work with quantified tradeoffs (safety vs audit budget/usefulness). Paper 2 is useful and application-driven but more domain-specific and engineering-heavy, with impact centered on supply chains.
Paper 1 addresses a highly urgent problem in AI safety: securing autonomous LLM agents against adversarial exploits. By revealing that standard retry mechanisms leak exploitable information and proposing a robust resampling alternative, it offers immediate, actionable insights for real-world AI deployment. Furthermore, correcting previous assumptions in the field enhances its methodological impact. While Paper 2 offers solid contributions to human-AI collaboration in reinforcement learning, Paper 1's direct relevance to the booming field of LLM coding agents and AI control gives it a higher potential for immediate, broad scientific and practical impact.