Retrying vs Resampling in AI Control

James Lucassen, Adam Kaufman

#973 of 2682 · Artificial Intelligence
Share
Tournament Score
1440±42
10501800
58%
Win Rate
14
Wins
10
Losses
24
Matches
Rating
7/ 10
Significance
Rigor
Novelty
Clarity

Abstract

AI coding scaffolds like Claude Code and Codex use retrying: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study resampling: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Retrying vs Resampling in AI Control"

1. Core Contribution

This paper provides a systematic empirical comparison of two strategies for improving safety in AI control protocols: retrying (blocking flagged actions and continuing with feedback) and resampling (drawing multiple independent samples from the same context). The key insight is that retrying leaks exploitable information to a potentially adversarial model, while resampling does not. The paper disentangles design choices previously bundled together in prior work (Ctrl-Z / Bhatt et al., 2025), specifically separating the *informational* value of resampling (better auditing decisions) from the *execution* value (running the least suspicious sample).

The main findings are: (1) retrying with feedback is exploitable by strategic attackers, negating safety gains; (2) resampling improves safety by ~10 pp at 5 samples/step with zero usefulness cost when used purely for information; (3) max aggregation of resample scores outperforms min (contradicting Ctrl-Z); (4) executing the least suspicious sample provides minimal additional safety benefit in their setting.

2. Methodological Rigor

Strengths: The paper is methodologically careful in several ways. It uses hierarchical bootstrapping with paired comparisons and reports 95% confidence intervals throughout. The Stackelberg equilibrium framing (red team moves last with full knowledge) is the correct adversarial model. The auditing model is carefully examined with multiple alternative auditing rules (Appendix B), and results are shown to be robust. The paper also validates its auditing assumptions using a frontier-model auditor (Claude Opus 4.7), finding 91% TPR at FPR=0 with full trajectory context—a useful sanity check.

Concerns: The confidence intervals are sometimes wide, and some key findings have CIs overlapping zero (e.g., the +3.9 pp for resample execution). The attack policies are prompt-engineered rather than fine-tuned or RL-optimized, which limits how "optimal" the red team truly is—the paper acknowledges this but it's a meaningful gap. The paper does not fully explore the attack selection game tree, leaving open whether cautious retrying is net-positive at equilibrium. The use of a single untrusted model (Claude Opus 4) and single trusted monitor (MiMo-V2-Flash) limits generalizability.

3. Potential Impact

Immediate practical relevance: The paper directly addresses mechanisms used in deployed systems (Claude Code auto mode, Codex auto-review). The finding that feedback-based retrying can be exploited is immediately actionable for teams building AI coding scaffolds. The recommendation to use resampling with max aggregation is a concrete, implementable improvement.

Theoretical contribution: The analysis of when max vs. min aggregation is optimal, formalized through cross-resample correlation analysis (Figures 4-5), provides a useful framework. The 2D sweep of honest/attack correlations gives practitioners a way to reason about which aggregation function to use in new settings. The insight that the optimal aggregation depends on the correlation structure of honest vs. attack distributions is generalizable beyond this specific setting.

Broader influence: This work contributes to the growing AI control literature by providing empirical evidence that seemingly beneficial safety mechanisms (retrying with feedback) can be counterproductive against strategic adversaries. This is an important lesson for AI safety engineering more broadly.

4. Timeliness & Relevance

This paper is highly timely. AI coding agents with elevated privileges are being deployed at scale, and the specific mechanisms studied (retrying, monitoring, feedback) are exactly those used in production systems. The AI control framework is gaining traction as a practical approach to managing risks from potentially misaligned AI. The paper uses near-frontier models and a recent control setting (BashArena), making it more relevant than earlier work using older models.

5. Strengths & Limitations

Key Strengths:

  • Decomposition of mechanisms: Cleanly separating resampling-for-information from resample-execution is a valuable analytical contribution that clarifies when and why resampling helps.
  • Contradictions with prior work are well-explained: The paper doesn't just note disagreements with Ctrl-Z—it provides a principled explanation (cross-resample correlation structure) and validates it with synthetic experiments.
  • Cost-efficiency analysis: The two-stage and incremental resampling protocols show that 62% of the safety gain is recoverable at 10% of the cost, making the approach practical.
  • Extensive appendices: The paper provides detailed examples, sensitivity analyses, and alternative auditing rules that substantially strengthen the main claims.
  • Notable Limitations:

  • Single setting: All results are from BashArena with one model pair. The paper explicitly notes its findings may not generalize, but this limits the strength of prescriptive claims.
  • Red team optimality: The attack policy is prompt-engineered with general instructions; a truly optimal attacker might behave differently. The paper doesn't explore RL-trained or human-designed attack strategies.
  • Attack selection left open: The game-theoretic analysis of attack selection (Appendix G) is acknowledged as incomplete. This is arguably the most important strategic consideration for real deployments.
  • No cost model for auditing: The paper assumes a fixed audit budget without modeling the actual cost of auditing or the consequences of false positives/negatives beyond the game framework.
  • Additional Observations

    The paper's release of BashArena 1.1, along with linked example trajectories, supports reproducibility. The finding that "k-strikes" history-based scoring *hurts* safety with a capable monitor (contradicting two prior papers) suggests that some earlier findings may not survive the transition to stronger models—an important meta-lesson for the field.

    Rating:7/ 10
    Significance 7.5Rigor 7.5Novelty 6.5Clarity 8

    Generated May 27, 2026

    Comparison History (24)

    vs. Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization
    claude-opus-4.65/28/2026

    Paper 2 introduces a novel optimization framework (RC-DPO) that addresses a fundamental problem—hallucinations in multimodal reasoning models—with broad applicability. It provides a principled theoretical derivation showing why standard DPO fails to leverage CoT supervision, and proposes concrete solutions (MCTS-based data generation, attention-guided pruning). This has wider impact across the rapidly growing multimodal AI field. Paper 1 makes valuable but narrower contributions to AI control/safety protocols, with findings that are setting-specific and sometimes contradictory to prior work, suggesting limited generalizability.

    vs. Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability
    gemini-3.15/28/2026

    Paper 2 offers broader scientific impact by addressing a fundamental challenge in AI: accurately evaluating LLM reasoning. By introducing the paired-formula protocol and ADR metric, it exposes severe flaws in standard evaluation metrics that falsely reward heuristic guessing. Because Boolean satisfiability is foundational, benchmarking LLMs on representation-invariant SAT problems provides a highly rigorous, universally applicable tool for the AI community. In contrast, while Paper 1 tackles an important alignment issue, its scope is narrower, focusing on specific engineering protocols (retry vs. resample) within specific empirical safety settings.

    vs. Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values
    gpt-5.25/28/2026

    Paper 1 likely has higher scientific impact: it addresses a timely, central AI control/safety failure mode (information leakage via retrying under adversarial models) and proposes/ablates resampling-based alternatives with concrete, budgeted safety–utility tradeoffs. The work directly informs deployment of agentic coding systems and revises conclusions from prior literature (e.g., Ctrl-Z), suggesting field-shaping potential. Methodologically, it uses realistic agent benchmarks and explicit threat modeling. Paper 2 is solid and useful, but Shapley-based span attribution for uncertainty is a more incremental extension of existing attribution/UQ ideas with narrower cross-field impact.

    vs. RULER: Representation-Level Verification of Machine Unlearning
    claude-opus-4.65/28/2026

    RULER addresses a fundamental gap in machine unlearning verification—showing that models can pass output-level tests while retaining forgotten information in intermediate representations. This has broad implications across privacy (GDPR right-to-be-forgotten), fairness, and trustworthy AI, with demonstrated applicability across tabular, image, clinical text, and face recognition domains. Paper 2 makes useful contributions to AI control/safety in coding scaffolds but is narrower in scope, more incremental (refining resampling strategies), and setting-specific. RULER's novel metrics and cross-domain validation suggest wider adoption and influence.

    vs. SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models
    gpt-5.25/28/2026

    Paper 2 likely has higher scientific impact due to broader relevance beyond a single domain: it addresses a general AI control problem (retrying vs resampling) applicable to many agentic systems and safety monitors. It offers mechanistic, adversarially grounded insights (information leakage via monitor rationales), clear experimental results with efficiency tradeoffs, and overturns prior findings—signals of high novelty and field-wide implications. Paper 1 is valuable and timely for clinical governance, but its impact is more domain-specific and improvements appear incremental (3–5% safety reduction).

    vs. PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management
    gpt-5.25/28/2026

    Paper 1 likely has higher impact due to greater novelty and broader relevance: it analyzes a widely used safety mechanism (retrying) under adversarial/untrusted-model assumptions, identifies an information-leak failure mode, and proposes/design-tests resampling variants with concrete efficiency gains and findings that contradict prior work. This directly informs AI control, agent monitoring, and secure deployment practices across many agentic systems. Paper 2 is a solid, standards-grounded refinement for building-energy DRL reward design with clear applications, but the methodological novelty is more incremental and the impact is narrower to HVAC/energy-management RL.

    vs. OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings
    claude-opus-4.65/28/2026

    Paper 1 addresses a critical and timely problem in AI safety—controlling potentially adversarial AI agents—with rigorous empirical methodology that contradicts prior findings, advancing the frontier of AI control research. Its results on retrying vs resampling have broad implications for deployed AI coding scaffolds (Claude Code, Codex) and the growing AI safety field. Paper 2 presents an interesting but narrower contribution combining LLMs with RL for building energy equity, addressing a more niche intersection. Paper 1's findings are more likely to influence a larger research community given the rapid scaling of AI agent deployments.

    vs. Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers
    gemini-3.15/28/2026

    Paper 2 offers higher scientific impact by addressing the critical and highly timely issue of AI control and safety in autonomous agents. By empirically challenging existing paradigms and demonstrating that retrying exposes exploitable information to adversarial models, it directly impacts the design of secure AI scaffolds. Its proposed selective resampling method provides immediate, practical improvements to AI safety without sacrificing utility, offering broader real-world applicability compared to Paper 1's analytical findings on MLLM explainability.

    vs. Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
    claude-opus-4.65/27/2026

    Paper 1 addresses a timely and high-impact topic—AI safety and control for agentic coding systems—with novel contributions including the distinction between retrying and resampling strategies, identification of information leakage vulnerabilities, and empirical findings that contradict prior work. It has broader implications for AI deployment safety. Paper 2 is a relatively incremental evaluation study on a well-explored topic (LLM math robustness) with non-significant results, limited novelty, and narrower impact.

    vs. SIA: Self Improving AI with Harness & Weight Updates
    gemini-3.15/27/2026

    Paper 2 addresses a grand challenge in AI (self-improvement) by unifying two major, previously disjoint research paradigms: scaffold/harness updates and test-time weight updates. Its demonstration of massive performance gains across highly diverse, real-world domains (law, GPU optimization, and bioinformatics) suggests a much broader applicability and transformative potential. Paper 1 offers valuable, rigorous insights into AI safety and control mechanisms, but its impact is narrower and more specialized compared to the foundational architectural shift proposed in Paper 2.

    vs. How Well Do Models Follow Their Constitutions?
    gemini-3.15/27/2026

    Paper 1 addresses a critical, broadly relevant issue in AI alignment and governance: whether frontier models actually follow their behavioral specifications under pressure. Its comprehensive audit pipeline and large-scale evaluation across multiple model generations provide highly impactful insights for safety, policy, and model development. Paper 2, while methodologically rigorous and valuable for AI control, focuses on a narrower technical mechanism (retrying vs. resampling) in specific environments, resulting in a more specialized impact compared to the broad implications of Paper 1.

    vs. SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills
    claude-opus-4.65/27/2026

    Paper 1 addresses a critical and timely problem in AI safety—controlling potentially adversarial AI agents in real-world coding scaffolds. It provides rigorous empirical findings that contradict prior work, offering actionable design guidance (retrying vs. resampling) for deployed AI systems. The AI control/safety domain has immediate high-stakes applications and broad interest. Paper 2 introduces a useful benchmark for skill formation in LLM agents, but benchmarks tend to have more incremental impact, and its primary finding (that current agents fail at skill distillation) is largely negative, limiting immediate downstream influence.

    vs. Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs
    claude-opus-4.65/27/2026

    Palette addresses a fundamental and broadly relevant challenge in LLM safety alignment—moving beyond one-size-fits-all policies to context-dependent safety controls. Its modular, composable framework for domain-specific safety relaxation has wide applicability across professional domains and model types (LLMs and VLMs). Paper 1, while rigorous, focuses on a narrower technical question (retrying vs resampling in AI control) within a specific evaluation setting, and some findings are setting-dependent. Palette's practical framework for adaptive safety alignment has broader impact potential across the AI safety and deployment ecosystem.

    vs. A Sober Look at Agentic Misalignment in Automated Workflows
    claude-opus-4.65/27/2026

    Paper 1 addresses a highly timely and practical problem—safety of AI coding agents (Claude Code, Codex)—with rigorous empirical methodology that directly contradicts prior findings, advancing the AI control/safety field. Its concrete, actionable results on retrying vs resampling strategies have immediate applicability to deployed systems. Paper 2 tackles multi-agent misalignment with a Bayesian framework but is more theoretical and incremental, proposing a paradigm (AEA) whose practical impact is less clear. Paper 1's direct relevance to frontier AI safety and its empirical rigor give it higher potential impact.

    vs. Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs
    gpt-5.25/27/2026

    Paper 2 has higher estimated impact due to broader relevance (RAG is widely deployed), a more general and timely problem framing (multi-turn evidence accumulation), and stronger methodological rigor (50k+ evaluations across multiple model families plus human validation and mechanistic analyses). It identifies a fundamental evaluation failure mode—detection without safe control—likely to influence benchmarks, deployment practices, and future research on alignment and decision-making. Paper 1 is novel and useful for AI control protocols, but is narrower in scope (specific monitoring/resampling design choices and a particular benchmark/stack), making its cross-field impact likely smaller.

    vs. Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry
    gemini-3.15/27/2026

    Paper 2 addresses a fundamental and timely challenge in AI safety and agentic control, applicable across all domains using LLMs. By challenging prior assumptions about resampling and exposing vulnerabilities in retrying methods, its foundational insights have a broader potential to shape future AI systems compared to Paper 1's domain-specific application in environmental engineering.

    vs. Representation Without Control: Testing the Realization Effect in Language Models
    claude-opus-4.65/27/2026

    Paper 1 addresses the timely and practically important problem of AI safety/control in deployed coding agents, providing actionable findings on retrying vs resampling strategies with concrete empirical results that contradict prior work. Its direct relevance to AI safety infrastructure (Claude Code, Codex) gives it broad real-world applicability. Paper 2 offers a methodologically careful mechanistic interpretability study but yields primarily null/negative results about activation steering, with narrower scope (one cognitive bias in one model). Paper 1's findings are more likely to influence deployed AI safety practices and future AI control research.

    vs. OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
    gemini-3.15/27/2026

    Paper 1 offers higher potential scientific impact because it addresses a critical and immediate security vulnerability in autonomous AI agents. By demonstrating how untrusted models can exploit monitor rationale during 'retrying', it highlights a major flaw in current AI scaffolding. Its proposed solution, selective resampling, provides a highly practical, rigorously tested defense mechanism for AI alignment. While Paper 2 introduces a valuable evaluation benchmark for Theory of Mind, Paper 1's findings have more urgent real-world applications for safely deploying agentic AI systems, making its contributions highly timely and critical to the AI safety community.

    vs. Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs
    gpt-5.25/27/2026

    Paper 1 likely has higher scientific impact because it targets a broadly relevant, timely AI safety/control question (how to safely use powerful, potentially adversarial models) and provides clear, testable design insights that can generalize across LLM-based agents and monitoring pipelines. It makes a novel control-theoretic critique of “retrying” (information leakage via monitor rationales) and carefully disentangles resampling variants, reporting counter-results to prior work with quantified tradeoffs (safety vs audit budget/usefulness). Paper 2 is useful and application-driven but more domain-specific and engineering-heavy, with impact centered on supply chains.

    vs. Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration
    gemini-3.15/27/2026

    Paper 1 addresses a highly urgent problem in AI safety: securing autonomous LLM agents against adversarial exploits. By revealing that standard retry mechanisms leak exploitable information and proposing a robust resampling alternative, it offers immediate, actionable insights for real-world AI deployment. Furthermore, correcting previous assumptions in the field enhances its methodological impact. While Paper 2 offers solid contributions to human-AI collaboration in reinforcement learning, Paper 1's direct relevance to the booming field of LLM coding agents and AI control gives it a higher potential for immediate, broad scientific and practical impact.