Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia

Jun 4, 2026

arXiv:2606.05922v1 PDF

cs.AI(primary)cs.CLcs.LG

#483of 3355·Artificial Intelligence

#483 of 3355 · Artificial Intelligence

Tournament Score

1487±48

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7

Rigor6.5

Novelty6.5

Clarity8

Tournament Score

1487±48

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Retrospective Harness Optimization (RHO)

Core Contribution

RHO addresses a practical gap in agent improvement: how to optimize the full agent harness—comprising tools, skills, prompts, and workflows—without access to labeled validation data. The method operates in three stages: (1) DPP-based coreset selection balancing task difficulty and diversity, (2) group rollout with self-validation and self-consistency diagnostics, and (3) best-of-N harness proposal using pairwise self-preference. The central claim is that an agent's own past trajectories contain sufficient signal for self-improvement, substituting a latent utility function with model-generated preference judgments. The headline result—improving SWE-Bench Pro pass rate from 59% to 78% in a single optimization round without external grading—is striking.

Methodological Rigor

The experimental design is generally sound. The paper evaluates across three diverse domains (SWE-Bench Pro, Terminal-Bench 2, GAIA-2), provides proper train/test splits via seeded hash-based partitioning, and includes comprehensive ablations covering coreset selection strategies (Figure 5), diagnostic signal contributions (Table 4), and best-of-N consistency (Table 3). The compute accounting (Appendix G) is unusually thorough, with per-role invocation counts and wall-clock breakdowns.

However, several methodological concerns arise. No confidence intervals or variance estimates are reported for the main results (Table 1). While Table 3 partially addresses stochasticity for candidate selection, the end-to-end variance of the full pipeline remains unknown. The evaluation uses a single backbone (Codex with GPT-5.5 at high reasoning effort), leaving generalizability to weaker or different models untested. This is particularly concerning because every operator in the pipeline—solver, judge, diagnostician, optimizer, ranker—relies on the same frontier model being capable enough to judge its own outputs accurately. The self-preference assumption is adopted without formal analysis of when it provides a reliable proxy for true utility; systematic biases in self-evaluation (e.g., preferring verbose solutions or familiar patterns) could lead to consistent but incorrect optimization directions.

Potential Impact

The paper targets a genuine deployment bottleneck: labeled validation data is expensive and may not represent future task distributions. If the approach generalizes beyond frontier models and the tested domains, it could enable continuous agent improvement in production settings. The harness abstraction—a filesystem directory containing executable scripts, markdown instructions, and skill files—is pragmatic and implementation-friendly.

The generated harness artifacts (Section 5.2, Appendix H) demonstrate that RHO produces interpretable, actionable improvements addressing specific failure modes (e.g., non-standard Go toolchain paths, Python cache hygiene). This interpretability is a practical advantage over opaque optimization methods. The behavioral analysis (Figure 4) showing shifted action distributions and improved long-horizon performance provides mechanistic insight rarely seen in agent optimization papers.

However, the approach has narrow applicability constraints: it requires resettable environments for group rollout, editable harness surfaces, and a model capable of reliable self-judgment. The per-optimization cost (~200 agent calls) is non-trivial for expensive frontier models.

Timeliness & Relevance

The paper is highly timely. Agent systems are rapidly proliferating, and the question of how to continuously improve them post-deployment is increasingly urgent. The paper positions itself well within the emerging taxonomy of agent self-improvement methods (Table 5), being the first to combine label-free optimization, full-harness editing, and single-pass execution.

Strengths

1. Strong empirical results: The +19 percentage point improvement on SWE-Bench Pro without validation labels is compelling and substantially outperforms all feedback-free baselines.

2. Comprehensive ablations: Each pipeline component (coreset selection strategy, diagnostic signals, best-of-N selection) is individually justified through controlled ablations.

3. Excellent reproducibility: Full prompts, hyperparameters, and verbatim harness outputs are provided, with all artifacts persisted to disk.

4. Insightful analysis: The behavior shift visualization (Figure 4) and failure mode analysis demonstrate *how* the harness changes agent behavior, not just *that* it improves.

5. Fair baseline comparisons: Budget-matched comparisons with both trajectory-only and validation-feedback methods (Meta-Harness) at multiple compute levels.

Limitations

1. Single-model evaluation: All results use GPT-5.5/Codex. The approach's effectiveness with weaker models—where self-judgment may be unreliable—is entirely unknown.

2. No statistical significance testing: Main results lack error bars despite the inherently stochastic pipeline.

3. Self-preference validity: The paper assumes the model can reliably judge trajectory quality. The gap between "chosen" and actual best candidate (Table 3, SWE-Bench Pro: chosen=0.78 vs. mean=0.79) suggests selection is noisy and sometimes suboptimal.

4. Meta-Harness comparison: At 10 rounds, Meta-Harness reaches 0.80 versus RHO's 0.78, suggesting that with sufficient budget and labels, validation-feedback approaches remain competitive. The cost narrative (3.1× compute) partially offsets this but the label requirement is the real differentiator.

5. Scalability unknown: k=10 coreset, N=3 candidates tested. How performance scales with these parameters—and whether larger budgets yield further gains—is unexplored.

6. Risk of preference amplification: The ethics section correctly flags that model-generated judgments could amplify biased or unsafe procedures, but no mitigation beyond audit logs is implemented.

Overall Assessment

RHO is a well-executed applied contribution that demonstrates a practical and effective approach to label-free agent harness optimization. The core idea—using retrospective self-analysis of past trajectories to drive harness improvement—is sound and well-validated within its tested scope. The main reservation is the narrow evaluation context (single frontier model, three benchmarks) and the absence of variance estimates. The paper would be substantially strengthened by demonstrating robustness across model capabilities and providing statistical confidence bounds.

Rating:7.2/ 10

Significance 7Rigor 6.5Novelty 6.5Clarity 8

Generated Jun 5, 2026

Comparison History (22)

vs. Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

gpt-5.26/8/2026

Paper 2 likely has higher impact: it proposes a concrete, broadly applicable method (self-supervised harness optimization from deployment trajectories) that directly improves real agent performance without labeled data, with strong reported gains on a major benchmark (SWE-Bench Pro 59%→78%). This is timely for real-world LLM agent deployment and could generalize across domains and toolchains. Paper 1 offers valuable measurement for AI safety oversight, but is primarily diagnostic/forecasting with heavier dependence on specific model families and uncertainty in projections, and may have narrower practical uptake outside safety monitoring.

vs. StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

claude-opus-4.66/6/2026

Paper 1 (RHO) addresses a broader, more fundamental problem—self-supervised optimization of LLM agent harnesses without ground-truth labels—applicable across diverse domains. Its demonstrated improvement on SWE-Bench Pro (59% to 78%) is substantial and practically significant. The self-supervised nature makes it widely applicable to any agentic system. Paper 2, while solid, targets a narrower domain (RTL code generation) with a more incremental combination of existing techniques (PRM, MCTS, RAFT). Paper 1's breadth of impact across software engineering, technical work, and knowledge work gives it higher potential scientific influence.

vs. When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

gemini-3.16/6/2026

Paper 2 introduces a highly impactful, self-supervised method for continually improving AI agents without ground-truth data. Its demonstrated success on complex, real-world benchmarks like SWE-Bench Pro (improving pass rates from 59% to 78%) suggests significant and immediate applicability across various domains. While Paper 1 provides a valuable benchmark for failure recovery, Paper 2 offers an actionable optimization framework with broader potential to advance agentic systems.

vs. Unsupervised Pattern Analysis in Japanese Veterinary Toxicology: A Regulatory-Compliant Framework for Cross-Species Risk Assessment

claude-opus-4.66/6/2026

Paper 1 presents a novel self-supervised optimization method (RHO) for improving AI agent harnesses without ground-truth labels, addressing a critical bottleneck in LLM agent deployment. It demonstrates dramatic performance improvements (59% to 78% on SWE-Bench Pro) across three diverse domains, with broad applicability to the rapidly growing AI agents field. Paper 2 addresses a narrower niche (Japanese veterinary toxicology) with relatively standard unsupervised methods (clustering, dimensionality reduction) applied to a domain-specific dataset. While useful, its impact is limited to veterinary pharmacovigilance. Paper 1's timeliness, novelty, and breadth of impact are substantially greater.

vs. Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems

claude-opus-4.66/6/2026

Paper 1 introduces RHO, a novel self-supervised framework for optimizing LLM agent harnesses without ground-truth labels—addressing a critical bottleneck in deploying AI agents. The dramatic improvement on SWE-Bench Pro (59% to 78%) demonstrates significant practical impact. It spans multiple domains, has broad applicability to the rapidly growing field of LLM agents, and addresses the timely challenge of autonomous agent self-improvement. Paper 2, while solid, addresses a narrower problem (product-aware anomaly detection in manufacturing) with a relatively incremental contribution (conditioning autoencoders on product grade), limiting its breadth of impact.

vs. Where does Absolute Position come from in decoder-only Transformers?

gpt-5.26/5/2026

Paper 1 likely has higher impact due to strong practical relevance and immediate applicability: it proposes a self-supervised optimization loop for deployed LLM agents without labeled validation data and reports a large improvement on a major benchmark (SWE-Bench Pro 59%→78%). This method could generalize across many agentic systems and organizations, affecting tooling, workflows, and continual improvement practices. Paper 2 offers valuable mechanistic insight into positional information in RoPE transformers, but its contributions are more explanatory/diagnostic and may translate into fewer near-term system-level gains than a broadly deployable optimization procedure.

vs. Multi-ResNets for Subspace Preconditioning in Constrained Optimization

gemini-3.16/5/2026

Paper 1 addresses a critical and highly timely bottleneck in AI: optimizing LLM agents without ground-truth labels. Its self-supervised approach (RHO) demonstrates exceptional empirical results on a prominent benchmark (SWE-Bench Pro), improving pass rates from 59% to 78%. This broad applicability to software engineering and knowledge work promises immense real-world impact. While Paper 2 presents a rigorous and novel architecture for constrained optimization, its scope is more specialized. The explosive growth and broader interdisciplinary relevance of autonomous AI agents give Paper 1 a significantly higher potential for widespread scientific impact.

vs. Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures

gpt-5.26/5/2026

Paper 2 has higher likely scientific impact due to its methodological rigor and breadth: it introduces a causal evaluation protocol that cleanly tests whether intermediate structures in schema-guided reasoning are true causal mediators, then validates findings across 12 models and 4 benchmarks. The result challenges a widely used controllability assumption and directly informs tool-based pipeline design and alignment methods (e.g., preference optimization). Paper 1 is highly application-relevant for agent improvement, but its self-preference optimization is closer to engineering progress and may be harder to generalize or attribute scientifically.

vs. Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

claude-opus-4.66/5/2026

RHO addresses a fundamental, domain-general challenge in AI agent optimization—improving agent performance without ground-truth labels—making it broadly applicable across software engineering, technical work, and knowledge work. Its self-supervised approach (self-validation, self-consistency, self-preference) is highly novel and practical for real-world deployment where labeled data is scarce. The dramatic improvement on SWE-Bench Pro (59%→78%) without external grading is compelling. While BioManus makes strong contributions to biomedical agent planning with its MCP graph architecture, its impact is more domain-specific. RHO's generality and practicality give it broader potential impact across the AI agent ecosystem.

vs. Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

gpt-5.26/5/2026

Paper 2 likely has higher impact due to a more novel, broadly applicable optimization paradigm for LLM agents that removes dependence on external labels via self-preference over trajectory rollouts. It targets a timely and fast-growing area (agentic AI), shows large practical gains on a major benchmark (SWE-Bench Pro 59%→78%), and could generalize across domains where deployment logs exist. Paper 1 is valuable and rigorous as a diagnostic benchmark for VLM chronological reasoning and shortcut bias, but its impact is narrower (evaluation-focused, vision-language-specific) and may influence fewer downstream applications than agent-harness optimization.

vs. Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

gemini-3.16/5/2026

Paper 2 introduces a self-supervised method for LLM agents to improve their own workflows without external labels, addressing a major bottleneck in autonomous agent development. The reported 19% absolute improvement on the notoriously difficult SWE-Bench Pro benchmark is exceptionally significant. While Paper 1 provides valuable efficiency gains for LLM deployment via quantization, the autonomous, self-improving capabilities demonstrated in Paper 2 have broader, more transformative implications for the future of general-purpose AI systems.

vs. TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management

gemini-3.16/5/2026

Paper 1 addresses the critical bottleneck of agent self-improvement without requiring ground-truth labels. Its massive performance jump on the rigorous SWE-Bench Pro (59% to 78%) demonstrates exceptional real-world utility and methodological strength. While Paper 2 offers a clever approach to context management, native LLM context windows are rapidly expanding natively, potentially reducing the long-term impact of external compression proxies compared to autonomous, self-improving agent architectures.

vs. Agents' Last Exam

claude-opus-4.66/5/2026

Agents' Last Exam (ALE) introduces a large-scale, living benchmark covering 1K+ economically valuable tasks across 55 subfields, developed with 250+ industry experts. It addresses a fundamental evaluation gap between benchmark performance and real-world deployment. Its breadth of impact across fields, timeliness (agents are rapidly advancing but lack rigorous real-world benchmarks), and potential to become a standard evaluation instrument give it higher scientific impact. While RHO presents a clever self-supervised optimization method with strong results, it is more narrowly focused on agent harness improvement methodology.

vs. When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

gpt-5.26/5/2026

Paper 1 is likely higher impact due to a more novel and broadly applicable method for continual agent improvement without labels, directly addressing a major deployment bottleneck. RHO targets real-world agent harness optimization (tools/workflows/policies) and demonstrates a large practical gain on a prominent benchmark (SWE-Bench Pro) without external grading, suggesting strong applicability and timeliness for autonomous agents. Paper 2 is rigorous and useful for reliability in long-form generation, but selective abstraction is a more incremental extension of abstention/uncertainty frameworks with narrower scope compared to end-to-end agent optimization.

vs. Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems

gpt-5.26/5/2026

Paper 2 (RHO) likely has higher scientific impact due to its more broadly applicable, self-supervised optimization paradigm that removes dependence on labeled validation data—a major practical bottleneck for deployed agents. The reported jump on SWE-Bench Pro (59%→78%) suggests strong real-world utility and timeliness for agentic LLM improvement. Methodologically, it proposes a general loop (coreset selection, parallel rollouts, self-evaluation, pairwise self-preference) that can transfer across domains. Paper 1 targets an important safety-alignment issue, but relies on LLM-simulated oversight and appears narrower in application scope and transformative performance gains.

vs. Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains

gpt-5.26/5/2026

Paper 2 is likely higher impact due to stronger novelty and broader relevance: it proposes a self-supervised optimization method for LLM agents that removes reliance on labeled validation, a key bottleneck in real deployments. The approach is timely and broadly applicable across domains where agents operate, and the reported gains on a prominent benchmark (SWE-Bench Pro) suggest substantial practical impact. Paper 1 applies DRL to a valuable but narrower domain (pharmaceutical inventory) with incremental algorithmic hybridization; its impact is more specialized and less cross-field.

vs. Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

gemini-3.16/5/2026

Paper 2 addresses a critical bottleneck in agent development—optimizing workflows without ground-truth data—using a novel self-supervised approach. Its massive improvement on the highly challenging and relevant SWE-Bench Pro benchmark demonstrates exceptional real-world applicability and methodological rigor. In contrast, Paper 1 applies a more familiar generator-validator multi-agent pattern to a relatively saturated benchmark (GSM8K), offering less methodological novelty and a narrower scope of impact compared to Paper 2's broad, task-agnostic workflow optimization.

vs. EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

claude-opus-4.66/5/2026

RHO addresses a more general and broadly applicable problem—self-supervised optimization of LLM agent harnesses without ground-truth labels—applicable across diverse domains (software engineering, technical work, knowledge work). Its domain-agnostic framework, strong empirical results (59%→78% on SWE-Bench Pro), and practical relevance to real deployment settings give it broader impact potential. EpiEvolve, while innovative in its streaming pandemic forecasting approach, is more narrowly scoped to epidemic prediction. RHO's self-preference mechanism and trajectory-based optimization represent a more transferable methodological contribution to the rapidly growing LLM agents field.

vs. Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development

gpt-5.26/5/2026

Paper 2 has higher scientific impact potential: it proposes a broadly applicable, self-supervised optimization method for improving LLM agent harnesses without labeled evaluation data, addressing a central deployment bottleneck. The method is testable, algorithmic, and shows large gains on a widely recognized benchmark (SWE-Bench Pro), suggesting strong methodological and comparative rigor and easier replication across domains. Its applicability spans many agent settings (tools/workflows/skills), making it more general than Paper 1’s primarily enterprise knowledge-architecture framework, whose evidence is compelling but narrower and more context-dependent.

vs. 2-Step Agent: A Framework for the Interaction of a Decision Maker with AI Decision Support

claude-opus-4.66/5/2026

Paper 1 introduces a novel theoretical framework addressing the fundamental question of how humans learn from ML predictions, with formal Bayesian analysis revealing counterintuitive findings (ML decision support can harm outcomes even under ideal conditions). This has broad implications across healthcare, judiciary, and any field using ML-assisted decisions. Its theoretical depth, methodological rigor, and surprising negative results challenge prevailing assumptions. Paper 2, while practically useful, is more incremental—an engineering contribution for LLM agent optimization in a rapidly evolving space where methods quickly become obsolete.