SIA: Self Improving AI with Harness & Weight Updates

Prannay Hebbar, Yogendra Manawat, Samuel Verboomen, Alesia Ivanova, Selvam Palanimalai, Kunal Bhatia, Vignesh Baskaran

#124 of 2682 · Artificial Intelligence
Share
Tournament Score
1540±46
10501800
83%
Win Rate
15
Wins
3
Losses
18
Matches
Rating
5.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. The gains are 56.6% on LawBench, 91.9% runtime reduction on GPU kernels, and 502% on denoising over the initial baseline. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SIA: Self Improving AI with Harness & Weight Updates

1. Core Contribution

SIA proposes unifying two previously disjoint approaches to AI self-improvement: (1) harness/scaffold updates (modifying prompts, tools, retry logic while keeping model weights fixed) and (2) test-time training (updating model weights via RL while keeping the scaffold fixed). A "Feedback-Agent" dynamically selects between these two levers—scaffold rewriting or weight updates via RL—in a closed loop. The system is evaluated on three diverse tasks: Chinese legal charge classification (LawBench), GPU kernel optimization (TriMul), and single-cell RNA denoising.

The conceptual contribution is straightforward and intuitive: if you can improve both the scaffold and the weights, you should do better than improving either alone. The paper's value lies in demonstrating this empirically and providing an architecture where an LLM-based meta-agent orchestrates both types of updates.

2. Methodological Rigor

Strengths in experimental design:

  • Three genuinely diverse domains (law, systems programming, biology) provide breadth.
  • The ablation structure (Baseline → SIA-H → SIA-W+H) cleanly isolates each lever's contribution.
  • The reported gains are substantial and consistent across all three benchmarks.
  • Significant concerns:

  • Base model opacity. The task-specific agent uses `gpt-oss-120b`, described only as "an internal 120B-parameter instruction-tuned language model." This is a proprietary, non-public model. No details about its training data, architecture specifics, or capabilities are provided. This severely limits reproducibility and makes it impossible to assess whether the base model was already exposed to related data during pretraining.
  • No statistical reporting. Results appear to be single runs. No confidence intervals, standard deviations, or repeated trials are reported. Given the stochasticity of both LLM-based scaffold generation and RL training, this is a notable gap.
  • Incomplete ablation. The paper never tests weight-updates-only (SIA-W) as an independent condition. This means we cannot determine whether the harness updates are truly complementary or whether RL alone on the base model would achieve similar results. The claim of synergy between the two levers is weakened without this ablation.
  • Feedback-Agent selection mechanism. The paper describes the Feedback-Agent choosing among PPO, GRPO, DPO, REINFORCE+KL, Best-of-N BC, and entropic advantage weighting. However, it's unclear how systematically this selection was evaluated. The descriptions read more as post-hoc rationalizations ("selected when...") than as empirically validated decision rules. No ablation studies compare different algorithm selections for the same task.
  • Fairness of comparison. The baseline is the initial meta-agent scaffold with no iteration. The "previous SOTA" numbers are drawn from different systems with potentially different compute budgets. Compute costs for SIA (LLM calls for meta-agent + feedback agent using Claude Sonnet 4.6, plus RL training on H100s) are never reported, making efficiency comparisons impossible.
  • 3. Potential Impact

    The framing of unifying scaffold and weight updates is compelling and timely. If the results generalize, this could establish a new paradigm for building self-improving AI systems. The practical implications are significant: rather than requiring separate teams for prompt engineering and model fine-tuning, a single automated loop handles both.

    However, the current paper's impact is limited by:

  • Reliance on proprietary infrastructure (internal model, Modal platform)
  • Lack of a released framework or code
  • The Feedback-Agent itself (Claude Sonnet 4.6) is a frontier model making high-level architectural decisions—this is expensive and may not scale
  • The conceptual framework could inspire follow-up work even if the specific implementation isn't reproducible.

    4. Timeliness & Relevance

    The paper addresses a genuine and timely gap. The self-improving AI space is active, with concurrent work on scaffold evolution (Hyperagents, Darwin Gödel Machine) and test-time training (TTRL, Discover-TTT). The observation that these two lines are siloed is accurate and the proposed unification is natural. The comparison table (Table 1) effectively positions SIA as the first system editing both harness and weights.

    The timing is right—the field has matured enough in both scaffold engineering and test-time RL that combining them is feasible—but the paper would have benefited from stronger baselines from each individual silo.

    5. Strengths & Limitations

    Key Strengths:

  • Clean conceptual framing that identifies a genuine gap between two research silos
  • Diverse benchmark selection spanning very different domains
  • Consistent improvement from combining both levers across all three tasks
  • Detailed qualitative analysis of what each lever changes (§7.2-7.4)
  • The observation that weight updates can discover structural invariants (np.clip + np.rint in denoising) that harness iteration never found is insightful
  • Notable Weaknesses:

  • Reproducibility crisis: Proprietary base model, proprietary RL platform, no code release mentioned
  • Missing ablation: No weight-only (SIA-W) condition
  • No variance reporting: Single-run results on stochastic processes
  • Compute budget asymmetry: Harness updates and weight updates consume very different compute; the paper doesn't control for total compute
  • Coupled Goodhart concern acknowledged but unaddressed: The authors identify this as a limitation but provide no empirical investigation of its effects
  • Feedback-Agent is a black box: The meta-level decisions (when to switch levers, which RL algorithm to use) are made by a frontier LLM with no transparent decision rule
  • Scale of evaluation: 913 test instances (LawBench), single fixed input (TriMul), single dataset (denoising)—relatively small-scale evaluations
  • Additional Observations

    The paper's writing is clear and well-structured. The research questions are explicitly stated and mapped to sections. However, some claims feel overclaimed relative to evidence—"502% improvement on denoising" sounds dramatic but reflects going from 0.048 to 0.289 on a normalized score, where much of the gain comes from harness updates establishing a working pipeline from a near-zero baseline.

    The future work on meta-RL over the action-selection policy is intellectually interesting but speculative. The paper would benefit from more grounded near-term experiments, such as testing on established benchmarks where both scaffold-only and weight-only SOTA numbers exist.

    Rating:5.5/ 10
    Significance 6.5Rigor 4.5Novelty 6Clarity 7.5

    Generated May 27, 2026

    Comparison History (18)

    vs. Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs
    gpt-5.25/27/2026

    Paper 2 has higher likely impact: it identifies a broadly relevant failure mode (monitoring-control gap) in widely deployed RAG systems, backed by large-scale multi-turn evaluations (50k+), multiple model families, human validation, and mechanistic analyses. The finding challenges common safety evaluation assumptions and motivates new benchmarks and mitigation research across AI safety, NLP, and human-AI decision-making. Paper 1 is innovative and application-rich, but self-improving harness+weight updates face higher reproducibility/safety barriers and may be more sensitive to engineering choices, narrowing near-term cross-field adoption.

    vs. Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
    gemini-3.15/27/2026

    Paper 1 presents a significant step towards recursive self-improving AI by unifying harness and weight updates. Its demonstration of massive empirical gains across highly diverse domains (law, systems programming, and biology) highlights its broad applicability and transformative capability potential. While Paper 2 identifies a critical vulnerability in current RLHF methods, Paper 1 introduces a novel paradigm that could fundamentally accelerate how AI systems are built, optimized, and deployed, giving it a higher long-term scientific and practical impact.

    vs. Retrying vs Resampling in AI Control
    gemini-3.15/27/2026

    Paper 2 addresses a grand challenge in AI (self-improvement) by unifying two major, previously disjoint research paradigms: scaffold/harness updates and test-time weight updates. Its demonstration of massive performance gains across highly diverse, real-world domains (law, GPU optimization, and bioinformatics) suggests a much broader applicability and transformative potential. Paper 1 offers valuable, rigorous insights into AI safety and control mechanisms, but its impact is narrower and more specialized compared to the foundational architectural shift proposed in Paper 2.

    vs. Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2
    gpt-5.25/27/2026

    Paper 1 has higher potential impact: it proposes a unifying self-improvement loop that jointly updates agent harness and model weights, bridging two previously separate research directions, and demonstrates large gains across three diverse domains (law, systems/GPU optimization, and biology), suggesting broad applicability. This combination is novel, timely for agentic AI, and could influence multiple fields and future autonomous AI development. Paper 2 is a solid, practical incremental improvement to an existing PTQ pipeline for a specific model/format (Wan2.2/HiFloat4), with narrower scope and less cross-field impact.

    vs. XDecomposer: Learning Prior-Free Set Decomposition for Multiphase X-ray Diffraction
    gemini-3.15/27/2026

    Paper 2 addresses a grand challenge in artificial intelligence—self-improving AI—by unifying harness and weight updates. Its framework is domain-agnostic, demonstrating significant performance gains across diverse fields like law, GPU optimization, and biology. While Paper 1 provides a highly valuable tool for materials science, Paper 2 has a much broader potential impact across multiple scientific and engineering disciplines due to its generalizable approach to autonomous AI improvement.

    vs. Maat: The Agentic Legal Research Assistant for Competition Protection
    claude-opus-4.65/27/2026

    SIA addresses a fundamental bottleneck in AI research—self-improvement—by unifying two previously disjoint research paradigms (harness updates and weight updates) into a single loop. This has broader scientific impact across multiple fields, demonstrated by evaluation on three diverse domains. The novelty of combining scaffold iteration with test-time weight training opens a new research direction with wide applicability. Paper 1, while useful, is a domain-specific application of existing techniques (ReAct + RAG) to competition law, with narrower impact scope.

    vs. Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models
    gemini-3.15/27/2026

    Paper 1 addresses a fundamental bottleneck in AI development by unifying scaffolding and weight updates for self-improving AI. This novel paradigm represents a significant step toward autonomous AI improvement. Its demonstration of massive performance gains across highly diverse domains (law, GPU optimization, and computational biology) proves its broad applicability and transformative potential. While Paper 2 offers a valuable methodological improvement for LLM evaluation, Paper 1's approach to self-improving systems has a much higher ceiling for real-world impact and cross-disciplinary innovation.

    vs. Learning to Search and Searching to Learn for Generalization in Planning
    gpt-5.25/27/2026

    Paper 1 offers a methodologically grounded integration of classic best-first search (WA*/A*) with learned relational heuristics via Q-learning, demonstrating striking zero-shot combinatorial generalization (e.g., 30→488 blocks) and broad relevance to planning, RL, and neuro-symbolic methods. Its evaluation on standard puzzles and IPC benchmarks increases rigor and reproducibility. Paper 2 targets a timely self-improvement theme with attractive applications, but the claims (large gains, cross-domain harness+weights updates) are harder to assess without strong controls and may be more system/benchmark-dependent, reducing expected durable scientific impact.

    vs. A governance horizon for ethical-use constraints in open-weight AI models
    gemini-3.15/27/2026

    Paper 2 tackles a foundational goal in artificial intelligence—autonomous self-improvement—by unifying scaffold optimization and weight updates. Its methodological breadth, demonstrated by massive performance gains across disparate domains (law, GPU optimization, RNA denoising), gives it immense potential for widespread, cross-disciplinary scientific application. While Paper 1 provides a highly rigorous and important empirical analysis for AI governance, Paper 2 directly advances core AI capabilities, which typically drives deeper technical impact and broader adoption in the machine learning community.

    vs. MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration
    gpt-5.25/27/2026

    Paper 1 is more novel and potentially transformative: it unifies two previously separate self-improvement paradigms (harness/scaffold editing and weight updates) into a single loop, aiming at the long-standing goal of autonomous AI improvement. Its evaluation spans disparate domains (law, systems/GPU optimization, biology), suggesting broader cross-field impact and generality. If methodologically sound, jointly optimizing agent scaffolds and model parameters could influence many areas of agent design and continual learning. Paper 2 is timely and practical for on-device GUI agents, but its core idea is a narrower systems/prompting optimization with more incremental scientific novelty.

    vs. Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling
    gpt-5.25/27/2026

    Paper 2 has higher potential impact due to a more ambitious and general paradigm: a unified self-improvement loop that updates both agent harness and model weights, bridging two previously separate research lines. It demonstrates large gains across three highly distinct real-world domains (law, systems optimization, biology), suggesting broad applicability and timeliness as interest in autonomous AI improvement grows. Paper 1 is methodologically solid and timely for test-time scaling, but its contribution is narrower (search/backtracking for LM reasoning) and likely impacts a more specific subcommunity.

    vs. Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents
    gemini-3.15/27/2026

    Paper 1 tackles a fundamental and highly ambitious goal in AI research: autonomous self-improvement. By successfully unifying previously disjoint approaches (harness updates and weight updates) and demonstrating significant performance gains across highly diverse and complex domains (law, systems optimization, and biology), it offers a broader conceptual leap. Paper 2, while offering a practical and rigorous solution for deployment efficiency (device-cloud routing), represents a more incremental optimization rather than a fundamental paradigm shift in AI capabilities.

    vs. Learning to Reason Efficiently with A* Post-Training
    gpt-5.25/27/2026

    Paper 2 is likely higher impact: it introduces a clear, principled bridge between classical optimal search (A*) and LLM reasoning, with broadly applicable training signals (trace SFT + A*-informed RL) that target both correctness and efficiency. The result—small models rivaling much larger ones—has strong practical implications for cost-effective reasoning systems and is highly timely. Methodologically, the framing is crisp and analyzable, with interpretable trade-offs and insights about imperfect heuristics. Paper 1 is ambitious and cross-domain, but “self-improving” harness+weights loops raise reproducibility/safety concerns and may be harder to generalize rigorously.

    vs. FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning
    claude-opus-4.65/27/2026

    SIA introduces a fundamentally novel paradigm by unifying two previously disjoint research lines (harness updates and weight updates) into a single self-improving loop, addressing the core bottleneck of human involvement in AI improvement. Its cross-domain evaluation (legal, GPU optimization, bioinformatics) demonstrates broad applicability with substantial gains. This work opens a new research direction toward truly self-improving AI systems, which has transformative potential. FAST-GOAL, while solid, is an incremental improvement to CLIP for handling long text—a narrower contribution in an already crowded vision-language fine-tuning space.

    vs. StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs
    claude-opus-4.65/27/2026

    SIA addresses a fundamental challenge in AI—self-improvement through both harness and weight updates—unifying two previously disjoint research paradigms. It demonstrates strong empirical results across three diverse domains (56.6%-502% improvements), suggesting broad applicability. This work advances the long-horizon goal of autonomous AI improvement, which has transformative potential. While StructBreak is valuable for AI safety (revealing structural vulnerabilities in MLLMs), it is more narrowly focused on attack methodology. SIA's framework for closing the loop on AI self-improvement has broader and more lasting implications for the field.

    vs. From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation
    gemini-3.15/27/2026

    Paper 1 addresses a fundamental bottleneck in AI development (self-improvement) by combining harness and weight updates. It demonstrates substantial improvements across three vastly different domains (law, hardware optimization, and biology), indicating broad cross-disciplinary applicability and high methodological rigor. In contrast, Paper 2 presents a specialized RAG application focused narrowly on legal document analysis, giving it a much more limited scope and breadth of potential impact.

    vs. Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
    gpt-5.25/27/2026

    Paper 2 is more novel and broadly impactful: it unifies two previously separate self-improvement paradigms (harness/scaffold rewriting and weight updates) into a single closed-loop framework, with strong cross-domain results (law, systems/GPU optimization, and biology). This combination targets a central, timely bottleneck—reducing human labor in model/agent improvement—and could generalize widely across applied AI. Paper 1 addresses an important deployment problem with a practical neuro-symbolic verification pipeline, but its novelty and breadth are narrower and its guarantees are partial (semantic similarity heuristics), likely limiting transformative impact relative to SIA’s self-improvement agenda.

    vs. Solving Combinatorial Counting Problems with Weighted First-Order Model Counting
    gemini-3.15/27/2026

    Paper 1 addresses a grand challenge in AI (self-improvement) by unifying two major, previously disjoint research directions (harness updates and weight updates). Its methodology demonstrates substantial empirical gains across highly diverse and impactful domains (law, systems optimization, biology), indicating broad real-world applicability and high relevance to current AI bottlenecks. Paper 2, while methodologically rigorous and innovative in discrete mathematics and logic, focuses on a more niche problem (combinatorial counting), giving it a narrower scope of impact compared to the general-purpose AI advancements in Paper 1.