SIA: Self Improving AI with Harness & Weight Updates
Prannay Hebbar, Yogendra Manawat, Samuel Verboomen, Alesia Ivanova, Selvam Palanimalai, Kunal Bhatia, Vignesh Baskaran
Abstract
Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. The gains are 56.6% on LawBench, 91.9% runtime reduction on GPU kernels, and 502% on denoising over the initial baseline. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.
AI Impact Assessments
(1 models)Scientific Impact Assessment: SIA: Self Improving AI with Harness & Weight Updates
1. Core Contribution
SIA proposes unifying two previously disjoint approaches to AI self-improvement: (1) harness/scaffold updates (modifying prompts, tools, retry logic while keeping model weights fixed) and (2) test-time training (updating model weights via RL while keeping the scaffold fixed). A "Feedback-Agent" dynamically selects between these two levers—scaffold rewriting or weight updates via RL—in a closed loop. The system is evaluated on three diverse tasks: Chinese legal charge classification (LawBench), GPU kernel optimization (TriMul), and single-cell RNA denoising.
The conceptual contribution is straightforward and intuitive: if you can improve both the scaffold and the weights, you should do better than improving either alone. The paper's value lies in demonstrating this empirically and providing an architecture where an LLM-based meta-agent orchestrates both types of updates.
2. Methodological Rigor
Strengths in experimental design:
Significant concerns:
3. Potential Impact
The framing of unifying scaffold and weight updates is compelling and timely. If the results generalize, this could establish a new paradigm for building self-improving AI systems. The practical implications are significant: rather than requiring separate teams for prompt engineering and model fine-tuning, a single automated loop handles both.
However, the current paper's impact is limited by:
The conceptual framework could inspire follow-up work even if the specific implementation isn't reproducible.
4. Timeliness & Relevance
The paper addresses a genuine and timely gap. The self-improving AI space is active, with concurrent work on scaffold evolution (Hyperagents, Darwin Gödel Machine) and test-time training (TTRL, Discover-TTT). The observation that these two lines are siloed is accurate and the proposed unification is natural. The comparison table (Table 1) effectively positions SIA as the first system editing both harness and weights.
The timing is right—the field has matured enough in both scaffold engineering and test-time RL that combining them is feasible—but the paper would have benefited from stronger baselines from each individual silo.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations
The paper's writing is clear and well-structured. The research questions are explicitly stated and mapped to sections. However, some claims feel overclaimed relative to evidence—"502% improvement on denoising" sounds dramatic but reflects going from 0.048 to 0.289 on a normalized score, where much of the gain comes from harness updates establishing a working pipeline from a near-zero baseline.
The future work on meta-RL over the action-selection policy is intellectually interesting but speculative. The paper would benefit from more grounded near-term experiments, such as testing on established benchmarks where both scaffold-only and weight-only SOTA numbers exist.
Generated May 27, 2026
Comparison History (18)
Paper 2 has higher likely impact: it identifies a broadly relevant failure mode (monitoring-control gap) in widely deployed RAG systems, backed by large-scale multi-turn evaluations (50k+), multiple model families, human validation, and mechanistic analyses. The finding challenges common safety evaluation assumptions and motivates new benchmarks and mitigation research across AI safety, NLP, and human-AI decision-making. Paper 1 is innovative and application-rich, but self-improving harness+weight updates face higher reproducibility/safety barriers and may be more sensitive to engineering choices, narrowing near-term cross-field adoption.
Paper 1 presents a significant step towards recursive self-improving AI by unifying harness and weight updates. Its demonstration of massive empirical gains across highly diverse domains (law, systems programming, and biology) highlights its broad applicability and transformative capability potential. While Paper 2 identifies a critical vulnerability in current RLHF methods, Paper 1 introduces a novel paradigm that could fundamentally accelerate how AI systems are built, optimized, and deployed, giving it a higher long-term scientific and practical impact.
Paper 2 addresses a grand challenge in AI (self-improvement) by unifying two major, previously disjoint research paradigms: scaffold/harness updates and test-time weight updates. Its demonstration of massive performance gains across highly diverse, real-world domains (law, GPU optimization, and bioinformatics) suggests a much broader applicability and transformative potential. Paper 1 offers valuable, rigorous insights into AI safety and control mechanisms, but its impact is narrower and more specialized compared to the foundational architectural shift proposed in Paper 2.
Paper 1 has higher potential impact: it proposes a unifying self-improvement loop that jointly updates agent harness and model weights, bridging two previously separate research directions, and demonstrates large gains across three diverse domains (law, systems/GPU optimization, and biology), suggesting broad applicability. This combination is novel, timely for agentic AI, and could influence multiple fields and future autonomous AI development. Paper 2 is a solid, practical incremental improvement to an existing PTQ pipeline for a specific model/format (Wan2.2/HiFloat4), with narrower scope and less cross-field impact.
Paper 2 addresses a grand challenge in artificial intelligence—self-improving AI—by unifying harness and weight updates. Its framework is domain-agnostic, demonstrating significant performance gains across diverse fields like law, GPU optimization, and biology. While Paper 1 provides a highly valuable tool for materials science, Paper 2 has a much broader potential impact across multiple scientific and engineering disciplines due to its generalizable approach to autonomous AI improvement.
SIA addresses a fundamental bottleneck in AI research—self-improvement—by unifying two previously disjoint research paradigms (harness updates and weight updates) into a single loop. This has broader scientific impact across multiple fields, demonstrated by evaluation on three diverse domains. The novelty of combining scaffold iteration with test-time weight training opens a new research direction with wide applicability. Paper 1, while useful, is a domain-specific application of existing techniques (ReAct + RAG) to competition law, with narrower impact scope.
Paper 1 addresses a fundamental bottleneck in AI development by unifying scaffolding and weight updates for self-improving AI. This novel paradigm represents a significant step toward autonomous AI improvement. Its demonstration of massive performance gains across highly diverse domains (law, GPU optimization, and computational biology) proves its broad applicability and transformative potential. While Paper 2 offers a valuable methodological improvement for LLM evaluation, Paper 1's approach to self-improving systems has a much higher ceiling for real-world impact and cross-disciplinary innovation.
Paper 1 offers a methodologically grounded integration of classic best-first search (WA*/A*) with learned relational heuristics via Q-learning, demonstrating striking zero-shot combinatorial generalization (e.g., 30→488 blocks) and broad relevance to planning, RL, and neuro-symbolic methods. Its evaluation on standard puzzles and IPC benchmarks increases rigor and reproducibility. Paper 2 targets a timely self-improvement theme with attractive applications, but the claims (large gains, cross-domain harness+weights updates) are harder to assess without strong controls and may be more system/benchmark-dependent, reducing expected durable scientific impact.
Paper 2 tackles a foundational goal in artificial intelligence—autonomous self-improvement—by unifying scaffold optimization and weight updates. Its methodological breadth, demonstrated by massive performance gains across disparate domains (law, GPU optimization, RNA denoising), gives it immense potential for widespread, cross-disciplinary scientific application. While Paper 1 provides a highly rigorous and important empirical analysis for AI governance, Paper 2 directly advances core AI capabilities, which typically drives deeper technical impact and broader adoption in the machine learning community.
Paper 1 is more novel and potentially transformative: it unifies two previously separate self-improvement paradigms (harness/scaffold editing and weight updates) into a single loop, aiming at the long-standing goal of autonomous AI improvement. Its evaluation spans disparate domains (law, systems/GPU optimization, biology), suggesting broader cross-field impact and generality. If methodologically sound, jointly optimizing agent scaffolds and model parameters could influence many areas of agent design and continual learning. Paper 2 is timely and practical for on-device GUI agents, but its core idea is a narrower systems/prompting optimization with more incremental scientific novelty.
Paper 2 has higher potential impact due to a more ambitious and general paradigm: a unified self-improvement loop that updates both agent harness and model weights, bridging two previously separate research lines. It demonstrates large gains across three highly distinct real-world domains (law, systems optimization, biology), suggesting broad applicability and timeliness as interest in autonomous AI improvement grows. Paper 1 is methodologically solid and timely for test-time scaling, but its contribution is narrower (search/backtracking for LM reasoning) and likely impacts a more specific subcommunity.
Paper 1 tackles a fundamental and highly ambitious goal in AI research: autonomous self-improvement. By successfully unifying previously disjoint approaches (harness updates and weight updates) and demonstrating significant performance gains across highly diverse and complex domains (law, systems optimization, and biology), it offers a broader conceptual leap. Paper 2, while offering a practical and rigorous solution for deployment efficiency (device-cloud routing), represents a more incremental optimization rather than a fundamental paradigm shift in AI capabilities.
Paper 2 is likely higher impact: it introduces a clear, principled bridge between classical optimal search (A*) and LLM reasoning, with broadly applicable training signals (trace SFT + A*-informed RL) that target both correctness and efficiency. The result—small models rivaling much larger ones—has strong practical implications for cost-effective reasoning systems and is highly timely. Methodologically, the framing is crisp and analyzable, with interpretable trade-offs and insights about imperfect heuristics. Paper 1 is ambitious and cross-domain, but “self-improving” harness+weights loops raise reproducibility/safety concerns and may be harder to generalize rigorously.
SIA introduces a fundamentally novel paradigm by unifying two previously disjoint research lines (harness updates and weight updates) into a single self-improving loop, addressing the core bottleneck of human involvement in AI improvement. Its cross-domain evaluation (legal, GPU optimization, bioinformatics) demonstrates broad applicability with substantial gains. This work opens a new research direction toward truly self-improving AI systems, which has transformative potential. FAST-GOAL, while solid, is an incremental improvement to CLIP for handling long text—a narrower contribution in an already crowded vision-language fine-tuning space.
SIA addresses a fundamental challenge in AI—self-improvement through both harness and weight updates—unifying two previously disjoint research paradigms. It demonstrates strong empirical results across three diverse domains (56.6%-502% improvements), suggesting broad applicability. This work advances the long-horizon goal of autonomous AI improvement, which has transformative potential. While StructBreak is valuable for AI safety (revealing structural vulnerabilities in MLLMs), it is more narrowly focused on attack methodology. SIA's framework for closing the loop on AI self-improvement has broader and more lasting implications for the field.
Paper 1 addresses a fundamental bottleneck in AI development (self-improvement) by combining harness and weight updates. It demonstrates substantial improvements across three vastly different domains (law, hardware optimization, and biology), indicating broad cross-disciplinary applicability and high methodological rigor. In contrast, Paper 2 presents a specialized RAG application focused narrowly on legal document analysis, giving it a much more limited scope and breadth of potential impact.
Paper 2 is more novel and broadly impactful: it unifies two previously separate self-improvement paradigms (harness/scaffold rewriting and weight updates) into a single closed-loop framework, with strong cross-domain results (law, systems/GPU optimization, and biology). This combination targets a central, timely bottleneck—reducing human labor in model/agent improvement—and could generalize widely across applied AI. Paper 1 addresses an important deployment problem with a practical neuro-symbolic verification pipeline, but its novelty and breadth are narrower and its guarantees are partial (semantic similarity heuristics), likely limiting transformative impact relative to SIA’s self-improvement agenda.
Paper 1 addresses a grand challenge in AI (self-improvement) by unifying two major, previously disjoint research directions (harness updates and weight updates). Its methodology demonstrates substantial empirical gains across highly diverse and impactful domains (law, systems optimization, biology), indicating broad real-world applicability and high relevance to current AI bottlenecks. Paper 2, while methodologically rigorous and innovative in discrete mathematics and logic, focuses on a more niche problem (combinatorial counting), giving it a narrower scope of impact compared to the general-purpose AI advancements in Paper 1.