GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents
Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem
Abstract
LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.
AI Impact Assessments
(1 models)Scientific Impact Assessment: GRASP — Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents
1. Core Contribution
GRASP addresses a genuine and well-defined problem: prior self-improvement methods for LLM agents (e.g., Reflexion, ExpeL) accumulate natural-language guidance monotonically without verifying that new additions preserve previously correct behavior. This leads to silent regressions and context dilution. GRASP reframes agent self-improvement as a sequence of validated, reversible edits to a bounded library of structured behavioral "skills." The key mechanism is a regression-aware acceptance gate: candidate skill edits are evaluated on a balanced held-out probe of previously failing and previously passing examples, and accepted only if they produce net improvement without exceeding a hard regression budget. This is conceptually clean — it imports the core idea of held-out validation from prompt optimization (DSPy) and applies it at the granularity of individual skill edits, with explicit regression control analogous to continual learning's catastrophic forgetting prevention.
2. Methodological Rigor
The experimental design is thorough and well-controlled. Strengths include:
Notable weaknesses in rigor:
3. Potential Impact
Practical deployment: The bounded, auditable, versioned skill library is genuinely valuable for deployment in regulated environments (healthcare, finance). Each skill traces its provenance — which failure prompted it, which probe score justified acceptance — enabling human review and selective reversion without retraining. This is a meaningful contribution to trustworthy AI deployment.
Transferability findings: The cross-model transfer asymmetry (stronger writer → weaker executor works; reverse does not) is an interesting empirical finding with practical implications. It suggests a "distillation without parameter updates" paradigm where expensive frontier models generate procedural knowledge that cheaper models can consume. That no ungated baseline reproduces this asymmetry strengthens the claim that it is the gate, not writer quality alone, that makes libraries transferable.
Broader methodological lesson: The central insight — that skill writing without validation is no better than no skills — is a cautionary finding for the entire verbal-feedback literature. The K=1 no-gate variant matching the no-skills baseline (40.1% vs 40.6%) is a striking negative result about unchecked self-improvement.
Limitations of scope: The mechanism works where "tasks recur with verifiable structure" and fails where "the action space is open-ended" (OS Interaction). This is an honest but significant boundary — many real-world agent tasks involve open-ended actions. The clinical benchmarks, while well-designed, test procedural reliability against curated FHIR state rather than clinical outcomes, limiting claims about real clinical applicability.
4. Timeliness & Relevance
This paper addresses a current bottleneck in the LLM agent community. As agents are deployed in increasingly structured environments (EHRs, databases, enterprise software), procedural reliability becomes the binding constraint rather than conversational ability. The failure mode of monotonic memory accumulation is well-documented but poorly addressed in practice. The timing is also appropriate given the rapid proliferation of clinical LLM agent benchmarks (MedAgentBench, MedAgentBench-v2, FHIR-AgentBench) and the growing recognition that operational failures dominate in structured environments.
5. Strengths & Limitations
Key strengths:
Notable limitations:
Overall Assessment
GRASP makes a solid contribution by formalizing agent self-improvement as validated, regression-aware editing of a bounded skill library. The core insight is well-supported: the gate is what matters, not the skill writing. The experimental design is strong, with unusually thorough ablations and transfer experiments. The practical value for auditable, deployable agent improvement in structured environments is clear. The main limitations are the restriction to environments with recurring verifiable structure and the mixed results on the more challenging MedAgentBench-v2 benchmark.
Generated May 29, 2026
Comparison History (18)
Paper 2 addresses a critical bottleneck in LLM agent reliability—silent regression during self-improvement—by introducing a rigorous gating mechanism. Its focus on skill transferability across models and regression-aware learning provides a highly novel and methodologically rigorous approach. While Paper 1 offers useful guidelines for data organization during training, Paper 2's solution to dynamic agent improvement and robust empirical validation across domains suggests a broader potential impact on the rapidly growing field of autonomous AI agents.
GRASP addresses a critical practical problem—reliable self-improvement of LLM agents—with a principled regression-aware gating mechanism. It demonstrates large empirical gains across multiple models and benchmarks (clinical and non-clinical), shows transferability of learned skills across models, and provides thorough ablations. The work is highly timely given the rapid deployment of LLM agents. Paper 2 contributes a useful meta-programming framework for temporal ASP variants, but serves a narrower community in logic programming. GRASP's breadth of applicability, practical relevance, and connection to the fast-growing LLM agent ecosystem give it substantially higher potential impact.
Paper 1 shifts the paradigm of LLM safety evaluation from a binary outcome (Attack Success Rate) to a temporal, observable process using logits. This foundational diagnostic tool bridges mechanistic interpretability and real-time defense, offering broad implications for understanding and mitigating vulnerabilities. While Paper 2 presents a strong practical method for agent reliability, Paper 1 addresses a more critical, universally applicable bottleneck in AI safety with a highly novel, training-free approach.
Paper 1 addresses a critical bottleneck in LLM agent reliability—regression during self-improvement—with a robust, gated skill-proposal framework. Its extensive empirical validation across multiple state-of-the-art models, significant performance gains in both clinical and general environments, and demonstration of cross-model skill transfer indicate high methodological rigor and broad applicability. In contrast, Paper 2 is narrowly focused on inference-time steering for small language models specifically in mathematical reasoning, limiting its breadth of impact compared to Paper 1's generalized agentic self-improvement.
Paper 2 (GRASP) has higher likely impact due to a broadly applicable, rigorously validated mechanism for safe self-improvement: regression-aware gating with a hard regression budget directly tackles a central failure mode (silent regressions) in agent skill accumulation. It shows large, consistent gains across five diverse base models, includes ablations isolating causal components, demonstrates cross-domain generalization and model-to-model skill transfer—evidence of a more general principle. Paper 1 is strong but more domain-specific (optimization/NLCO) and its impact may be narrower compared to GRASP’s agent reliability contribution.
Paper 2 is likely higher impact due to a more novel, generalizable methodological contribution: a regression-aware gating mechanism for self-improving agents that directly addresses a key failure mode (silent regressions) and shows very large gains across multiple frontier models and environments, with strong ablations and transfer results. Its applications extend beyond medicine to reliable agentic systems broadly, making its cross-field breadth and timeliness (agent reliability) especially high. Paper 1 is valuable for reproducibility and open clinical LLM pipelines, but its core innovation is more infrastructure/process-oriented and narrower in methodological novelty.
Paper 2 presents a generalizable, methodologically rigorous approach to a fundamental problem in LLM agents (preventing regression during self-improvement). Its extensive evaluation across multiple frontier models, environments, and transferability tests demonstrates significant breadth of impact. In contrast, Paper 1 offers a domain-specific architectural design for finance, which, while valuable, has a narrower scope, fewer empirical validations, and lower potential for broad methodological impact across AI research.
GRASP addresses a broader and more impactful problem—systematic self-improvement of LLM agents with regression-aware gating—demonstrating large performance gains (40.6% to 88.8%) across multiple models and domains. Its contributions (gated skill libraries, cross-model transfer, regression budgets) are more novel and generalizable than Moment-KV's incremental improvement to KV cache compression (2.3-3.2% gains). GRASP's clinical applications, cross-domain generalization, and insights about skill transfer asymmetry give it wider potential impact across AI safety, agent reliability, and practical deployment.
Paper 1 (Scaling Monosemanticity) represents a landmark contribution to AI interpretability, demonstrating that mechanistic interpretability techniques scale to production-level models. It reveals multilingual/multimodal features, identifies safety-relevant features (deception, power-seeking), and enables causal steering of model behavior. Its breadth of impact spans AI safety, interpretability, and fundamental understanding of large language models. Paper 2, while methodologically sound with strong empirical results on agent self-improvement, addresses a narrower problem with more incremental contributions. Paper 1's implications for AI alignment and safety give it substantially broader and more lasting scientific impact.
Paper 1 has higher impact potential due to a more novel and broadly applicable contribution: a regression-aware gating mechanism for self-improving LLM agents that directly targets a key reliability failure mode (silent regressions). It is evaluated across multiple foundation models and multiple environments, suggesting generality beyond a single domain, with large reported gains and clear ablations supporting causal claims. The approach is timely for agent safety/reliability and could influence LLM tooling across fields. Paper 2 is valuable for urban/tourism planning but is more domain-specific and methodologically incremental (simulation + LLM generation) with narrower cross-field impact.
GRASP presents a novel, generalizable method for self-improving LLM agents with regression-aware gating that demonstrates strong empirical gains across multiple domains and models. Its contributions—bounded skill libraries, regression budgets, cross-model transferability—are broadly applicable beyond clinical settings. While RealICU is a valuable benchmark contribution exposing LLM failures in ICU reasoning, benchmarks typically have narrower impact than new methods. GRASP's methodological innovation, strong ablations, cross-domain generalization, and practical applicability give it higher potential for broad scientific impact.
Paper 1 introduces a highly novel adversarial self-play framework that addresses a fundamental flaw in LLM reasoning (fragility to distractors). By co-evolving reasoners and distractors, it forces models to move beyond superficial pattern matching. Paper 2, while showing impressive practical gains for agentic workflows, essentially applies standard software engineering concepts (regression testing) to skill libraries. Paper 1 offers deeper algorithmic innovation with broader implications for the foundational training of robust reasoning models across various domains.
Paper 2 frames skill evolution as a systematic text-space optimizer with direct analogies to deep learning optimization (e.g., learning-rate budgets, meta-updates), which provides a more foundational and generalizable methodological contribution. Furthermore, its extensive evaluation across 6 benchmarks, 7 target models, and 3 distinct execution harnesses (including real-world tools like Codex and Claude Code), alongside comparisons to strong baselines like TextGrad, suggests broader applicability and a higher potential impact across the AI agent community.
Paper 1 likely has higher scientific impact because it introduces a concrete, broadly useful methodological contribution (regression-aware gated self-improvement via a bounded skill library) with large empirical gains across multiple frontier models and both clinical and some non-clinical environments, plus transfer across models—suggesting strong practical adoption potential for reliable agent deployment. Paper 2 is timely and important for AI governance, but it is primarily a characterization study in a simulated setting; while influential for auditing and policy, it offers less of a general-purpose technical mechanism likely to be reused widely in systems.
Paper 1 proposes a novel, generalizable solution to a major bottleneck in agentic AI (regression during self-improvement). By demonstrating massive empirical gains across diverse models and environments, plus cross-model skill transferability, it actively drives the state-of-the-art forward. While Paper 2 provides a crucial diagnostic critique for AI evaluation, Paper 1 offers a fundamental architectural advancement with broader immediate utility.
Paper 1 has higher estimated impact due to a clearer methodological advance (regression-aware gated skill updates with a hard regression budget) that targets a widely observed failure mode in self-improving agents: silent regressions. It demonstrates large, replicated gains across five strong base models, includes ablations, and shows cross-domain and cross-model transfer—suggesting broad applicability to agent reliability and continual improvement. Paper 2 is novel and timely (proactive memory + new benchmark), but its impact may be narrower to conversational memory systems and depends more on architectural choices that are harder to standardize across tasks.
GRASP presents a concrete, empirically validated method with strong quantitative results across multiple models and benchmarks, demonstrating significant performance improvements (e.g., 40.6% to 88.8%). It introduces a novel gated regression-aware mechanism with rigorous ablations and demonstrates transferability. Paper 1 defines conceptual frameworks (Agentic Technical Debt, Stochastic Tax) but lacks empirical validation and offers primarily managerial guidelines. Paper 2's methodological rigor, reproducible results, cross-domain generalization, and practical applicability to improving LLM agent reliability give it substantially higher scientific impact.
GRASP addresses a fundamental bottleneck in AI agent reliability—regression during self-improvement—with a rigorous, generalizable gating mechanism. Its strong empirical results across diverse models and domains, along with cross-model skill transferability, offer broader and more immediate utility to the AI community compared to AgentSchool's domain-specific, simulation-based approach, which may face challenges regarding external validity.