Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li

Jun 8, 2026arXiv:2606.09365v1

cs.AIcs.CL

#814of 3539·Artificial Intelligence

#814 of 3539 · Artificial Intelligence

Tournament Score

1458±44

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7.3/ 10

Significance7.5

Rigor7.5

Novelty6.8

Clarity7.5

Abstract

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SkeMex — Self-Evolving Skill Memory for Medical Agents

1. Core Contribution

SkeMex proposes a non-parametric, post-deployment self-evolution framework for medical agents that distills interaction trajectories into structured, reusable "skills" organized in a multi-branch repository (general, task-specific, action-level). The key innovation is the closed-loop Read–Write–Assess–Govern lifecycle that continuously creates, evaluates, promotes, merges, and deprecates skills based on empirically estimated utility from clinical feedback signals. This addresses a genuine gap: existing memory-augmented agents either store raw trajectories (noisy, redundant) or couple memory improvement with parameter updates (costly, prone to catastrophic forgetting). SkeMex decouples memory evolution from model training entirely.

The formulation as a Memory-based MDP (M-MDP) provides a principled framework where improvement comes from better memory content and retrieval rather than gradient updates. The utility estimation mechanism—using category-normalized advantage, adoption-aware credit assignment, and cosine warmup learning rates—is a thoughtful design that goes beyond simple recency or frequency heuristics.

2. Methodological Rigor

Strengths in experimental design:

Evaluation spans 9 benchmarks (12 configurations) covering interactive diagnosis, rubric-based evaluation, multimodal reasoning, and knowledge-intensive QA—an unusually comprehensive setup for medical agent work.

16 baselines spanning medical specialists, memory-free agents, reflection-based methods, and self-improving memory agents provide thorough comparison.

Both offline (frozen repo) and online (streaming updates) evaluation modes are tested.

Cross-backbone transfer experiments (DeepSeek-V3.2 → Claude Sonnet-4.6, Qwen3.6-35B-A3B, Qwen3.6-Max-Preview, Kimi-2.6, GLM-5.1) demonstrate the skill repo captures transferable procedural knowledge rather than backbone-specific patterns.

Extensive ablations cover buffer management, encoding, branch structure, valuation components, retrieval signals, and governance mechanisms.

Sensitivity analysis across 5 hyperparameters shows stable performance across reasonable ranges.

Concerns:

The evaluation relies heavily on API-based LLMs (DeepSeek-V3.2, Qwen3.6-Plus), making exact reproducibility dependent on API versioning and availability. While the authors promise code release, API endpoint changes could affect reproducibility.

Rubric-based evaluation using Gemini-3-Flash as judge introduces potential bias. No inter-annotator agreement or human validation of the judge is reported.

The skill distillation, classification, and governance all use the same backbone LLM, creating a circular dependency where the quality of memory evolution is bounded by the backbone's own reasoning capabilities.

Wall-clock time analysis (Table 16) shows SkeMex is ~2x slower than ReAct (116s vs 54s per task), which the authors acknowledge but don't deeply address for clinical deployment feasibility.

Training data order analysis (Appendix I.5) reveals surprising sensitivity—random ordering outperforms structured curricula, suggesting the system's learning dynamics may be fragile in non-random deployment scenarios.

3. Potential Impact

Direct applications: The framework is directly applicable to clinical decision support systems that need to improve with use without retraining. The plug-and-play nature (no weight updates) makes it attractive for regulated medical environments where model retraining requires re-certification.

Broader influence: The skill-based memory paradigm with utility-driven governance could generalize beyond medicine to any domain requiring reliable, evolving agent memory (legal reasoning, scientific discovery, financial analysis). The multi-branch repository design and the Read–Write–Assess–Govern lifecycle represent a reusable architectural pattern.

Cross-model transferability is particularly impactful—a skill repo built by one model improving others suggests a form of "institutional knowledge" that persists independent of the underlying LLM, analogous to clinical protocols that outlast individual practitioners.

4. Timeliness & Relevance

This work addresses a critical bottleneck at the intersection of two fast-moving fields: medical AI agents and self-improving LLM systems. As medical agents transition from static QA to interactive clinical decision-making, the inability to accumulate and govern reusable experience is a genuine limitation. The paper's timing is excellent—it builds on the recent wave of skill-based memory work (Trace2Skill, SkillClaw, EvoSkills, all 2026) while being the first to comprehensively apply these ideas to medical domains with proper clinical evaluation infrastructure.

The non-parametric approach is especially timely given growing concerns about catastrophic forgetting in fine-tuned medical models and the computational costs of continual training.

5. Strengths & Limitations

Key Strengths:

Comprehensive framework design: The lifecycle metaphor (Read–Write–Assess–Govern) is intuitive and well-operationalized with concrete mechanisms at each stage.

Scale of evaluation: 9 benchmarks, 16+ baselines, 6 backbone models, both online/offline modes—this is among the most thorough evaluations in the medical agent literature.

Consistent gains: +7.88 to +13.78 points over ReAct across settings; no negative transfer on any benchmark (unlike several baselines).

Transferability evidence: Cross-backbone and cross-task transfer results are convincing and practically important.

Detailed ablations: Every component is individually justified with empirical evidence.

Notable Weaknesses:

No real clinical deployment: All evaluations are on benchmarks, not actual clinical environments. The gap between simulated patient interactions and real clinical workflows remains unaddressed.

Complexity: The system has many interacting components (gated buffers, two-pass distillation, novelty/quality gates, category-conditioned utility, cosine warmup, EMA baselines, branch-aware governance) with numerous hyperparameters. While sensitivity analysis suggests robustness, the engineering burden for adoption is high.

Limited failure analysis: Only one failure case is discussed. Systematic analysis of when and why skills fail would strengthen the contribution.

No safety analysis: For a medical system, the paper lacks formal safety evaluation beyond the "harm clamping" mechanism. Clinical safety implications of incorrect skill propagation deserve deeper treatment.

Scalability questions: How the system performs with thousands of accumulated skills over long deployment periods is not tested.

Overall Assessment

SkeMex is a well-engineered and thoroughly evaluated framework that makes a meaningful contribution to medical agent systems. The core insight—that agent experience should be distilled into governed, utility-tracked skills rather than stored as raw trajectories—is sound and practically important. The evaluation is among the most comprehensive in this space. However, the gap between benchmark evaluation and real clinical deployment, the system's complexity, and the limited safety analysis temper the impact somewhat.

Rating:7.3/ 10

Significance 7.5Rigor 7.5Novelty 6.8Clarity 7.5

Generated Jun 9, 2026

Comparison History (19)

Wonvs. HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

Paper 1 (SkeMex) likely has higher impact due to its domain-critical focus (interactive clinical decision support), a clear governance-centric innovation (utility-aware skill memory with lifecycle management), and strong real-world applicability where safety, auditability, and continual post-deployment improvement matter. Its structured, multi-branch skill repository and memory retention/removal mechanism address a key limitation of generic trace memories, with cross-backbone generalization and planned public release supporting adoption. Paper 2 is technically timely for agent RL, but its scope is narrower and application domains less high-stakes.

gpt-5.2·Jun 11, 2026

Wonvs. Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

Paper 1 introduces a novel, self-evolving memory framework for medical agents, directly advancing interactive clinical decision-making without requiring model retraining. Its focus on distilling experience into structured, reusable skills addresses core challenges in long-horizon AI reasoning. While Paper 2 provides a valuable benchmark, Paper 1 offers a methodological breakthrough with immediate, high-stakes applications in healthcare, likely yielding broader cross-disciplinary impact in both agentic AI and medical informatics.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Paper 2 (SkeMex) likely has higher scientific impact due to strong real-world applicability in clinical decision support, a timely and high-stakes domain. Its self-evolving, governance-aware skill memory addresses key gaps in agent reliability, continual learning, and safety without weight updates, and could generalize beyond medicine to other interactive agent settings. If validated rigorously, the framework may influence agent design broadly (memory, utility estimation, lifecycle governance). Paper 1 is novel and useful for LLM efficiency, but its impact is narrower (KV-cache allocation during decoding) and primarily systems-level.

gpt-5.2·Jun 10, 2026

Lostvs. Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

Paper 1 has higher estimated impact due to stronger novelty and broader relevance: it identifies and systematically quantifies an underappreciated failure mode (memory-amplified sycophancy) across multiple memory systems and model families, introduces a benchmark (MIST) likely reusable beyond this work, and provides lightweight mitigations applicable to many domains. Its implications extend to any long-term memory LLM deployment (safety, alignment, reliability). Paper 2 is timely and useful but more domain-specific (medical agents) and its gains depend on task setups and feedback signals, potentially limiting breadth and generality.

gpt-5.2·Jun 10, 2026

Wonvs. Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

Paper 2 addresses a broader and more impactful problem—enabling medical AI agents to accumulate and reuse structured clinical reasoning experience without weight updates. Its self-evolving skill memory framework (SkeMex) has wide applicability across clinical decision-making tasks and generalizes across model backbones. Paper 1, while technically rigorous in identifying and fixing embedding geometry issues for biomedical language models, addresses a narrower infrastructure-level problem (embedding calibration for cross-domain discrimination) with more limited downstream applications centered on a specific architecture (Large Behavioural Models). Paper 2's contributions to agentic AI in medicine have broader cross-field relevance.

claude-opus-4-6·Jun 9, 2026

Wonvs. IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

Paper 2 likely has higher impact due to a more novel, generalizable method (self-evolving, utility-governed skill memory) with clear real-world relevance to interactive clinical decision support. It proposes an end-to-end post-deployment framework (read–write–assess–govern) addressing robustness, governance, and continual improvement without weight updates—broadly applicable to agentic systems beyond medicine. Paper 1 is timely and useful but primarily contributes a benchmark and test-time strategies; its impact is narrower and more incremental relative to fast-evolving multimodal evaluation suites.

gpt-5.2·Jun 9, 2026

Wonvs. LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines

Paper 1 introduces a novel, generalizable framework (SkeMex) for medical agent reasoning with a self-evolving skill memory system that addresses fundamental limitations in how AI agents accumulate and reuse clinical experience. Its contributions—structured skill distillation, value-aware retrieval, and a closed-loop governance lifecycle—are broadly applicable across clinical tasks and model backbones. Paper 2 presents a useful but more narrowly scoped LLM orchestration framework for conformance checking in stroke care, validated at a single hospital. Paper 1 has greater novelty, broader applicability, and stronger methodological contributions with higher potential to influence multiple research directions.

claude-opus-4-6·Jun 9, 2026

Lostvs. Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Paper 2 introduces PRIME, a novel framework for understanding and predicting reward hacking before it manifests—a critical AI safety concern. It provides mechanistic insights into how models learn to exploit proxy rewards, offering early-warning signals for alignment risks. This has broad implications across AI safety, interpretability, and alignment research, which are among the most pressing challenges in AI. While Paper 1 presents a solid engineering contribution to medical agents with skill-based memory, Paper 2 addresses a more fundamental and timely problem with wider cross-field relevance and stronger novelty in its mechanistic analysis of pre-hacking dynamics.

claude-opus-4-6·Jun 9, 2026

Wonvs. Frequency-based Constrained Sampling for Interval Patterns

Paper 1 addresses the highly impactful and timely intersection of LLM-based agents and clinical decision-making, proposing a novel self-evolving skill memory framework (SkeMex) that enables generalizable reasoning without weight updates. Its breadth of impact spans AI, healthcare, and agent systems, with clear real-world clinical applications. Paper 2, while technically sound, addresses a narrower problem in pattern mining (constrained sampling of interval patterns) with more limited cross-disciplinary impact and a smaller potential audience. The medical AI domain's rapid growth further amplifies Paper 1's likely citation impact.

claude-opus-4-6·Jun 9, 2026

Wonvs. Vision Language Model Helps Private Information De-Identification in Vision Data

Paper 2 (SkeMex) introduces a more broadly impactful framework addressing a fundamental challenge in AI agent systems—accumulating and reusing structured experience for clinical decision-making without weight updates. Its self-evolving skill memory with a closed-loop lifecycle is novel and generalizable beyond medicine. Paper 1 (VisShield) addresses the important but narrower problem of PHI de-identification in images. While practical, it is more of an engineering contribution combining existing capabilities (VLMs, OCR, instruction tuning). Paper 2's methodological innovation in memory-based reasoning has broader implications for agentic AI systems.

claude-opus-4-6·Jun 9, 2026

#814of 3539·Artificial Intelligence

#814 of 3539 · Artificial Intelligence

Tournament Score

1458±44

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7.3/ 10

Significance7.5

Rigor7.5

Novelty6.8

Clarity7.5