Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li
Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.
SkeMex proposes a non-parametric, post-deployment self-evolution framework for medical agents that distills interaction trajectories into structured, reusable "skills" organized in a multi-branch repository (general, task-specific, action-level). The key innovation is the closed-loop Read–Write–Assess–Govern lifecycle that continuously creates, evaluates, promotes, merges, and deprecates skills based on empirically estimated utility from clinical feedback signals. This addresses a genuine gap: existing memory-augmented agents either store raw trajectories (noisy, redundant) or couple memory improvement with parameter updates (costly, prone to catastrophic forgetting). SkeMex decouples memory evolution from model training entirely.
The formulation as a Memory-based MDP (M-MDP) provides a principled framework where improvement comes from better memory content and retrieval rather than gradient updates. The utility estimation mechanism—using category-normalized advantage, adoption-aware credit assignment, and cosine warmup learning rates—is a thoughtful design that goes beyond simple recency or frequency heuristics.
Direct applications: The framework is directly applicable to clinical decision support systems that need to improve with use without retraining. The plug-and-play nature (no weight updates) makes it attractive for regulated medical environments where model retraining requires re-certification.
Broader influence: The skill-based memory paradigm with utility-driven governance could generalize beyond medicine to any domain requiring reliable, evolving agent memory (legal reasoning, scientific discovery, financial analysis). The multi-branch repository design and the Read–Write–Assess–Govern lifecycle represent a reusable architectural pattern.
Cross-model transferability is particularly impactful—a skill repo built by one model improving others suggests a form of "institutional knowledge" that persists independent of the underlying LLM, analogous to clinical protocols that outlast individual practitioners.
This work addresses a critical bottleneck at the intersection of two fast-moving fields: medical AI agents and self-improving LLM systems. As medical agents transition from static QA to interactive clinical decision-making, the inability to accumulate and govern reusable experience is a genuine limitation. The paper's timing is excellent—it builds on the recent wave of skill-based memory work (Trace2Skill, SkillClaw, EvoSkills, all 2026) while being the first to comprehensively apply these ideas to medical domains with proper clinical evaluation infrastructure.
The non-parametric approach is especially timely given growing concerns about catastrophic forgetting in fine-tuned medical models and the computational costs of continual training.
SkeMex is a well-engineered and thoroughly evaluated framework that makes a meaningful contribution to medical agent systems. The core insight—that agent experience should be distilled into governed, utility-tracked skills rather than stored as raw trajectories—is sound and practically important. The evaluation is among the most comprehensive in this space. However, the gap between benchmark evaluation and real clinical deployment, the system's complexity, and the limited safety analysis temper the impact somewhat.
Generated Jun 9, 2026
Paper 1 (SkeMex) likely has higher impact due to its domain-critical focus (interactive clinical decision support), a clear governance-centric innovation (utility-aware skill memory with lifecycle management), and strong real-world applicability where safety, auditability, and continual post-deployment improvement matter. Its structured, multi-branch skill repository and memory retention/removal mechanism address a key limitation of generic trace memories, with cross-backbone generalization and planned public release supporting adoption. Paper 2 is technically timely for agent RL, but its scope is narrower and application domains less high-stakes.
Paper 1 introduces a novel, self-evolving memory framework for medical agents, directly advancing interactive clinical decision-making without requiring model retraining. Its focus on distilling experience into structured, reusable skills addresses core challenges in long-horizon AI reasoning. While Paper 2 provides a valuable benchmark, Paper 1 offers a methodological breakthrough with immediate, high-stakes applications in healthcare, likely yielding broader cross-disciplinary impact in both agentic AI and medical informatics.
Paper 2 (SkeMex) likely has higher scientific impact due to strong real-world applicability in clinical decision support, a timely and high-stakes domain. Its self-evolving, governance-aware skill memory addresses key gaps in agent reliability, continual learning, and safety without weight updates, and could generalize beyond medicine to other interactive agent settings. If validated rigorously, the framework may influence agent design broadly (memory, utility estimation, lifecycle governance). Paper 1 is novel and useful for LLM efficiency, but its impact is narrower (KV-cache allocation during decoding) and primarily systems-level.
Paper 1 has higher estimated impact due to stronger novelty and broader relevance: it identifies and systematically quantifies an underappreciated failure mode (memory-amplified sycophancy) across multiple memory systems and model families, introduces a benchmark (MIST) likely reusable beyond this work, and provides lightweight mitigations applicable to many domains. Its implications extend to any long-term memory LLM deployment (safety, alignment, reliability). Paper 2 is timely and useful but more domain-specific (medical agents) and its gains depend on task setups and feedback signals, potentially limiting breadth and generality.
Paper 2 addresses a broader and more impactful problem—enabling medical AI agents to accumulate and reuse structured clinical reasoning experience without weight updates. Its self-evolving skill memory framework (SkeMex) has wide applicability across clinical decision-making tasks and generalizes across model backbones. Paper 1, while technically rigorous in identifying and fixing embedding geometry issues for biomedical language models, addresses a narrower infrastructure-level problem (embedding calibration for cross-domain discrimination) with more limited downstream applications centered on a specific architecture (Large Behavioural Models). Paper 2's contributions to agentic AI in medicine have broader cross-field relevance.
Paper 2 likely has higher impact due to a more novel, generalizable method (self-evolving, utility-governed skill memory) with clear real-world relevance to interactive clinical decision support. It proposes an end-to-end post-deployment framework (read–write–assess–govern) addressing robustness, governance, and continual improvement without weight updates—broadly applicable to agentic systems beyond medicine. Paper 1 is timely and useful but primarily contributes a benchmark and test-time strategies; its impact is narrower and more incremental relative to fast-evolving multimodal evaluation suites.
Paper 1 introduces a novel, generalizable framework (SkeMex) for medical agent reasoning with a self-evolving skill memory system that addresses fundamental limitations in how AI agents accumulate and reuse clinical experience. Its contributions—structured skill distillation, value-aware retrieval, and a closed-loop governance lifecycle—are broadly applicable across clinical tasks and model backbones. Paper 2 presents a useful but more narrowly scoped LLM orchestration framework for conformance checking in stroke care, validated at a single hospital. Paper 1 has greater novelty, broader applicability, and stronger methodological contributions with higher potential to influence multiple research directions.
Paper 2 introduces PRIME, a novel framework for understanding and predicting reward hacking before it manifests—a critical AI safety concern. It provides mechanistic insights into how models learn to exploit proxy rewards, offering early-warning signals for alignment risks. This has broad implications across AI safety, interpretability, and alignment research, which are among the most pressing challenges in AI. While Paper 1 presents a solid engineering contribution to medical agents with skill-based memory, Paper 2 addresses a more fundamental and timely problem with wider cross-field relevance and stronger novelty in its mechanistic analysis of pre-hacking dynamics.
Paper 1 addresses the highly impactful and timely intersection of LLM-based agents and clinical decision-making, proposing a novel self-evolving skill memory framework (SkeMex) that enables generalizable reasoning without weight updates. Its breadth of impact spans AI, healthcare, and agent systems, with clear real-world clinical applications. Paper 2, while technically sound, addresses a narrower problem in pattern mining (constrained sampling of interval patterns) with more limited cross-disciplinary impact and a smaller potential audience. The medical AI domain's rapid growth further amplifies Paper 1's likely citation impact.
Paper 2 (SkeMex) introduces a more broadly impactful framework addressing a fundamental challenge in AI agent systems—accumulating and reusing structured experience for clinical decision-making without weight updates. Its self-evolving skill memory with a closed-loop lifecycle is novel and generalizable beyond medicine. Paper 1 (VisShield) addresses the important but narrower problem of PHI de-identification in images. While practical, it is more of an engineering contribution combining existing capabilities (VLMs, OCR, instruction tuning). Paper 2's methodological innovation in memory-based reasoning has broader implications for agentic AI systems.