You Live More Than Once: Towards Hierarchical Skill Meta-Evolving
Xujun Li, Kehan Zheng, Mingyuan Zhao, Yize Geng, Jinfeng Zhou, Qi Zhu, Fei Mi, Lifeng Shang
Abstract
Test-time skill evolving is regarded as a new paradigm for enhancing deployed agentic systems. Existing works mainly focus on hard-coded skill evolving strategies or parametric learning that rely on expensive parameter updates in the underlying LLMs. In this paper, we demonstrate that test-time refinement of the skill evolving framework itself is necessary for continuous improvement of the agent systems in different downstream scenarios, and lightweight algorithmic adaptation is feasible. Specifically, we propose HiSME, a lightweight hierarchical skill meta-evolving solution that jointly optimizes skills and the skill evolving strategy by learning meta-skills from agents' task execution traces. Experiments on diverse agentic benchmarks show that meta-evolving can produce a higher-quality skill library than pure skill evolving and can derive diverse meta-skills for different scenarios, thereby facilitating future continual experience learning. Our code is temporarily public at https://anonymous.4open.science/r/HiSME-BD45.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "You Live More Than Once: Towards Hierarchical Skill Meta-Evolving"
1. Core Contribution
HiSME introduces a hierarchical meta-learning framework for LLM-based agent skill evolution. The key insight is that the skill evolving process itself—how skills are extracted, refined, and maintained—can be optimized at test time through "meta-skills." The paper frames this as a multi-level residual optimization problem: just as skills approximate parametric updates to a frozen executor (first-order), meta-skills optimize the skill evolving algorithm itself (second-order). All updates occur in text space without modifying LLM parameters, maintaining lightweight deployment.
The system comprises several algorithmic roles—extractor, refactorer, refiner, and filter—each of which receives role-specific meta-skills derived from observed skill outcomes. These meta-skills are concise rules-of-thumb (capped at 5 per role) that guide future skill generation and maintenance. The framework builds on a credit assignment mechanism, overlap-graph-based refactoring, and bundle-gated release for skill quality control.
2. Methodological Rigor
Strengths in experimental design:
Weaknesses:
3. Potential Impact
The idea that the skill evolving framework itself should be an optimization target is conceptually appealing and could influence how future agent systems handle continuous learning. If validated at scale, this could:
However, practical impact is uncertain. The overhead of maintaining overlap graphs, bundle tests, credit tables, and meta-skill updates adds significant system complexity. The paper does not provide wall-clock time comparisons or a clear analysis of when the meta-evolving cost is amortized.
4. Timeliness & Relevance
The paper is highly timely. It cites a wave of concurrent 2026 works on skill evolving (SkillX, Trace2Skill, SkillForge, etc.), positioning itself at the frontier of an active research area. The question of how to make deployed LLM agents self-improving without parameter updates is a genuine bottleneck in production settings. The hierarchical optimization framing provides a useful conceptual lens.
The connection to meta-learning is natural but could be developed more deeply—the paper does not engage substantially with classical meta-learning theory (MAML, learning-to-learn) beyond the naming convention.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Additional Observations:
Summary
HiSME presents a well-motivated and cleanly formulated approach to self-improving skill management for LLM agents. The hierarchical optimization framing is elegant, and the experimental evidence, while small-scale, consistently supports the main claims. The primary concerns are evaluation scale, single-model evaluation, and system complexity. The paper makes a meaningful conceptual contribution to the emerging skill evolving paradigm, though its practical impact remains to be validated at production scale.
Generated May 28, 2026
Comparison History (14)
Paper 1 (Prompt Codebooks) offers a more clearly novel and broadly applicable formulation: reframing prompt optimization as discrete, compositional, per-instance routing over reusable “instinct” units, enabling transfer and modularity that instance-blind methods cannot express. It reports concrete, multi-benchmark gains on widely used open LLMs and adds efficiency benefits (prompt length reduction), strengthening real-world deployability. Paper 2 is timely for agent systems, but the abstract provides fewer methodological specifics and less concrete comparative evidence, making impact harder to assess and likely narrower to agent skill libraries.
Paper 1 introduces a novel paradigm (meta-evolving) for agentic AI systems with practical implications for continual learning and skill adaptation at test time. Its hierarchical approach to jointly optimizing skills and evolving strategies addresses a timely need in deployed LLM-based agents. Paper 2 provides valuable mechanistic interpretability insights into LLM reasoning circuits, but its contributions are more analytical/descriptive rather than enabling new capabilities. Paper 1's broader applicability across agentic benchmarks and its practical framework for improving agent systems gives it higher potential impact in the rapidly growing field of AI agents.
Paper 1 proposes a foundational representation (WorldString) for modeling actionable objects in the physical world, which has profound implications for embodied AI, robotics, and simulation. Its fully differentiable architecture enables seamless integration with policy learning. While Paper 2 offers valuable algorithmic improvements for LLM agents, Paper 1 addresses a more fundamental bottleneck in bridging AI with the physical world, promising broader and more transformative impacts across multiple disciplines.
Paper 2 (ZipRL) likely has higher impact due to a clearer, broadly applicable problem (multi-turn context compression for long-horizon agents), stronger methodological package (RLVR-tailored framework, explicit training-signal densification via HRR, and theoretical guarantees), and demonstrated robustness under extreme extrapolation plus sizable empirical gains across model scales. Its applications span any agentic LLM system constrained by context length, making it timely and widely relevant. Paper 1 is novel in meta-evolving skills, but impact may be narrower and evidence less concrete from the abstract.
Paper 2 addresses a fundamental and broadly applicable problem in RLHF/RLVR—how to weight rubric criteria during training based on their actual optimization utility rather than static human-assigned importance. This insight is novel, well-validated across multiple settings (multimodal and text-only, three base policies, two datasets), and has immediate practical impact for anyone training LLMs with rubric-based rewards. The 2.5-4x training efficiency gain is significant. Paper 1, while interesting in its meta-evolving framework for agentic systems, addresses a narrower niche and builds incrementally on existing skill-evolving paradigms with less clearly demonstrated generalizability.
Paper 1 introduces a novel hierarchical 'meta-evolving' paradigm that adapts the skill evolving strategy itself at test time. This approach addresses the critical challenge of continual learning and adaptability in LLM agents across varying downstream scenarios without expensive parametric updates. While Paper 2 offers a rigorous methodological improvement for RL-based skill internalization, Paper 1's lightweight, meta-learning approach has broader implications and higher potential impact for the continuous improvement of general agentic systems.
Paper 2 (HiSME) likely has higher scientific impact due to broader applicability and timeliness: hierarchical meta-evolving of skills and the evolving strategy targets a central bottleneck in deployed agentic systems (continual, test-time improvement) and can generalize across many tasks, domains, and LLM backends without costly parameter updates. This paradigm could influence agent design, lifelong learning, and meta-learning communities. Paper 1 is methodologically solid and useful for efficient VLM deployment, but its contribution is more specialized (structured pruning for CoT in VLMs) with narrower cross-field reach.
Paper 2 introduces a novel paradigm (hierarchical skill meta-evolving) for LLM-based agentic systems with broader applicability across diverse AI agent scenarios. Its contribution—lightweight test-time adaptation of skill evolving strategies themselves—addresses a fundamental challenge in continual agent learning with wide cross-domain impact. Paper 1, while methodologically sound, addresses a narrower domain (pedestrian-AV interaction modeling) combining existing techniques (Mamba + DDPG) for a specific transportation safety application, limiting its breadth of impact compared to Paper 2's more general AI agent framework.
Paper 1 introduces a fundamental paradigm shift for agent architectures by moving from 'Memory-as-Tool' to 'Memory-as-Cognition', addressing critical limitations in how LLMs handle memory and reasoning. Its broad applicability to conversational agents, combined with a novel structural approach and a new benchmark for proactive memory, promises higher foundational impact across AI cognitive architectures compared to Paper 2's narrower focus on test-time skill optimization.
Paper 1 (HiSME) introduces a more novel and broadly impactful paradigm—meta-evolving skill frameworks at test time—addressing a fundamental challenge in continual agent learning with a hierarchical approach. Its concept of optimizing the skill evolution strategy itself (meta-skills) is more innovative and generalizable across diverse agentic systems. Paper 2 (StepOPSD) makes a solid but more incremental contribution to credit assignment in RL for agents, combining existing ideas (preference distillation, step-level decomposition, GRPO) in a useful but narrower scope. Paper 1's broader applicability and paradigm-level contribution suggest higher long-term impact.
Paper 1 addresses a fundamental bottleneck in general LLM agentic systems by introducing a lightweight, meta-evolving framework for continuous skill improvement. Its methodological innovation (hierarchical test-time evolving without parametric updates) has broad implications across the rapidly growing field of autonomous AI agents. In contrast, Paper 2 tackles the important but more niche problem of reproducibility in applied industrial Prognostics and Health Management. Consequently, Paper 1 has a higher potential for broad, cross-disciplinary impact and widespread adoption in AI research.
Paper 2 likely has higher impact due to stronger novelty and broader, timelier relevance: meta-evolving the skill-evolution strategy at test time addresses a central bottleneck in LLM agents (continual adaptation without costly parameter updates) and generalizes across many downstream agentic settings. Its benchmark-driven framing suggests wider applicability across RL, agent systems, and LLM tooling. Paper 1 is highly practical and important for safety in data-sensitive domains, but the hybrid neuro-symbolic verification concept is more incremental and its evaluation is narrower (one medical reporting system), potentially limiting breadth despite strong real-world value.
Paper 1 has higher potential impact due to its timely focus on deployed LLM-agent improvement via test-time adaptation without costly model updates, a broadly applicable problem across agentic systems. Its hierarchical meta-evolving approach (learning meta-skills and adapting the skill-evolution strategy from traces) is relatively novel and could generalize to many benchmarks and real-world settings where continual improvement is needed. Paper 2 offers a rigorous, insightful analysis of PPO failure modes in cumulative-damage long-horizon tasks, but its immediate applicability and cross-domain breadth are narrower and more domain-structured.
Paper 1 (BRANE) targets a broadly deployed and costly bottleneck—per-query optimization of retrieval-agent configurations—showing large, quantifiable cost savings at matched accuracy across multiple established benchmarks and clear comparisons to strong baselines. Its methodology (predictive routing with explicit cost-quality tradeoff) is concrete, readily implementable, and immediately applicable to production RAG systems, giving high near-term and cross-domain impact. Paper 2’s hierarchical meta-evolving is conceptually interesting but more speculative, with less clearly grounded rigor and adoption path, making its near-term scientific and practical impact harder to gauge.