You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

Xujun Li, Kehan Zheng, Mingyuan Zhao, Yize Geng, Jinfeng Zhou, Qi Zhu, Fei Mi, Lifeng Shang

#1307 of 2682 · Artificial Intelligence
Share
Tournament Score
1413±48
10501800
57%
Win Rate
8
Wins
6
Losses
14
Matches
Rating
5.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Test-time skill evolving is regarded as a new paradigm for enhancing deployed agentic systems. Existing works mainly focus on hard-coded skill evolving strategies or parametric learning that rely on expensive parameter updates in the underlying LLMs. In this paper, we demonstrate that test-time refinement of the skill evolving framework itself is necessary for continuous improvement of the agent systems in different downstream scenarios, and lightweight algorithmic adaptation is feasible. Specifically, we propose HiSME, a lightweight hierarchical skill meta-evolving solution that jointly optimizes skills and the skill evolving strategy by learning meta-skills from agents' task execution traces. Experiments on diverse agentic benchmarks show that meta-evolving can produce a higher-quality skill library than pure skill evolving and can derive diverse meta-skills for different scenarios, thereby facilitating future continual experience learning. Our code is temporarily public at https://anonymous.4open.science/r/HiSME-BD45.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "You Live More Than Once: Towards Hierarchical Skill Meta-Evolving"

1. Core Contribution

HiSME introduces a hierarchical meta-learning framework for LLM-based agent skill evolution. The key insight is that the skill evolving process itself—how skills are extracted, refined, and maintained—can be optimized at test time through "meta-skills." The paper frames this as a multi-level residual optimization problem: just as skills approximate parametric updates to a frozen executor (first-order), meta-skills optimize the skill evolving algorithm itself (second-order). All updates occur in text space without modifying LLM parameters, maintaining lightweight deployment.

The system comprises several algorithmic roles—extractor, refactorer, refiner, and filter—each of which receives role-specific meta-skills derived from observed skill outcomes. These meta-skills are concise rules-of-thumb (capped at 5 per role) that guide future skill generation and maintenance. The framework builds on a credit assignment mechanism, overlap-graph-based refactoring, and bundle-gated release for skill quality control.

2. Methodological Rigor

Strengths in experimental design:

  • The paper includes a proper static ablation (HiSME-static) that isolates the contribution of meta-evolving from the base skill lifecycle framework, which is critical for the main claim.
  • Component ablations (Table 3) systematically remove each module, revealing that refactoring is the most important component and that all pieces contribute.
  • Process evaluation (Figure 2) tracks intermediate performance during evolving, showing meta-evolving consistently outperforms static evolving throughout training.
  • The meta-test experiment (Table 4) freezes learned meta-skills and applies them to a new static run, providing evidence for transferability.
  • Weaknesses:

  • The evaluation scale is quite small: 50/50 train/test on BFCL-v3 and 30/30 on MineDojo. These are modest sample sizes that raise questions about statistical reliability—no confidence intervals or variance measures are reported.
  • Only a single LLM backbone (Claude Sonnet 4) is used for all experiments. Generalizability across model families is untested.
  • The meta-skill format is restricted to rules-of-thumb (acknowledged as a limitation), and there is no systematic comparison with alternative meta-skill representations.
  • Credit assignment, the foundation of the meta-evolving signal, relies entirely on LLM judgment, which introduces a potentially circular dependency: the same LLM class generates skills, evaluates them, and learns meta-rules from those evaluations. The paper acknowledges this but does not quantify credit assignment accuracy.
  • The comparison with SkillX requires an executor-alignment caveat (Appendix D), where SkillX performed worse under HiSME's executor. While the authors conservatively report the stronger SkillX result, this raises questions about fair comparison.
  • 3. Potential Impact

    The idea that the skill evolving framework itself should be an optimization target is conceptually appealing and could influence how future agent systems handle continuous learning. If validated at scale, this could:

  • Enable domain adaptation of agent skill management without expensive retraining
  • Provide a principled framework for self-improving agent architectures
  • Extend the text-based optimization paradigm (OPRO, TextGrad, Reflexion) to algorithm-level self-improvement
  • However, practical impact is uncertain. The overhead of maintaining overlap graphs, bundle tests, credit tables, and meta-skill updates adds significant system complexity. The paper does not provide wall-clock time comparisons or a clear analysis of when the meta-evolving cost is amortized.

    4. Timeliness & Relevance

    The paper is highly timely. It cites a wave of concurrent 2026 works on skill evolving (SkillX, Trace2Skill, SkillForge, etc.), positioning itself at the frontier of an active research area. The question of how to make deployed LLM agents self-improving without parameter updates is a genuine bottleneck in production settings. The hierarchical optimization framing provides a useful conceptual lens.

    The connection to meta-learning is natural but could be developed more deeply—the paper does not engage substantially with classical meta-learning theory (MAML, learning-to-learn) beyond the naming convention.

    5. Strengths & Limitations

    Key Strengths:

  • Clean conceptual framing as hierarchical residual optimization with clear mathematical formulation
  • The meta-test experiment (Table 4) is a compelling demonstration that learned meta-skills transfer
  • Thorough ablation study identifying the relative importance of components
  • Well-documented case studies (Appendix E) showing meta-skills are grounded in concrete evidence rather than generic advice
  • Complete algorithmic pseudocode (Algorithms 1-5) enhancing reproducibility
  • Key Limitations:

  • Small-scale evaluation with no statistical significance analysis
  • Single model backbone
  • Complexity of the full system (overlap graphs, bundle testing, credit assignment, filtering, meta-evolving) makes it difficult to determine which design decisions are essential versus incidental
  • The "meta-skills" are limited to 5 short rules per role—while this is presented as a feature (lightweight), it also limits expressiveness
  • No comparison with parametric approaches (RL-based methods mentioned in related work) on the same benchmarks
  • The MineDojo results, while showing improvements, lack the detailed analysis given to BFCL-v3
  • Additional Observations:

  • The paper's title and framing are somewhat dramatic relative to the contribution—the core mechanism is essentially iterative prompt refinement for skill generation roles, guided by outcome feedback. The "meta-evolving" terminology, while mathematically justified, may overstate the sophistication of what amounts to learning prompting guidelines.
  • The overlap-graph refactoring mechanism (Appendix C) appears to be a significant engineering contribution that is somewhat buried in the appendix. Its importance (largest ablation impact) warrants more prominence.
  • Reproducibility is partially addressed through anonymous code release, but dependency on Claude Sonnet 4 (a commercial API) limits full reproducibility.
  • Summary

    HiSME presents a well-motivated and cleanly formulated approach to self-improving skill management for LLM agents. The hierarchical optimization framing is elegant, and the experimental evidence, while small-scale, consistently supports the main claims. The primary concerns are evaluation scale, single-model evaluation, and system complexity. The paper makes a meaningful conceptual contribution to the emerging skill evolving paradigm, though its practical impact remains to be validated at production scale.

    Rating:5.8/ 10
    Significance 6.5Rigor 5Novelty 6.5Clarity 6.5

    Generated May 28, 2026

    Comparison History (14)

    vs. Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement
    gpt-5.25/28/2026

    Paper 1 (Prompt Codebooks) offers a more clearly novel and broadly applicable formulation: reframing prompt optimization as discrete, compositional, per-instance routing over reusable “instinct” units, enabling transfer and modularity that instance-blind methods cannot express. It reports concrete, multi-benchmark gains on widely used open LLMs and adds efficiency benefits (prompt length reduction), strengthening real-world deployability. Paper 2 is timely for agent systems, but the abstract provides fewer methodological specifics and less concrete comparative evidence, making impact harder to assess and likely narrower to agent skill libraries.

    vs. Revealing Algorithmic Deductive Circuits for Logical Reasoning
    claude-opus-4.65/28/2026

    Paper 1 introduces a novel paradigm (meta-evolving) for agentic AI systems with practical implications for continual learning and skill adaptation at test time. Its hierarchical approach to jointly optimizing skills and evolving strategies addresses a timely need in deployed LLM-based agents. Paper 2 provides valuable mechanistic interpretability insights into LLM reasoning circuits, but its contributions are more analytical/descriptive rather than enabling new capabilities. Paper 1's broader applicability across agentic benchmarks and its practical framework for improving agent systems gives it higher potential impact in the rapidly growing field of AI agents.

    vs. Actionable World Representation
    gemini-3.15/28/2026

    Paper 1 proposes a foundational representation (WorldString) for modeling actionable objects in the physical world, which has profound implications for embodied AI, robotics, and simulation. Its fully differentiable architecture enables seamless integration with policy learning. While Paper 2 offers valuable algorithmic improvements for LLM agents, Paper 1 addresses a more fundamental bottleneck in bridging AI with the physical world, promising broader and more transformative impacts across multiple disciplines.

    vs. ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay
    gpt-5.25/28/2026

    Paper 2 (ZipRL) likely has higher impact due to a clearer, broadly applicable problem (multi-turn context compression for long-horizon agents), stronger methodological package (RLVR-tailored framework, explicit training-signal densification via HRR, and theoretical guarantees), and demonstrated robustness under extreme extrapolation plus sizable empirical gains across model scales. Its applications span any agentic LLM system constrained by context length, making it timely and widely relevant. Paper 1 is novel in meta-evolving skills, but impact may be narrower and evidence less concrete from the abstract.

    vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
    claude-opus-4.65/28/2026

    Paper 2 addresses a fundamental and broadly applicable problem in RLHF/RLVR—how to weight rubric criteria during training based on their actual optimization utility rather than static human-assigned importance. This insight is novel, well-validated across multiple settings (multimodal and text-only, three base policies, two datasets), and has immediate practical impact for anyone training LLMs with rubric-based rewards. The 2.5-4x training efficiency gain is significant. Paper 1, while interesting in its meta-evolving framework for agentic systems, addresses a narrower niche and builds incrementally on existing skill-evolving paradigms with less clearly demonstrated generalizability.

    vs. SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment
    gemini-3.15/28/2026

    Paper 1 introduces a novel hierarchical 'meta-evolving' paradigm that adapts the skill evolving strategy itself at test time. This approach addresses the critical challenge of continual learning and adaptability in LLM agents across varying downstream scenarios without expensive parametric updates. While Paper 2 offers a rigorous methodological improvement for RL-based skill internalization, Paper 1's lightweight, meta-learning approach has broader implications and higher potential impact for the continuous improvement of general agentic systems.

    vs. MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning
    gpt-5.25/28/2026

    Paper 2 (HiSME) likely has higher scientific impact due to broader applicability and timeliness: hierarchical meta-evolving of skills and the evolving strategy targets a central bottleneck in deployed agentic systems (continual, test-time improvement) and can generalize across many tasks, domains, and LLM backends without costly parameter updates. This paradigm could influence agent design, lifelong learning, and meta-learning communities. Paper 1 is methodologically solid and useful for efficient VLM deployment, but its contribution is more specialized (structured pruning for CoT in VLMs) with narrower cross-field reach.

    vs. Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning
    claude-opus-4.65/28/2026

    Paper 2 introduces a novel paradigm (hierarchical skill meta-evolving) for LLM-based agentic systems with broader applicability across diverse AI agent scenarios. Its contribution—lightweight test-time adaptation of skill evolving strategies themselves—addresses a fundamental challenge in continual agent learning with wide cross-domain impact. Paper 1, while methodologically sound, addresses a narrower domain (pedestrian-AV interaction modeling) combining existing techniques (Mamba + DDPG) for a specific transportation safety application, limiting its breadth of impact compared to Paper 2's more general AI agent framework.

    vs. MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents
    gemini-3.15/28/2026

    Paper 1 introduces a fundamental paradigm shift for agent architectures by moving from 'Memory-as-Tool' to 'Memory-as-Cognition', addressing critical limitations in how LLMs handle memory and reasoning. Its broad applicability to conversational agents, combined with a novel structural approach and a new benchmark for proactive memory, promises higher foundational impact across AI cognitive architectures compared to Paper 2's narrower focus on test-time skill optimization.

    vs. StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
    claude-opus-4.65/28/2026

    Paper 1 (HiSME) introduces a more novel and broadly impactful paradigm—meta-evolving skill frameworks at test time—addressing a fundamental challenge in continual agent learning with a hierarchical approach. Its concept of optimizing the skill evolution strategy itself (meta-skills) is more innovative and generalizable across diverse agentic systems. Paper 2 (StepOPSD) makes a solid but more incremental contribution to credit assignment in RL for agents, combining existing ideas (preference distillation, step-level decomposition, GRPO) in a useful but narrower scope. Paper 1's broader applicability and paradigm-level contribution suggest higher long-term impact.

    vs. From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence
    gemini-3.15/28/2026

    Paper 1 addresses a fundamental bottleneck in general LLM agentic systems by introducing a lightweight, meta-evolving framework for continuous skill improvement. Its methodological innovation (hierarchical test-time evolving without parametric updates) has broad implications across the rapidly growing field of autonomous AI agents. In contrast, Paper 2 tackles the important but more niche problem of reproducibility in applied industrial Prognostics and Health Management. Consequently, Paper 1 has a higher potential for broad, cross-disciplinary impact and widespread adoption in AI research.

    vs. Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
    gpt-5.25/28/2026

    Paper 2 likely has higher impact due to stronger novelty and broader, timelier relevance: meta-evolving the skill-evolution strategy at test time addresses a central bottleneck in LLM agents (continual adaptation without costly parameter updates) and generalizes across many downstream agentic settings. Its benchmark-driven framing suggests wider applicability across RL, agent systems, and LLM tooling. Paper 1 is highly practical and important for safety in data-sensitive domains, but the hybrid neuro-symbolic verification concept is more incremental and its evaluation is narrower (one medical reporting system), potentially limiting breadth despite strong real-world value.

    vs. Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
    gpt-5.25/28/2026

    Paper 1 has higher potential impact due to its timely focus on deployed LLM-agent improvement via test-time adaptation without costly model updates, a broadly applicable problem across agentic systems. Its hierarchical meta-evolving approach (learning meta-skills and adapting the skill-evolution strategy from traces) is relatively novel and could generalize to many benchmarks and real-world settings where continual improvement is needed. Paper 2 offers a rigorous, insightful analysis of PPO failure modes in cumulative-damage long-horizon tasks, but its immediate applicability and cross-domain breadth are narrower and more domain-structured.

    vs. Natural Language Query to Configuration for Retrieval Agents
    gpt-5.25/28/2026

    Paper 1 (BRANE) targets a broadly deployed and costly bottleneck—per-query optimization of retrieval-agent configurations—showing large, quantifiable cost savings at matched accuracy across multiple established benchmarks and clear comparisons to strong baselines. Its methodology (predictive routing with explicit cost-quality tradeoff) is concrete, readily implementable, and immediately applicable to production RAG systems, giving high near-term and cross-domain impact. Paper 2’s hierarchical meta-evolving is conceptually interesting but more speculative, with less clearly grounded rigor and adoption path, making its near-term scientific and practical impact harder to gauge.