SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu

#1445 of 3355 · Artificial Intelligence
Share
Tournament Score
1420±41
10501800
65%
Win Rate
15
Wins
8
Losses
23
Matches
Rating
6.3/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at https://github.com/kirito618/SIRI.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SIRI

1. Core Contribution

SIRI introduces a three-phase curriculum for LLM agent training that addresses a genuine architectural limitation of prior skill-based RL methods: the dependency on external skill generators during training and persistent skill retrieval at inference time. The framework's central insight is that skills should be *temporary training-time scaffolding* rather than permanent inference-time dependencies.

The three phases are well-motivated: (1) a GiGPO-based warmup to produce high-quality trajectories for skill extraction, (2) self-skill mining where the agent extracts candidate skills from its own successful rollouts and validates them through paired rollouts, and (3) advantage-weighted skill internalization that selectively distills beneficial skill-guided actions into the plain policy. The key novelty lies in the combination of self-mining (no external LLM needed) and selective internalization (advantage-weighted filtering of which tokens to distill), resulting in a clean inference-time model that carries no retrieval overhead.

2. Methodological Rigor

The methodology is generally sound and well-specified. The paired rollout validation mechanism (Eq. 10-12) provides a principled approach to skill quality assessment, and the exponential moving average utility tracking with promotion/retirement lifecycle is a sensible design choice. The advantage-weighted internalization (Eq. 15-17) is a thoughtful mechanism that uses both trajectory-level utility gates and action-level GiGPO advantages to filter distillation targets.

However, several concerns arise:

  • Benchmark scope is limited. Only ALFWorld and WebShop are evaluated—both are well-established but relatively narrow benchmarks. The gains on ALFWorld (0.908→0.930) are modest in absolute terms. WebShop improvements (0.728→0.813) are more substantial but still limited to a single web interaction domain.
  • Statistical reporting is incomplete. No confidence intervals, standard deviations, or significance tests are reported across runs. With a single random seed (seed=0), it is impossible to assess whether the improvements are statistically robust.
  • Ablation depth. The ablation study (Table 2) is only on WebShop and covers coarse-grained phase removals. More granular ablations—e.g., the effect of the utility gate alone, the advantage weighting alone, the mining frequency, or the EMA coefficient—would strengthen the claims.
  • Phase transitions rely on hand-tuned thresholds (Kwarm, Kmax, Nreq, Kmat, ρhit, ρpos), which may require significant environment-specific tuning, partially undermining the "self-evolving" narrative.
  • 3. Potential Impact

    The practical value of inference-time simplicity is real. Eliminating skill retrieval at deployment reduces latency, context length, and infrastructure complexity—meaningful advantages for production LLM agents. The self-mining approach also reduces dependency on expensive closed-source models for skill generation, as demonstrated by the convergence experiment with Gemini-3-Flash (Figure 4).

    The broader impact could extend to:

  • Hierarchical agent training in other domains (code generation, multi-step tool use)
  • Knowledge distillation paradigms where privileged information is available only at training time
  • Curriculum learning for RL-based LLM fine-tuning
  • However, the impact is tempered by the fact that the framework is built specifically on GiGPO and its anchor-state mechanism. Generalization to other RL algorithms is partially addressed (GRPO in Appendix B.2) but the performance gap suggests tight coupling with the base optimizer.

    4. Timeliness & Relevance

    This work is highly timely. The field is actively exploring how to train LLM agents via RL (GRPO, GiGPO, Tree-GRPO), and the question of how to accumulate and reuse experience across episodes is a current bottleneck. The skill internalization idea—using privileged context asymmetrically during training—connects to broader trends in self-play, knowledge distillation, and asymmetric actor-critic architectures. The paper positions itself well against a comprehensive set of recent baselines (SkillRL, D2Skill, MemRL, etc.), most from 2025-2026.

    5. Strengths & Limitations

    Strengths:

  • Clean conceptual framework. The three-phase design is intuitive and well-motivated. The progression from warmup → mining → internalization has a natural curriculum structure.
  • Self-contained pipeline. No external LLMs, no persistent retrieval infrastructure—this is a meaningful engineering simplification.
  • Comprehensive baselines. The comparison covers prompt-based, RL-based, and memory-augmented methods, including very recent work.
  • The Gemini-3-Flash comparison (Figure 4) is a compelling analysis showing self-mined skills converge toward oracle-quality skills.
  • Code availability enhances reproducibility.
  • Limitations:

  • Only two benchmarks, both relatively saturated in the research community. Testing on AppWorld, SWE-Bench, or other complex agent environments would significantly strengthen claims.
  • Single seed evaluation undermines confidence in reported numbers, especially for ALFWorld where the margin is small (~2%).
  • Scalability to harder tasks is untested. The method assumes the base policy can achieve enough successes during warmup to bootstrap skill mining—this may fail in truly sparse-reward settings.
  • Skill quality evaluation (Appendix B.3) using LLM-as-a-judge is qualitative and circular (LLMs judging LLM-generated skills).
  • The 1.5B results (Table 3) show marginal ALFWorld gains (0.867→0.875), raising questions about the method's effectiveness under constrained model capacity.
  • Training cost is non-trivial (39 hours on 8×A100 for ALFWorld with 7B model), and the paired rollout design doubles sample requirements during Phase 1.
  • Overall Assessment

    SIRI presents a well-structured and practically motivated approach to skill-based LLM agent training that makes a genuine contribution by eliminating inference-time skill retrieval. The self-mining and advantage-weighted internalization mechanisms are technically sound. However, the empirical evaluation is limited in scope and statistical rigor, and the improvements, while consistent, are modest on some benchmarks. The work represents a solid incremental advance in the rapidly evolving space of RL for LLM agents, with clear practical benefits but limited evidence of transformative impact.

    Rating:6.3/ 10
    Significance 6Rigor 5.5Novelty 6.5Clarity 7.5

    Generated Jun 2, 2026

    Comparison History (23)

    vs. Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach
    claude-opus-4.66/5/2026

    Paper 2 provides foundational formal semantics for agentic tool protocols, bridging theory (process calculus, bisimulation) with a rapidly growing practical domain (LLM agent-tool integration via MCP). Its contributions—proving structural bisimilarity between SGD and MCP, identifying expressivity gaps, and proposing type-system extensions—establish a theoretical framework that could influence safety verification, protocol design standards, and formal methods across the entire agent ecosystem. Paper 1, while solid empirically, offers incremental RL improvements on specific benchmarks. Paper 2's breadth of impact across formal methods, AI safety, and industry standards gives it higher long-term scientific significance.

    vs. PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?
    gpt-5.26/5/2026

    Paper 2 likely has higher impact: it introduces a timely safety benchmark for long-term memory in LLM assistants, a broadly relevant emerging deployment setting. PersistBench targets underexplored risks (cross-domain leakage, memory-induced sycophancy) and provides empirical evidence across 18 major models, making it immediately useful for evaluation, regulation, and future method development. Its applicability spans academia and industry and can shape standards. Paper 1 is a solid algorithmic contribution with strong results but is narrower (agent RL training on specific benchmarks) and less directly tied to urgent safety concerns affecting many deployed systems.

    vs. SciDER: Scientific Data-centric End-to-end Researcher
    gpt-5.26/5/2026

    Paper 1 likely has higher impact due to broader real-world applicability and cross-field relevance: an end-to-end, data-centric, multimodal multi-agent framework targeting the full scientific workflow (hypothesis→data→code→critique) plus released dataset and a 27B model, enabling adoption and follow-on research across scientific domains. Its novelty lies in integrating raw experimental data processing with executable experimentation synthesis and iterative refinement. Paper 2 is methodologically solid and timely for LLM agent training, but its contributions are more narrowly scoped to skill internalization in RL and evaluated on fewer, task-specific benchmarks.

    vs. Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection
    gpt-5.26/5/2026

    Paper 1 offers a concrete, technically novel RL framework (self-mined, validated, and distilled skills) that removes inference-time skill banks, reducing deployment complexity while improving benchmark performance with open models. It is methodologically clearer and more reproducible (code released, quantified gains, controlled comparisons) and can generalize across LLM-agent settings, impacting RL, agentic LLM training, and tool-use. Paper 2 is timely and societally important, but is primarily a synthesis/argument around emerging evidence; impact depends on external datasets and policy uptake, with less methodological novelty in the paper itself.

    vs. Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI
    claude-opus-4.66/3/2026

    Paper 1 (SIRI) presents a novel, empirically validated framework for skill internalization in LLM agents with clear quantitative improvements over strong baselines. It addresses a practical and timely problem in LLM agent training with a complete methodology (discovery, validation, internalization) and demonstrates significant gains on established benchmarks. Paper 2 addresses an important governance/authorization problem for agentic AI but is more incremental in nature, extending existing IAM/OAuth concepts to agentic settings. While relevant, its impact is narrower and more domain-specific compared to SIRI's broader applicability to LLM agent training.

    vs. From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting
    gemini-3.16/3/2026

    Paper 1 proposes a fundamental improvement to LLM agent training by internalizing skills and eliminating the need for external skill generators or inference-time retrieval. This addresses critical bottlenecks (latency, context limits) in autonomous agent deployment. Paper 2, while offering a strong methodological approach for time series forecasting with text, addresses a narrower, domain-specific application. Therefore, Paper 1 has a broader potential impact across the rapidly growing field of general-purpose AI agents.

    vs. The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs
    gpt-5.26/3/2026

    Paper 2 has higher estimated impact due to a more novel and broadly applicable training paradigm: self-mining, validating, and internalizing skills into an LLM agent without external skill generators or inference-time skill banks. This directly reduces deployment complexity/latency and targets a timely bottleneck in long-horizon agent RL. It shows solid empirical gains on standard agent benchmarks and suggests competitiveness with closed-model distillation, improving relevance and adoption potential. Paper 1 is practical for inference budgeting, but its impact is narrower (token allocation policy) and more incremental relative to existing compute-aware routing/scheduling work.

    vs. Tracking the Behavioral Trajectories of Adapting Agents
    claude-opus-4.66/2/2026

    Paper 2 (SIRI) presents a more complete and practically impactful framework for LLM agent training with self-discovered skills that are internalized into the model, eliminating inference-time overhead. It demonstrates strong empirical results on established benchmarks (ALFWorld, WebShop), offers a novel three-phase training paradigm combining RL with skill mining and distillation, and addresses a broadly relevant problem in LLM agent deployment. Paper 1, while addressing an interesting safety-adjacent problem of tracking agent behavioral drift, is narrower in scope, evaluated on a small dataset (68 pairs), and relies on a relatively straightforward linear probing methodology with limited generalizability demonstrated.

    vs. TrafficRAG: A Multimodal RAG Framework for Traffic Accident Liability Determination
    gemini-3.16/2/2026

    Paper 2 presents a foundational methodological advancement in LLM agent training (SIRI), addressing broad challenges like long-horizon planning and inference latency without relying on external skill banks. Its generalizability across agentic tasks (demonstrated on ALFWorld and WebShop) gives it a much wider breadth of impact. In contrast, Paper 1 offers a highly domain-specific application of existing RAG and VLM techniques to traffic accident liability, which, while practically useful, has more limited methodological novelty and cross-disciplinary scientific impact.

    vs. FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search
    claude-opus-4.66/2/2026

    SIRI addresses a fundamental challenge in LLM agent training—skill discovery and internalization—with a novel three-phase framework that eliminates external dependencies at inference time. It demonstrates strong empirical results across multiple benchmarks, offers practical deployment benefits (reduced latency, no skill retrieval), and the self-mining approach matching closed-source model distillation is particularly impactful. FALAT addresses the important but narrower problem of failure attribution in agent trajectories, with moderate absolute accuracy numbers (46% and 29.1%), suggesting the problem remains largely unsolved. SIRI's broader applicability to agent training and its open-source availability give it higher potential impact.

    vs. GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning
    gemini-3.16/2/2026

    Paper 2 addresses a fundamental and widely researched problem in LLM agents (long-horizon tasks and skill internalization without external dependencies), offering a novel methodological framework with broad applicability. Paper 1, while practically useful, is a localized application of existing federated learning and fine-tuning techniques to a specific administrative domain, limiting its broader scientific impact.

    vs. Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration
    gpt-5.26/2/2026

    Paper 2 has higher potential scientific impact because it exposes a broadly relevant, under-specified methodological issue (protocol sensitivity) in LLM confidence calibration that affects many prior and future studies. Its findings generalize across tasks and model families, and it provides actionable guidance (a reporting checklist) that can standardize evaluation practices across fields using LLM uncertainty (QA, alignment, safety, HCI, decision support). While Paper 1 is a solid, practical RL/agent-training contribution, its impact is narrower (specific agent benchmarks and training pipeline) and more incremental relative to fast-moving agent-RL methods.

    vs. Can LLM Agents Sustain Long-Horizon Organizational Dynamics?
    claude-opus-4.66/2/2026

    Paper 1 (SIRI) presents a novel, well-defined technical framework with clear methodological contributions (self-skill mining, validation, and internalization without external dependencies), demonstrated quantitative improvements on established benchmarks, and broad applicability to LLM agent training. Its approach of eliminating inference-time skill retrieval addresses a practical engineering bottleneck. Paper 2 (TaskWeave) tackles an interesting but narrower problem of organizational simulation with less generalizable contributions, evaluation on a single custom scenario, and weaker methodological rigor in terms of reproducible benchmarks.

    vs. SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning
    gemini-3.16/2/2026

    Paper 2 addresses a critical and highly timely challenge: the safety and alignment of autonomous LLM agents using tools (via the recent Model Context Protocol). Mitigating power-seeking behaviors and preventing catastrophic failures in agentic systems is a major bottleneck for real-world deployment. While Paper 1 offers a strong algorithmic improvement for agent efficiency, Paper 2's focus on foundational safety and environment-grounded proactive defense promises a broader impact across the rapidly growing field of agentic AI deployment.

    vs. Certificate-Guided Evaluation of Reinforcement Learning Generalization
    gemini-3.16/2/2026

    Paper 1 addresses a critical bottleneck in the highly active field of LLM agents by eliminating the need for external skill generators and reducing inference-time context length and latency. Its approach to self-internalizing skills offers immediate practical benefits and high potential for widespread adoption in building efficient, long-horizon AI agents. While Paper 2 provides a valuable theoretical benchmarking tool for RL generalization, Paper 1's alignment with current trends in LLM deployment gives it a broader and more immediate potential scientific and practical impact.

    vs. UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents
    claude-opus-4.66/2/2026

    SIRI presents a more broadly applicable and methodologically novel framework. Its three-phase approach to skill discovery, validation, and internalization without external generators addresses fundamental challenges in LLM agent training across multiple domains. The self-mining strategy matching closed-source model distillation is a significant finding. The method is domain-agnostic, demonstrated on diverse benchmarks, and the internalization concept (no inference overhead) is elegant. UI-KOBE, while practical, is more narrowly focused on mobile GUI agents and relies on a relatively straightforward knowledge graph approach with less generalizable contributions.

    vs. From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks
    gemini-3.16/2/2026

    While Paper 1 offers a clever algorithmic improvement for LLM agents, Paper 2 introduces a massive, multi-decade dataset that fundamentally challenges and redefines the assumptions of spatio-temporal forecasting. By exposing the failure of current state-of-the-art methods in realistic, evolving sensor networks, Paper 2 is likely to establish a new standard benchmark and drive extensive follow-up research in continual learning and graph neural networks.

    vs. Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation
    gemini-3.16/2/2026

    While Paper 2 addresses a critical societal issue (misinformation), Paper 1 offers a foundational technical contribution to the rapidly advancing field of LLM agents. By solving significant bottlenecks in skill-based agent training (context length, external dependency, and latency), SIRI provides a highly scalable and reusable framework. This methodological innovation is likely to be widely adopted and built upon within the core AI and machine learning communities, resulting in a higher concentrated scientific impact.

    vs. Rubric-Guided Process Reward for Stepwise Model Routing
    gemini-3.16/2/2026

    Paper 2 (SIRI) introduces a novel approach for LLM agents to internalize skills without relying on external skill generators or inference-time retrieval, significantly reducing engineering complexity, context length, and deployment latency. This addresses a major bottleneck in long-horizon autonomous agents, giving it broader potential applications in real-world agent deployment compared to Paper 1's focus on stepwise model routing. SIRI's self-mining and distillation methodology demonstrates strong rigor and offers a highly scalable paradigm for training autonomous AI systems.

    vs. AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
    gpt-5.26/2/2026

    Paper 2 likely has higher impact due to broader real-world relevance (agent safety/security), wider cross-field applicability (alignment, security, systems, deployment), and timeliness as open-world agents proliferate. Its contributions span taxonomy updates, data engine with purification, efficient SFT/RL environment, and an online guardrail—suggesting an end-to-end framework with practical deployment advantages and open releases. Paper 1 is technically solid and novel for skill internalization in RL-trained LLM agents, but its impact is narrower (performance gains on specific benchmarks) and less societally urgent than scalable safety alignment.