SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training
Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu
Abstract
Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at https://github.com/kirito618/SIRI.
AI Impact Assessments
(1 models)Scientific Impact Assessment: SIRI
1. Core Contribution
SIRI introduces a three-phase curriculum for LLM agent training that addresses a genuine architectural limitation of prior skill-based RL methods: the dependency on external skill generators during training and persistent skill retrieval at inference time. The framework's central insight is that skills should be *temporary training-time scaffolding* rather than permanent inference-time dependencies.
The three phases are well-motivated: (1) a GiGPO-based warmup to produce high-quality trajectories for skill extraction, (2) self-skill mining where the agent extracts candidate skills from its own successful rollouts and validates them through paired rollouts, and (3) advantage-weighted skill internalization that selectively distills beneficial skill-guided actions into the plain policy. The key novelty lies in the combination of self-mining (no external LLM needed) and selective internalization (advantage-weighted filtering of which tokens to distill), resulting in a clean inference-time model that carries no retrieval overhead.
2. Methodological Rigor
The methodology is generally sound and well-specified. The paired rollout validation mechanism (Eq. 10-12) provides a principled approach to skill quality assessment, and the exponential moving average utility tracking with promotion/retirement lifecycle is a sensible design choice. The advantage-weighted internalization (Eq. 15-17) is a thoughtful mechanism that uses both trajectory-level utility gates and action-level GiGPO advantages to filter distillation targets.
However, several concerns arise:
3. Potential Impact
The practical value of inference-time simplicity is real. Eliminating skill retrieval at deployment reduces latency, context length, and infrastructure complexity—meaningful advantages for production LLM agents. The self-mining approach also reduces dependency on expensive closed-source models for skill generation, as demonstrated by the convergence experiment with Gemini-3-Flash (Figure 4).
The broader impact could extend to:
However, the impact is tempered by the fact that the framework is built specifically on GiGPO and its anchor-state mechanism. Generalization to other RL algorithms is partially addressed (GRPO in Appendix B.2) but the performance gap suggests tight coupling with the base optimizer.
4. Timeliness & Relevance
This work is highly timely. The field is actively exploring how to train LLM agents via RL (GRPO, GiGPO, Tree-GRPO), and the question of how to accumulate and reuse experience across episodes is a current bottleneck. The skill internalization idea—using privileged context asymmetrically during training—connects to broader trends in self-play, knowledge distillation, and asymmetric actor-critic architectures. The paper positions itself well against a comprehensive set of recent baselines (SkillRL, D2Skill, MemRL, etc.), most from 2025-2026.
5. Strengths & Limitations
Strengths:
Limitations:
Overall Assessment
SIRI presents a well-structured and practically motivated approach to skill-based LLM agent training that makes a genuine contribution by eliminating inference-time skill retrieval. The self-mining and advantage-weighted internalization mechanisms are technically sound. However, the empirical evaluation is limited in scope and statistical rigor, and the improvements, while consistent, are modest on some benchmarks. The work represents a solid incremental advance in the rapidly evolving space of RL for LLM agents, with clear practical benefits but limited evidence of transformative impact.
Generated Jun 2, 2026
Comparison History (23)
Paper 2 provides foundational formal semantics for agentic tool protocols, bridging theory (process calculus, bisimulation) with a rapidly growing practical domain (LLM agent-tool integration via MCP). Its contributions—proving structural bisimilarity between SGD and MCP, identifying expressivity gaps, and proposing type-system extensions—establish a theoretical framework that could influence safety verification, protocol design standards, and formal methods across the entire agent ecosystem. Paper 1, while solid empirically, offers incremental RL improvements on specific benchmarks. Paper 2's breadth of impact across formal methods, AI safety, and industry standards gives it higher long-term scientific significance.
Paper 2 likely has higher impact: it introduces a timely safety benchmark for long-term memory in LLM assistants, a broadly relevant emerging deployment setting. PersistBench targets underexplored risks (cross-domain leakage, memory-induced sycophancy) and provides empirical evidence across 18 major models, making it immediately useful for evaluation, regulation, and future method development. Its applicability spans academia and industry and can shape standards. Paper 1 is a solid algorithmic contribution with strong results but is narrower (agent RL training on specific benchmarks) and less directly tied to urgent safety concerns affecting many deployed systems.
Paper 1 likely has higher impact due to broader real-world applicability and cross-field relevance: an end-to-end, data-centric, multimodal multi-agent framework targeting the full scientific workflow (hypothesis→data→code→critique) plus released dataset and a 27B model, enabling adoption and follow-on research across scientific domains. Its novelty lies in integrating raw experimental data processing with executable experimentation synthesis and iterative refinement. Paper 2 is methodologically solid and timely for LLM agent training, but its contributions are more narrowly scoped to skill internalization in RL and evaluated on fewer, task-specific benchmarks.
Paper 1 offers a concrete, technically novel RL framework (self-mined, validated, and distilled skills) that removes inference-time skill banks, reducing deployment complexity while improving benchmark performance with open models. It is methodologically clearer and more reproducible (code released, quantified gains, controlled comparisons) and can generalize across LLM-agent settings, impacting RL, agentic LLM training, and tool-use. Paper 2 is timely and societally important, but is primarily a synthesis/argument around emerging evidence; impact depends on external datasets and policy uptake, with less methodological novelty in the paper itself.
Paper 1 (SIRI) presents a novel, empirically validated framework for skill internalization in LLM agents with clear quantitative improvements over strong baselines. It addresses a practical and timely problem in LLM agent training with a complete methodology (discovery, validation, internalization) and demonstrates significant gains on established benchmarks. Paper 2 addresses an important governance/authorization problem for agentic AI but is more incremental in nature, extending existing IAM/OAuth concepts to agentic settings. While relevant, its impact is narrower and more domain-specific compared to SIRI's broader applicability to LLM agent training.
Paper 1 proposes a fundamental improvement to LLM agent training by internalizing skills and eliminating the need for external skill generators or inference-time retrieval. This addresses critical bottlenecks (latency, context limits) in autonomous agent deployment. Paper 2, while offering a strong methodological approach for time series forecasting with text, addresses a narrower, domain-specific application. Therefore, Paper 1 has a broader potential impact across the rapidly growing field of general-purpose AI agents.
Paper 2 has higher estimated impact due to a more novel and broadly applicable training paradigm: self-mining, validating, and internalizing skills into an LLM agent without external skill generators or inference-time skill banks. This directly reduces deployment complexity/latency and targets a timely bottleneck in long-horizon agent RL. It shows solid empirical gains on standard agent benchmarks and suggests competitiveness with closed-model distillation, improving relevance and adoption potential. Paper 1 is practical for inference budgeting, but its impact is narrower (token allocation policy) and more incremental relative to existing compute-aware routing/scheduling work.
Paper 2 (SIRI) presents a more complete and practically impactful framework for LLM agent training with self-discovered skills that are internalized into the model, eliminating inference-time overhead. It demonstrates strong empirical results on established benchmarks (ALFWorld, WebShop), offers a novel three-phase training paradigm combining RL with skill mining and distillation, and addresses a broadly relevant problem in LLM agent deployment. Paper 1, while addressing an interesting safety-adjacent problem of tracking agent behavioral drift, is narrower in scope, evaluated on a small dataset (68 pairs), and relies on a relatively straightforward linear probing methodology with limited generalizability demonstrated.
Paper 2 presents a foundational methodological advancement in LLM agent training (SIRI), addressing broad challenges like long-horizon planning and inference latency without relying on external skill banks. Its generalizability across agentic tasks (demonstrated on ALFWorld and WebShop) gives it a much wider breadth of impact. In contrast, Paper 1 offers a highly domain-specific application of existing RAG and VLM techniques to traffic accident liability, which, while practically useful, has more limited methodological novelty and cross-disciplinary scientific impact.
SIRI addresses a fundamental challenge in LLM agent training—skill discovery and internalization—with a novel three-phase framework that eliminates external dependencies at inference time. It demonstrates strong empirical results across multiple benchmarks, offers practical deployment benefits (reduced latency, no skill retrieval), and the self-mining approach matching closed-source model distillation is particularly impactful. FALAT addresses the important but narrower problem of failure attribution in agent trajectories, with moderate absolute accuracy numbers (46% and 29.1%), suggesting the problem remains largely unsolved. SIRI's broader applicability to agent training and its open-source availability give it higher potential impact.
Paper 2 addresses a fundamental and widely researched problem in LLM agents (long-horizon tasks and skill internalization without external dependencies), offering a novel methodological framework with broad applicability. Paper 1, while practically useful, is a localized application of existing federated learning and fine-tuning techniques to a specific administrative domain, limiting its broader scientific impact.
Paper 2 has higher potential scientific impact because it exposes a broadly relevant, under-specified methodological issue (protocol sensitivity) in LLM confidence calibration that affects many prior and future studies. Its findings generalize across tasks and model families, and it provides actionable guidance (a reporting checklist) that can standardize evaluation practices across fields using LLM uncertainty (QA, alignment, safety, HCI, decision support). While Paper 1 is a solid, practical RL/agent-training contribution, its impact is narrower (specific agent benchmarks and training pipeline) and more incremental relative to fast-moving agent-RL methods.
Paper 1 (SIRI) presents a novel, well-defined technical framework with clear methodological contributions (self-skill mining, validation, and internalization without external dependencies), demonstrated quantitative improvements on established benchmarks, and broad applicability to LLM agent training. Its approach of eliminating inference-time skill retrieval addresses a practical engineering bottleneck. Paper 2 (TaskWeave) tackles an interesting but narrower problem of organizational simulation with less generalizable contributions, evaluation on a single custom scenario, and weaker methodological rigor in terms of reproducible benchmarks.
Paper 2 addresses a critical and highly timely challenge: the safety and alignment of autonomous LLM agents using tools (via the recent Model Context Protocol). Mitigating power-seeking behaviors and preventing catastrophic failures in agentic systems is a major bottleneck for real-world deployment. While Paper 1 offers a strong algorithmic improvement for agent efficiency, Paper 2's focus on foundational safety and environment-grounded proactive defense promises a broader impact across the rapidly growing field of agentic AI deployment.
Paper 1 addresses a critical bottleneck in the highly active field of LLM agents by eliminating the need for external skill generators and reducing inference-time context length and latency. Its approach to self-internalizing skills offers immediate practical benefits and high potential for widespread adoption in building efficient, long-horizon AI agents. While Paper 2 provides a valuable theoretical benchmarking tool for RL generalization, Paper 1's alignment with current trends in LLM deployment gives it a broader and more immediate potential scientific and practical impact.
SIRI presents a more broadly applicable and methodologically novel framework. Its three-phase approach to skill discovery, validation, and internalization without external generators addresses fundamental challenges in LLM agent training across multiple domains. The self-mining strategy matching closed-source model distillation is a significant finding. The method is domain-agnostic, demonstrated on diverse benchmarks, and the internalization concept (no inference overhead) is elegant. UI-KOBE, while practical, is more narrowly focused on mobile GUI agents and relies on a relatively straightforward knowledge graph approach with less generalizable contributions.
While Paper 1 offers a clever algorithmic improvement for LLM agents, Paper 2 introduces a massive, multi-decade dataset that fundamentally challenges and redefines the assumptions of spatio-temporal forecasting. By exposing the failure of current state-of-the-art methods in realistic, evolving sensor networks, Paper 2 is likely to establish a new standard benchmark and drive extensive follow-up research in continual learning and graph neural networks.
While Paper 2 addresses a critical societal issue (misinformation), Paper 1 offers a foundational technical contribution to the rapidly advancing field of LLM agents. By solving significant bottlenecks in skill-based agent training (context length, external dependency, and latency), SIRI provides a highly scalable and reusable framework. This methodological innovation is likely to be widely adopted and built upon within the core AI and machine learning communities, resulting in a higher concentrated scientific impact.
Paper 2 (SIRI) introduces a novel approach for LLM agents to internalize skills without relying on external skill generators or inference-time retrieval, significantly reducing engineering complexity, context length, and deployment latency. This addresses a major bottleneck in long-horizon autonomous agents, giving it broader potential applications in real-world agent deployment compared to Paper 1's focus on stepwise model routing. SIRI's self-mining and distillation methodology demonstrates strong rigor and offers a highly scalable paradigm for training autonomous AI systems.
Paper 2 likely has higher impact due to broader real-world relevance (agent safety/security), wider cross-field applicability (alignment, security, systems, deployment), and timeliness as open-world agents proliferate. Its contributions span taxonomy updates, data engine with purification, efficient SFT/RL environment, and an online guardrail—suggesting an end-to-end framework with practical deployment advantages and open releases. Paper 1 is technically solid and novel for skill internalization in RL-trained LLM agents, but its impact is narrower (performance gains on specific benchmarks) and less societally urgent than scalable safety alignment.