Harnessing LLM Agents with Skill Programs

Hongjun Liu, Yifei Ming, Shafiq Joty, Chen Zhao

May 18, 2026

arXiv:2605.17734v1 PDF

cs.AI(primary)

#574of 2292·Artificial Intelligence

#574 of 2292 · Artificial Intelligence

Tournament Score

1463±46

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7

Novelty6

Clarity6.5

Tournament Score

1463±46

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long-horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs. Empirically, HASP drives substantial gains compared to both training-free and training-based methods on web-search, math reasoning, and coding tasks. For example, on web-search reasoning, inference-time PFs alone improve the average performance by 25% compared to (multi-loop) ReAct Agent, while post-training and controlled evolution achieve a 30.4% gain over Search-R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Harnessing LLM Agents with Skill Programs (HASP)

1. Core Contribution

HASP introduces a conceptual shift in how LLM agent skills are represented and deployed. Instead of encoding skills as textual advice injected into prompts (which models frequently ignore), HASP transforms them into executable Program Functions (PFs) — typed Python objects with explicit `should_activate` predicates and `intervene` methods that directly modify agent actions or inject corrective context at runtime. This moves from advisory to interventional skill representation.

The framework operates at three levels: (1) inference-time guardrails that intercept and repair agent decisions, (2) post-training supervision derived from PF intervention records (supporting SFT, rejection sampling, and on-policy distillation), and (3) self-improving library evolution where residual failures generate new candidate PFs that undergo executable validation and teacher review before admission.

The key insight is that skills should specify *when* they activate and *how* they change behavior, not merely *what* the agent should do. This is operationalized through a state-action intervention interface that separates the base policy's proposal from the harness's correction.

2. Methodological Rigor

The experimental design is thorough, spanning three diverse domains (web-search reasoning, math, coding) with a systematic 2×3 grid of post-training configurations (fixed/evolving library × SFT/RS/OPD). The ablation studies in Table 5 are particularly well-structured, isolating contributions of individual components (PF vs. teacher), supervision signals (timing, mode, correctness, outcome), and filtering mechanisms.

Strengths in rigor:

The signal ablation demonstrates that all four scoring dimensions contribute meaningfully (drops of 7.8–15.5 points), validating the multi-dimensional supervision design.

The filtering ablation reveals that unfiltered evolution is actively harmful (36.3% vs. 60.3% with full filtering), providing strong evidence for the controlled evolution thesis.

Training dynamics (Figures 3, 5–8) are tracked across all six settings, enabling mechanistic understanding of recipe differences.

The case studies (Appendix G.4) provide full trajectory-level evidence of how PFs intervene.

Weaknesses in rigor:

All results are single-seed runs (acknowledged by authors). The claimed ±1.5 point standard deviation from "internal smoke tests" is not formally reported.

The auxiliary teacher (GPT-4o) plays a significant role in the strongest variants, making it difficult to fully disentangle PF-design benefits from teacher-model benefits. The "PF-only" ablation partially addresses this but shows notably weaker results (51.0% vs. 56.2%).

The initial PF library is hand-designed from failure patterns, introducing substantial human engineering that may not transfer easily to new domains.

Comparisons with RL-based methods (AgentFlow) show mixed results — AgentFlow outperforms on 2Wiki and substantially outperforms on AIME24 math, suggesting HASP's elicitation-oriented approach has fundamental limitations for tasks requiring novel strategy discovery.

3. Potential Impact

Direct applications: The framework's modularity is genuinely useful — the same PF interface works across inference-time intervention, training data generation, and library evolution. This could enable practical deployment patterns where PFs serve as safety guardrails or quality control mechanisms in production agent systems.

Broader influence: The paper advances the programmatic control of LLM agents, connecting to the growing literature on constrained decoding, structured generation, and agent guardrails. The idea that skills should be executable rather than advisory could influence how agent memory and experience replay systems are designed.

Limitations on impact: The heavy reliance on domain-specific PF design (16 web-search skills, 11 math skills, 12 code skills, each with hand-crafted activation predicates) limits out-of-the-box transferability. The self-improvement loop partially addresses this but still requires careful filtering infrastructure.

4. Timeliness & Relevance

The paper addresses a genuine and timely problem: LLM agents frequently exhibit recognizable failure patterns that textual prompting cannot reliably prevent. The positioning against Search-R1, AgentFlow, and other recent agent training methods (many from 2025) demonstrates awareness of the rapidly moving frontier. The emphasis on structured intervention rather than end-to-end RL provides a complementary approach that could be combined with existing methods.

5. Strengths & Limitations

Key strengths:

The PF abstraction is clean and well-defined, with clear interfaces and intervention semantics

Comprehensive evaluation across three domains with systematic ablations

The filtering analysis (Table 5, bottom block) provides compelling evidence that controlled evolution is essential

Detailed appendices with full PF implementations, prompts, and training configurations enhance reproducibility

The mechanism analysis (Section 4.2) provides genuine insight into which skills internalize and which remain necessary at inference time

Notable limitations:

The "skill internalization" analysis (Section 4.2) reveals that only behavior-correcting PFs are internalized while input-dependent PFs continue to require runtime intervention — this suggests the approach may hit a ceiling where fundamental capability gaps cannot be addressed by PF-style corrections

The coding results, while positive, show relatively modest gains over strong baselines like P-GRPO (+0.1 point), and the PF interface is constrained to audit-only mode (no code rewriting)

The framework's complexity is substantial — the appendices reveal 17 distinct teacher invocation sites, multi-layer skill selection, phase handlers, rate limits, and numerous hyperparameters, raising questions about practical deployability

The paper does not evaluate on truly open-ended or long-horizon tasks where the failure taxonomy might be less predictable

Summary

HASP presents a well-engineered framework with a sound conceptual contribution (executable vs. advisory skills) and strong empirical results, particularly on web-search reasoning. The ablation studies are thorough and informative. However, the heavy engineering overhead, reliance on teacher models for strongest results, and limited novelty in the underlying learning mechanisms (SFT/RS/OPD are standard) temper the fundamental contribution. The work is best understood as a sophisticated systems contribution that demonstrates the value of structured intervention in agent loops, rather than a fundamental advance in agent learning.

Rating:6.5/ 10

Significance 6.5Rigor 7Novelty 6Clarity 6.5

Generated May 19, 2026

Comparison History (17)

vs. What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

claude-opus-4.65/20/2026

Paper 1 addresses a fundamental question about why code in pretraining improves reasoning, providing controlled experiments at scale (10T tokens) that challenge a widely-held assumption. Its findings—that structured reasoning traces, not executable code per se, drive reasoning gains—have broad implications for data-centric optimization across the entire foundation model training community. Paper 2 presents a useful but more incremental framework for LLM agents. Paper 1's mechanistic insights and practical guidance for training data composition will likely influence a wider range of future research and industry practices.

vs. Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact: it introduces a mechanistic, head-level causal account of modality-conflict hallucinations across multiple MLLMs (strong novelty + rigor via path patching and ablations), yielding broadly relevant insights for interpretability and multimodal safety. Its intervention (MACI) is targeted, inference-time, and shown to transfer zero-shot across models and benchmarks, increasing real-world applicability. The findings generalize across architectures and connect to wider work on causal interpretability, making its impact broader than Paper 1’s primarily engineering-oriented agent-skill framework.

vs. Learning Quantifiable Visual Explanations Without Ground-Truth

gpt-5.25/19/2026

Paper 2 likely has higher impact due to timeliness and broad applicability: agentic LLMs are a rapidly expanding area, and a modular framework for executable, reusable skills can transfer across many tasks (web, math, coding) and deployment settings (inference-time guardrails, post-training supervision, self-improvement). The reported large empirical gains on strong baselines suggest practical relevance. Paper 1 addresses an important XAI evaluation gap and proposes a novel metric plus adapter, but XAI impact is often constrained by domain-specific validation needs and slower adoption compared to agent frameworks.

vs. Adaptive auditing of AI systems with anytime-valid guarantees

gemini-3.15/19/2026

Paper 1 addresses a critical bottleneck in AI safety and evaluation by providing rigorous, anytime-valid statistical guarantees for adaptive auditing. Its strong theoretical foundation and applicability to AI compliance give it broader, more foundational impact compared to Paper 2's architectural improvements for LLM agents.

vs. Actionable World Representation

claude-opus-4.65/19/2026

WorldString proposes a fundamentally new representation paradigm for physical world models—unified actionable object state manifolds learned from point clouds/RGB-D—addressing a core gap in world modeling. Its differentiable architecture enabling integration with policy learning and neural dynamics has broad implications for robotics, simulation, and digital twins. While Paper 2 (HASP) presents solid engineering contributions for LLM agent skill execution with strong empirical gains, it is more incremental within the crowded LLM agent framework space. Paper 1's foundational contribution to physical world representation has greater long-term scientific impact potential across multiple fields.

vs. OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

gpt-5.25/19/2026

Paper 2 (HASP) likely has higher scientific impact due to its broad, timely relevance to LLM agent reliability and long-horizon task performance, with clear, reusable mechanism design (executable skill programs) and multiple integration pathways (inference-time, post-training, self-improvement). It reports substantial empirical gains across diverse benchmarks (web-search, math, coding), suggesting strong real-world applicability and cross-domain impact. Paper 1 is novel for black-box vision interpretability and ontology induction, but its applicability is more specialized to vision/XAI and may see narrower uptake compared to general agent frameworks.

vs. ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to broader cross-domain applicability and timeliness: executable skill programs for LLM agents can improve reliability across many tasks (web, math, coding) and can be adopted widely at inference/post-training/self-improvement. Its modular framework and reported large empirical gains suggest immediate real-world utility and influence across AI research and tooling. Paper 1 is innovative and potentially high-impact in clinical decision support, but its impact is narrower (cardiology/ECG), with heavier deployment/regulatory barriers and a smaller affected research community.

vs. Voices in the Loop: Mapping Participatory AI

gemini-3.15/19/2026

Paper 2 introduces a highly relevant framework for improving LLM agents, a rapidly growing and impactful area in AI. Its method of upgrading skills into executable programs demonstrates significant performance gains across multiple domains (search, math, coding). This technical innovation offers broader and more immediate applicability to the wider AI community compared to Paper 1, which primarily provides a mapping and repository tool with a more specialized impact footprint in policy and ethics.

vs. LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs

claude-opus-4.65/19/2026

TESSERA presents a more novel and methodologically rigorous contribution by combining LLMs with MCTS over knowledge graphs in a principled neuro-symbolic framework. It addresses the fundamental challenge of compositional reasoning with credit assignment, offering a general paradigm applicable beyond drug-disease mechanisms. Paper 2 (HASP), while showing strong empirical gains, represents a more incremental advance in the crowded LLM agent skill-learning space. TESSERA's interdisciplinary bridging of AI and biomedical discovery, along with its principled integration of neural and symbolic methods, gives it broader and more lasting impact potential.

vs. A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

claude-opus-4.65/19/2026

HASP introduces a broadly applicable framework that converts learned skills into executable program functions for LLM agents, demonstrating substantial empirical gains (25-30%+) across multiple diverse task domains (web-search, math, coding). Its modularity—applicable at inference, post-training, and self-improvement stages—gives it wide practical utility. While A2RBench offers a valuable benchmark contribution with formal verification guarantees, benchmarks typically have narrower impact than new training/inference frameworks. HASP's approach addresses a fundamental limitation in agentic AI and is likely to influence multiple research directions in LLM agent development.

vs. QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

gpt-5.25/19/2026

Paper 2 (HASP) likely has higher scientific impact because it introduces a broadly applicable, executable skill-program abstraction for LLM agents that intervenes directly in the agent loop, and it demonstrates sizable performance gains across multiple core benchmarks (web search, math, coding) with modular use at inference, post-training, and self-improvement. This is timely for agent reliability and long-horizon task execution, and its programmatic guardrail mechanism could influence both research and production agent frameworks. Paper 1 is valuable for evaluation alignment, but may be more incremental and primarily impacts evaluation workflows.

vs. ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to broader applicability and timeliness: a general framework for LLM agents that can intervene at inference time, support post-training, and enable self-improvement, spanning web search, math, and coding. This makes it relevant across many domains using agents and could influence system design patterns widely. Paper 1 is innovative and rigorous for chemistry-diagram understanding and introduces a valuable benchmark, but its impact is more domain-specific to cheminformatics and multimodal chemistry rather than general AI agent behavior.

vs. From LLM-Generated Conjectures to Lean Formalizations: Automated Polynomial Inequality Proving via Sum-of-Squares Certificates

gpt-5.25/19/2026

Paper 2 has higher potential impact due to a clearer leap in methodological rigor (exact SOS refinement plus Lean-certified proofs) and stronger, broadly valuable real-world guarantees (machine-checked correctness). Its neuro-symbolic pipeline addresses a well-defined scalability bottleneck in automated theorem proving and connects LLM conjecturing to formal verification, with relevance across formal methods, symbolic algebra, and AI for mathematics. Paper 1 is timely and useful for agent engineering, but the idea of executable skill/guardrail modules is closer to incremental systems innovation and may generalize less cleanly with weaker correctness guarantees.

vs. LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

claude-opus-4.65/19/2026

Paper 1 (HASP) introduces a more broadly applicable framework that transforms LLM agent skills into executable program functions, demonstrating substantial empirical gains across multiple diverse domains (web-search, math reasoning, coding). Its modularity—applicable at inference time, post-training, and for self-improvement—gives it wide applicability. The 25-30% improvements over strong baselines and the mechanistic analysis add rigor. Paper 2 (LMAC) addresses a narrower problem (communication in MARL) with a more incremental contribution of using LLMs to design communication protocols. HASP's breadth of impact and practical utility give it higher potential scientific impact.

vs. M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models

gemini-3.15/19/2026

Paper 2 introduces a highly novel, gradient-free model merging technique that isolates and merges task vectors along the null space of a specific feature subspace. This elegantly solves the misalignment between mathematical and agentic reasoning without costly retraining. Its theoretical rigor and impressive zero-shot performance gains on complex benchmarks like SWE-Bench give it broader foundational implications for model merging and reasoning synergy compared to the more applied programmatic guardrails in Paper 1.

vs. Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

gemini-3.15/19/2026

Paper 1 introduces a broadly applicable framework (HASP) that upgrades text-based skills into executable program functions for LLM agents, impacting multiple domains like web search, math, and coding. In contrast, Paper 2 is highly specialized for competitive programming. The generalizability and modularity of Paper 1's approach across diverse reasoning tasks give it a significantly higher potential for broad scientific impact and widespread adoption in general agentic systems.

vs. TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

claude-opus-4.65/19/2026

Paper 2 (HASP) addresses a fundamental and broadly applicable challenge in LLM agent design—converting experiential knowledge into executable, reusable skill programs. Its modular framework applicable across inference, post-training, and self-improvement stages gives it wide applicability across diverse domains (web search, math, coding). The 25-30% empirical gains are substantial. Paper 1 (TopoEvo) is technically sophisticated but addresses a narrower domain (microservice RCA). While innovative in combining topology-aware reasoning with LLM agents, its impact is more domain-specific. Paper 2's broader applicability and foundational contribution to LLM agent architectures suggests higher scientific impact.