AIP: A Graph Representation for Learning and Governing Agent Skills

Zachary Blumenfeld, Jim Webber

Jun 3, 2026

arXiv:2606.04781v1 PDF

cs.AI(primary)cs.LG

#1331of 3404·Artificial Intelligence

#1331 of 3404 · Artificial Intelligence

Tournament Score

1427±44

10501800

41%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance6

Rigor4.5

Novelty5.5

Clarity7.5

Tournament Score

1427±44

10501800

41%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AIP: A Graph Representation for Learning and Governing Agent Skills

1. Core Contribution

The paper introduces the Agent Instruction Protocol (AIP), which reframes agent skills—currently stored as free-form Markdown prose—as directed execution graphs with typed nodes (backed by deterministic scripts or natural-language descriptions) and typed input/output edges, all governed by a schema-validated YAML specification. A "compiler meta-skill" translates existing human-written prose skills into this graph representation. The key insight is that much procedural knowledge in agent skills is deterministic and can be pre-compiled into runnable code rather than re-derived by the LLM at each session, while reserving natural-language nodes for steps requiring judgment.

This contribution sits at an interesting intersection: it's neither a new model architecture nor a new prompting technique, but rather a representation design for reusable agent knowledge. The problem it addresses—that prose-based skills are unreliable, hard to debug, and resistant to systematic improvement—is genuine and increasingly relevant as agent skill ecosystems grow.

2. Methodological Rigor

The evaluation uses SkillsBench with 27 tasks across 8 domains, running 5 trials per condition with Claude Sonnet as the solver. The primary comparison (human-curated vs. AIP-compiled skills) yields a statistically significant improvement: mean reward 0.60→0.71 (Wilcoxon p=0.011), with 12 wins, 2 losses, and 13 ties.

Strengths in methodology:

Appropriate use of non-parametric testing (Wilcoxon signed-rank) given small sample sizes

Stratification across structure class × implementation class to span the gradient of expected benefit

Reporting of both 27-task and 24-task subsets for robustness

Honest treatment of wall-clock improvements as descriptive rather than significant

Significant methodological weaknesses:

Format-author confound: The authors themselves acknowledge this is the most critical limitation. The comparison conflates two effects: (1) the graph structure and (2) improvements the compiler agent makes to scripts during compilation. Without a control arm delivering the same converted scripts in plain Markdown, the source of the uplift cannot be isolated. This is a fundamental threat to the paper's central claim.

Single model: All experiments use only Claude Sonnet, limiting generalizability claims.

Small n: 5 trials per cell and only 14 non-tied pairs for the Wilcoxon test provide limited statistical power.

Skill improvement evidence is anecdotal: The two case studies (offer-letter-generator and bike-rebalance) are compelling illustrations but not systematic evidence. The claim about "improvability" rests on two manually-executed fixes.

The `aip-from-instruction` condition was excluded because it performed terribly, which is informative but also conveniently removes a negative result from the headline numbers.

3. Potential Impact

The paper addresses a practical problem in the rapidly growing agent skills ecosystem. If the representation genuinely improves reliability, the impact could be substantial:

Immediate practical value: Organizations deploying agent skills could adopt AIP's graph format to improve task success rates, particularly on implementation-heavy tasks.

Governance and auditability: The queryable graph structure enabling corpus-level auditing (e.g., finding skills missing approval steps) addresses a real enterprise need as agentic systems scale.

RL over skills: The bounded, typed action space for reinforcement learning is an intellectually appealing framing, though it remains entirely speculative in this paper—no RL experiments are conducted.

Tooling ecosystem: The open-source specification and compiler could seed an ecosystem, though adoption depends heavily on whether the community coalesces around this format.

The paper is clearly positioned as an industry contribution (both authors are from Neo4j), and the graph database angle for governance is a natural extension of their expertise, which both adds credibility and raises questions about bias.

4. Timeliness & Relevance

The timing is strong. Anthropic's Agent Skills specification was introduced in 2025, and this paper arrives while the format is still malleable and the community is actively debating how to package agent knowledge. The problems identified (prose brittleness, difficulty of agent self-improvement, lack of addressability for debugging) are real bottlenecks that practitioners encounter. The connection to the broader trend of structured agent workflows (LangGraph, ADK, DSPy) is well-drawn.

5. Strengths & Limitations

Key Strengths:

Identifies a genuine and timely problem with clear articulation of four distinct limitations of prose-based skills

Pragmatic design that extends rather than replaces the existing Agent Skills specification

Open-source implementation with reproducible benchmarking infrastructure

Honest and detailed limitations section that doesn't oversell

The stratification by structure class (prose-only, mixed, script-heavy) provides useful insight into where AIP helps most

Notable Weaknesses:

The format-author confound undermines the central empirical claim. The paper cannot distinguish whether the graph structure or the script improvements drive the gains.

The skill improvement and RL narratives are largely aspirational—supported by two anecdotes and speculation rather than systematic experimentation.

The "protocol" framing is premature: AIP is currently just a specification the agent reads into context. The paper acknowledges this but the naming creates expectation mismatch.

The paper doesn't address potential downsides: compilation errors, cases where the graph structure might be overly rigid, or the token cost of loading YAML specifications.

No comparison with other structured approaches (e.g., DSPy-compiled skills, LangGraph workflows used as skills).

Overall Assessment

This is a well-motivated systems paper that introduces a sensible representation for agent skills and provides preliminary evidence of its benefits. The core idea—that skills should be graphs of typed, testable steps rather than prose—is sound and likely to influence how the community thinks about skill packaging. However, the empirical contribution is weakened by the format-author confound, small sample sizes, and single-model evaluation. The most impactful claims (RL over skills, corpus governance) remain undemonstrated. The paper is best understood as introducing a promising framework with suggestive but not conclusive evidence, rather than as a definitive empirical result.

Rating:5.5/ 10

Significance 6Rigor 4.5Novelty 5.5Clarity 7.5

Generated Jun 5, 2026

Comparison History (17)

vs. Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

claude-opus-4.66/6/2026

Paper 1 provides the first comprehensive systems characterization of agent memory, a foundational infrastructure concern for scaling LLM agents. Its taxonomy, profiling methodology, and 10 actionable system recommendations have broad applicability across the entire agent ecosystem. Paper 2 presents a useful but narrower contribution—a graph-based skill representation with solid empirical gains on a specific benchmark. While Paper 2 is well-executed, Paper 1 addresses a more fundamental and widely relevant problem, offering reusable frameworks and insights that will likely influence system design across many agent architectures.

vs. Synapse: Federated Tool Routing via Typed Compendium Artifacts

claude-opus-4.66/6/2026

Paper 2 (AIP) has higher estimated scientific impact because it addresses a more broadly applicable problem—improving agent reliability and skill governance—with clear empirical results on real tasks (statistically significant improvements on SkillsBench). Its graph-based skill representation is intuitive, practical, and immediately applicable to the rapidly growing AI agent ecosystem. It also opens a natural path toward RL over structured skill spaces. Paper 1 (Synapse) tackles an important but narrower federated learning problem with a more complex theoretical framework and less immediately actionable results for the broader community.

vs. Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents

gpt-5.26/6/2026

Paper 2 (MAGE) has higher likely scientific impact due to broader applicability and timeliness: execution-state management for long-horizon agents is a central bottleneck across many agentic systems. The state-tree framework (grow/compress/maintain/revise) reframes “memory” from semantic retrieval to trajectory-consistent state reconstruction, with clear real-world benefits (higher success rates and large token reductions) and easy integration into diverse agent stacks. Paper 1 is novel and practical for skill authoring/governance, but is more specific (YAML/graph skill spec) and closer to systems/engineering impact than a generalizable agent cognition/memory paradigm.

vs. CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

gemini-3.16/5/2026

Paper 2 addresses a critical, highly timely issue: system-level security and prompt injection defenses for Computer Use Agents. As frontier models increasingly interact with computer environments, security guarantees are a major blocker for real-world deployment. Paper 2's novel approach of single-shot planning to achieve architectural isolation represents a significant step forward in safe AI agent design, likely impacting both AI safety research and commercial agent architectures more broadly and urgently than the skill representation protocol proposed in Paper 1.

vs. Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo

claude-opus-4.66/5/2026

Paper 1 proposes a general theoretical framework for knowledge infusion in iterative generative models, identifying four structurally distinct intervention layers. This framework has broad applicability across multimodal generative AI (diffusion models, potentially others), addresses the critical problem of safety and reliability in generative AI, and provides both theoretical grounding and empirical validation with a 70.97% reduction in knowledge-violating outputs. Paper 2, while practically useful, presents a more narrow engineering contribution (a graph-based skill representation for agents) with evaluation limited to one model on one benchmark. Paper 1's conceptual framework is more likely to influence future research directions across multiple subfields.

vs. Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

claude-opus-4.66/5/2026

Paper 2 (AIP) introduces a novel representational framework—modeling agent skills as directed execution graphs—that addresses fundamental limitations in how LLM agents execute procedural tasks. Its contribution is more foundational: it proposes a new abstraction (graph-based skill representation) with broad applicability across agent architectures, enables reinforcement learning over skills, and supports governance/introspection. While Paper 1 (CHARM) addresses an important practical problem (cascading hallucinations in RAG), it is more narrowly scoped to a specific failure mode in a specific pipeline type. AIP's potential to reshape how agent skills are created, tested, and improved gives it broader cross-field impact and greater long-term significance.

vs. Towards a Science of AI Agent Reliability

gemini-3.16/5/2026

Paper 2 proposes a foundational evaluation framework for AI agent reliability, addressing a critical bottleneck in the field. By defining comprehensive metrics for consistency, robustness, predictability, and safety, it has the potential to become a standard evaluation methodology that shapes future research directions. While Paper 1 offers an innovative methodology with strong empirical gains, its focus on specific skill representation is narrower, making Paper 2's broad, field-level evaluative contribution likely to yield higher overall scientific impact.

vs. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

claude-opus-4.66/5/2026

Paper 1 addresses a fundamental structural flaw in knowledge editing for LLMs—Epistemic Dissonance—with a novel causal editing paradigm (CODE) that demonstrates dramatic improvements (95.6% self-refutation reduced to 1.8%). It introduces new conceptual frameworks (causal editing vs. static fact overwriting), rigorous causal analysis, and strong empirical results across multiple models. Paper 2 proposes a practical engineering contribution (graph-based skill representation) with moderate improvements on a specific benchmark. While useful, it is more incremental and narrower in theoretical contribution compared to Paper 1's paradigm-shifting insight into how LLMs internalize knowledge updates.

vs. Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

gemini-3.16/5/2026

Achieving gold-medal-level performance on IMO and IPhO represents a major milestone in AI. Paper 1's unified scaling recipe addresses critical frontiers in long-horizon reasoning and test-time compute, which are highly influential topics for advanced AI development. While Paper 2 offers a practical engineering framework for agent execution, Paper 1 demonstrates a fundamental leap in core model capabilities with profound implications for complex scientific problem-solving and reasoning.

vs. Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

claude-opus-4.66/5/2026

Paper 2 demonstrates broader scientific impact through several key advantages: (1) It addresses a more fundamental and generalizable problem—runtime harness adaptation for frozen LLMs—applicable across 18 model backbones and 7 environments, showing remarkable transferability. (2) The 88.5% average relative improvement across 126 settings is substantially more impressive than Paper 1's gains. (3) The insight that harness adaptations transfer across models reveals reusable environment-side structure, a novel conceptual contribution. (4) It offers a complementary paradigm to model-centric training, potentially influencing how the field approaches agent improvement. Paper 1, while practical, is more narrowly focused on skill representation format.

vs. AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

gemini-3.16/5/2026

Paper 2 addresses a critical bottleneck in AI agent deployment: runtime safety and security. By providing an open-source tool and a new benchmark for multi-step attack chains, its real-world applicability and timeliness are exceptionally high. While Paper 1 offers a novel architectural improvement for agent skills, Paper 2's focus on preventing irreversible harm tackles a broader, more urgent challenge across the entire field of autonomous systems.

vs. Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

claude-opus-4.66/5/2026

Paper 1 (AIP) introduces a more broadly applicable framework—representing agent skills as directed execution graphs—with clear, statistically significant improvements on a substantial benchmark (27 real tasks). It addresses a fundamental bottleneck in agentic AI (skill reliability, creation, and governance) with a structured, schema-validated approach that naturally supports reinforcement learning and corpus-level management. Paper 2 addresses a narrower problem (conversation grounding verification) with modest accuracy gains (+1.3-6.7pp) on small benchmarks. AIP's combination of practical impact, breadth of applicability to the rapidly growing agent ecosystem, and its enabling of systematic skill improvement gives it higher potential scientific impact.

vs. Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

gpt-5.26/5/2026

Paper 2 is likely to have higher scientific impact because it proposes a novel, reusable representation (typed execution-graph + schema-validated spec) that directly changes how agent skills are authored, tested, governed, and optimized, with clear empirical gains and statistical testing on real tasks. Its method enables broader downstream work (tooling, RL over skills, safety/governance, debugging) across agent systems and domains. Paper 1 offers valuable infrastructure (trace dataset + simulator) for reproducibility and analysis, but its impact is narrower and more contingent on benchmark/agent specifics and potential constraints around proprietary traces.

vs. HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

gemini-3.16/5/2026

Paper 1 proposes a fundamental shift in how agent skills are represented (from fragile prose to robust execution graphs), which has broad implications across all LLM agent applications. While Paper 2 presents a strong, rigorous pipeline for embodied AI, its focus is largely restricted to the smart home domain. Paper 1's generalizable methodology for skill compilation, verification, and governance addresses a critical bottleneck in agent reliability and offers higher potential for wide-ranging impact across the field of artificial intelligence.

vs. R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

gemini-3.16/5/2026

Paper 2 offers a deeper theoretical contribution by decomposing conflicting reasoning modes to address fundamental structural failures in LLM planning. While Paper 1 provides a highly practical engineering protocol for skill representation, Paper 2's R-APS framework tackles core cognitive limitations like robustness and memory invalidation without fine-tuning. Furthermore, its demonstration that structured reasoning protocols allow small 4B models to compete with 70B models has profound, widespread implications for AI efficiency, scalability, and complex constrained design.

vs. Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

gpt-5.26/5/2026

Paper 1 presents a more novel, broadly applicable contribution: a structured, schema-validated graph representation for agent skills with demonstrated, statistically tested performance gains on a real benchmark and a clear path to governance, debugging, and RL over skills. Its methodology includes concrete ablations (before/after compilation) and significance testing, and the approach generalizes across domains where agents execute procedures. Paper 2 targets an important application, but appears closer to established agent-based epidemic simulation + RL control, with limited evidence of methodological novelty, validation against real data, or generality beyond the pandemic policy setting.

vs. ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

gemini-3.16/5/2026

Paper 2 proposes a foundational shift in how agent skills are represented by replacing fragile prose with a directed execution graph (AIP). This not only significantly improves task performance and reliability but also enables precise debugging, governance, and reinforcement learning. Paper 1 offers a valuable but more narrow optimization for token efficiency in tool use, whereas Paper 2's methodological innovation has broader implications for the design and scalability of agentic systems across multiple domains.