AIP: A Graph Representation for Learning and Governing Agent Skills
Zachary Blumenfeld, Jim Webber
Abstract
Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.
AI Impact Assessments
(1 models)Scientific Impact Assessment: AIP: A Graph Representation for Learning and Governing Agent Skills
1. Core Contribution
The paper introduces the Agent Instruction Protocol (AIP), which reframes agent skills—currently stored as free-form Markdown prose—as directed execution graphs with typed nodes (backed by deterministic scripts or natural-language descriptions) and typed input/output edges, all governed by a schema-validated YAML specification. A "compiler meta-skill" translates existing human-written prose skills into this graph representation. The key insight is that much procedural knowledge in agent skills is deterministic and can be pre-compiled into runnable code rather than re-derived by the LLM at each session, while reserving natural-language nodes for steps requiring judgment.
This contribution sits at an interesting intersection: it's neither a new model architecture nor a new prompting technique, but rather a representation design for reusable agent knowledge. The problem it addresses—that prose-based skills are unreliable, hard to debug, and resistant to systematic improvement—is genuine and increasingly relevant as agent skill ecosystems grow.
2. Methodological Rigor
The evaluation uses SkillsBench with 27 tasks across 8 domains, running 5 trials per condition with Claude Sonnet as the solver. The primary comparison (human-curated vs. AIP-compiled skills) yields a statistically significant improvement: mean reward 0.60→0.71 (Wilcoxon p=0.011), with 12 wins, 2 losses, and 13 ties.
Strengths in methodology:
Significant methodological weaknesses:
3. Potential Impact
The paper addresses a practical problem in the rapidly growing agent skills ecosystem. If the representation genuinely improves reliability, the impact could be substantial:
The paper is clearly positioned as an industry contribution (both authors are from Neo4j), and the graph database angle for governance is a natural extension of their expertise, which both adds credibility and raises questions about bias.
4. Timeliness & Relevance
The timing is strong. Anthropic's Agent Skills specification was introduced in 2025, and this paper arrives while the format is still malleable and the community is actively debating how to package agent knowledge. The problems identified (prose brittleness, difficulty of agent self-improvement, lack of addressability for debugging) are real bottlenecks that practitioners encounter. The connection to the broader trend of structured agent workflows (LangGraph, ADK, DSPy) is well-drawn.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
This is a well-motivated systems paper that introduces a sensible representation for agent skills and provides preliminary evidence of its benefits. The core idea—that skills should be graphs of typed, testable steps rather than prose—is sound and likely to influence how the community thinks about skill packaging. However, the empirical contribution is weakened by the format-author confound, small sample sizes, and single-model evaluation. The most impactful claims (RL over skills, corpus governance) remain undemonstrated. The paper is best understood as introducing a promising framework with suggestive but not conclusive evidence, rather than as a definitive empirical result.
Generated Jun 5, 2026
Comparison History (17)
Paper 1 provides the first comprehensive systems characterization of agent memory, a foundational infrastructure concern for scaling LLM agents. Its taxonomy, profiling methodology, and 10 actionable system recommendations have broad applicability across the entire agent ecosystem. Paper 2 presents a useful but narrower contribution—a graph-based skill representation with solid empirical gains on a specific benchmark. While Paper 2 is well-executed, Paper 1 addresses a more fundamental and widely relevant problem, offering reusable frameworks and insights that will likely influence system design across many agent architectures.
Paper 2 (AIP) has higher estimated scientific impact because it addresses a more broadly applicable problem—improving agent reliability and skill governance—with clear empirical results on real tasks (statistically significant improvements on SkillsBench). Its graph-based skill representation is intuitive, practical, and immediately applicable to the rapidly growing AI agent ecosystem. It also opens a natural path toward RL over structured skill spaces. Paper 1 (Synapse) tackles an important but narrower federated learning problem with a more complex theoretical framework and less immediately actionable results for the broader community.
Paper 2 (MAGE) has higher likely scientific impact due to broader applicability and timeliness: execution-state management for long-horizon agents is a central bottleneck across many agentic systems. The state-tree framework (grow/compress/maintain/revise) reframes “memory” from semantic retrieval to trajectory-consistent state reconstruction, with clear real-world benefits (higher success rates and large token reductions) and easy integration into diverse agent stacks. Paper 1 is novel and practical for skill authoring/governance, but is more specific (YAML/graph skill spec) and closer to systems/engineering impact than a generalizable agent cognition/memory paradigm.
Paper 2 addresses a critical, highly timely issue: system-level security and prompt injection defenses for Computer Use Agents. As frontier models increasingly interact with computer environments, security guarantees are a major blocker for real-world deployment. Paper 2's novel approach of single-shot planning to achieve architectural isolation represents a significant step forward in safe AI agent design, likely impacting both AI safety research and commercial agent architectures more broadly and urgently than the skill representation protocol proposed in Paper 1.
Paper 1 proposes a general theoretical framework for knowledge infusion in iterative generative models, identifying four structurally distinct intervention layers. This framework has broad applicability across multimodal generative AI (diffusion models, potentially others), addresses the critical problem of safety and reliability in generative AI, and provides both theoretical grounding and empirical validation with a 70.97% reduction in knowledge-violating outputs. Paper 2, while practically useful, presents a more narrow engineering contribution (a graph-based skill representation for agents) with evaluation limited to one model on one benchmark. Paper 1's conceptual framework is more likely to influence future research directions across multiple subfields.
Paper 2 (AIP) introduces a novel representational framework—modeling agent skills as directed execution graphs—that addresses fundamental limitations in how LLM agents execute procedural tasks. Its contribution is more foundational: it proposes a new abstraction (graph-based skill representation) with broad applicability across agent architectures, enables reinforcement learning over skills, and supports governance/introspection. While Paper 1 (CHARM) addresses an important practical problem (cascading hallucinations in RAG), it is more narrowly scoped to a specific failure mode in a specific pipeline type. AIP's potential to reshape how agent skills are created, tested, and improved gives it broader cross-field impact and greater long-term significance.
Paper 2 proposes a foundational evaluation framework for AI agent reliability, addressing a critical bottleneck in the field. By defining comprehensive metrics for consistency, robustness, predictability, and safety, it has the potential to become a standard evaluation methodology that shapes future research directions. While Paper 1 offers an innovative methodology with strong empirical gains, its focus on specific skill representation is narrower, making Paper 2's broad, field-level evaluative contribution likely to yield higher overall scientific impact.
Paper 1 addresses a fundamental structural flaw in knowledge editing for LLMs—Epistemic Dissonance—with a novel causal editing paradigm (CODE) that demonstrates dramatic improvements (95.6% self-refutation reduced to 1.8%). It introduces new conceptual frameworks (causal editing vs. static fact overwriting), rigorous causal analysis, and strong empirical results across multiple models. Paper 2 proposes a practical engineering contribution (graph-based skill representation) with moderate improvements on a specific benchmark. While useful, it is more incremental and narrower in theoretical contribution compared to Paper 1's paradigm-shifting insight into how LLMs internalize knowledge updates.
Achieving gold-medal-level performance on IMO and IPhO represents a major milestone in AI. Paper 1's unified scaling recipe addresses critical frontiers in long-horizon reasoning and test-time compute, which are highly influential topics for advanced AI development. While Paper 2 offers a practical engineering framework for agent execution, Paper 1 demonstrates a fundamental leap in core model capabilities with profound implications for complex scientific problem-solving and reasoning.
Paper 2 demonstrates broader scientific impact through several key advantages: (1) It addresses a more fundamental and generalizable problem—runtime harness adaptation for frozen LLMs—applicable across 18 model backbones and 7 environments, showing remarkable transferability. (2) The 88.5% average relative improvement across 126 settings is substantially more impressive than Paper 1's gains. (3) The insight that harness adaptations transfer across models reveals reusable environment-side structure, a novel conceptual contribution. (4) It offers a complementary paradigm to model-centric training, potentially influencing how the field approaches agent improvement. Paper 1, while practical, is more narrowly focused on skill representation format.
Paper 2 addresses a critical bottleneck in AI agent deployment: runtime safety and security. By providing an open-source tool and a new benchmark for multi-step attack chains, its real-world applicability and timeliness are exceptionally high. While Paper 1 offers a novel architectural improvement for agent skills, Paper 2's focus on preventing irreversible harm tackles a broader, more urgent challenge across the entire field of autonomous systems.
Paper 1 (AIP) introduces a more broadly applicable framework—representing agent skills as directed execution graphs—with clear, statistically significant improvements on a substantial benchmark (27 real tasks). It addresses a fundamental bottleneck in agentic AI (skill reliability, creation, and governance) with a structured, schema-validated approach that naturally supports reinforcement learning and corpus-level management. Paper 2 addresses a narrower problem (conversation grounding verification) with modest accuracy gains (+1.3-6.7pp) on small benchmarks. AIP's combination of practical impact, breadth of applicability to the rapidly growing agent ecosystem, and its enabling of systematic skill improvement gives it higher potential scientific impact.
Paper 2 is likely to have higher scientific impact because it proposes a novel, reusable representation (typed execution-graph + schema-validated spec) that directly changes how agent skills are authored, tested, governed, and optimized, with clear empirical gains and statistical testing on real tasks. Its method enables broader downstream work (tooling, RL over skills, safety/governance, debugging) across agent systems and domains. Paper 1 offers valuable infrastructure (trace dataset + simulator) for reproducibility and analysis, but its impact is narrower and more contingent on benchmark/agent specifics and potential constraints around proprietary traces.
Paper 1 proposes a fundamental shift in how agent skills are represented (from fragile prose to robust execution graphs), which has broad implications across all LLM agent applications. While Paper 2 presents a strong, rigorous pipeline for embodied AI, its focus is largely restricted to the smart home domain. Paper 1's generalizable methodology for skill compilation, verification, and governance addresses a critical bottleneck in agent reliability and offers higher potential for wide-ranging impact across the field of artificial intelligence.
Paper 2 offers a deeper theoretical contribution by decomposing conflicting reasoning modes to address fundamental structural failures in LLM planning. While Paper 1 provides a highly practical engineering protocol for skill representation, Paper 2's R-APS framework tackles core cognitive limitations like robustness and memory invalidation without fine-tuning. Furthermore, its demonstration that structured reasoning protocols allow small 4B models to compete with 70B models has profound, widespread implications for AI efficiency, scalability, and complex constrained design.
Paper 1 presents a more novel, broadly applicable contribution: a structured, schema-validated graph representation for agent skills with demonstrated, statistically tested performance gains on a real benchmark and a clear path to governance, debugging, and RL over skills. Its methodology includes concrete ablations (before/after compilation) and significance testing, and the approach generalizes across domains where agents execute procedures. Paper 2 targets an important application, but appears closer to established agent-based epidemic simulation + RL control, with limited evidence of methodological novelty, validation against real data, or generality beyond the pandemic policy setting.
Paper 2 proposes a foundational shift in how agent skills are represented by replacing fragile prose with a directed execution graph (AIP). This not only significantly improves task performance and reliability but also enables precise debugging, governance, and reinforcement learning. Paper 1 offers a valuable but more narrow optimization for token efficiency in tool use, whereas Paper 2's methodological innovation has broader implications for the design and scalability of agentic systems across multiple domains.