Back to Rankings

Collaborative Human-Agent Protocol (CHAP)

Arsalan Shahid, Gordon Suttie, Philip Black

cs.AIcs.CLcs.HC
Share
#1774 of 3489 · Artificial Intelligence
Tournament Score
1397±43
10501800
65%
Win Rate
11
Wins
6
Losses
17
Matches
Rating
4.5/ 10
Significance6.5
Rigor3.5
Novelty5
Clarity7

Abstract

Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisions. Production deployments are no longer one human supervising one model. They are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries. The technical surface for this collaboration remains weakly specified. When an agent drafts a response and a human edits it before it ships, the moment of human judgement is the most valuable signal in the system. In current practice it is recorded, if at all, in application code, chat threads, ticket comments, and tribal memory. Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability. Neither defines the shared workspace in which humans and agents perform accountable work together. This paper presents CHAP, the Collaborative Human-Agent Protocol. Under CHAP, the override that used to vanish into a chat thread becomes a structured event carrying a diff, a rationale, and a content hash. The handoff between shifts becomes a portable envelope rather than a pinned message. The human approval of an agent's draft becomes a non-repudiable signed decision that can be replayed years later. The protocol achieves this through a small Core (workspaces, participants, tasks, artefacts, and an append-only evidence log) together with composable profiles that add review, modes, routing, deliberation, handoff, identity, signatures, and transparency-backed audit as deployments require them. Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Collaborative Human-Agent Protocol (CHAP)

1. Core Contribution

CHAP identifies and addresses a genuine gap in the emerging agentic AI infrastructure stack: the absence of a standardized protocol for the shared workspace where humans and agents perform accountable work together. The paper correctly observes that MCP handles agent-to-tool interactions, A2A handles agent-to-agent interoperability, but neither defines the collaboration semantics for human-agent joint work—particularly the structured recording of human review, override, approval, escalation, abstention, and handoff events.

The core technical contribution is a protocol specification organized as a minimal Core (workspaces, participants, tasks, artefacts, append-only evidence log) plus composable profiles (review, modes, routing, deliberation, handoff, identity, signing, SCITT-backed audit). The most distinctive design element is the structured override: when a human edits an agent's output, the change is recorded as a typed diff with rationale, tags, content hash, and an `intent_preserved` flag distinguishing refinement from substitution.

2. Methodological Rigor

This is a protocol design paper, not an empirical research paper. As such, it should be evaluated differently than experimental work. The paper:

  • Clearly identifies requirements (R1–R9) derived from operational needs in regulated industries
  • Explicitly scopes non-goals (N1–N5), which demonstrates disciplined protocol design
  • Provides a threat model covering replay, downgrade, capability confusion, key rotation, chain forking, and compromised coordinator scenarios
  • Defines conformance levels (Minimal, Recommended, Full) with honest acknowledgment that Full cannot yet be claimed due to lack of a second interoperable implementation
  • Provides a reference implementation and conformance scaffolding
  • However, several rigor concerns stand out:

  • No empirical evaluation whatsoever. There are no user studies, no deployment measurements, no performance benchmarks, no comparison with existing approaches to recording human-agent collaboration. The twelve practice scenarios (Appendix D) are illustrative but hypothetical.
  • Single implementation. The authors acknowledge this; standards typically require two independent interoperable implementations.
  • No formal verification of protocol properties (e.g., safety, liveness, evidence chain integrity under concurrent operations).
  • Conformance claims are aspirational. The test suite is described as "draft test vectors" that are "not exhaustive."
  • 3. Potential Impact

    The practical relevance is high. Organizations deploying AI agents in regulated environments (healthcare, finance, insurance, pharma, legal) genuinely need structured audit trails of human-agent collaboration. The paper's worked examples—from a solo developer tracking Cursor overrides to a QP signing batch release in GMP manufacturing—are compelling because they map to real operational pain points.

    If adopted, CHAP could:

  • Create a portable audit format for human-agent collaboration that survives across tools, vendors, and time
  • Enable systematic analysis of human override patterns to improve agent behavior
  • Provide regulatory-ready evidence chains for AI governance frameworks (EU AI Act, NIST AI RMF, FCA Consumer Duty)
  • Establish interoperability between different human-in-the-loop implementations
  • Barriers to impact are significant:

  • Protocol adoption requires ecosystem buy-in from multiple vendors; a single company publishing a spec rarely achieves this without consortium backing or significant market power
  • The paper comes from Brightbeam AI, a company without established market presence in this space
  • Competing approaches may emerge from larger players (Google's A2A team, Anthropic, Microsoft)
  • The overhead of structured override recording may face resistance from practitioners who find it burdensome
  • 4. Timeliness & Relevance

    The timing is excellent. The paper addresses a current bottleneck: as agentic AI moves into production, the lack of structured collaboration semantics between humans and agents is a real and growing problem. The EU AI Act's requirements for human oversight documentation, the proliferation of multi-agent systems, and enterprise adoption of AI in regulated domains all create demand for exactly this kind of protocol layer. The paper correctly positions itself relative to MCP and A2A, both of which emerged in 2024-2025.

    5. Strengths & Limitations

    Key Strengths:

  • Well-identified gap. The observation that human judgment events (overrides, approvals, escalations) are the most valuable signals in human-agent systems, yet are systematically lost, is sharp and correct.
  • Composable design. The Core + Profiles architecture enables incremental adoption, which is critical for real-world protocol uptake.
  • Reuse-first philosophy. Building on JSON-RPC, JCS, JSON Patch, OIDC, SCITT, and W3C VCs rather than inventing new primitives demonstrates mature protocol design thinking.
  • The `intent_preserved` flag is a genuinely useful innovation for distinguishing refinement from substitution in overrides—a distinction with real operational and analytical value.
  • Practical scenarios are unusually concrete and grounded in real operational contexts.
  • Notable Limitations:

  • No empirical validation of any kind—no deployments, no user studies, no performance data
  • This is primarily a specification document, not a research contribution in the traditional sense; it presents no novel algorithms, no theoretical results, no experimental findings
  • The "Wave III" framing is self-serving and historically imprecise
  • Privacy and GDPR tensions with append-only logs are acknowledged but unresolved
  • Scalability is entirely unaddressed—no discussion of evidence chain size, query performance, or storage costs at scale
  • The paper is extremely long (51 pages) for a v0.2 draft specification, yet contains significant repetition and marketing-adjacent prose
  • Overall Assessment

    CHAP addresses a real and timely problem with a thoughtfully designed protocol. However, as a scientific contribution, it is primarily a specification proposal rather than a research paper. It lacks empirical evaluation, formal analysis, and comparison with alternative approaches. Its impact will be determined entirely by adoption, which depends on ecosystem dynamics beyond the paper's technical merits. The contribution is more akin to an RFC or industry specification than a research advance.

    Rating:4.5/ 10
    Significance 6.5Rigor 3.5Novelty 5Clarity 7

    Generated Jun 9, 2026

    Comparison History (17)

    Wonvs. Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

    Paper 1 introduces a foundational protocol for human-agent collaboration, addressing a critical and universal challenge in AI deployment: accountability, auditability, and structured interaction. Its application spans multiple high-stakes domains (clinical, legal, coding), offering broader and more transformative real-world impact than Paper 2, which provides a valuable but narrower domain-specific benchmark for evaluating VLMs in engineering.

    gemini-3.1-pro-preview·Jun 10, 2026
    Wonvs. HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

    Paper 1 (CHAP) likely has higher impact due to strong real-world applicability and timeliness: it proposes a concrete protocol standard for accountable multi-human/multi-agent work, addressing governance, auditability, provenance, and interoperability gaps not covered by MCP/A2A. If adopted, it could influence tooling and practices across many industries (software, customer support, legal, clinical), giving broad cross-field impact. Paper 2 (HIPIF) is a solid methodological contribution to long-horizon agent learning, but it is more incremental within a crowded area and its impact depends on benchmark/generalization and later adoption.

    gpt-5.2·Jun 10, 2026
    Wonvs. Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning

    Paper 2 proposes a foundational protocol for human-agent collaboration, addressing a critical and widespread bottleneck in agentic AI deployments. Its breadth of application across numerous domains and its potential to standardize future HCI, AI safety, and systems research give it massive interdisciplinary impact, whereas Paper 1 is highly specialized within control engineering.

    gemini-3.1-pro-preview·Jun 10, 2026
    Wonvs. Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

    CHAP addresses a fundamental infrastructure gap in the rapidly expanding field of human-AI collaboration, proposing a protocol standard for accountable multi-human, multi-agent work. Its breadth of impact is significant: it spans AI safety, governance, enterprise AI deployment, and interoperability—areas of enormous current urgency. While Paper 2 makes a solid contribution to automatic floor plan furnishing with a niche dataset and pipeline, its scope is narrow (interior design/architecture). CHAP's potential to become foundational infrastructure for AI accountability gives it substantially higher cross-domain impact potential.

    claude-opus-4-6·Jun 10, 2026
    Lostvs. Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

    Paper 2 likely has higher scientific impact: it introduces a broadly applicable RL framework for MDPs with state-dependent feasible action sets—a common, hard setting in operations research—backed by a formal performance guarantee and demonstrated on queueing control. This combines novelty with methodological rigor and strong cross-domain applicability (OR, RL, control, logistics). Paper 1 is timely and practically valuable as a protocol/specification for human-agent collaboration and auditability, but its primary contribution is standardization/engineering rather than a generalizable scientific method with theory, and its impact may be more dependent on adoption.

    gpt-5.2·Jun 10, 2026
    Wonvs. Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

    CHAP addresses a critical and timely infrastructure gap in the rapidly expanding field of AI agent deployments. As foundation models increasingly take on operational roles, the need for structured human-agent collaboration protocols with accountability, auditability, and trust is immense. This paper has broader cross-domain impact (healthcare, legal, software engineering, customer service) and complements existing standards (MCP, A2A). Paper 1, while valuable as a benchmark for chronological reasoning in VLMs, addresses a narrower evaluation problem with more incremental contributions. CHAP's potential to become foundational infrastructure gives it higher long-term scientific and practical impact.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

    Paper 2 (CHAP) has higher estimated scientific impact due to broader cross-domain applicability and timeliness: a standardized protocol for accountable human–agent collaboration could influence many high-stakes deployments (enterprise ops, healthcare, legal, software) and shape interoperability/auditability norms. Its artifact-centric, signed, replayable evidence log targets a major emerging gap not addressed by MCP/A2A, potentially becoming infrastructure-level. Paper 1 is a solid, novel methodological advance for web-agent skill reuse with demonstrated gains, but its impact is narrower (web automation benchmarks) and more incremental within an active research line.

    gpt-5.2·Jun 9, 2026
    Wonvs. An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

    Paper 1 addresses a critical, timely bottleneck in the era of autonomous AI: human-agent accountability and collaboration. While Paper 2 presents a rigorous and useful framework for osteoarthritis research, its impact is confined to a specific medical domain. In contrast, Paper 1 proposes a foundational protocol (CHAP) with massive breadth, applicable across software, healthcare, law, and business operations. By providing a standardized, auditable framework for human-in-the-loop agentic systems, Paper 1 has the potential to shape the infrastructure of future AI deployments universally, giving it significantly higher overall scientific and practical impact.

    gemini-3.1-pro-preview·Jun 9, 2026
    Wonvs. Semantic Partial Grounding via LLMs

    CHAP addresses a fundamental infrastructure gap in the rapidly expanding field of human-AI collaboration, proposing a protocol standard for accountable multi-human, multi-agent work. Its breadth of impact spans AI governance, enterprise AI deployment, compliance, and software engineering. While Paper 2 offers a solid incremental contribution to AI planning efficiency using LLMs for partial grounding, CHAP tackles a more timely and broadly impactful problem—establishing trust, accountability, and interoperability standards as foundation models enter operational roles across industries. Its potential to become foundational infrastructure gives it higher impact potential.

    claude-opus-4-6·Jun 9, 2026
    Lostvs. Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

    Paper 2 (AARR) introduces a novel benchmark suite that addresses a timely and important gap: evaluating whether AI agents can replicate the nuanced judgment of human researchers. Benchmarks historically drive significant research progress and community adoption. Paper 1 (CHAP) proposes a protocol specification for human-agent collaboration, which is practically useful but more incremental—building on existing protocol standards (MCP, A2A). While CHAP addresses real engineering needs, AARR's empirical findings (e.g., best agents achieving only 68.3%) provide actionable scientific insights that will likely influence agent development research more broadly across the AI community.

    claude-opus-4-6·Jun 9, 2026