Arsalan Shahid, Gordon Suttie, Philip Black
Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisions. Production deployments are no longer one human supervising one model. They are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries. The technical surface for this collaboration remains weakly specified. When an agent drafts a response and a human edits it before it ships, the moment of human judgement is the most valuable signal in the system. In current practice it is recorded, if at all, in application code, chat threads, ticket comments, and tribal memory. Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability. Neither defines the shared workspace in which humans and agents perform accountable work together. This paper presents CHAP, the Collaborative Human-Agent Protocol. Under CHAP, the override that used to vanish into a chat thread becomes a structured event carrying a diff, a rationale, and a content hash. The handoff between shifts becomes a portable envelope rather than a pinned message. The human approval of an agent's draft becomes a non-repudiable signed decision that can be replayed years later. The protocol achieves this through a small Core (workspaces, participants, tasks, artefacts, and an append-only evidence log) together with composable profiles that add review, modes, routing, deliberation, handoff, identity, signatures, and transparency-backed audit as deployments require them. Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap
CHAP identifies and addresses a genuine gap in the emerging agentic AI infrastructure stack: the absence of a standardized protocol for the shared workspace where humans and agents perform accountable work together. The paper correctly observes that MCP handles agent-to-tool interactions, A2A handles agent-to-agent interoperability, but neither defines the collaboration semantics for human-agent joint work—particularly the structured recording of human review, override, approval, escalation, abstention, and handoff events.
The core technical contribution is a protocol specification organized as a minimal Core (workspaces, participants, tasks, artefacts, append-only evidence log) plus composable profiles (review, modes, routing, deliberation, handoff, identity, signing, SCITT-backed audit). The most distinctive design element is the structured override: when a human edits an agent's output, the change is recorded as a typed diff with rationale, tags, content hash, and an `intent_preserved` flag distinguishing refinement from substitution.
This is a protocol design paper, not an empirical research paper. As such, it should be evaluated differently than experimental work. The paper:
However, several rigor concerns stand out:
The practical relevance is high. Organizations deploying AI agents in regulated environments (healthcare, finance, insurance, pharma, legal) genuinely need structured audit trails of human-agent collaboration. The paper's worked examples—from a solo developer tracking Cursor overrides to a QP signing batch release in GMP manufacturing—are compelling because they map to real operational pain points.
If adopted, CHAP could:
Barriers to impact are significant:
The timing is excellent. The paper addresses a current bottleneck: as agentic AI moves into production, the lack of structured collaboration semantics between humans and agents is a real and growing problem. The EU AI Act's requirements for human oversight documentation, the proliferation of multi-agent systems, and enterprise adoption of AI in regulated domains all create demand for exactly this kind of protocol layer. The paper correctly positions itself relative to MCP and A2A, both of which emerged in 2024-2025.
CHAP addresses a real and timely problem with a thoughtfully designed protocol. However, as a scientific contribution, it is primarily a specification proposal rather than a research paper. It lacks empirical evaluation, formal analysis, and comparison with alternative approaches. Its impact will be determined entirely by adoption, which depends on ecosystem dynamics beyond the paper's technical merits. The contribution is more akin to an RFC or industry specification than a research advance.
Generated Jun 9, 2026
Paper 1 introduces a foundational protocol for human-agent collaboration, addressing a critical and universal challenge in AI deployment: accountability, auditability, and structured interaction. Its application spans multiple high-stakes domains (clinical, legal, coding), offering broader and more transformative real-world impact than Paper 2, which provides a valuable but narrower domain-specific benchmark for evaluating VLMs in engineering.
Paper 1 (CHAP) likely has higher impact due to strong real-world applicability and timeliness: it proposes a concrete protocol standard for accountable multi-human/multi-agent work, addressing governance, auditability, provenance, and interoperability gaps not covered by MCP/A2A. If adopted, it could influence tooling and practices across many industries (software, customer support, legal, clinical), giving broad cross-field impact. Paper 2 (HIPIF) is a solid methodological contribution to long-horizon agent learning, but it is more incremental within a crowded area and its impact depends on benchmark/generalization and later adoption.
Paper 2 proposes a foundational protocol for human-agent collaboration, addressing a critical and widespread bottleneck in agentic AI deployments. Its breadth of application across numerous domains and its potential to standardize future HCI, AI safety, and systems research give it massive interdisciplinary impact, whereas Paper 1 is highly specialized within control engineering.
CHAP addresses a fundamental infrastructure gap in the rapidly expanding field of human-AI collaboration, proposing a protocol standard for accountable multi-human, multi-agent work. Its breadth of impact is significant: it spans AI safety, governance, enterprise AI deployment, and interoperability—areas of enormous current urgency. While Paper 2 makes a solid contribution to automatic floor plan furnishing with a niche dataset and pipeline, its scope is narrow (interior design/architecture). CHAP's potential to become foundational infrastructure for AI accountability gives it substantially higher cross-domain impact potential.
Paper 2 likely has higher scientific impact: it introduces a broadly applicable RL framework for MDPs with state-dependent feasible action sets—a common, hard setting in operations research—backed by a formal performance guarantee and demonstrated on queueing control. This combines novelty with methodological rigor and strong cross-domain applicability (OR, RL, control, logistics). Paper 1 is timely and practically valuable as a protocol/specification for human-agent collaboration and auditability, but its primary contribution is standardization/engineering rather than a generalizable scientific method with theory, and its impact may be more dependent on adoption.
CHAP addresses a critical and timely infrastructure gap in the rapidly expanding field of AI agent deployments. As foundation models increasingly take on operational roles, the need for structured human-agent collaboration protocols with accountability, auditability, and trust is immense. This paper has broader cross-domain impact (healthcare, legal, software engineering, customer service) and complements existing standards (MCP, A2A). Paper 1, while valuable as a benchmark for chronological reasoning in VLMs, addresses a narrower evaluation problem with more incremental contributions. CHAP's potential to become foundational infrastructure gives it higher long-term scientific and practical impact.
Paper 2 (CHAP) has higher estimated scientific impact due to broader cross-domain applicability and timeliness: a standardized protocol for accountable human–agent collaboration could influence many high-stakes deployments (enterprise ops, healthcare, legal, software) and shape interoperability/auditability norms. Its artifact-centric, signed, replayable evidence log targets a major emerging gap not addressed by MCP/A2A, potentially becoming infrastructure-level. Paper 1 is a solid, novel methodological advance for web-agent skill reuse with demonstrated gains, but its impact is narrower (web automation benchmarks) and more incremental within an active research line.
Paper 1 addresses a critical, timely bottleneck in the era of autonomous AI: human-agent accountability and collaboration. While Paper 2 presents a rigorous and useful framework for osteoarthritis research, its impact is confined to a specific medical domain. In contrast, Paper 1 proposes a foundational protocol (CHAP) with massive breadth, applicable across software, healthcare, law, and business operations. By providing a standardized, auditable framework for human-in-the-loop agentic systems, Paper 1 has the potential to shape the infrastructure of future AI deployments universally, giving it significantly higher overall scientific and practical impact.
CHAP addresses a fundamental infrastructure gap in the rapidly expanding field of human-AI collaboration, proposing a protocol standard for accountable multi-human, multi-agent work. Its breadth of impact spans AI governance, enterprise AI deployment, compliance, and software engineering. While Paper 2 offers a solid incremental contribution to AI planning efficiency using LLMs for partial grounding, CHAP tackles a more timely and broadly impactful problem—establishing trust, accountability, and interoperability standards as foundation models enter operational roles across industries. Its potential to become foundational infrastructure gives it higher impact potential.
Paper 2 (AARR) introduces a novel benchmark suite that addresses a timely and important gap: evaluating whether AI agents can replicate the nuanced judgment of human researchers. Benchmarks historically drive significant research progress and community adoption. Paper 1 (CHAP) proposes a protocol specification for human-agent collaboration, which is practically useful but more incremental—building on existing protocol standards (MCP, A2A). While CHAP addresses real engineering needs, AARR's empirical findings (e.g., best agents achieving only 68.3%) provide actionable scientific insights that will likely influence agent development research more broadly across the AI community.