AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

Shuaike Shen, Wenduo Cheng, Shike Wang, Mingqian Ma, Jian Ma

May 19, 2026

arXiv:2605.20425v1 PDF

cs.AI(primary)

#359of 2292·Artificial Intelligence

#359 of 2292 · Artificial Intelligence

Tournament Score

1492±43

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.5

Novelty6.5

Clarity7

Tournament Score

1492±43

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Designing multi-agent workflows is especially difficult in open-ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies, AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene-set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. AgentCo-op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question-answering benchmarks, AgentCo-op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per-task cost relative to multi-agent baselines. Together, these results suggest that retrieval-based synthesis can extend automated agentic workflow design beyond benchmark-optimized agent graphs to open-world workflows built from existing agents, tools, and typed artifacts.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AgentCo-op

1. Core Contribution

AgentCo-op reframes automated multi-agent workflow design from a search-based optimization problem to a retrieval-based synthesis problem. The key insight is that many real-world tasks—especially in scientific domains—lack curated training sets, scalar evaluation metrics, and standardized interfaces that search-based methods (ADAS, AFlow) require. Instead of iteratively searching over workflow topologies against a benchmark, AgentCo-op retrieves reusable skills, tools, and external agents from libraries, composes them into directed-graph workflows with typed artifact handoffs, and applies bounded local repair when execution fails.

The framework addresses a genuine gap: composing *independently developed* agents with incompatible environments and interfaces into coherent workflows. The Docker-wrapping mechanism for external repositories and the typed artifact broker for inter-agent communication are practical design choices that enable heterogeneous agent interoperability.

2. Methodological Rigor

Strengths in design: The formalism (Eq. 1-3) is clean, decomposing tasks into goal/context/constraints/resources and workflows into roles/graph/mappings/protocols. The separation of synthesis from repair is well-motivated.

Weaknesses in evaluation: The rigor is mixed:

Benchmark results (Table 3): AgentCo-op achieves best performance on 4/6 benchmarks under the matched GPT-4o-mini backbone, but the comparison is somewhat confounded. AFlow* uses mixed backbones and scores substantially higher on HumanEval and MBPP. The fair comparison (AFlow with GPT-4o-mini) drops significantly, raising questions about whether AFlow's reimplementation is faithful. ReConcile and LLM-Debate beat AgentCo-op on DROP, and LLM-Debate on HumanEval.

Scientific case studies lack quantitative ground truth. The spatial transcriptomics case study (Section 4.1) demonstrates that the workflow *executes* and produces biologically plausible interpretations, but there is no rigorous validation that the conclusions are correct or superior to manual analysis. The cross-modality study (Section 4.2) has quantitative metrics (precision/recall against marker databases), but the absolute numbers are low (precision ~0.3, recall ~0.12), and the comparison is only RNA-alone vs. ATAC-alone vs. combined—not against alternative workflow design methods.

Ablation study (Table 5) shows modest contributions from local repair and skills/tools on standard benchmarks, with the authors acknowledging these components matter more in scientific settings where they are not quantitatively evaluated.

The "synthesis + search" complementarity (Table 2) is demonstrated on a single benchmark (MBPP) with a marginal improvement (87.1 → 87.5), which is not strongly convincing.

3. Potential Impact

High potential in scientific workflow automation: The paper identifies a real and underserved need—composing specialized scientific agents without redesigning them. As the scientific agent ecosystem grows (SpatialAgent, GeneAgent, CRISPR-GPT, etc.), interoperability will become increasingly important. AgentCo-op's Docker-wrapping and typed artifact approach provides a practical template.

Moderate impact on benchmark-oriented research: The benchmark gains are incremental rather than transformative. The cost savings (Table 4) are more compelling—AgentCo-op is substantially cheaper than discussion-based methods on most benchmarks.

Broader relevance: The retrieval-based synthesis paradigm could influence how the community thinks about workflow design beyond search-based optimization, particularly for domains where evaluation is expensive or ill-defined.

4. Timeliness & Relevance

The paper is highly timely. The proliferation of specialized LLM agents across scientific domains creates an urgent need for composition frameworks. The paper also connects to the emerging "agent skills" paradigm (Anthropic's MCP, SkillFoundry) and positions itself within a rapidly evolving landscape. The acknowledgment that search and synthesis are complementary rather than competing is a mature perspective.

5. Strengths & Limitations

Key Strengths:

Addresses a genuine gap: workflow design without scalar rewards

Practical Docker-based interoperability mechanism

Demonstrates real-world scientific agent composition (not just toy examples)

Cost-efficient compared to multi-agent baselines

Clean separation of synthesis and repair phases

Code and project website available

Notable Limitations:

Scientific case studies are primarily demonstrations of *feasibility* rather than rigorous evaluations of *superiority*

The "typed artifact handoff" mechanism is described at a high level; the actual schema enforcement and failure modes are not deeply analyzed

Limited analysis of failure cases—when does retrieval-based synthesis produce poor workflows?

The retrieval mechanism itself is underspecified: how are skills scored against roles? What happens with poor retrieval quality?

Only two scientific case studies, both in genomics, limiting generalizability claims

The bounded local repair's convergence properties and repair policy design are not formally analyzed

The ablation shows that on standard benchmarks, the novel components (skills, tools, repair) contribute marginally, while the settings where they should matter most (scientific tasks) lack quantitative evaluation

Missing Comparisons: No comparison against manual expert-designed workflows or simpler baselines (e.g., sequential prompting with tool use) in the scientific settings, making it hard to quantify the actual value added.

Overall Assessment

AgentCo-op presents a well-motivated framework for an important and timely problem. The vision of retrieval-based synthesis for multi-agent workflows is compelling, and the practical mechanisms (Docker wrapping, typed artifacts, local repair) are sound engineering contributions. However, the paper's impact is limited by the gap between its ambitious claims about scientific workflow design and the relatively surface-level evaluation of those capabilities. The benchmark results, while competitive, are incremental. The paper is stronger as a systems/framework contribution than as a scientific discovery paper.

Rating:6.2/ 10

Significance 6.5Rigor 5.5Novelty 6.5Clarity 7

Generated May 21, 2026

Comparison History (30)

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

claude-opus-4.65/22/2026

Paper 2 demonstrates unprecedented real-world scientific impact by autonomously solving 9 open Erdős problems—longstanding challenges in mathematics—using AI-driven formal proof search. This represents a concrete, verifiable advance in mathematical research, not just benchmark improvements. The approach combines LLMs with formal verification (Lean), addressing the critical reliability problem. Its deployment across multiple mathematical subfields (combinatorics, optimization, algebraic geometry, quantum optics) shows broad applicability. Solving open problems is qualitatively different from workflow optimization, representing a paradigm shift in how AI contributes to fundamental research.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

claude-opus-4.65/22/2026

Paper 2 identifies a fundamental and counterintuitive phenomenon—inverse scaling in LLM forecasting for superlinear/tail-risk scenarios—with broad implications across finance, epidemiology, and AI safety. It challenges the prevailing assumption that more capable models are universally better, introduces a contamination-free benchmark, and provides actionable recommendations for evaluation methodology. This finding has high novelty, broad cross-disciplinary relevance, and timely importance as LLMs are increasingly deployed for real-world forecasting. Paper 1, while solid engineering work on multi-agent workflows, is more incremental in its contribution to the existing agentic AI design literature.

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

gemini-3.15/22/2026

Paper 2 demonstrates a landmark achievement by using AI to autonomously solve open, historically significant mathematical problems (Erdős problems and OEIS conjectures). While Paper 1 presents a valuable framework for multi-agent workflows, Paper 2's direct contribution to new mathematical knowledge and its deployment across multiple theoretical disciplines signify a broader and more transformative scientific impact.

vs. Forecasting Scientific Progress with Artificial Intelligence

gemini-3.15/22/2026

Paper 1 introduces a practical, executable framework (AgentCo-op) that immediately enhances scientific discovery by synthesizing interoperable multi-agent workflows. Its proven success in complex, open-world domains like genomics demonstrates high real-world utility and direct application. While Paper 2 provides a valuable metascience benchmark, its primary finding is a negative result regarding current AI forecasting limitations. Paper 1's actionable methodology for automating and repairing research workflows offers a more direct, constructive, and broader technological impact across various scientific disciplines.

vs. PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

gemini-3.15/21/2026

Paper 2 demonstrates significant cross-disciplinary impact by applying multi-agent workflow synthesis to open-ended scientific domains like genomics. While Paper 1 provides a valuable benchmark for LLM planning, Paper 2 tackles the complex, real-world challenge of tool interoperability and collaborative discovery, offering broader tangible applications and accelerating scientific research.

vs. AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

gpt-5.25/21/2026

Paper 2 (AgentCo-op) likely has higher scientific impact due to broader novelty and cross-domain applicability: it introduces retrieval-based synthesis of interoperable multi-agent workflows with typed artifact handoffs and localized repair, addressing a key bottleneck in open-world scientific automation. Its demonstrated use in genomics case studies suggests strong real-world relevance and timeliness for scientific tool/agent integration, and the interoperability/typed interface idea can generalize across fields. Paper 1 is impactful for GUI RPA efficiency, but its scope is narrower (GUI task automation) and more application-specific.

vs. Generative Recursive Reasoning

gpt-5.25/21/2026

Paper 2 (GRAM) is more conceptually novel and broadly impactful: it generalizes recursive reasoning into a probabilistic latent-trajectory framework, enabling multi-hypothesis computation and inference-time scaling—ideas applicable across reasoning, planning, generation, and uncertainty modeling. Its methodological framing (latent-variable generative model + variational inference) is principled and extensible. Paper 1 is strong and timely for tool/agent orchestration in scientific workflows, but its impact may be narrower and more engineering/system-integration focused, with novelty largely in composition and repair mechanisms rather than a new foundational modeling paradigm.

vs. Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

gemini-3.15/21/2026

Paper 2 demonstrates higher potential scientific impact due to its broad applicability across both AI systems and domain sciences. While Paper 1 provides valuable insights into LLM alignment and sycophancy mitigation, Paper 2 tackles the critical frontier of automated multi-agent workflows for open-ended scientific discovery. By successfully applying its framework to complex genomics case studies alongside general benchmarks, AgentCo-op directly accelerates actual scientific research. Its interoperable methodology and real-world utility in diverse, cross-disciplinary tasks give it a significantly wider and more transformative reach than Paper 1's narrowly focused alignment intervention.

vs. Uneven Evolution of Cognition Across Generations of Generative AI Models

gemini-3.15/21/2026

Paper 1 addresses fundamental theoretical questions regarding the trajectory of Artificial General Intelligence (AGI) and identifies critical architectural biases in current models. Its introduction of a psychometric framework to evaluate AI cognition provides broad, foundational insights that could shift AGI research paradigms away from simple scaling. While Paper 2 offers a valuable, practical tool for applied scientific workflows, Paper 1 has higher potential for sweeping theoretical impact across the AI field.

vs. For How Long Should We Be Punching? Learning Action Duration in Fighting Games

gemini-3.15/21/2026

Paper 2 addresses a highly impactful and timely problem: automating multi-agent workflows for open-ended scientific discovery. Its application to real-world genomics and strong performance across diverse benchmarks (coding, math, QA) demonstrates broad utility and significant methodological innovation. In contrast, Paper 1 focuses on a niche application (fighting game AI), and its findings primarily highlight exploitative behaviors against scripted bots rather than broad, generalizable RL advancements. Therefore, Paper 2 has vastly greater potential for real-world application and cross-disciplinary scientific impact.

vs. VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

gpt-5.25/21/2026

Paper 1 is more novel and broadly applicable: it proposes a general retrieval-based synthesis framework for composing interoperable multi-agent workflows with typed artifacts and local repair, targeting open-world scientific settings and demonstrating cross-domain benchmark gains plus genomics case studies. Its methodological contribution (workflow synthesis + execution-evidence-driven repair) can impact agent systems, automated science, and tool interoperability across many fields. Paper 2 is timely and practically relevant for EV battery maintenance, but its contribution is more application-specific and appears less generalizable as a scientific framework beyond the battery domain.

vs. Safety Certification is Classification

gemini-3.15/21/2026

Paper 2 addresses the highly relevant and rapidly growing field of LLM-based multi-agent systems and AI for science. Its approach to composing interoperable workflows has broad, cross-disciplinary applications, demonstrated through genomics and standard benchmarks. While Paper 1 provides a rigorous and novel approach to safety certification in control systems, Paper 2's potential to accelerate open-ended scientific discovery across various domains gives it a significantly wider breadth of impact and timeliness.

vs. Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

claude-opus-4.65/21/2026

AgentCo-op addresses a broader and more impactful problem—composing multi-agent workflows for open-ended scientific tasks including genomics applications—with demonstrated utility across diverse benchmarks and real scientific case studies. It introduces a novel retrieval-based synthesis framework with typed artifact handoffs and local repair, offering methodological contributions applicable across many fields. Paper 1, while technically solid, is a domain-specific GPU-accelerated simulator for Mahjong RL, which serves a narrower research community and represents more of an engineering contribution than a conceptual advance.

vs. Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

claude-opus-4.65/21/2026

Paper 2 addresses a fundamental problem in guided sampling for diffusion/flow models—composing multiple constraints while staying on the data manifold. It provides theoretical insight (gradient misalignment causing off-manifold drift) and a principled, lightweight solution (CAR guidance) validated across diverse domains (images, planning, control). This has broad applicability across generative AI. Paper 1, while addressing an important practical problem in multi-agent workflow design with strong genomics applications, is more systems-oriented and narrower in its theoretical contribution. Paper 2's foundational insight into gradient conflicts in compositional guidance is likely to influence a wider range of future work.

vs. Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

gpt-5.25/21/2026

Paper 2 likely has higher impact: it targets a field-wide bottleneck—benchmark validity under reward hacking—affecting how progress is measured across many agent domains. It contributes a general taxonomy, a practical checklist, and an automated red-teaming/auditing system validated on 10 major benchmarks with extensive empirical findings (219 flaws) and demonstrated mitigation (patching pipelines, large reductions in hackability). This is timely and broadly applicable to AI evaluation, safety, and deployment. Paper 1 is innovative for workflow synthesis but is narrower and more application-specific.

vs. Interaction Locality in Hierarchical Recursive Reasoning

gemini-3.15/21/2026

Paper 1 presents a highly practical and timely framework for automating and synthesizing multi-agent workflows, with direct, high-impact applications in real-world scientific discovery such as open-world genomics. Its ability to improve interoperability and reduce costs across diverse domains (science, coding, math) gives it broader immediate utility and transformational potential compared to Paper 2, which focuses more narrowly on theoretical interpretability and architectural analysis of spatial reasoning models.

vs. \ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems

gpt-5.25/21/2026

Paper 2 likely has higher impact: it introduces a broadly applicable framework for composing interoperable multi-agent workflows with typed artifact handoffs and local repair, addressing a timely, high-demand problem (open-world agent/tool integration). It demonstrates real-world utility via genomics case studies plus competitive benchmark results and cost reductions, suggesting strong practical adoption potential across scientific and engineering domains. Paper 1 is methodologically rigorous and novel for uncertainty-augmented evaluation, but its impact is narrower (evaluation metrics) and more likely confined to ML assessment practice rather than enabling new classes of end-to-end systems.

vs. OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

gemini-3.15/21/2026

Paper 1 presents a highly versatile framework for multi-agent workflow synthesis with direct applications to open-ended scientific discovery, demonstrated through complex genomics case studies. Its ability to integrate independent agents and tools dynamically offers broader interdisciplinary impact and real-world utility compared to Paper 2, which focuses on a narrower, albeit impressive, improvement in Theory of Mind reasoning for language models.

vs. COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

gpt-5.25/21/2026

Paper 1 is likely to have higher scientific impact due to broader cross-domain relevance: a general framework for composing interoperable multi-agent workflows via typed artifacts, retrieval-based component grounding, and localized repair, demonstrated in open-world genomics plus multiple general benchmarks with cost reductions. Its emphasis on interoperability and reuse of existing agents/tools targets a major bottleneck in real scientific automation and could influence LLM-agent systems, workflow languages, and computational science practices. Paper 2 is rigorous and impactful for VRP, but is more domain-specific, with narrower breadth despite strong benchmark gains.

vs. COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

gpt-5.25/21/2026

Paper 2 is likely higher impact due to broader cross-domain relevance and timeliness: it targets open-world, tool-integrating multi-agent workflow synthesis with typed artifact interfaces and execution-driven local repair—capabilities central to current scientific automation trends. Its applications span genomics case studies plus multiple general benchmarks, suggesting wider applicability beyond a single optimization domain. Paper 1 is methodologically solid and impactful for VRP/learn-to-search, but its contribution is more domain-specific (routing) and likely narrower in breadth compared to a general framework for interoperable agent workflows.