AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
Shuaike Shen, Wenduo Cheng, Shike Wang, Mingqian Ma, Jian Ma
Abstract
Designing multi-agent workflows is especially difficult in open-ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies, AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene-set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. AgentCo-op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question-answering benchmarks, AgentCo-op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per-task cost relative to multi-agent baselines. Together, these results suggest that retrieval-based synthesis can extend automated agentic workflow design beyond benchmark-optimized agent graphs to open-world workflows built from existing agents, tools, and typed artifacts.
AI Impact Assessments
(1 models)Scientific Impact Assessment: AgentCo-op
1. Core Contribution
AgentCo-op reframes automated multi-agent workflow design from a search-based optimization problem to a retrieval-based synthesis problem. The key insight is that many real-world tasks—especially in scientific domains—lack curated training sets, scalar evaluation metrics, and standardized interfaces that search-based methods (ADAS, AFlow) require. Instead of iteratively searching over workflow topologies against a benchmark, AgentCo-op retrieves reusable skills, tools, and external agents from libraries, composes them into directed-graph workflows with typed artifact handoffs, and applies bounded local repair when execution fails.
The framework addresses a genuine gap: composing *independently developed* agents with incompatible environments and interfaces into coherent workflows. The Docker-wrapping mechanism for external repositories and the typed artifact broker for inter-agent communication are practical design choices that enable heterogeneous agent interoperability.
2. Methodological Rigor
Strengths in design: The formalism (Eq. 1-3) is clean, decomposing tasks into goal/context/constraints/resources and workflows into roles/graph/mappings/protocols. The separation of synthesis from repair is well-motivated.
Weaknesses in evaluation: The rigor is mixed:
3. Potential Impact
High potential in scientific workflow automation: The paper identifies a real and underserved need—composing specialized scientific agents without redesigning them. As the scientific agent ecosystem grows (SpatialAgent, GeneAgent, CRISPR-GPT, etc.), interoperability will become increasingly important. AgentCo-op's Docker-wrapping and typed artifact approach provides a practical template.
Moderate impact on benchmark-oriented research: The benchmark gains are incremental rather than transformative. The cost savings (Table 4) are more compelling—AgentCo-op is substantially cheaper than discussion-based methods on most benchmarks.
Broader relevance: The retrieval-based synthesis paradigm could influence how the community thinks about workflow design beyond search-based optimization, particularly for domains where evaluation is expensive or ill-defined.
4. Timeliness & Relevance
The paper is highly timely. The proliferation of specialized LLM agents across scientific domains creates an urgent need for composition frameworks. The paper also connects to the emerging "agent skills" paradigm (Anthropic's MCP, SkillFoundry) and positions itself within a rapidly evolving landscape. The acknowledgment that search and synthesis are complementary rather than competing is a mature perspective.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Missing Comparisons: No comparison against manual expert-designed workflows or simpler baselines (e.g., sequential prompting with tool use) in the scientific settings, making it hard to quantify the actual value added.
Overall Assessment
AgentCo-op presents a well-motivated framework for an important and timely problem. The vision of retrieval-based synthesis for multi-agent workflows is compelling, and the practical mechanisms (Docker wrapping, typed artifacts, local repair) are sound engineering contributions. However, the paper's impact is limited by the gap between its ambitious claims about scientific workflow design and the relatively surface-level evaluation of those capabilities. The benchmark results, while competitive, are incremental. The paper is stronger as a systems/framework contribution than as a scientific discovery paper.
Generated May 21, 2026
Comparison History (30)
Paper 2 demonstrates unprecedented real-world scientific impact by autonomously solving 9 open Erdős problems—longstanding challenges in mathematics—using AI-driven formal proof search. This represents a concrete, verifiable advance in mathematical research, not just benchmark improvements. The approach combines LLMs with formal verification (Lean), addressing the critical reliability problem. Its deployment across multiple mathematical subfields (combinatorics, optimization, algebraic geometry, quantum optics) shows broad applicability. Solving open problems is qualitatively different from workflow optimization, representing a paradigm shift in how AI contributes to fundamental research.
Paper 2 identifies a fundamental and counterintuitive phenomenon—inverse scaling in LLM forecasting for superlinear/tail-risk scenarios—with broad implications across finance, epidemiology, and AI safety. It challenges the prevailing assumption that more capable models are universally better, introduces a contamination-free benchmark, and provides actionable recommendations for evaluation methodology. This finding has high novelty, broad cross-disciplinary relevance, and timely importance as LLMs are increasingly deployed for real-world forecasting. Paper 1, while solid engineering work on multi-agent workflows, is more incremental in its contribution to the existing agentic AI design literature.
Paper 2 demonstrates a landmark achievement by using AI to autonomously solve open, historically significant mathematical problems (Erdős problems and OEIS conjectures). While Paper 1 presents a valuable framework for multi-agent workflows, Paper 2's direct contribution to new mathematical knowledge and its deployment across multiple theoretical disciplines signify a broader and more transformative scientific impact.
Paper 1 introduces a practical, executable framework (AgentCo-op) that immediately enhances scientific discovery by synthesizing interoperable multi-agent workflows. Its proven success in complex, open-world domains like genomics demonstrates high real-world utility and direct application. While Paper 2 provides a valuable metascience benchmark, its primary finding is a negative result regarding current AI forecasting limitations. Paper 1's actionable methodology for automating and repairing research workflows offers a more direct, constructive, and broader technological impact across various scientific disciplines.
Paper 2 demonstrates significant cross-disciplinary impact by applying multi-agent workflow synthesis to open-ended scientific domains like genomics. While Paper 1 provides a valuable benchmark for LLM planning, Paper 2 tackles the complex, real-world challenge of tool interoperability and collaborative discovery, offering broader tangible applications and accelerating scientific research.
Paper 2 (AgentCo-op) likely has higher scientific impact due to broader novelty and cross-domain applicability: it introduces retrieval-based synthesis of interoperable multi-agent workflows with typed artifact handoffs and localized repair, addressing a key bottleneck in open-world scientific automation. Its demonstrated use in genomics case studies suggests strong real-world relevance and timeliness for scientific tool/agent integration, and the interoperability/typed interface idea can generalize across fields. Paper 1 is impactful for GUI RPA efficiency, but its scope is narrower (GUI task automation) and more application-specific.
Paper 2 (GRAM) is more conceptually novel and broadly impactful: it generalizes recursive reasoning into a probabilistic latent-trajectory framework, enabling multi-hypothesis computation and inference-time scaling—ideas applicable across reasoning, planning, generation, and uncertainty modeling. Its methodological framing (latent-variable generative model + variational inference) is principled and extensible. Paper 1 is strong and timely for tool/agent orchestration in scientific workflows, but its impact may be narrower and more engineering/system-integration focused, with novelty largely in composition and repair mechanisms rather than a new foundational modeling paradigm.
Paper 2 demonstrates higher potential scientific impact due to its broad applicability across both AI systems and domain sciences. While Paper 1 provides valuable insights into LLM alignment and sycophancy mitigation, Paper 2 tackles the critical frontier of automated multi-agent workflows for open-ended scientific discovery. By successfully applying its framework to complex genomics case studies alongside general benchmarks, AgentCo-op directly accelerates actual scientific research. Its interoperable methodology and real-world utility in diverse, cross-disciplinary tasks give it a significantly wider and more transformative reach than Paper 1's narrowly focused alignment intervention.
Paper 1 addresses fundamental theoretical questions regarding the trajectory of Artificial General Intelligence (AGI) and identifies critical architectural biases in current models. Its introduction of a psychometric framework to evaluate AI cognition provides broad, foundational insights that could shift AGI research paradigms away from simple scaling. While Paper 2 offers a valuable, practical tool for applied scientific workflows, Paper 1 has higher potential for sweeping theoretical impact across the AI field.
Paper 2 addresses a highly impactful and timely problem: automating multi-agent workflows for open-ended scientific discovery. Its application to real-world genomics and strong performance across diverse benchmarks (coding, math, QA) demonstrates broad utility and significant methodological innovation. In contrast, Paper 1 focuses on a niche application (fighting game AI), and its findings primarily highlight exploitative behaviors against scripted bots rather than broad, generalizable RL advancements. Therefore, Paper 2 has vastly greater potential for real-world application and cross-disciplinary scientific impact.
Paper 1 is more novel and broadly applicable: it proposes a general retrieval-based synthesis framework for composing interoperable multi-agent workflows with typed artifacts and local repair, targeting open-world scientific settings and demonstrating cross-domain benchmark gains plus genomics case studies. Its methodological contribution (workflow synthesis + execution-evidence-driven repair) can impact agent systems, automated science, and tool interoperability across many fields. Paper 2 is timely and practically relevant for EV battery maintenance, but its contribution is more application-specific and appears less generalizable as a scientific framework beyond the battery domain.
Paper 2 addresses the highly relevant and rapidly growing field of LLM-based multi-agent systems and AI for science. Its approach to composing interoperable workflows has broad, cross-disciplinary applications, demonstrated through genomics and standard benchmarks. While Paper 1 provides a rigorous and novel approach to safety certification in control systems, Paper 2's potential to accelerate open-ended scientific discovery across various domains gives it a significantly wider breadth of impact and timeliness.
AgentCo-op addresses a broader and more impactful problem—composing multi-agent workflows for open-ended scientific tasks including genomics applications—with demonstrated utility across diverse benchmarks and real scientific case studies. It introduces a novel retrieval-based synthesis framework with typed artifact handoffs and local repair, offering methodological contributions applicable across many fields. Paper 1, while technically solid, is a domain-specific GPU-accelerated simulator for Mahjong RL, which serves a narrower research community and represents more of an engineering contribution than a conceptual advance.
Paper 2 addresses a fundamental problem in guided sampling for diffusion/flow models—composing multiple constraints while staying on the data manifold. It provides theoretical insight (gradient misalignment causing off-manifold drift) and a principled, lightweight solution (CAR guidance) validated across diverse domains (images, planning, control). This has broad applicability across generative AI. Paper 1, while addressing an important practical problem in multi-agent workflow design with strong genomics applications, is more systems-oriented and narrower in its theoretical contribution. Paper 2's foundational insight into gradient conflicts in compositional guidance is likely to influence a wider range of future work.
Paper 2 likely has higher impact: it targets a field-wide bottleneck—benchmark validity under reward hacking—affecting how progress is measured across many agent domains. It contributes a general taxonomy, a practical checklist, and an automated red-teaming/auditing system validated on 10 major benchmarks with extensive empirical findings (219 flaws) and demonstrated mitigation (patching pipelines, large reductions in hackability). This is timely and broadly applicable to AI evaluation, safety, and deployment. Paper 1 is innovative for workflow synthesis but is narrower and more application-specific.
Paper 1 presents a highly practical and timely framework for automating and synthesizing multi-agent workflows, with direct, high-impact applications in real-world scientific discovery such as open-world genomics. Its ability to improve interoperability and reduce costs across diverse domains (science, coding, math) gives it broader immediate utility and transformational potential compared to Paper 2, which focuses more narrowly on theoretical interpretability and architectural analysis of spatial reasoning models.
Paper 2 likely has higher impact: it introduces a broadly applicable framework for composing interoperable multi-agent workflows with typed artifact handoffs and local repair, addressing a timely, high-demand problem (open-world agent/tool integration). It demonstrates real-world utility via genomics case studies plus competitive benchmark results and cost reductions, suggesting strong practical adoption potential across scientific and engineering domains. Paper 1 is methodologically rigorous and novel for uncertainty-augmented evaluation, but its impact is narrower (evaluation metrics) and more likely confined to ML assessment practice rather than enabling new classes of end-to-end systems.
Paper 1 presents a highly versatile framework for multi-agent workflow synthesis with direct applications to open-ended scientific discovery, demonstrated through complex genomics case studies. Its ability to integrate independent agents and tools dynamically offers broader interdisciplinary impact and real-world utility compared to Paper 2, which focuses on a narrower, albeit impressive, improvement in Theory of Mind reasoning for language models.
Paper 1 is likely to have higher scientific impact due to broader cross-domain relevance: a general framework for composing interoperable multi-agent workflows via typed artifacts, retrieval-based component grounding, and localized repair, demonstrated in open-world genomics plus multiple general benchmarks with cost reductions. Its emphasis on interoperability and reuse of existing agents/tools targets a major bottleneck in real scientific automation and could influence LLM-agent systems, workflow languages, and computational science practices. Paper 2 is rigorous and impactful for VRP, but is more domain-specific, with narrower breadth despite strong benchmark gains.
Paper 2 is likely higher impact due to broader cross-domain relevance and timeliness: it targets open-world, tool-integrating multi-agent workflow synthesis with typed artifact interfaces and execution-driven local repair—capabilities central to current scientific automation trends. Its applications span genomics case studies plus multiple general benchmarks, suggesting wider applicability beyond a single optimization domain. Paper 1 is methodologically solid and impactful for VRP/learn-to-search, but its contribution is more domain-specific (routing) and likely narrower in breadth compared to a general framework for interoperable agent workflows.