Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System
Zhangtianyi Chen, Florensia Widjaja, Wufei Dai, Xiangjun Zhang, Yuhao Shen, Juexiao Zhou
Abstract
Biomedical agents promise to automate complex biological workflows, yet current systems face two fundamental bottlenecks: bioinformatics tools are highly heterogeneous in interfaces and execution environments, while agent planning still relies on flat prompt-retrieved tool descriptions. As biomedical software ecosystems grow, this coupling between tool coverage and context size leads to tool confusion, unstable planning, and inefficient execution. We introduce BioManus, an MCP-native biomedical agent built on graph-scaffolded planning over structured biological capabilities. BioManus first introduces the BioinfoMCP Compiler, which converts heterogeneous bioinformatics software into standardized MCP servers, yielding a large executable MCP ecosystem. It then organizes this ecosystem as a typed heterogeneous MCP graph over tools, operations, datatypes, and workflow stages. At inference time, BioManus retrieves compact task-specific subgraphs, synthesizes operation-level workflow scaffolds. This design decouples planning complexity from raw tool inventory size, achieving a context compression ratio of Theta(N / (h * m_bar)) under high-recall retrieval, where N is the total tool count, h is the workflow horizon, and m_bar (much smaller than N) is the average number of candidate tools per operation. Experiments on BioAgentBench and LAB-Bench show that BioManus improves execution accuracy, workflow validity, and context efficiency over advanced biomedical agent baselines. This work suggests a paradigm shift: scalable biomedical reasoning requires structured executable capability graphs rather than increasingly larger prompt-level tool retrieval.
AI Impact Assessments
(1 models)Scientific Impact Assessment: BioManus – MCP-Native Graph Planning-based Biomedical Agent System
1. Core Contribution
BioManus addresses two concrete bottlenecks in LLM-based biomedical agents: (1) heterogeneous bioinformatics tool interfaces that fragment execution, and (2) the scaling failure of prompt-based tool retrieval as tool ecosystems grow. The paper proposes two linked innovations:
The key conceptual insight is separating *what operations to perform* from *which specific tools to use*, reasoning over an abstract operation graph before binding to concrete executables. This yields a context compression ratio of Θ(N/(h·m̄)), where planning scales with workflow complexity rather than total tool inventory size.
2. Methodological Rigor
Strengths in experimental design:
Methodological concerns:
3. Potential Impact
The paper's strongest potential impact lies in infrastructure contribution: 910 MCP servers covering 3,500 bioinformatics tools across eight domains is a substantial ecosystem artifact. If released and maintained, this could become a community resource that accelerates biomedical agent research broadly.
The graph-scaffolded planning paradigm is conceptually transferable beyond biomedicine — any domain with structured tool ecosystems (materials science, chemistry, engineering simulations) could benefit from operation-level abstraction over typed capability graphs. The decoupling of planning complexity from tool inventory size addresses a genuine scaling concern that will intensify as tool ecosystems grow.
The MCP standardization layer addresses a real engineering pain point in bioinformatics: dependency conflicts, runtime incompatibilities, and cross-language execution. Docker-based packaging of MCP servers could improve reproducibility.
4. Timeliness & Relevance
The paper is highly timely. MCP has rapidly emerged as a standard for tool-LLM interaction (2024-2025), and biomedical AI agents are an active frontier. The observation that prompt-based retrieval doesn't scale is becoming widely recognized but few systems have proposed structured alternatives. The work directly engages with very recent systems (Biomni, BioAgentBench, STELLA, PoSyMed) published in 2025-2026, positioning itself at the cutting edge.
The growing complexity of bioinformatics pipelines (Nextflow, Snakemake ecosystems) and the proliferation of single-cell, spatial, and multi-omics tools make the scalability problem increasingly urgent.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations:
The paper's framing as a "paradigm shift" is somewhat overclaimed given the incremental accuracy improvements. The conceptual contribution (graph-over-tools planning) is sound but not deeply novel — hierarchical planning and typed workflow graphs have precedent in classical AI planning and workflow management systems. The novelty is more in the specific instantiation for MCP-native biomedical agents than in the abstract idea.
Generated Jun 5, 2026
Comparison History (17)
Paper 2 offers broader scientific impact by addressing a critical vulnerability in AI safety: evaluating the hidden (No-CoT) reasoning capabilities of frontier models. While Paper 1 presents an innovative, highly practical architecture for biomedical agents, Paper 2's findings have sweeping implications for AI alignment, model evaluation, and oversight policies across all domains. By quantifying and forecasting no-CoT capabilities against human time horizons and reasoning tokens, Paper 2 establishes a vital metric for the broader AI community, making its potential impact more fundamental and cross-disciplinary.
Paper 2 has higher likely impact: it targets a broadly relevant and timely problem—accurate low-bit (NVFP4) LLM deployment—affecting many domains and production systems. It contributes a clear diagnostic (KL-only QAD hides internal representational drift), links this drift to downstream reasoning/coding performance, and proposes a general, lightweight remedy (CKA-based regularization) demonstrated on multiple model families. Paper 1 is innovative but more domain-specific (biomedical tooling/MCP ecosystem) and depends on platform adoption, narrowing breadth of impact compared to quantization methods applicable across LLM deployments.
BioManus introduces a more structurally novel paradigm shift—replacing prompt-based tool retrieval with graph-scaffolded planning via MCP-native architecture for biomedical agents. It addresses a critical infrastructure bottleneck in bioinformatics tool integration, provides formal complexity analysis (context compression), and demonstrates results on established benchmarks. While Trace2Skill offers a useful general framework for skill distillation with impressive transfer results, BioManus's domain-specific impact in biomedicine (a high-stakes field), its architectural innovation (MCP graph planning), and the creation of a reusable ecosystem (BioinfoMCP Compiler) give it broader and deeper potential impact.
Paper 2 has higher potential impact due to a more novel, scalable architecture (MCP-native tool standardization + typed heterogeneous capability graphs) addressing a widely felt bottleneck in agentic tool-use. It provides clearer methodological contributions (compiler, graph formalism, retrieval/scaffolding, theoretical compression analysis) and empirical validation on established benchmarks, suggesting rigor and near-term applicability in biomedical automation. Its ideas generalize beyond biomedicine (structured capability graphs for tool ecosystems), broadening cross-field impact. Paper 1 is valuable for standards/interoperability and formal protocols, but is narrower and more incremental to existing MAS protocol work.
BioManus introduces a paradigm shift in biomedical agent design through MCP-native graph planning, addressing fundamental scalability bottlenecks in tool heterogeneity and context management. Its structured capability graph approach has broader impact across the rapidly growing biomedical AI field, offering a generalizable architecture pattern. While LLM4Cov presents solid work in hardware verification with novel offline learning techniques, its domain is narrower. BioManus's theoretical framework (context compression), large executable ecosystem, and applicability to diverse biological workflows give it higher potential for widespread adoption and cross-field influence.
BioManus introduces a broader paradigm shift in how biomedical agents handle tool ecosystems through graph-scaffolded planning and MCP standardization, with potential impact across all of biomedical research automation. Its architectural innovation (decoupling planning from tool inventory size) addresses a fundamental scalability bottleneck affecting many agent systems beyond biomedicine. Paper 2, while technically strong with impressive results on VRP benchmarks, addresses a narrower problem (constraint verification for OR modeling) with more limited cross-domain applicability. BioManus's ecosystem-level contribution and generalizable design principles suggest wider and longer-lasting scientific influence.
RHO addresses a fundamental, domain-general challenge in AI agent optimization—improving agent performance without ground-truth labels—making it broadly applicable across software engineering, technical work, and knowledge work. Its self-supervised approach (self-validation, self-consistency, self-preference) is highly novel and practical for real-world deployment where labeled data is scarce. The dramatic improvement on SWE-Bench Pro (59%→78%) without external grading is compelling. While BioManus makes strong contributions to biomedical agent planning with its MCP graph architecture, its impact is more domain-specific. RHO's generality and practicality give it broader potential impact across the AI agent ecosystem.
BioManus introduces a novel architectural paradigm (MCP-native graph planning) that addresses fundamental scalability bottlenecks in biomedical agent systems, with broad implications across bioinformatics and AI agent design. Its contribution—decoupling planning complexity from tool inventory via structured capability graphs—represents a conceptual advance applicable beyond biomedicine. While Drive-KD achieves impressive compression results, multi-teacher knowledge distillation is more incremental. BioManus's creation of a standardized MCP ecosystem and formal complexity analysis (context compression ratio) provides deeper theoretical and practical foundations with wider cross-field impact.
Paper 1 challenges a fundamental assumption in general LLM agent design ('more is better') and demonstrates broad applicability across domains. Its rigorous methodology, including full factorial experiments and exact Shapley values, provides deep theoretical and practical insights into cross-component interference. While Paper 2 presents an innovative and rigorous solution for biomedical workflows, its immediate impact is largely domain-specific. Paper 1's findings will influence the design and evaluation of virtually all future LLM agent architectures, giving it a much wider breadth of impact and higher overall scientific significance.
Paper 2 has higher potential impact due to a more general, scalable methodological contribution: compiling heterogeneous bioinformatics tools into standardized MCP servers and enabling graph-scaffolded planning with formal context-compression properties. This addresses a central bottleneck in agentic biomedical automation (tool heterogeneity and planning instability) and can transfer to other tool-rich scientific domains. Its evaluation on established benchmarks suggests rigor and relevance. Paper 1 is valuable and applied, but is more domain- and vendor-specific (Abaqus/solid mechanics) and the multi-agent LLM interface pattern is less novel and broadly extensible than the structured capability-graph paradigm.
Paper 1 introduces a highly novel approach to agent planning by replacing prompt-based tool retrieval with a structured capability graph using the new MCP standard. This addresses a critical bottleneck in scaling AI agents for complex, real-world scientific workflows. Paper 2, while addressing an important issue (bias mitigation), represents an incremental application of an existing RL algorithm (GRPO) to alignment, making Paper 1's paradigm shift in agent architecture more impactful across both AI and bioinformatics fields.
Paper 2 likely has higher scientific impact: it introduces an MCP-native, graph-based planning architecture plus a compiler that standardizes heterogeneous bioinformatics tools into executable servers—an enabling infrastructure with direct real-world utility for biomedical automation. The approach is timely (tool-augmented agents), broadly reusable across biomedical workflows, and methodologically grounded with benchmark evaluations and a scaling argument (context compression). Paper 1 is novel and potentially influential for AI evaluation/governance, but its impact depends on adoption and external validation of epistemic claims; applications are more indirect than Paper 2’s immediately deployable system.
Paper 2 addresses a critical and timely issue at the intersection of AI ethics and healthcare—the systematic auditing of ethical value pluralism in medical LLMs. Its framework for detecting value biases in clinical AI has broad implications for AI safety, regulation, and deployment policy across healthcare systems worldwide. While Paper 1 presents solid engineering advances in biomedical agent planning, Paper 2's findings about deployment monoculture risks and autonomy underweighting are likely to influence policy discussions, clinical AI governance, and responsible AI research more broadly, giving it higher cross-disciplinary impact.
BioManus addresses a fundamental scalability bottleneck in biomedical AI agents with a novel graph-based planning architecture and MCP-native design. Its formal context compression analysis, large executable ecosystem (BioinfoMCP Compiler), and paradigm-shifting approach to decoupling planning from tool inventory size have broad implications for biomedical automation. Paper 2, while valuable for LLM safety testing via formal methods, addresses a more incremental improvement in an already active area. Paper 1's potential to transform how biomedical workflows are automated gives it higher cross-disciplinary impact and real-world applicability.
Paper 1 introduces a novel, general-purpose geometric decomposition framework for understanding how prompting reshapes internal representations in LLMs/VLMs. It provides fundamental mechanistic insights applicable across all prompted models, with rigorous causal testing methodology. Its breadth—spanning multiple model architectures, modalities, and tasks—gives it wide relevance to the interpretability and alignment communities. Paper 2 is a solid engineering contribution for biomedical agents but is more domain-specific and incremental (combining MCP servers with graph-based planning). Paper 1's foundational insights into prompt mechanisms have broader and longer-lasting scientific impact.
BioManus introduces a novel architectural paradigm (MCP-native graph planning) that addresses fundamental scalability bottlenecks in biomedical agent systems, with theoretical analysis of context compression and demonstrated improvements on benchmarks. It offers both a new system design and a reusable ecosystem (BioinfoMCP Compiler). While DeskCraft is a valuable benchmark contribution for desktop GUI agents with thoughtful human-in-the-loop protocols, benchmarks typically have narrower methodological impact than new architectural paradigms. BioManus's structured capability graph approach could influence agent design across domains beyond biomedicine.
Paper 1 proposes a highly novel architectural paradigm for biomedical AI agents, directly addressing the critical bottleneck of tool scalability in biological workflows. Its potential to accelerate real-world scientific discovery gives it profound cross-disciplinary impact. While Paper 2 provides a valuable benchmark for personalized decision modeling, its focus on prediction markets is narrower in scope compared to advancing automated biomedical research.