Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

Zhangtianyi Chen, Florensia Widjaja, Wufei Dai, Xiangjun Zhang, Yuhao Shen, Juexiao Zhou

Jun 3, 2026

arXiv:2606.04494v1 PDF

cs.AI(primary)

#666of 3404·Artificial Intelligence

#666 of 3404 · Artificial Intelligence

Tournament Score

1469±46

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor5

Novelty5.5

Clarity6.5

Tournament Score

1469±46

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Biomedical agents promise to automate complex biological workflows, yet current systems face two fundamental bottlenecks: bioinformatics tools are highly heterogeneous in interfaces and execution environments, while agent planning still relies on flat prompt-retrieved tool descriptions. As biomedical software ecosystems grow, this coupling between tool coverage and context size leads to tool confusion, unstable planning, and inefficient execution. We introduce BioManus, an MCP-native biomedical agent built on graph-scaffolded planning over structured biological capabilities. BioManus first introduces the BioinfoMCP Compiler, which converts heterogeneous bioinformatics software into standardized MCP servers, yielding a large executable MCP ecosystem. It then organizes this ecosystem as a typed heterogeneous MCP graph over tools, operations, datatypes, and workflow stages. At inference time, BioManus retrieves compact task-specific subgraphs, synthesizes operation-level workflow scaffolds. This design decouples planning complexity from raw tool inventory size, achieving a context compression ratio of Theta(N / (h * m_bar)) under high-recall retrieval, where N is the total tool count, h is the workflow horizon, and m_bar (much smaller than N) is the average number of candidate tools per operation. Experiments on BioAgentBench and LAB-Bench show that BioManus improves execution accuracy, workflow validity, and context efficiency over advanced biomedical agent baselines. This work suggests a paradigm shift: scalable biomedical reasoning requires structured executable capability graphs rather than increasingly larger prompt-level tool retrieval.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: BioManus – MCP-Native Graph Planning-based Biomedical Agent System

1. Core Contribution

BioManus addresses two concrete bottlenecks in LLM-based biomedical agents: (1) heterogeneous bioinformatics tool interfaces that fragment execution, and (2) the scaling failure of prompt-based tool retrieval as tool ecosystems grow. The paper proposes two linked innovations:

BioinfoMCP Compiler: An automated pipeline that converts heterogeneous bioinformatics tools (CLIs, Python/R packages, containerized workflows) into standardized Model Context Protocol (MCP) servers, producing an ecosystem of 910 servers and 3,500 callable tools.

Graph-Scaffolded Planning: A typed heterogeneous capability graph over tools, operations, datatypes, and workflow stages, enabling the agent to retrieve compact task-specific subgraphs and plan at the operation level rather than over flat tool descriptions.

The key conceptual insight is separating *what operations to perform* from *which specific tools to use*, reasoning over an abstract operation graph before binding to concrete executables. This yields a context compression ratio of Θ(N/(h·m̄)), where planning scales with workflow complexity rather than total tool inventory size.

2. Methodological Rigor

Strengths in experimental design:

The paper provides a clear empirical diagnosis (Section 3.3, Figure 2) showing that Biomni's prompt-token consumption grows ~4.2× as MCP tools scale from 0 to 2,000, with increasing variance and long-tail behavior. This motivates the approach well.

Evaluation spans two complementary benchmarks: BioAgentBench (10 end-to-end bioinformatics tasks with LLM-judge scoring) and LAB-Bench (DbQA, SeqQA, CloningScenarios with exact-match accuracy).

The ablation study (Figure 5) cleanly decomposes contributions of MCP infrastructure (+3.0% on BioAgentBench) and graph-scaffolded planning (+4.2% additional), showing both components contribute meaningfully.

BioinfoMCP Compiler is validated across multiple LLM backbones (Gemini 3.1, GPT-4.1-mini, Kimi 2.6), showing near-perfect parse success rates.

Methodological concerns:

BioAgentBench evaluation uses only 10 tasks, which is statistically limited. The LLM-judge evaluation adds another layer of noise, and pass count (4/10 for BioManus vs. 4/10 for some baselines) is not strongly discriminating.

The mean score improvement on BioAgentBench (46.84% vs. 46.16% for Biomni-2k) is marginal — less than 1 percentage point. The claim of superiority rests partly on context efficiency rather than raw accuracy gains.

LAB-Bench results are stronger (90.48% on SeqQA, 81.82% on CloningScenarios), but DbQA performance (67.29%) trails Biomni-100 (75.35%), which the authors acknowledge but somewhat downplay.

The complexity analysis (Section 4.5) is straightforward asymptotic reasoning rather than empirical validation of the compression ratio's practical impact across diverse task types.

All experiments use DeepSeek-V4 as the backbone; generalization to other LLM planners is untested for the full system.

3. Potential Impact

The paper's strongest potential impact lies in infrastructure contribution: 910 MCP servers covering 3,500 bioinformatics tools across eight domains is a substantial ecosystem artifact. If released and maintained, this could become a community resource that accelerates biomedical agent research broadly.

The graph-scaffolded planning paradigm is conceptually transferable beyond biomedicine — any domain with structured tool ecosystems (materials science, chemistry, engineering simulations) could benefit from operation-level abstraction over typed capability graphs. The decoupling of planning complexity from tool inventory size addresses a genuine scaling concern that will intensify as tool ecosystems grow.

The MCP standardization layer addresses a real engineering pain point in bioinformatics: dependency conflicts, runtime incompatibilities, and cross-language execution. Docker-based packaging of MCP servers could improve reproducibility.

4. Timeliness & Relevance

The paper is highly timely. MCP has rapidly emerged as a standard for tool-LLM interaction (2024-2025), and biomedical AI agents are an active frontier. The observation that prompt-based retrieval doesn't scale is becoming widely recognized but few systems have proposed structured alternatives. The work directly engages with very recent systems (Biomni, BioAgentBench, STELLA, PoSyMed) published in 2025-2026, positioning itself at the cutting edge.

The growing complexity of bioinformatics pipelines (Nextflow, Snakemake ecosystems) and the proliferation of single-cell, spatial, and multi-omics tools make the scalability problem increasingly urgent.

5. Strengths & Limitations

Key Strengths:

Clear problem diagnosis with empirical evidence of prompt-scaling failures

Large-scale tool ecosystem (910 servers, 3,500 tools) — significant engineering contribution

Principled graph abstraction that elegantly separates operation planning from tool binding

Clean ablation decomposing infrastructure vs. planning contributions

Compiler generalization across multiple LLM backbones

Notable Weaknesses:

Accuracy improvements are modest, particularly on BioAgentBench (< 1 point over Biomni-2k)

Very small evaluation set (10 BioAgentBench tasks); statistical significance is unclear

The paper includes ~17 pages of MCP server catalog (Appendix E) that inflates perceived contribution without adding scientific insight

Graph construction relies on LLM-based annotation of operations, datatypes, and stages — error rates and quality of this annotation are not systematically evaluated

No evaluation of graph retrieval precision/recall or failure modes

The 3,500 tools include many Perl library wrappers of questionable biological utility (e.g., perl-capture-tiny, perl-carp), potentially inflating ecosystem statistics

Reproducibility depends on access to DeepSeek-V4 API and the full MCP ecosystem, which may not be publicly available

Additional Observations:

The paper's framing as a "paradigm shift" is somewhat overclaimed given the incremental accuracy improvements. The conceptual contribution (graph-over-tools planning) is sound but not deeply novel — hierarchical planning and typed workflow graphs have precedent in classical AI planning and workflow management systems. The novelty is more in the specific instantiation for MCP-native biomedical agents than in the abstract idea.

Rating:5.8/ 10

Significance 6.5Rigor 5Novelty 5.5Clarity 6.5

Generated Jun 5, 2026

Comparison History (17)

vs. Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

gemini-3.16/8/2026

Paper 2 offers broader scientific impact by addressing a critical vulnerability in AI safety: evaluating the hidden (No-CoT) reasoning capabilities of frontier models. While Paper 1 presents an innovative, highly practical architecture for biomedical agents, Paper 2's findings have sweeping implications for AI alignment, model evaluation, and oversight policies across all domains. By quantifying and forecasting no-CoT capabilities against human time horizons and reasoning tokens, Paper 2 establishes a vital metric for the broader AI community, making its potential impact more fundamental and cross-disciplinary.

vs. Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio

gpt-5.26/6/2026

Paper 2 has higher likely impact: it targets a broadly relevant and timely problem—accurate low-bit (NVFP4) LLM deployment—affecting many domains and production systems. It contributes a clear diagnostic (KL-only QAD hides internal representational drift), links this drift to downstream reasoning/coding performance, and proposes a general, lightweight remedy (CKA-based regularization) demonstrated on multiple model families. Paper 1 is innovative but more domain-specific (biomedical tooling/MCP ecosystem) and depends on platform adoption, narrowing breadth of impact compared to quantization methods applicable across LLM deployments.

vs. Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

claude-opus-4.66/6/2026

BioManus introduces a more structurally novel paradigm shift—replacing prompt-based tool retrieval with graph-scaffolded planning via MCP-native architecture for biomedical agents. It addresses a critical infrastructure bottleneck in bioinformatics tool integration, provides formal complexity analysis (context compression), and demonstrates results on established benchmarks. While Trace2Skill offers a useful general framework for skill distillation with impressive transfer results, BioManus's domain-specific impact in biomedicine (a high-stakes field), its architectural innovation (MCP graph planning), and the creation of a reusable ecosystem (BioinfoMCP Compiler) give it broader and deeper potential impact.

vs. Strabo: Declarative Specification and Implementation of Agentic Interaction Protocols

gpt-5.26/6/2026

Paper 2 has higher potential impact due to a more novel, scalable architecture (MCP-native tool standardization + typed heterogeneous capability graphs) addressing a widely felt bottleneck in agentic tool-use. It provides clearer methodological contributions (compiler, graph formalism, retrieval/scaffolding, theoretical compression analysis) and empirical validation on established benchmarks, suggesting rigor and near-term applicability in biomedical automation. Its ideas generalize beyond biomedicine (structured capability graphs for tool ecosystems), broadening cross-field impact. Paper 1 is valuable for standards/interoperability and formal protocols, but is narrower and more incremental to existing MAS protocol work.

vs. LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

claude-opus-4.66/6/2026

BioManus introduces a paradigm shift in biomedical agent design through MCP-native graph planning, addressing fundamental scalability bottlenecks in tool heterogeneity and context management. Its structured capability graph approach has broader impact across the rapidly growing biomedical AI field, offering a generalizable architecture pattern. While LLM4Cov presents solid work in hardware verification with novel offline learning techniques, its domain is narrower. BioManus's theoretical framework (context compression), large executable ecosystem, and applicability to diverse biological workflows give it higher potential for widespread adoption and cross-field influence.

vs. Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems

claude-opus-4.66/6/2026

BioManus introduces a broader paradigm shift in how biomedical agents handle tool ecosystems through graph-scaffolded planning and MCP standardization, with potential impact across all of biomedical research automation. Its architectural innovation (decoupling planning from tool inventory size) addresses a fundamental scalability bottleneck affecting many agent systems beyond biomedicine. Paper 2, while technically strong with impressive results on VRP benchmarks, addresses a narrower problem (constraint verification for OR modeling) with more limited cross-domain applicability. BioManus's ecosystem-level contribution and generalizable design principles suggest wider and longer-lasting scientific influence.

vs. Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

claude-opus-4.66/5/2026

RHO addresses a fundamental, domain-general challenge in AI agent optimization—improving agent performance without ground-truth labels—making it broadly applicable across software engineering, technical work, and knowledge work. Its self-supervised approach (self-validation, self-consistency, self-preference) is highly novel and practical for real-world deployment where labeled data is scarce. The dramatic improvement on SWE-Bench Pro (59%→78%) without external grading is compelling. While BioManus makes strong contributions to biomedical agent planning with its MCP graph architecture, its impact is more domain-specific. RHO's generality and practicality give it broader potential impact across the AI agent ecosystem.

vs. Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

claude-opus-4.66/5/2026

BioManus introduces a novel architectural paradigm (MCP-native graph planning) that addresses fundamental scalability bottlenecks in biomedical agent systems, with broad implications across bioinformatics and AI agent design. Its contribution—decoupling planning complexity from tool inventory via structured capability graphs—represents a conceptual advance applicable beyond biomedicine. While Drive-KD achieves impressive compression results, multi-teacher knowledge distillation is more incremental. BioManus's creation of a standardized MCP ecosystem and formal complexity analysis (context compression ratio) provides deeper theoretical and practical foundations with wider cross-field impact.

vs. More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

gemini-3.16/5/2026

Paper 1 challenges a fundamental assumption in general LLM agent design ('more is better') and demonstrates broad applicability across domains. Its rigorous methodology, including full factorial experiments and exact Shapley values, provides deep theoretical and practical insights into cross-component interference. While Paper 2 presents an innovative and rigorous solution for biomedical workflows, its immediate impact is largely domain-specific. Paper 1's findings will influence the design and evaluation of virtually all future LLM agent architectures, giving it a much wider breadth of impact and higher overall scientific significance.

vs. A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems

gpt-5.26/5/2026

Paper 2 has higher potential impact due to a more general, scalable methodological contribution: compiling heterogeneous bioinformatics tools into standardized MCP servers and enabling graph-scaffolded planning with formal context-compression properties. This addresses a central bottleneck in agentic biomedical automation (tool heterogeneity and planning instability) and can transfer to other tool-rich scientific domains. Its evaluation on established benchmarks suggests rigor and relevance. Paper 1 is valuable and applied, but is more domain- and vendor-specific (Abaqus/solid mechanics) and the multi-agent LLM interface pattern is less novel and broadly extensible than the structured capability-graph paradigm.

vs. BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

gemini-3.16/5/2026

Paper 1 introduces a highly novel approach to agent planning by replacing prompt-based tool retrieval with a structured capability graph using the new MCP standard. This addresses a critical bottleneck in scaling AI agents for complex, real-world scientific workflows. Paper 2, while addressing an important issue (bias mitigation), represents an incremental application of an existing RL algorithm (GRPO) to alignment, making Paper 1's paradigm shift in agent architecture more impactful across both AI and bioinformatics fields.

vs. Emergent Collaborative Deliberation in Multi-Model AI Systems: A BFT-Derived Protocol for Epistemic Synthesis

gpt-5.26/5/2026

Paper 2 likely has higher scientific impact: it introduces an MCP-native, graph-based planning architecture plus a compiler that standardizes heterogeneous bioinformatics tools into executable servers—an enabling infrastructure with direct real-world utility for biomedical automation. The approach is timely (tool-augmented agents), broadly reusable across biomedical workflows, and methodologically grounded with benchmark evaluations and a scaling argument (context compression). Paper 1 is novel and potentially influential for AI evaluation/governance, but its impact depends on adoption and external validation of epistemic claims; applications are more indirect than Paper 2’s immediately deployable system.

vs. What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

claude-opus-4.66/5/2026

Paper 2 addresses a critical and timely issue at the intersection of AI ethics and healthcare—the systematic auditing of ethical value pluralism in medical LLMs. Its framework for detecting value biases in clinical AI has broad implications for AI safety, regulation, and deployment policy across healthcare systems worldwide. While Paper 1 presents solid engineering advances in biomedical agent planning, Paper 2's findings about deployment monoculture risks and autonomy underweighting are likely to influence policy discussions, clinical AI governance, and responsible AI research more broadly, giving it higher cross-disciplinary impact.

vs. Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications

claude-opus-4.66/5/2026

BioManus addresses a fundamental scalability bottleneck in biomedical AI agents with a novel graph-based planning architecture and MCP-native design. Its formal context compression analysis, large executable ecosystem (BioinfoMCP Compiler), and paradigm-shifting approach to decoupling planning from tool inventory size have broad implications for biomedical automation. Paper 2, while valuable for LLM safety testing via formal methods, addresses a more incremental improvement in an already active area. Paper 1's potential to transform how biomedical workflows are automated gives it higher cross-disciplinary impact and real-world applicability.

vs. Decomposing how prompting steers behavior

claude-opus-4.66/5/2026

Paper 1 introduces a novel, general-purpose geometric decomposition framework for understanding how prompting reshapes internal representations in LLMs/VLMs. It provides fundamental mechanistic insights applicable across all prompted models, with rigorous causal testing methodology. Its breadth—spanning multiple model architectures, modalities, and tasks—gives it wide relevance to the interpretability and alignment communities. Paper 2 is a solid engineering contribution for biomedical agents but is more domain-specific and incremental (combining MCP servers with graph-based planning). Paper 1's foundational insights into prompt mechanisms have broader and longer-lasting scientific impact.

vs. DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

claude-opus-4.66/5/2026

BioManus introduces a novel architectural paradigm (MCP-native graph planning) that addresses fundamental scalability bottlenecks in biomedical agent systems, with theoretical analysis of context compression and demonstrated improvements on benchmarks. It offers both a new system design and a reusable ecosystem (BioinfoMCP Compiler). While DeskCraft is a valuable benchmark contribution for desktop GUI agents with thoughtful human-in-the-loop protocols, benchmarks typically have narrower methodological impact than new architectural paradigms. BioManus's structured capability graph approach could influence agent design across domains beyond biomedicine.

vs. BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

gemini-3.16/5/2026

Paper 1 proposes a highly novel architectural paradigm for biomedical AI agents, directly addressing the critical bottleneck of tool scalability in biological workflows. Its potential to accelerate real-world scientific discovery gives it profound cross-disciplinary impact. While Paper 2 provides a valuable benchmark for personalized decision modeling, its focus on prediction markets is narrower in scope compared to advancing automated biomedical research.