MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents

Haonan Li, Tianjun Sun, Yongqing Wang, Qisheng Zhang

Apr 30, 2026

arXiv:2604.27819v1 PDF

cs.AI(primary)

#157of 2292·Artificial Intelligence

#157 of 2292 · Artificial Intelligence

Tournament Score

1529±31

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7.5

Novelty7

Clarity8

Tournament Score

1529±31

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Multi-server MCP agents create an information-flow control problem: faithful tool composition can turn individually benign read/write permissions into cross-boundary credential propagation -- a structural side effect of workflow topology, not necessarily malicious model behavior. We present MCPHunt, to our knowledge the first controlled benchmark that isolates non-adversarial, verbatim credential propagation across multi-server MCP trust boundaries, with three methodological contributions: (1) canary-based taint tracking that reduces propagation detection to objective string matching; (2) an environment-controlled coverage design with risky, benign, and hard-negative conditions that validates pipeline soundness and controls for credential-format confounds; (3) CRS stratification that disentangles task-mandated propagation (faithful execution of verbatim-transfer instructions) from policy-violating propagation (credentials included despite the option to redact). Across 3,615 main-benchmark traces from 5 models spanning 147 tasks and 9 mechanism families, policy-violating propagation rates reach 11.5--41.3% across all models. This propagation is pathway-specific (25x cross-mechanism range) and concentrated in browser-mediated data flows; hard-negative controls provide evidence that production-format credentials are not necessary -- prompt-directed cross-boundary data flow is sufficient. A prompt-mitigation study across 3 models reduces policy-violating propagation by up to 97% while preserving 80.5% utility, but effectiveness varies with instruction-following capability -- suggesting that prompt-level defenses alone may not suffice. Code, traces, and labeling pipeline are released under MIT and CC BY 4.0.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: MCPHunt

1. Core Contribution

MCPHunt addresses a genuinely novel problem space: non-adversarial credential propagation across trust boundaries in multi-server MCP agent deployments. Unlike the growing body of adversarial MCP security work (jailbreaks, prompt injection, malicious servers), this paper isolates a structural information-flow control problem — credentials leak not because the model is tricked, but because faithful tool composition inherently routes sensitive data across boundaries. This is a meaningful conceptual contribution. The distinction between *task-mandated* propagation (where verbatim transfer is instructed) and *policy-violating* propagation (where credentials appear despite the option to redact) is well-motivated and operationalized through CRS stratification. The canary-based taint tracking reduces a subjective judgment problem to objective string matching, which is a clean methodological choice.

2. Methodological Rigor

The experimental design is notably careful for a benchmark paper:

Strengths in design: The three-condition environment control (risky/benign/hard-negative) is well-constructed. Benign environments producing 0% false positives validates pipeline soundness. The hard-negative condition is particularly clever — it demonstrates that production-format credentials are not necessary for propagation, ruling out an important confound. The 2×2 comparison (Table 9) crossing task type with credential format is a rigorous control.

Statistical approach: Wilson confidence intervals, GEE logistic regression clustered by task, Fisher's exact tests with Bonferroni correction discussion, and Cohen's κ for inter-annotator reliability (0.89) all demonstrate statistical conscientiousness. The deviance decomposition showing mechanism family accounts for 62% of pseudo-R² versus 32% for model identity is a key quantitative finding.

Potential concerns: (1) 147 tasks are researcher-designed, raising ecological validity questions — real enterprise workflows may differ substantially. (2) Per-mechanism sample sizes (n=39) yield wide confidence intervals (±15pp at extremes). (3) The CRS annotation, while showing high inter-annotator agreement, was performed by only two annotators, and boundary cases (4 disagreements) all resolved toward CRS, potentially inflating the policy-violating rate denominator. (4) The "post-hoc simulation" of the taint guard is acknowledged as a limitation but presented alongside empirical results, which could mislead readers about its practical effectiveness.

3. Potential Impact

Immediate practical relevance: With MCP adoption at 10,000+ servers and 97M monthly SDK downloads, this work addresses a real deployment concern. The finding that policy-violating propagation rates reach 11.5–41.3% across models — with browser-mediated paths dominating — provides actionable guidance for practitioners deploying multi-server MCP configurations.

Framework reusability: The canary-based approach is model-agnostic and extensible. The released code, 6,321 traces, and labeling pipeline under MIT/CC BY 4.0 lower the barrier for follow-up work. The mechanism taxonomy (9 families) provides a shared vocabulary for the community.

Influence on MCP specification: The paper explicitly notes that the MCP 2026 roadmap does not address structural propagation. Empirical evidence of 13.5% policy-violating rates on GPT-5.4 (and up to 41.3% on MiniMax-M2.7) could influence protocol-level design decisions, particularly around data-flow-aware orchestration.

Limitations of impact scope: The work is narrowly focused on verbatim propagation — paraphrased or semantically equivalent leakage is explicitly out of scope. The synthetic task design may not capture the complexity of real enterprise workflows where credential boundaries are more ambiguous.

4. Timeliness & Relevance

This paper is well-timed. MCP is experiencing rapid adoption, real-world incidents have already demonstrated security risks (Asana cross-tenant flaw, GitHub MCP prompt injection), and the community is actively debating safety standards. However, the existing incident reports involve adversarial manipulation — this paper's contribution of showing that *non-adversarial, faithful execution* creates risks is a timely and important complement.

The mitigation analysis showing that prompt-level defenses vary with instruction-following capability (97% reduction for GPT-5.4 vs. 47% for MiniMax-M2.7) is particularly relevant as organizations evaluate which models to deploy in MCP configurations.

5. Strengths & Limitations

Key Strengths:

Clean problem formulation distinguishing compositional propagation from adversarial attacks

Rigorous three-condition experimental design with appropriate controls

The CRS stratification is a genuine methodological contribution that prevents conflation of fundamentally different phenomena

Comprehensive cross-model evaluation (5 models, 3,615 traces) with consistent findings

Strong reproducibility infrastructure (versioned schemas, offline relabeling, checkpoint/resume)

The finding that propagation is pathway-specific (25× cross-mechanism range) rather than model-specific is structurally important

Notable Weaknesses:

Ecological validity: all tasks are synthetic; no validation against real enterprise workflows

Verbatim-only detection misses the potentially more dangerous case of semantically equivalent propagation

Limited server coverage (8 servers; cloud storage, email, CI/CD untested)

The taint guard evaluation is simulated post-hoc, not tested in a live setting where it might alter agent planning

The paper's institutional affiliation (geophysics department) is unusual for this research area, which doesn't affect quality but may affect credibility perception

Some mechanism families have very small effective sample sizes for statistical claims

Minor observations: The paper's positioning table (Table 1) effectively differentiates from prior work. The browser_to_local finding — that HTML's lack of column-level granularity makes selective extraction structurally harder — is an insight with practical design implications. The finding that stronger utility doesn't imply lower propagation (MiniMax: 92.2% utility, 41.3% policy-violating) challenges naive assumptions about model capability and safety.

Overall Assessment

MCPHunt makes a clear, well-executed contribution to an emerging and practically important problem. The methodology is sound, the controls are appropriate, and the findings are actionable. The primary limitation is ecological validity — the gap between synthetic benchmarks and real-world enterprise MCP deployments remains unvalidated. The work would benefit from larger per-mechanism samples and validation on production task logs.

Rating:7/ 10

Significance 7.5Rigor 7.5Novelty 7Clarity 8

Generated May 1, 2026

Comparison History (48)

vs. Agentic Discovery of Exchange-Correlation Density Functionals

gemini-3.15/16/2026

Paper 1 tackles a fundamental and pervasive problem in physics and chemistry (DFT), achieving significant improvements over gold-standard baselines using novel AI agents. Its breakthrough in automated scientific discovery has broad implications for materials science, chemistry, and physics. In contrast, Paper 2 focuses on a narrower, albeit important, security evaluation framework for specific AI agent architectures, which has less fundamental scientific breadth.

vs. SafeAgent: A Runtime Protection Architecture for Agentic Systems

gemini-35/5/2026

Paper 1 introduces a novel evaluation framework and benchmark for a highly timely and critical issue (cross-boundary data propagation in multi-server MCP agents). Its extensive empirical study, canary-based methodology, and open-source release offer foundational impact for future agent security research, whereas Paper 2 addresses the well-studied prompt injection problem using a more traditional runtime architecture evaluated on existing benchmarks.

vs. SafeAgent: A Runtime Protection Architecture for Agentic Systems

gemini-35/5/2026

Paper 2 introduces a foundational evaluation framework for a novel, emerging structural vulnerability (cross-boundary data propagation in MCP agents). Its rigorous methodological design, large-scale empirical evaluation across 3,615 traces, and open-source release position it to catalyze further research. While Paper 1 offers a solid defensive architecture, Paper 2 establishes a critical benchmark in a new threat space. Benchmarks that expose systemic vulnerabilities typically drive broader long-term scientific impact by defining the metrics and challenges for future defensive systems.

vs. Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

gemini-35/5/2026

Paper 2 fundamentally challenges standard AI evaluation metrics (human agreement) in rule-governed environments, offering a widely applicable paradigm shift towards policy-grounded correctness and defensibility. Its large-scale validation on content moderation demonstrates significant real-world utility. Paper 1, while methodologically rigorous, addresses a more niche security vulnerability in multi-server MCP agents, giving Paper 2 a broader potential impact across AI evaluation, alignment, and governance fields.

vs. FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

gemini-35/5/2026

Paper 2 tackles a highly novel and critical security issue (cross-boundary data propagation) in emerging multi-server LLM agents using the recent Model Context Protocol (MCP). Its broad applicability across all agentic AI deployments and rigorous canary-based taint tracking framework offer wider scientific and practical impact than Paper 1, which focuses on a domain-specific (financial) refinement of existing RAG hallucination detection methods.

vs. FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

gemini-35/5/2026

Paper 1 addresses a highly novel and timely security issue (cross-boundary data propagation) in newly emerging multi-server AI agent architectures (MCP). Its fundamental approach to agentic information-flow control offers broader impact across all agent deployments, whereas Paper 2, while highly practical and rigorous, focuses on an domain-specific application (financial RAG) of a well-known problem (hallucination).

vs. Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

gemini-35/5/2026

Paper 2 challenges a fundamental and pervasive assumption in AI evaluation—that models should be judged solely on human agreement—and introduces a novel, policy-grounded framework. Its application to content moderation and AI governance offers broader impact across multiple disciplines compared to Paper 1, which focuses on a specific security vulnerability within a more niche architecture (MCP agents). Furthermore, Paper 2's massive scale of validation (193,000+ decisions) demonstrates exceptional methodological rigor and immediate real-world applicability for scaling rule-governed AI systems.

vs. Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

gpt-5.25/5/2026

Paper 2 likely has higher scientific impact due to its novelty and timeliness in addressing a concrete, cross-field security problem (information-flow control in multi-server agent/tool ecosystems). MCPHunt introduces a rigorous, controlled evaluation methodology (taint tracking, hard negatives, stratification) with clear, actionable findings and mitigation evidence, enabling reproducible auditing across models and infrastructures. Its implications span ML, systems, security, and policy/compliance. Paper 1 is a strong applied multi-agent search system with impressive benchmarks, but it is closer to incremental agentic orchestration advances and may have narrower cross-disciplinary impact.

vs. To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

claude-opus-4.65/5/2026

Paper 1 addresses a fundamental and broadly applicable problem in agentic AI—optimizing when LLMs should invoke tools—with a principled decision-theoretic framework, empirical analysis across multiple models/tasks, and practical lightweight estimators. This has wide applicability across all tool-augmented LLM systems. Paper 2, while timely and important for MCP security, addresses a narrower problem (credential propagation in multi-server MCP setups) in a still-emerging ecosystem. Paper 1's broader scope, methodological rigor, and practical utility for the rapidly growing tool-use paradigm give it higher potential impact.

vs. KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

gemini-35/1/2026

Paper 1 pioneers a novel security evaluation framework for a highly anticipated emerging architecture (Multi-Server MCP agents), identifying critical structural vulnerabilities in cross-boundary data flows. Its focus on non-adversarial credential leakage addresses an urgent, underexplored safety risk in agentic systems. While Paper 2 offers a strong methodological improvement for LLM reasoning via RL, it operates in a highly saturated domain. Paper 1's foundational insights into multi-agent workflow security will likely drive broader architectural shifts across the rapidly growing AI agent ecosystem.

vs. KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

claude-opus-4.65/1/2026

MCPHunt addresses a novel and timely security problem in multi-server MCP agents—cross-boundary credential propagation—which is increasingly relevant as LLM-based tool-use agents proliferate. It introduces a first-of-its-kind benchmark with rigorous methodology (canary-based taint tracking, controlled experimental design, stratified analysis). The findings have immediate practical implications for AI safety and system security. While KnowRL offers solid incremental improvements to RL-based LLM reasoning training, MCPHunt opens a new research direction at the intersection of AI agents and security, likely inspiring broader follow-up work across multiple communities.

vs. Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression

claude-opus-4.65/1/2026

MCPHunt addresses a novel, timely security problem in multi-server MCP agents—an emerging and rapidly adopted paradigm. It introduces the first controlled benchmark for cross-boundary credential propagation with rigorous methodology (canary taint tracking, hard-negative controls, CRS stratification). The breadth of evaluation (3,615 traces, 5 models, 147 tasks) and the practical security implications for AI agent deployment give it high real-world relevance. Paper 2, while solid, tackles the more established problem of knowledge editing with incremental improvements. MCPHunt opens a new research direction in AI agent security with broader cross-field impact.

vs. Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression

gemini-35/1/2026

Paper 1 addresses a highly timely and critical security issue in the emerging multi-server Model Context Protocol (MCP) ecosystem. By providing the first controlled benchmark for cross-boundary data propagation, it offers a novel methodological framework with immediate, significant real-world implications for agentic AI security. Paper 2 tackles the important but crowded field of lifelong knowledge editing with an approach that appears more incremental compared to the foundational security framework introduced in Paper 1.

vs. TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

claude-opus-4.65/1/2026

MCPHunt addresses a timely and critical security problem in the rapidly expanding MCP agent ecosystem, introducing a novel evaluation framework with rigorous methodology (canary-based taint tracking, controlled experimental design, CRS stratification). It identifies a concrete, previously uncharacterized vulnerability class—cross-boundary credential propagation—with broad implications for AI safety and deployment. While TREX is a solid engineering contribution for automating LLM fine-tuning, MCPHunt's findings have more immediate, cross-cutting impact on AI security policy, agent design, and trust boundary enforcement across the field.

vs. TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

gpt-5.25/1/2026

Paper 1 (MCPHunt) has higher impact potential due to its clearer novelty (a controlled benchmark isolating non-adversarial cross-boundary credential propagation), strong methodological rigor (canary-based taint tracking, hard negatives, stratification), and immediate real-world relevance for secure deployment of multi-tool/agent systems. It releases code/traces and quantifies pathway-specific risks with mitigation tradeoffs, enabling reproducible follow-on work across security, AI agents, and governance. Paper 2 is timely and useful, but similar “agentic AutoML/fine-tuning” systems are crowded and its benchmark scope (10 tasks) suggests narrower, less foundational impact.

vs. A Pattern Language for Resilient Visual Agents

gpt-5.25/1/2026

Paper 1 has higher likely scientific impact: it introduces a concrete, novel evaluation framework with controlled benchmark design, objective canary-based detection, large-scale empirical results (3,615 traces; 147 tasks; multiple models/mechanisms), and released artifacts enabling replication and follow-on work. The problem—cross-boundary credential propagation in multi-server agent toolchains—is timely and security-critical with clear real-world implications for enterprise deployments. Paper 2 offers useful architectural guidance, but as a pattern language with limited methodological/empirical validation described, its scientific contribution and measurable impact are likely narrower.

vs. D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery

gemini-35/1/2026

Paper 2 presents a verifiable environment for scientific data-driven discovery, directly advancing the broad and high-impact field of AI for Science. By providing 565 executable tasks across four disciplines and demonstrating that training on this data substantially improves model performance, it offers foundational infrastructure for future research. While Paper 1 addresses an important security niche in agent architectures, Paper 2's potential to accelerate diverse scientific discoveries utilizing LLM agents gives it a wider and more profound cross-disciplinary scientific impact.

vs. Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

gemini-35/1/2026

Paper 1 addresses a highly novel and critical security challenge in the rapidly expanding field of LLM agents: cross-boundary data propagation and credential leakage. Its rigorous benchmarking approach, extensive empirical evaluation, and focus on fundamental information-flow vulnerabilities offer broad implications for AI safety and multi-agent system design. In contrast, Paper 2 focuses on a more applied, engineering-oriented problem of production monitoring for Text-to-SQL systems, which, while practical, has a narrower scientific scope and relies on less fundamentally innovative methodologies.

vs. KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

gpt-5.25/1/2026

Paper 1 has higher impact potential: it targets a timely, safety-critical problem (credential/data leakage across multi-agent tool boundaries) with clear real-world deployment relevance. It offers methodological innovations (canary tainting, controlled coverage with hard negatives, stratified labeling separating mandated vs policy-violating propagation) and reports large-scale empirical results with open code/traces, supporting rigor and reproducibility. Its findings generalize across agent architectures and security/compliance domains. Paper 2 is a useful benchmark but is narrower (sports betting), relies partly on subjective rubrics, and has more limited cross-field applicability despite relevance to long-horizon decision-making.

vs. Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

claude-opus-4.65/1/2026

Paper 2 addresses a novel and timely security concern in the rapidly emerging MCP (Model Context Protocol) multi-server agent ecosystem. It introduces the first benchmark for cross-boundary credential propagation, a critical safety issue as agentic AI systems are deployed in production. The problem is fundamental to AI safety/security and has broad implications across the field. Paper 1, while solid, represents an incremental improvement in medical RAG with modest gains (+5.8 points). Paper 2's novelty, timeliness given the explosion of MCP adoption, and relevance to AI safety give it higher potential impact.