MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang, Haoting Shi, Yaxin Du, Jingyi Chai

#2755 of 3355 · Artificial Intelligence
Share
Tournament Score
1313±43
10501800
20%
Win Rate
4
Wins
16
Losses
20
Matches
Rating
6.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MCP-Persona

1. Core Contribution

MCP-Persona introduces the first benchmark specifically targeting LLM agent evaluation on personalized, account-bound MCP (Model Context Protocol) tools. The key insight is that existing benchmarks focus on generic, stateless information-seeking tools, while real-world agent deployment increasingly involves personal applications (social media, enterprise collaboration, email) where tools are tightly coupled to user accounts, preferences, and historical state.

The paper contributes three methodological innovations: (1) Tool-Traverse, a traverse-then-simulate paradigm that probes real MCP servers to capture authentic behavioral patterns (including error modes) and synthesizes executable Python simulators; (2) Context-Tree, a hierarchical tree structure for representing user profiles and application state; and (3) Persona-Gen, a pipeline for generating personalized tasks with deliberate instruction fuzzification to mimic real user ambiguity.

The benchmark covers 12 personalized MCP servers across social media (Reddit, Xiaohongshu, Instagram), collaboration platforms (Lark, Slack), content management (Notion, Obsidian), and email, with 173 human-verified tasks.

2. Methodological Rigor

The methodology is generally well-structured but has several notable aspects:

Strengths in rigor:

  • The Tool-Traverse approach is validated quantitatively (Table 4), showing 94% accuracy and 93.8% F1 compared to 58%/53.3% for documentation-only baselines on Lark. The use of context reconstruction to eliminate confounding variables in simulation fidelity evaluation is thoughtful.
  • The adversarial failure induction for FC pool augmentation covers a reasonable taxonomy (type mismatches, schema violations, boundary conditions, semantic conflicts).
  • Human-LLM correlation analysis (Table 7) shows 91.5% alignment across 970 checkpoints, lending credibility to the automated evaluation.
  • Concerns:

  • The benchmark size (173 tasks) is relatively small. While each task is human-verified and complex, statistical significance of performance differences between models is questionable at this scale.
  • Simulation fidelity is only validated on Lark (one of 12 servers). It is unclear how well Tool-Traverse generalizes to servers with fundamentally different architectures.
  • The reliance on LLM judges for both checkpoint evaluation and simulation code generation introduces circular dependencies that are not fully addressed.
  • The paper does not provide inter-annotator agreement metrics for human verification, despite the complexity of the annotation task.
  • 3. Potential Impact

    Practical relevance: The benchmark addresses a genuine pain point. With MCP adoption accelerating (Anthropic's Skills, OpenClaw ecosystem), evaluating agents on personalized tools is increasingly important. The finding that even GPT-5 achieves <50% accuracy on these tasks (Table 3) is a striking result that motivates further research.

    Simulation paradigm: The Tool-Traverse approach could be broadly applicable beyond this benchmark — any scenario requiring safe, reproducible testing of account-bound APIs could benefit. The code-as-simulation paradigm is particularly valuable for privacy-preserving evaluation.

    Identified failure modes: The three failure archetypes (under-exploration of environment, skipping dependent steps, over-long context degradation) provide actionable insights for agent developers.

    Limitations on impact: The benchmark covers primarily Chinese and English applications, with a notable skew toward Chinese platforms (Xiaohongshu, Lark, WeCom, Amap, Baidu Maps). This limits global applicability. The 173-task scale may be insufficient for fine-grained model comparison or training data purposes.

    4. Timeliness & Relevance

    This paper is highly timely. MCP was introduced by Anthropic in late 2024 and has seen explosive adoption in 2025. The paper correctly identifies that the evaluation infrastructure has not kept pace with deployment. The inclusion of very recent models (GPT-5, Claude-Sonnet-4.5, Grok-4, Claude-Opus-4.1) and frameworks (OpenClaw) demonstrates strong awareness of the rapidly evolving landscape.

    The focus on personalization aligns with the broader industry trend toward on-device, user-specific AI agents (Apple Intelligence, Doubao Phone). However, this rapid pace also poses a risk: the benchmark may become outdated quickly as MCP servers evolve.

    5. Strengths & Limitations

    Key Strengths:

  • Novel and well-motivated problem framing: Clearly articulates why personalized tool evaluation differs from generic tool benchmarks.
  • End-to-end pipeline: The Tool-Traverse → Context-Tree → Persona-Gen pipeline is cohesive and addresses the full evaluation lifecycle.
  • Privacy preservation: The simulation approach elegantly sidesteps the privacy concerns of sharing real user data.
  • Comprehensive model evaluation: Testing 13+ models including the latest proprietary and open-source options provides broad coverage.
  • Dual evaluation metrics: Both checkpoint-based and execution-based evaluations capture different aspects of agent performance.
  • Skills ablation (Table 5) provides practical guidance on the value of operational documentation.
  • Notable Weaknesses:

  • Scale: 173 tasks across 12 servers is modest. Some servers have very few tasks (Reddit: 2.31%, Instagram: 3.47%), making per-server conclusions unreliable.
  • Limited simulation validation: Only Lark is validated for simulation fidelity. Social media simulators (Instagram, Reddit) may have very different error characteristics.
  • Reproducibility concerns: While the code is released, the reliance on specific API behaviors (which change frequently) and LLM-generated simulation code creates fragility.
  • No training signal: The benchmark is evaluation-only; it doesn't provide a path toward improving agent performance beyond the skills ablation.
  • Context-Tree generation involves substantial manual effort (hierarchy definition, content sourcing), which limits scalability to new applications.
  • Comparison to prior work (Table 1) is somewhat superficial — binary checkmarks don't capture the depth of comparison needed (e.g., AppWorld has 457 tasks with richer evaluation).
  • Additional Observations

    The cost analysis (Figure 3) revealing no clear correlation between spending and performance is practically valuable. The finding that GPT-5 achieves reasonable performance at $0.09/task while some expensive models underperform provides actionable deployment guidance.

    The paper's framing around MCP specifically may be somewhat limiting — the underlying challenges (stateful tool use, implicit context resolution, multi-step planning) are general agent challenges that predate MCP. The MCP framing is timely but the contributions are arguably more about personalized stateful tool simulation than MCP per se.

    Rating:6.2/ 10
    Significance 6.5Rigor 5.8Novelty 6.5Clarity 7

    Generated Jun 2, 2026

    Comparison History (20)

    vs. WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation
    claude-opus-4.66/6/2026

    WorldFly introduces a novel architectural contribution (dual-branch coupled flow matching for joint video prediction and action generation) that advances both world models and embodied AI. It addresses a fundamental challenge in UAV navigation—partial observability in dense environments—with a principled approach integrating spatial imagination into policy learning. Paper 2, while timely given MCP adoption, is primarily a benchmark contribution for evaluating LLM agents on personalized tools. WorldFly has broader methodological impact across robotics, computer vision, and embodied AI, whereas MCP-Persona's impact is more narrowly tied to a specific protocol ecosystem.

    vs. Coordination Graphs for Constrained Multi-Agent Reinforcement Learning
    gpt-5.26/2/2026

    Paper 1 proposes a novel algorithmic framework (coordination graphs + Lagrangian CMARL) that addresses exponential joint action scaling and explicit constraints, with theoretical convergence guarantees and interpretable error bounds plus strong empirical scaling—features that typically drive durable scientific impact across MARL, control, and operations research. Paper 2 is timely and useful as a benchmark for MCP-based LLM agents, with clear practical relevance, but benchmarks often have narrower and shorter-lived impact unless they become a dominant standard. Overall, Paper 1 has broader methodological novelty and cross-field longevity.

    vs. LLM-Driven Co-Evolutionary Automated Heuristic Design for Bi-Component Coupled Combinatorial Optimization
    claude-opus-4.66/2/2026

    Paper 2 introduces a novel co-evolutionary framework (CoEvo-AHD) that addresses a fundamental limitation in LLM-driven automated heuristic design—handling coupled combinatorial optimization problems. This represents a methodological innovation with broad applicability across operations research and combinatorial optimization. Paper 1, while addressing a timely gap in benchmarking LLM agents for personalized MCP tools, is primarily a benchmark contribution tied to a specific protocol (MCP) that may have limited longevity. Paper 2's contributions to algorithmic design methodology and its generalizable framework for coupled optimization problems give it stronger long-term scientific impact.

    vs. Brain-Atlas-Guided Generative Counterfactual Attention for Explainable Cognitive Decline Diagnosis Using Multimodal Connectomes
    gemini-3.16/2/2026

    Paper 1 addresses a critical medical challenge (early Alzheimer's diagnosis) with a novel, explainable AI approach, offering profound real-world healthcare applications and strong methodological rigor. In contrast, Paper 2 presents a software engineering benchmark for LLM agents which, while timely, likely has a narrower scientific impact and shorter shelf-life compared to life-saving medical AI advancements.

    vs. Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models
    gemini-3.16/2/2026

    Paper 2 introduces a concrete, publicly available benchmark in the highly active field of LLM agents and tool use. AI benchmarks typically drive immediate, measurable progress and attract high citation counts. Paper 1, while conceptually valuable, is a perspective paper calling for future research directions, making its near-term scientific impact less direct and quantifiable compared to an open-source evaluation framework.

    vs. From Noise to Control: Parameterized Diffusion Policies
    claude-opus-4.66/2/2026

    Paper 2 introduces a novel framework (PDP) that addresses a fundamental challenge in robot learning—controlling and steering diffusion policies via learned behavior manifolds. This offers broad methodological contributions applicable across robotics, control, and generative modeling. The combination of theoretical novelty (behavior manifold construction with semantic distance preservation), practical utility (adaptation without weight updates), and demonstrated results on both simulated and real robots gives it stronger lasting impact. Paper 1, while timely and useful, is primarily a benchmark contribution tied to a specific protocol (MCP) whose longevity is uncertain.

    vs. Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents
    gemini-3.16/2/2026

    Paper 2 is likely to have higher scientific impact due to its broader applicability and timeliness. While Paper 1 presents a rigorous architecture with strong results, it is constrained to financial AI. In contrast, Paper 2 introduces the first benchmark for the rapidly adopted Model Context Protocol (MCP) across widespread applications like Slack and Reddit. Because benchmarking personalized tool-use in LLMs is a critical, cross-disciplinary challenge, Paper 2 will likely attract a wider audience, drive generalized agent development, and accumulate more citations across the broader AI community.

    vs. COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
    gpt-5.26/2/2026

    Paper 2 (COMAP) likely has higher scientific impact due to a more novel methodological contribution: a closed-loop co-evolution of world models and agent policies without relying on external rewards/verifiers, applicable across multiple benchmark families (embodied planning, web navigation, tool use). This advances core agent-learning principles and can generalize broadly. Paper 1 (MCP-Persona) is timely and practically valuable as a benchmark for personalized MCP tool use, but its primary contribution is evaluative infrastructure with narrower conceptual novelty and potentially more limited cross-field methodological influence.

    vs. TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding
    claude-opus-4.66/2/2026

    Paper 1 presents a novel algorithmic contribution (TAPS) with strong theoretical grounding and significant empirical improvements (up to 7.9x speedup) over state-of-the-art methods in speculative decoding, a critical bottleneck for LLM inference efficiency. The methodological innovation of converting marginal probabilities to path-conditioned acceptance estimates is technically deep and broadly applicable. Paper 2 introduces a useful benchmark for personalized MCP tool use, but benchmarks generally have lower lasting impact than algorithmic innovations unless they become widely adopted standards. Paper 1's direct impact on inference efficiency has broader practical implications.

    vs. LLM-Evolved Pattern Generators for Optimal Classical Planning
    gpt-5.26/2/2026

    Paper 1 is more scientifically novel and methodologically substantive: it proposes the first admissible-by-design learned, domain-dependent heuristics for optimal classical planning by learning abstraction/pattern generators via LLM-guided evolutionary program synthesis, with principled admissible combination (saturated cost partitioning) and measurable efficiency gains. This advances core planning theory/practice and could influence heuristic design, program synthesis for search, and trustworthy ML-for-planning. Paper 2 is timely and useful as an evaluation benchmark for personalized tool-using agents, but benchmarks typically yield narrower scientific novelty and longer-term impact unless they become a dominant standard.

    vs. OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation
    gpt-5.26/2/2026

    Paper 2 (OptSkills) introduces a novel archetype-centric, cluster-based distillation framework that learns reusable optimization “skills” and demonstrates strong in- and out-of-distribution gains on multiple challenging benchmarks (including MIPLIB-NL). The approach is methodological and potentially broadly applicable across automated optimization, operations research, and agentic LLM systems. Paper 1 (MCP-Persona) is timely and useful as a benchmark for personalized tool use, but its primary contribution is evaluation infrastructure in a narrower domain (MCP personal apps). Overall, OptSkills offers higher innovation and wider cross-field impact.

    vs. Formalizing and falsifying causal pathways of rare events
    gemini-3.16/2/2026

    Paper 1 offers a fundamental theoretical contribution to causal inference by formalizing root cause analysis for rare events. This mathematical foundation has broad, long-lasting applicability across multiple scientific and engineering disciplines. In contrast, while Paper 2 is highly timely, it introduces a specific LLM benchmark that, typical of the fast-paced AI field, may quickly become obsolete. Therefore, Paper 1 has a higher potential for enduring and widespread scientific impact.

    vs. Review Arcade: On the Human Alignment and Gameability of LLM Reviews
    claude-opus-4.66/2/2026

    Paper 2 addresses a highly timely and consequential issue—the integrity of scientific peer review when both authors and reviewers use LLMs. Its finding that LLM reviews can be 'gamed' by iterative revision has immediate implications for major AI conferences already piloting LLM-assisted review. This touches the foundations of scientific evaluation and affects the entire research community. Paper 1, while useful, is a more incremental benchmark contribution for a specific protocol (MCP) with a narrower scope. Paper 2's broader relevance to scientific integrity gives it higher potential impact across fields.

    vs. Emergent Ordinal Geometry in Transformers Trained on Local Comparisons
    claude-opus-4.66/2/2026

    Paper 1 offers a fundamentally novel mechanistic account linking transformer representation geometry to a 50-year-old cognitive science phenomenon (symbolic distance effect), bridging deep learning and cognitive science with strong theoretical implications. Its discovery that transformers spontaneously learn ordinal geometry from local comparisons, exhibiting grokking-like dynamics and behavioral signatures matching human/animal cognition, represents a deeper scientific contribution. Paper 2, while practically useful, is an incremental engineering benchmark for LLM tool-use evaluation that will likely be superseded quickly as MCP evolves, offering limited lasting scientific insight.

    vs. SDR: Set-Distance Rewards for Radiology Report Generation
    gemini-3.16/2/2026

    Paper 2 demonstrates higher scientific impact by introducing a novel methodological advancement (Set-Distance Rewards) for reinforcement learning in vision-language models. While Paper 1 provides a timely benchmark for a trending software protocol (MCP), Paper 2 solves a fundamental algorithmic challenge—rewarding unordered, orthogonal generated facts—which has broad implications for non-causal text generation. Furthermore, its application to medical AI (radiology report generation) offers critical real-world utility. Its rigorous evaluation of both post-training (GRPO) and efficient test-time compute scaling solidifies its methodological superiority and potential to influence future RL research.

    vs. TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation
    gpt-5.26/2/2026

    Paper 2 (TIGER) likely has higher scientific impact: it introduces a novel, general inference-time framework for mitigating hallucinations in multimodal generation via graph-based evidence routing, includes theoretical convergence analysis, and demonstrates broad empirical gains across multiple modalities and backbones. Hallucination mitigation is a timely, high-priority problem with wide applicability (assistants, search, safety-critical reporting). Paper 1 is valuable as a benchmark for personalized MCP tool-use, but its impact is narrower (evaluation-centric, MCP-specific ecosystem) and less methodologically novel than TIGER’s algorithmic + theoretical contribution.

    vs. The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary
    gemini-3.16/2/2026

    Paper 2 offers fundamental theoretical insights into LLM reasoning limitations, establishing an Attention Bottleneck Theorem and a quantifiable 'Deterministic Horizon'. While Paper 1 introduces a useful empirical benchmark for a specific protocol (MCP), Paper 2's rigorous mathematical boundaries and extensive multi-model validation provide foundational architectural guidance for agentic systems, giving it a much broader and deeper scientific impact.

    vs. From Capability Models to Automated Planning: An AAS-Native Approach for Automatic PDDL Generation
    claude-opus-4.66/2/2026

    MCP-Persona addresses a timely and broadly impactful problem—benchmarking LLM agents on personalized real-world applications using the rapidly adopted MCP standard. It targets the large and active LLM/AI agent research community, introduces a novel benchmark filling a clear gap, and has broad applicability across social media and enterprise tools. Paper 2 solves a useful but narrower problem in industrial automated planning by bridging AAS models and PDDL, relevant primarily to the manufacturing/Industry 4.0 niche. The broader audience and timeliness of LLM agent evaluation give Paper 1 higher potential impact.

    vs. S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty
    gpt-5.26/2/2026

    Paper 2 offers a novel planning algorithm that jointly handles uncertainty (scenario trees) and non-linear dynamics (tree search), addressing a well-known gap between stochastic optimization and MCTS. It demonstrates quantified performance gains (near-optimal in linear cases; large improvements in non-linear settings) and targets a high-stakes, timely application domain (grid/renewables scheduling) with clear real-world deployment potential. Paper 1 is valuable as a benchmark for MCP-based LLM agents, but its impact is more infrastructural and may be narrower/less enduring given fast-moving tool ecosystems and potential benchmark overfitting.

    vs. SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search
    gpt-5.26/2/2026

    Paper 2 is likely to have higher impact because it introduces a timely, broadly useful benchmark aligned with the rapidly adopted MCP ecosystem, enabling standardized evaluation across many personal and enterprise tool-use settings. Benchmarks often catalyze widespread follow-on research, comparisons, and leaderboards, affecting academia and industry. Its environment simulation for personalized applications has clear real-world relevance and cross-field utility (agents, tool use, HCI, security/privacy). Paper 1 is a solid methodological contribution for efficiency in agentic search, but its impact may be narrower and more incremental, with adoption tied to specific agent/search pipelines.