MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation
Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang, Haoting Shi, Yaxin Du, Jingyi Chai
Abstract
The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.
AI Impact Assessments
(1 models)Scientific Impact Assessment: MCP-Persona
1. Core Contribution
MCP-Persona introduces the first benchmark specifically targeting LLM agent evaluation on personalized, account-bound MCP (Model Context Protocol) tools. The key insight is that existing benchmarks focus on generic, stateless information-seeking tools, while real-world agent deployment increasingly involves personal applications (social media, enterprise collaboration, email) where tools are tightly coupled to user accounts, preferences, and historical state.
The paper contributes three methodological innovations: (1) Tool-Traverse, a traverse-then-simulate paradigm that probes real MCP servers to capture authentic behavioral patterns (including error modes) and synthesizes executable Python simulators; (2) Context-Tree, a hierarchical tree structure for representing user profiles and application state; and (3) Persona-Gen, a pipeline for generating personalized tasks with deliberate instruction fuzzification to mimic real user ambiguity.
The benchmark covers 12 personalized MCP servers across social media (Reddit, Xiaohongshu, Instagram), collaboration platforms (Lark, Slack), content management (Notion, Obsidian), and email, with 173 human-verified tasks.
2. Methodological Rigor
The methodology is generally well-structured but has several notable aspects:
Strengths in rigor:
Concerns:
3. Potential Impact
Practical relevance: The benchmark addresses a genuine pain point. With MCP adoption accelerating (Anthropic's Skills, OpenClaw ecosystem), evaluating agents on personalized tools is increasingly important. The finding that even GPT-5 achieves <50% accuracy on these tasks (Table 3) is a striking result that motivates further research.
Simulation paradigm: The Tool-Traverse approach could be broadly applicable beyond this benchmark — any scenario requiring safe, reproducible testing of account-bound APIs could benefit. The code-as-simulation paradigm is particularly valuable for privacy-preserving evaluation.
Identified failure modes: The three failure archetypes (under-exploration of environment, skipping dependent steps, over-long context degradation) provide actionable insights for agent developers.
Limitations on impact: The benchmark covers primarily Chinese and English applications, with a notable skew toward Chinese platforms (Xiaohongshu, Lark, WeCom, Amap, Baidu Maps). This limits global applicability. The 173-task scale may be insufficient for fine-grained model comparison or training data purposes.
4. Timeliness & Relevance
This paper is highly timely. MCP was introduced by Anthropic in late 2024 and has seen explosive adoption in 2025. The paper correctly identifies that the evaluation infrastructure has not kept pace with deployment. The inclusion of very recent models (GPT-5, Claude-Sonnet-4.5, Grok-4, Claude-Opus-4.1) and frameworks (OpenClaw) demonstrates strong awareness of the rapidly evolving landscape.
The focus on personalization aligns with the broader industry trend toward on-device, user-specific AI agents (Apple Intelligence, Doubao Phone). However, this rapid pace also poses a risk: the benchmark may become outdated quickly as MCP servers evolve.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations
The cost analysis (Figure 3) revealing no clear correlation between spending and performance is practically valuable. The finding that GPT-5 achieves reasonable performance at $0.09/task while some expensive models underperform provides actionable deployment guidance.
The paper's framing around MCP specifically may be somewhat limiting — the underlying challenges (stateful tool use, implicit context resolution, multi-step planning) are general agent challenges that predate MCP. The MCP framing is timely but the contributions are arguably more about personalized stateful tool simulation than MCP per se.
Generated Jun 2, 2026
Comparison History (20)
WorldFly introduces a novel architectural contribution (dual-branch coupled flow matching for joint video prediction and action generation) that advances both world models and embodied AI. It addresses a fundamental challenge in UAV navigation—partial observability in dense environments—with a principled approach integrating spatial imagination into policy learning. Paper 2, while timely given MCP adoption, is primarily a benchmark contribution for evaluating LLM agents on personalized tools. WorldFly has broader methodological impact across robotics, computer vision, and embodied AI, whereas MCP-Persona's impact is more narrowly tied to a specific protocol ecosystem.
Paper 1 proposes a novel algorithmic framework (coordination graphs + Lagrangian CMARL) that addresses exponential joint action scaling and explicit constraints, with theoretical convergence guarantees and interpretable error bounds plus strong empirical scaling—features that typically drive durable scientific impact across MARL, control, and operations research. Paper 2 is timely and useful as a benchmark for MCP-based LLM agents, with clear practical relevance, but benchmarks often have narrower and shorter-lived impact unless they become a dominant standard. Overall, Paper 1 has broader methodological novelty and cross-field longevity.
Paper 2 introduces a novel co-evolutionary framework (CoEvo-AHD) that addresses a fundamental limitation in LLM-driven automated heuristic design—handling coupled combinatorial optimization problems. This represents a methodological innovation with broad applicability across operations research and combinatorial optimization. Paper 1, while addressing a timely gap in benchmarking LLM agents for personalized MCP tools, is primarily a benchmark contribution tied to a specific protocol (MCP) that may have limited longevity. Paper 2's contributions to algorithmic design methodology and its generalizable framework for coupled optimization problems give it stronger long-term scientific impact.
Paper 1 addresses a critical medical challenge (early Alzheimer's diagnosis) with a novel, explainable AI approach, offering profound real-world healthcare applications and strong methodological rigor. In contrast, Paper 2 presents a software engineering benchmark for LLM agents which, while timely, likely has a narrower scientific impact and shorter shelf-life compared to life-saving medical AI advancements.
Paper 2 introduces a concrete, publicly available benchmark in the highly active field of LLM agents and tool use. AI benchmarks typically drive immediate, measurable progress and attract high citation counts. Paper 1, while conceptually valuable, is a perspective paper calling for future research directions, making its near-term scientific impact less direct and quantifiable compared to an open-source evaluation framework.
Paper 2 introduces a novel framework (PDP) that addresses a fundamental challenge in robot learning—controlling and steering diffusion policies via learned behavior manifolds. This offers broad methodological contributions applicable across robotics, control, and generative modeling. The combination of theoretical novelty (behavior manifold construction with semantic distance preservation), practical utility (adaptation without weight updates), and demonstrated results on both simulated and real robots gives it stronger lasting impact. Paper 1, while timely and useful, is primarily a benchmark contribution tied to a specific protocol (MCP) whose longevity is uncertain.
Paper 2 is likely to have higher scientific impact due to its broader applicability and timeliness. While Paper 1 presents a rigorous architecture with strong results, it is constrained to financial AI. In contrast, Paper 2 introduces the first benchmark for the rapidly adopted Model Context Protocol (MCP) across widespread applications like Slack and Reddit. Because benchmarking personalized tool-use in LLMs is a critical, cross-disciplinary challenge, Paper 2 will likely attract a wider audience, drive generalized agent development, and accumulate more citations across the broader AI community.
Paper 2 (COMAP) likely has higher scientific impact due to a more novel methodological contribution: a closed-loop co-evolution of world models and agent policies without relying on external rewards/verifiers, applicable across multiple benchmark families (embodied planning, web navigation, tool use). This advances core agent-learning principles and can generalize broadly. Paper 1 (MCP-Persona) is timely and practically valuable as a benchmark for personalized MCP tool use, but its primary contribution is evaluative infrastructure with narrower conceptual novelty and potentially more limited cross-field methodological influence.
Paper 1 presents a novel algorithmic contribution (TAPS) with strong theoretical grounding and significant empirical improvements (up to 7.9x speedup) over state-of-the-art methods in speculative decoding, a critical bottleneck for LLM inference efficiency. The methodological innovation of converting marginal probabilities to path-conditioned acceptance estimates is technically deep and broadly applicable. Paper 2 introduces a useful benchmark for personalized MCP tool use, but benchmarks generally have lower lasting impact than algorithmic innovations unless they become widely adopted standards. Paper 1's direct impact on inference efficiency has broader practical implications.
Paper 1 is more scientifically novel and methodologically substantive: it proposes the first admissible-by-design learned, domain-dependent heuristics for optimal classical planning by learning abstraction/pattern generators via LLM-guided evolutionary program synthesis, with principled admissible combination (saturated cost partitioning) and measurable efficiency gains. This advances core planning theory/practice and could influence heuristic design, program synthesis for search, and trustworthy ML-for-planning. Paper 2 is timely and useful as an evaluation benchmark for personalized tool-using agents, but benchmarks typically yield narrower scientific novelty and longer-term impact unless they become a dominant standard.
Paper 2 (OptSkills) introduces a novel archetype-centric, cluster-based distillation framework that learns reusable optimization “skills” and demonstrates strong in- and out-of-distribution gains on multiple challenging benchmarks (including MIPLIB-NL). The approach is methodological and potentially broadly applicable across automated optimization, operations research, and agentic LLM systems. Paper 1 (MCP-Persona) is timely and useful as a benchmark for personalized tool use, but its primary contribution is evaluation infrastructure in a narrower domain (MCP personal apps). Overall, OptSkills offers higher innovation and wider cross-field impact.
Paper 1 offers a fundamental theoretical contribution to causal inference by formalizing root cause analysis for rare events. This mathematical foundation has broad, long-lasting applicability across multiple scientific and engineering disciplines. In contrast, while Paper 2 is highly timely, it introduces a specific LLM benchmark that, typical of the fast-paced AI field, may quickly become obsolete. Therefore, Paper 1 has a higher potential for enduring and widespread scientific impact.
Paper 2 addresses a highly timely and consequential issue—the integrity of scientific peer review when both authors and reviewers use LLMs. Its finding that LLM reviews can be 'gamed' by iterative revision has immediate implications for major AI conferences already piloting LLM-assisted review. This touches the foundations of scientific evaluation and affects the entire research community. Paper 1, while useful, is a more incremental benchmark contribution for a specific protocol (MCP) with a narrower scope. Paper 2's broader relevance to scientific integrity gives it higher potential impact across fields.
Paper 1 offers a fundamentally novel mechanistic account linking transformer representation geometry to a 50-year-old cognitive science phenomenon (symbolic distance effect), bridging deep learning and cognitive science with strong theoretical implications. Its discovery that transformers spontaneously learn ordinal geometry from local comparisons, exhibiting grokking-like dynamics and behavioral signatures matching human/animal cognition, represents a deeper scientific contribution. Paper 2, while practically useful, is an incremental engineering benchmark for LLM tool-use evaluation that will likely be superseded quickly as MCP evolves, offering limited lasting scientific insight.
Paper 2 demonstrates higher scientific impact by introducing a novel methodological advancement (Set-Distance Rewards) for reinforcement learning in vision-language models. While Paper 1 provides a timely benchmark for a trending software protocol (MCP), Paper 2 solves a fundamental algorithmic challenge—rewarding unordered, orthogonal generated facts—which has broad implications for non-causal text generation. Furthermore, its application to medical AI (radiology report generation) offers critical real-world utility. Its rigorous evaluation of both post-training (GRPO) and efficient test-time compute scaling solidifies its methodological superiority and potential to influence future RL research.
Paper 2 (TIGER) likely has higher scientific impact: it introduces a novel, general inference-time framework for mitigating hallucinations in multimodal generation via graph-based evidence routing, includes theoretical convergence analysis, and demonstrates broad empirical gains across multiple modalities and backbones. Hallucination mitigation is a timely, high-priority problem with wide applicability (assistants, search, safety-critical reporting). Paper 1 is valuable as a benchmark for personalized MCP tool-use, but its impact is narrower (evaluation-centric, MCP-specific ecosystem) and less methodologically novel than TIGER’s algorithmic + theoretical contribution.
Paper 2 offers fundamental theoretical insights into LLM reasoning limitations, establishing an Attention Bottleneck Theorem and a quantifiable 'Deterministic Horizon'. While Paper 1 introduces a useful empirical benchmark for a specific protocol (MCP), Paper 2's rigorous mathematical boundaries and extensive multi-model validation provide foundational architectural guidance for agentic systems, giving it a much broader and deeper scientific impact.
MCP-Persona addresses a timely and broadly impactful problem—benchmarking LLM agents on personalized real-world applications using the rapidly adopted MCP standard. It targets the large and active LLM/AI agent research community, introduces a novel benchmark filling a clear gap, and has broad applicability across social media and enterprise tools. Paper 2 solves a useful but narrower problem in industrial automated planning by bridging AAS models and PDDL, relevant primarily to the manufacturing/Industry 4.0 niche. The broader audience and timeliness of LLM agent evaluation give Paper 1 higher potential impact.
Paper 2 offers a novel planning algorithm that jointly handles uncertainty (scenario trees) and non-linear dynamics (tree search), addressing a well-known gap between stochastic optimization and MCTS. It demonstrates quantified performance gains (near-optimal in linear cases; large improvements in non-linear settings) and targets a high-stakes, timely application domain (grid/renewables scheduling) with clear real-world deployment potential. Paper 1 is valuable as a benchmark for MCP-based LLM agents, but its impact is more infrastructural and may be narrower/less enduring given fast-moving tool ecosystems and potential benchmark overfitting.
Paper 2 is likely to have higher impact because it introduces a timely, broadly useful benchmark aligned with the rapidly adopted MCP ecosystem, enabling standardized evaluation across many personal and enterprise tool-use settings. Benchmarks often catalyze widespread follow-on research, comparisons, and leaderboards, affecting academia and industry. Its environment simulation for personalized applications has clear real-world relevance and cross-field utility (agents, tool use, HCI, security/privacy). Paper 1 is a solid methodological contribution for efficiency in agentic search, but its impact may be narrower and more incremental, with adoption tied to specific agent/search pipelines.