iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov

Jun 8, 2026arXiv:2606.09764v1

cs.LGcs.CL

#1887of 5669·cs.LG

#1887 of 5669 · cs.LG

Tournament Score

1438±42

10501750

60%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty7

Clarity8

Abstract

A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: iOSWorld

1. Core Contribution

iOSWorld introduces the first interactive, native iOS simulator benchmark centered on a persistent user identity spanning 26 custom-built iOS applications. The key novelty is threefold: (1) filling the iOS gap in mobile agent benchmarks (prior work targets Android exclusively), (2) embedding a coherent fictional persona ("Jordan Avery") with interconnected data across apps—transactions in one app producing receipts in another, contacts appearing across messaging, payment, and professional apps—and (3) designing memory/personalization tasks that require agents to discover implicit patterns from personal data without explicit instructions on where to look.

The benchmark includes 133 tasks in three categories of increasing difficulty: single-app (27), multi-app (60), and memory/personalization (46). The multi-app and memory categories are the primary differentiators from prior benchmarks, requiring cross-application reasoning and latent pattern discovery.

2. Methodological Rigor

Benchmark construction follows a disciplined pipeline: apps built in SwiftUI with Claude Code assistance but human-verified, tasks generated from source code and seed data then manually executed end-to-end by annotators, with 44/175 candidate tasks corrected or removed during QA. This is a meaningful quality bar.

Evaluation uses an LLM-as-a-judge framework (GPT-5.4 Mini) validated against human annotators with κ=0.77 at task level (89% accuracy, F1=0.86) on 128 trajectories. The paper provides extensive cross-judge robustness checks (Table 12), per-annotator breakdowns (Table 11), generalization to non-Opus agents (Table 15), and same-family bias audits (Table 16). This level of evaluation validation is commendable and exceeds what most benchmark papers provide.

Experimental design is careful: two observation modalities (vision-only, vision+XML) tested across six models with equivalent system prompts. The paper correctly frames vision+XML as "privileged access" rather than a realistic deployment setting.

Limitations in rigor: The benchmark uses a single persona, which limits generalizability claims about personalization. Bootstrap CIs (±8-14 pp per category) reveal that with 27-60 tasks per category, individual model comparisons within a category may not be statistically robust. The 50-step budget is somewhat arbitrary, though the step-budget analysis (Fig. 9) partially addresses this.

3. Potential Impact

Immediate impact: iOSWorld fills a clear gap—iOS covers 58-60% of U.S. mobile OS usage but had no interactive agent benchmark. The open-source release (apps, data, tasks, evaluation code, AWS runner) lowers the barrier significantly. The AWS-runner for non-Mac researchers is a thoughtful inclusion.

Personalization as a benchmark dimension: The most impactful conceptual contribution is arguing that phone agents must be "personally intelligent"—reasoning over user history and preferences rather than executing isolated commands. The memory/personalization task category (46 tasks) operationalizes this, and the finding that even frontier models only reach 54% on these tasks establishes a meaningful challenge. This framing could influence future benchmark design beyond mobile agents.

Practical findings: The vision-only vs. vision+XML gap (up to 26 pp for frontier models) quantifies the cost of iOS's closed accessibility infrastructure. The finding that smaller models degrade with XML context (~3,100 tokens/step overwhelming context budgets) is practically important for deployment decisions. The failure taxonomy (51% budget exhaustion, 26% gave up, 23% premature stop) provides actionable research directions.

Cross-field influence: The connected-data-across-apps design pattern could influence enterprise agent benchmarks, accessibility tools, and personal assistant research more broadly.

4. Timeliness & Relevance

The paper is highly timely. Phone agents are a major focus of commercial AI development (Apple Intelligence, Google's on-device agents), and the gap between Android-only benchmarks and real-world deployment needs is widely recognized. The personalization angle aligns with the emerging shift from general-purpose to user-adapted AI systems. The paper evaluates the latest frontier models (Claude Opus/Sonnet 4.6, GPT-5.4, Gemini 3 Flash), ensuring immediate relevance.

5. Strengths & Limitations

Key Strengths:

Novel and well-motivated benchmark axis: Personalization is genuinely underexplored in agent evaluation. The interconnected data design (e.g., food order → bank charge → receipt email) creates realistic cross-app dependencies.

Thorough evaluation infrastructure: The dual-modality setup, rubric-based scoring, extensive human validation, cross-judge checks, failure taxonomy, and step-budget analysis form a comprehensive evaluation framework.

Actionable findings: The XML benefit analysis, iOS-specific friction quantification (10% coordinate miss rate, 1.1% edge-swipe usage), and smaller-model context overflow are concrete, useful results.

Open-source with accessibility considerations: AWS runner, schema for new tasks/personas, and MCP tool ablation demonstrate extensibility.

Notable Weaknesses:

Single persona: One user limits evaluation of how agents handle diverse user profiles, cultural contexts, or edge cases in personalization. The paper acknowledges this but it constrains claims about "personal intelligence."

Synthetic ecosystem: All 26 apps are custom-built approximations of real apps. While necessary for reproducibility, this means results may not transfer to real iOS apps with their full complexity, bugs, and design idiosyncrasies.

Task scale: 133 tasks (especially 27 single-app) is modest. Per-category statistical power is limited, as evidenced by bootstrap CIs of ±14.8 pp for single-app.

No web browsing: The explicit exclusion of browser tasks is reasonable but means the benchmark captures only part of phone agent capability.

iOS simulator vs. real device: Simulator-based evaluation may miss real-device friction (notifications, system dialogs, performance variations).

Evaluation model dependency: Using GPT-5.4 Mini as judge while evaluating GPT models introduces potential (though audited) conflicts; the cross-judge analysis partially mitigates this.

Overall Assessment

iOSWorld makes a solid contribution by addressing two clear gaps: the absence of iOS agent benchmarks and the absence of personalization in mobile agent evaluation. The benchmark design is thoughtful, the evaluation is unusually thorough for a benchmark paper, and the findings (especially around XML privilege, model scaling, and failure modes) are immediately useful. The single-persona limitation and modest task count constrain the strength of personalization claims, but the open-source release and extensible design position this as a foundation others can build on. The conceptual push toward "personally intelligent" agents is the paper's most lasting contribution.

Rating:7.2/ 10

Significance 7.5Rigor 7.5Novelty 7Clarity 8

Generated Jun 9, 2026

Comparison History (20)

Wonvs. Overcoming Rank Collapse in Feedback Alignment

iOSWorld introduces a novel benchmark addressing a significant gap in mobile agent evaluation—personalization and persistent user identity. This is highly timely given the rapid development of LLM-based agents and has broad practical implications for the AI agent community. Paper 2 makes a solid contribution to understanding feedback alignment's scaling limitations, but it addresses a relatively niche problem in biologically plausible learning with incremental improvements on established benchmarks. iOSWorld's open-source release and practical relevance to the booming agent ecosystem give it higher potential impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. GRAFT: Gain-Recalibrated Adapters for Transformer-Based Neural Population Activity Modeling

GRAFT addresses a fundamental challenge in brain-computer interfaces—cross-day neural recording stability—with a novel architecture that separates reusable temporal dynamics from neuron-specific interfaces. It achieves state-of-the-art results on an established benchmark (NLB'21) and demonstrates practical recalibration with minimal parameter updates, directly advancing long-term BCI usability. While iOSWorld is a solid benchmark contribution for mobile AI agents, benchmarks typically have narrower lasting impact compared to methodological advances. GRAFT's innovation in neural population modeling has deeper scientific implications for both neuroscience and clinical BCI applications.

claude-opus-4-6·Jun 10, 2026

Wonvs. CITRAS-FM: Tiny Time Series Foundation Model for Covariate-Informed Zero-Shot Forecasting

iOSWorld introduces a highly timely and novel benchmark for personalized, multimodal autonomous phone agents, a rapidly expanding frontier in AI. By simulating persistent user identities across interconnected apps, it addresses a critical gap in evaluating agentic systems. Benchmarks in nascent but popular areas often generate substantial scientific impact by standardizing evaluation and driving future research directions, whereas the tiny time series model, while practically useful, enters a more established and crowded subfield.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Exploring the Design Space of Reward Backpropagation for Flow Matching

iOSWorld introduces a novel benchmark addressing a significant gap in mobile agent evaluation—personalization and persistent user identity. This fills an important need as phone agents become increasingly relevant, and the open-source release of 26 apps, tasks, and evaluation infrastructure will likely drive substantial follow-up research. While FlowBP makes solid technical contributions to reward backpropagation for flow matching models, it represents incremental improvements within a narrower subfield. iOSWorld's broader applicability across AI agent research, human-computer interaction, and mobile computing gives it higher potential impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib

Paper 1 introduces a highly timely and novel benchmark for mobile AI agents, a rapidly expanding field, providing a fully functional interactive simulator that will likely see immediate, broad adoption for training and evaluating LLMs. In contrast, while Paper 2 addresses a critical medical domain (lung cancer resistance), its own results demonstrate that the current dataset's input modality fails to perform better than chance, serving more as a negative result that sets up future work. Therefore, Paper 1 promises much higher immediate utility, citation potential, and broader technological impact.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Paper 2 (iOSWorld) likely has higher scientific impact because it introduces a broadly usable, open-source benchmark infrastructure for personalized, persistent-identity mobile agents—an area with strong real-world relevance and timely demand. Benchmarks often shape research directions across multiple communities (agents, HCI, multimodal models, evaluation), giving wide cross-field impact and strong practical applicability. Paper 1 is novel and useful (test-time prompt learning across heterogeneous streams), but its impact is more method-specific and may depend on adoption within a narrower subarea, whereas iOSWorld can become a standard evaluation platform.

gpt-5.2·Jun 10, 2026

Wonvs. COGENT: Continuous Graph Emulators with Neural Ordinary Differential Equations for Long-Term Physical Forecasting

iOSWorld introduces a novel benchmark addressing a significant gap in mobile agent evaluation—personalization and persistent user identity. This is highly timely given the rapid development of LLM-based agents and has broad implications for AI assistants, HCI, and industry applications. Paper 2 (COGENT) presents a solid technical contribution combining Neural ODEs with graph networks for physical forecasting, but it is more incremental, combining existing techniques (graph networks, Neural ODEs) and is evaluated on a single domain (ice-sheet simulations). iOSWorld's open-source benchmark has higher potential to catalyze research across the AI agent community.

claude-opus-4-6·Jun 10, 2026

Lostvs. Stable Deep Reinforcement Learning via Isotropic Gaussian Representations

Paper 1 addresses a fundamental challenge in deep reinforcement learning—training instability due to non-stationarity—with both theoretical grounding (provable advantages of isotropic Gaussian embeddings) and a practical, computationally inexpensive regularization method. It tackles representation collapse and neuron dormancy, which are broadly relevant problems. Paper 2 introduces a useful benchmark for phone agents with personalization, but benchmarks tend to have more transient impact and are narrower in scope. Paper 1's theoretical contributions and broad applicability across RL domains give it higher long-term scientific impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. Names Don't Matter: Symbol-Invariant Transformer for Open-Vocabulary Learning

Paper 1 addresses a fundamental architectural limitation in Transformers regarding symbol invariance, proposing a theoretically grounded mechanism with provable guarantees. This has broad implications across multiple fields (formal reasoning, program synthesis, mathematics, logic) and introduces a principled inductive bias that could influence future neural architecture design. Paper 2, while practically valuable as a benchmark for phone agents, is more incremental—it extends existing mobile agent evaluation with personalization. Benchmarks have impact but are more transient, whereas architectural innovations with theoretical foundations tend to have deeper, longer-lasting influence.

claude-opus-4-6·Jun 9, 2026

Lostvs. Muon Learns More Robust and Transferable Features than Adam

Paper 1 investigates fundamental optimization mechanisms in deep learning, demonstrating that the new Muon optimizer outperforms the ubiquitous Adam optimizer in feature robustness and transferability. Because optimization is central to all deep learning models (LLMs, CNNs), a proven improvement here has a massive, field-wide impact. Paper 2 introduces a valuable but more narrowly focused benchmark for smartphone AI agents, which, while highly relevant for applied AI, does not match the fundamental breadth and foundational theoretical contribution of Paper 1.

gemini-3.1-pro-preview·Jun 9, 2026

#1887of 5669·cs.LG

#1887 of 5669 · cs.LG

Tournament Score

1438±42

10501750

60%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty7

Clarity8