Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov
A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.
iOSWorld introduces the first interactive, native iOS simulator benchmark centered on a persistent user identity spanning 26 custom-built iOS applications. The key novelty is threefold: (1) filling the iOS gap in mobile agent benchmarks (prior work targets Android exclusively), (2) embedding a coherent fictional persona ("Jordan Avery") with interconnected data across apps—transactions in one app producing receipts in another, contacts appearing across messaging, payment, and professional apps—and (3) designing memory/personalization tasks that require agents to discover implicit patterns from personal data without explicit instructions on where to look.
The benchmark includes 133 tasks in three categories of increasing difficulty: single-app (27), multi-app (60), and memory/personalization (46). The multi-app and memory categories are the primary differentiators from prior benchmarks, requiring cross-application reasoning and latent pattern discovery.
Benchmark construction follows a disciplined pipeline: apps built in SwiftUI with Claude Code assistance but human-verified, tasks generated from source code and seed data then manually executed end-to-end by annotators, with 44/175 candidate tasks corrected or removed during QA. This is a meaningful quality bar.
Evaluation uses an LLM-as-a-judge framework (GPT-5.4 Mini) validated against human annotators with κ=0.77 at task level (89% accuracy, F1=0.86) on 128 trajectories. The paper provides extensive cross-judge robustness checks (Table 12), per-annotator breakdowns (Table 11), generalization to non-Opus agents (Table 15), and same-family bias audits (Table 16). This level of evaluation validation is commendable and exceeds what most benchmark papers provide.
Experimental design is careful: two observation modalities (vision-only, vision+XML) tested across six models with equivalent system prompts. The paper correctly frames vision+XML as "privileged access" rather than a realistic deployment setting.
Limitations in rigor: The benchmark uses a single persona, which limits generalizability claims about personalization. Bootstrap CIs (±8-14 pp per category) reveal that with 27-60 tasks per category, individual model comparisons within a category may not be statistically robust. The 50-step budget is somewhat arbitrary, though the step-budget analysis (Fig. 9) partially addresses this.
Immediate impact: iOSWorld fills a clear gap—iOS covers 58-60% of U.S. mobile OS usage but had no interactive agent benchmark. The open-source release (apps, data, tasks, evaluation code, AWS runner) lowers the barrier significantly. The AWS-runner for non-Mac researchers is a thoughtful inclusion.
Personalization as a benchmark dimension: The most impactful conceptual contribution is arguing that phone agents must be "personally intelligent"—reasoning over user history and preferences rather than executing isolated commands. The memory/personalization task category (46 tasks) operationalizes this, and the finding that even frontier models only reach 54% on these tasks establishes a meaningful challenge. This framing could influence future benchmark design beyond mobile agents.
Practical findings: The vision-only vs. vision+XML gap (up to 26 pp for frontier models) quantifies the cost of iOS's closed accessibility infrastructure. The finding that smaller models degrade with XML context (~3,100 tokens/step overwhelming context budgets) is practically important for deployment decisions. The failure taxonomy (51% budget exhaustion, 26% gave up, 23% premature stop) provides actionable research directions.
Cross-field influence: The connected-data-across-apps design pattern could influence enterprise agent benchmarks, accessibility tools, and personal assistant research more broadly.
The paper is highly timely. Phone agents are a major focus of commercial AI development (Apple Intelligence, Google's on-device agents), and the gap between Android-only benchmarks and real-world deployment needs is widely recognized. The personalization angle aligns with the emerging shift from general-purpose to user-adapted AI systems. The paper evaluates the latest frontier models (Claude Opus/Sonnet 4.6, GPT-5.4, Gemini 3 Flash), ensuring immediate relevance.
iOSWorld makes a solid contribution by addressing two clear gaps: the absence of iOS agent benchmarks and the absence of personalization in mobile agent evaluation. The benchmark design is thoughtful, the evaluation is unusually thorough for a benchmark paper, and the findings (especially around XML privilege, model scaling, and failure modes) are immediately useful. The single-persona limitation and modest task count constrain the strength of personalization claims, but the open-source release and extensible design position this as a foundation others can build on. The conceptual push toward "personally intelligent" agents is the paper's most lasting contribution.
Generated Jun 9, 2026
iOSWorld introduces a novel benchmark addressing a significant gap in mobile agent evaluation—personalization and persistent user identity. This is highly timely given the rapid development of LLM-based agents and has broad practical implications for the AI agent community. Paper 2 makes a solid contribution to understanding feedback alignment's scaling limitations, but it addresses a relatively niche problem in biologically plausible learning with incremental improvements on established benchmarks. iOSWorld's open-source release and practical relevance to the booming agent ecosystem give it higher potential impact.
GRAFT addresses a fundamental challenge in brain-computer interfaces—cross-day neural recording stability—with a novel architecture that separates reusable temporal dynamics from neuron-specific interfaces. It achieves state-of-the-art results on an established benchmark (NLB'21) and demonstrates practical recalibration with minimal parameter updates, directly advancing long-term BCI usability. While iOSWorld is a solid benchmark contribution for mobile AI agents, benchmarks typically have narrower lasting impact compared to methodological advances. GRAFT's innovation in neural population modeling has deeper scientific implications for both neuroscience and clinical BCI applications.
iOSWorld introduces a highly timely and novel benchmark for personalized, multimodal autonomous phone agents, a rapidly expanding frontier in AI. By simulating persistent user identities across interconnected apps, it addresses a critical gap in evaluating agentic systems. Benchmarks in nascent but popular areas often generate substantial scientific impact by standardizing evaluation and driving future research directions, whereas the tiny time series model, while practically useful, enters a more established and crowded subfield.
iOSWorld introduces a novel benchmark addressing a significant gap in mobile agent evaluation—personalization and persistent user identity. This fills an important need as phone agents become increasingly relevant, and the open-source release of 26 apps, tasks, and evaluation infrastructure will likely drive substantial follow-up research. While FlowBP makes solid technical contributions to reward backpropagation for flow matching models, it represents incremental improvements within a narrower subfield. iOSWorld's broader applicability across AI agent research, human-computer interaction, and mobile computing gives it higher potential impact.
Paper 1 introduces a highly timely and novel benchmark for mobile AI agents, a rapidly expanding field, providing a fully functional interactive simulator that will likely see immediate, broad adoption for training and evaluating LLMs. In contrast, while Paper 2 addresses a critical medical domain (lung cancer resistance), its own results demonstrate that the current dataset's input modality fails to perform better than chance, serving more as a negative result that sets up future work. Therefore, Paper 1 promises much higher immediate utility, citation potential, and broader technological impact.
Paper 2 (iOSWorld) likely has higher scientific impact because it introduces a broadly usable, open-source benchmark infrastructure for personalized, persistent-identity mobile agents—an area with strong real-world relevance and timely demand. Benchmarks often shape research directions across multiple communities (agents, HCI, multimodal models, evaluation), giving wide cross-field impact and strong practical applicability. Paper 1 is novel and useful (test-time prompt learning across heterogeneous streams), but its impact is more method-specific and may depend on adoption within a narrower subarea, whereas iOSWorld can become a standard evaluation platform.
iOSWorld introduces a novel benchmark addressing a significant gap in mobile agent evaluation—personalization and persistent user identity. This is highly timely given the rapid development of LLM-based agents and has broad implications for AI assistants, HCI, and industry applications. Paper 2 (COGENT) presents a solid technical contribution combining Neural ODEs with graph networks for physical forecasting, but it is more incremental, combining existing techniques (graph networks, Neural ODEs) and is evaluated on a single domain (ice-sheet simulations). iOSWorld's open-source benchmark has higher potential to catalyze research across the AI agent community.
Paper 1 addresses a fundamental challenge in deep reinforcement learning—training instability due to non-stationarity—with both theoretical grounding (provable advantages of isotropic Gaussian embeddings) and a practical, computationally inexpensive regularization method. It tackles representation collapse and neuron dormancy, which are broadly relevant problems. Paper 2 introduces a useful benchmark for phone agents with personalization, but benchmarks tend to have more transient impact and are narrower in scope. Paper 1's theoretical contributions and broad applicability across RL domains give it higher long-term scientific impact.
Paper 1 addresses a fundamental architectural limitation in Transformers regarding symbol invariance, proposing a theoretically grounded mechanism with provable guarantees. This has broad implications across multiple fields (formal reasoning, program synthesis, mathematics, logic) and introduces a principled inductive bias that could influence future neural architecture design. Paper 2, while practically valuable as a benchmark for phone agents, is more incremental—it extends existing mobile agent evaluation with personalization. Benchmarks have impact but are more transient, whereas architectural innovations with theoretical foundations tend to have deeper, longer-lasting influence.
Paper 1 investigates fundamental optimization mechanisms in deep learning, demonstrating that the new Muon optimizer outperforms the ubiquitous Adam optimizer in feature robustness and transferability. Because optimization is central to all deep learning models (LLMs, CNNs), a proven improvement here has a massive, field-wide impact. Paper 2 introduces a valuable but more narrowly focused benchmark for smartphone AI agents, which, while highly relevant for applied AI, does not match the fundamental breadth and foundational theoretical contribution of Paper 1.