MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

Zhichao Yang, Yuanze Hu, Haojie Hao, Longkun Hao, Dongshuo Huang, Hongyu Lin, Gen Li, Lanqing Hong

Jun 3, 2026

arXiv:2606.04627v1 PDF

cs.AI(primary)

#354of 3355·Artificial Intelligence

#354 of 3355 · Artificial Intelligence

Tournament Score

1501±44

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity7.8

Tournament Score

1501±44

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world-model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward-looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain-of-thought supervised fine-tuning in the 4B ablation with a 3-5x lower decoded-token budget and improves a comparable instruction-tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MIRAGE

1. Core Contribution

MIRAGE addresses a genuine tension in mobile GUI agents: explicit chain-of-thought (CoT) reasoning improves decision quality but inflates token generation, increasing latency and deployment cost. The paper proposes three interlocking ideas:

1. Latent reasoning slots: Replace verbose textual reasoning traces with N continuous latent vectors that occupy the same position in the decoder sequence but are never decoded to text.

2. Approximate Parallel Latent Refinement (APLR): A Jacobi-style parallel iterative scheme that refines all latent slots simultaneously across K rounds, rather than requiring N sequential forward passes as in Coconut-style serial latent CoT. The paper proves the first K slots exactly match the serial solution, with bounded tail error.

3. Q-Former world-model head: Aligns latent reasoning states with future screenshot features (extracted from the frozen vision encoder), encouraging the agent to anticipate GUI transitions. This provides auxiliary supervision that specifically targets tail-slot errors left by APLR.

The combination is well-motivated: APLR introduces approximation error in later latent slots, and these later slots correspond to the "predict" component of structured thought. The world-model loss directly supervises exactly these transition-predictive representations.

2. Methodological Rigor

Strengths:

The APLR formalization is clean. Proposition 1 (exact recovery of first K slots) follows straightforwardly from the strictly triangular causal structure, but the proof is complete and the tail-error bound via the block-lower-triangular Jacobian (Eq. 7, Appendix F) is informative.

The two-stage curriculum (explicit CoT warmup → latent distillation) is a principled knowledge-distillation approach rather than asking the model to discover latent reasoning from scratch.

The ablation study (Table 3) is well-designed: it isolates contributions of latent slots, APLR, and the world model. The finding that action-only SFT *degrades* performance below the base model is important—it confirms that some form of intermediate computation is necessary.

Sensitivity analysis over slot count, refinement passes, and loss balance (Table 4) provides useful practical guidance.

Weaknesses:

The theoretical tail-error analysis (Proposition 2, Appendix G) assumes the residual-gradient term vanishes and relies on local second-order arguments. While pedagogically useful, it doesn't provide quantitative guarantees about how much the world-model loss actually reduces task-relevant error in practice.

The claim "matches explicit CoT SFT" is carefully scoped to the 4B ablation on AndroidWorld (both at 52.6 SR), but MIRAGE-8B at 57.8 SR has no explicit CoT SFT 8B comparison point, making it hard to assess whether the approach truly matches or exceeds CoT at larger scale.

Training data details are somewhat sparse. The "self-explored trajectories" on AndroidWorld are mentioned but not characterized (how many, what exploration policy, how they complement AMEX).

The structured thought format (observation/rationale/predict) requires human design of these reasoning dimensions. How sensitive results are to this decomposition is not explored.

3. Potential Impact

Practical impact is potentially significant. Mobile agents face hard latency constraints—users expect sub-second responsiveness. A 3-5× reduction in decoded tokens (from ~100+ to ~20) and 1.7× latency reduction translates directly to better user experience. The framework is backbone-agnostic and demonstrated on two model sizes.

Methodological influence could extend beyond mobile agents. The APLR technique for parallel latent refinement with provable early-slot recovery is generalizable to any latent-CoT application. The idea of using a world-model objective to compensate for tail-slot approximation error is architecturally elegant and could apply to other sequential decision-making domains.

Limitations to impact: The approach is supervised-only. No reinforcement learning or online adaptation is used. The world model operates in feature space rather than pixel space, limiting its use as an actual environment simulator. The Q-Former head is discarded at inference, so the model cannot generate predicted future screens for user verification or planning.

4. Timeliness & Relevance

This paper is well-timed. The deployment of VLM-based agents on mobile devices is an active industry and research priority. The tension between reasoning quality and inference efficiency is a current bottleneck, especially as models like Qwen3-VL, GPT-4o, and Claude are being applied to agentic tasks. The latent reasoning direction (Coconut, Quiet-Star, pause tokens) is a nascent but rapidly growing area, and MIRAGE is among the first to apply it to a concrete agentic domain with practical efficiency constraints.

5. Strengths & Limitations

Key Strengths:

Clean architectural design where three components (latent slots, APLR, world model) address distinct and well-identified problems

Strong empirical results: +10.2 points on AndroidWorld over instruction-tuned baseline, 75%+ token reduction on AndroidControl

The latent visualization analysis (Figures 5-8) provides genuine interpretability evidence—slot groups specialize positionally, and action semantics are recoverable after centering

Comprehensive ablation isolating each component's contribution

Notable Weaknesses:

No comparison against other latent reasoning methods (e.g., Coconut, Quiet-Star) adapted to the mobile domain

Serial latent CoT baseline reaches 50.9 SR vs. MIRAGE's 52.6, but training cost comparison between serial and APLR is not quantified

Limited to supervised fine-tuning; no exploration of RL-based training which might further leverage the world model

AndroidWorld evaluation uses only 116 task instances, raising questions about statistical significance of the ~5-10 point improvements

The world-model prediction quality is never directly evaluated (e.g., retrieval accuracy of next-frame features)

Overall Assessment

MIRAGE presents a well-conceived and technically sound framework that addresses a real deployment bottleneck in mobile GUI agents. The combination of parallel latent refinement with provable approximation guarantees and a world-model auxiliary objective is novel and well-motivated. Empirical results are strong though somewhat limited in scale. The work makes a meaningful contribution at the intersection of efficient reasoning and embodied agent deployment, though its impact would be strengthened by RL integration, broader benchmarking, and more rigorous statistical analysis.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 7.8

Generated Jun 5, 2026

Comparison History (19)

vs. Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

gemini-3.16/6/2026

Paper 2 introduces a foundational shift in agentic AI by replacing explicit chain-of-thought with continuous latent reasoning and generative world models. This addresses critical bottlenecks in latency and supervision for autonomous agents. While Paper 1 offers highly impressive engineering for LLM quantization, Paper 2's methodological innovation in internalizing reasoning steps presents a broader paradigm shift for the rapidly growing field of multimodal agents, likely inspiring widespread future research in implicit reasoning architectures.

vs. The Self-Correction Illusion: LLMs Correct Others but Not Themselves

gpt-5.26/6/2026

Paper 1 offers a more technically novel and broadly enabling approach: compressing explicit reasoning into latent states and coupling it with a generative world-model objective for anticipating future UI states. This can materially improve efficiency and reliability of real-world mobile/UI agents, with quantitative gains and clear deployment relevance. While Paper 2 is timely and rigorous, its main contribution is a diagnostic/measurement finding plus a prompt-structure workaround; impactful for evaluation and prompting practice but narrower in downstream capability advances than MIRAGE’s method and application scope.

vs. Where does Absolute Position come from in decoder-only Transformers?

gpt-5.26/5/2026

Paper 2 has higher estimated impact: it proposes a broadly applicable framework (latent internal reasoning + generative world-model alignment) with clear, timely real-world utility for efficient mobile UI agents and measurable gains (token reduction, performance improvements) on established benchmarks. Its contribution could transfer across agentic RL, multimodal modeling, and deployment-constrained settings. Paper 1 is novel and methodologically insightful for transformer interpretability/architecture, but is more specialized and primarily diagnostic, with less immediate application breadth and product-facing impact than MIRAGE.

vs. Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models

gpt-5.26/5/2026

Paper 1 (MIRAGE) is more likely to have higher scientific impact: it introduces a novel latent-reasoning transfer framework plus a generative world-model objective that ties internal reasoning states to future observations, enabling anticipatory control in real UI environments. The demonstrated gains on AndroidWorld/AndroidControl suggest strong real-world applicability to autonomous UI agents and embodied/interactive systems, with potential cross-field impact (reasoning, world models, agentic interaction, efficiency). Paper 2 is timely and useful but is primarily an inference-time systems optimization (KV-cache pruning) with narrower conceptual novelty and application scope.

vs. Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

claude-opus-4.66/5/2026

Paper 2 (MRAgent) introduces a more broadly applicable and novel conceptual framework—active memory reconstruction inspired by cognitive science—that addresses a fundamental challenge (long-horizon memory reasoning) relevant across virtually all LLM agent applications. Its associative graph memory with dynamic reconstruction is a paradigm shift from static retrieve-then-reason approaches, with strong empirical gains (up to 23%). Paper 1 (MIRAGE) is well-executed but more narrowly focused on mobile UI agents and latent reasoning compression, which, while valuable, represents a more incremental advance in a specific application domain.

vs. LLM Self-Recognition: Steering and Retrieving Activation Signatures

gpt-5.26/5/2026

Paper 1 (MIRAGE) is likely to have higher scientific impact: it proposes a broadly useful training/inference framework (latent internal reasoning + world-model alignment) that directly improves efficiency and performance for UI-operating agents, a timely and high-demand application area. The approach has clear real-world deployment benefits (lower token/latency/cost) and can generalize to other embodied/interactive agent settings. Paper 2 is novel and relevant for provenance, but depends on access to internal activations/steering and may face adoption and threat-model limitations, narrowing practical impact.

vs. Towards a Science of AI Agent Reliability

gemini-3.16/5/2026

Paper 1 establishes a foundational evaluation framework for AI agent reliability, addressing a critical bottleneck for real-world deployment. By proposing comprehensive metrics for safety, robustness, and consistency, it has the potential to broadly influence how the entire field evaluates and develops agents. While Paper 2 offers a strong technical innovation for mobile agent efficiency, Paper 1's conceptual contribution offers wider applicability, cross-field relevance, and long-term impact on AI safety and engineering.

vs. SciDER: Scientific Data-centric End-to-end Researcher

claude-opus-4.66/5/2026

SciDER addresses the broader challenge of automating the entire scientific research lifecycle with a multi-agent system, releasing open-source datasets and models (OpenSciDER-SFT-8K, OpenSciDER-27B). Its breadth of impact spans multiple scientific domains and benchmarks, and it tackles fundamental limitations in AI-driven scientific discovery. While MIRAGE presents a technically elegant approach to latent reasoning for mobile agents with clear efficiency gains, its scope is narrower (mobile UI navigation). SciDER's potential to democratize and accelerate scientific research across disciplines gives it higher estimated impact.

vs. LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

gemini-3.16/5/2026

Paper 1 addresses a fundamental bottleneck in foundation model development: the self-evolution of reasoning capabilities without relying on scarce human process data. By tackling mimetic bias and reward decomposition, its theoretical contributions to latent logic mining and self-alignment have broader implications for general AI development compared to Paper 2, which focuses on a more specific application (mobile agents) and inference efficiency.

vs. Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

gemini-3.16/5/2026

Paper 1 addresses a critical bottleneck in AI agents—the latency and cost of explicit Chain-of-Thought—by introducing continuous latent reasoning and generative world models. Its demonstrated 3-5x efficiency gain on standard benchmarks (AndroidWorld) offers immediate, widespread real-world utility. While Paper 2 presents a highly novel theoretical framework for agent memory, it currently remains a proof-of-concept. Paper 1's methodological rigor, timeliness in the fast-growing agentic AI space, and practical scalability give it a significantly higher potential for immediate scientific and industrial impact.

vs. LACUNA: Safe Agents as Recursive Program Holes

gemini-3.16/5/2026

Paper 2 tackles the critical bottlenecks of token generation efficiency and environment modeling in LLM agents. By moving explicit chain-of-thought to continuous latent representations and integrating a generative world model, it offers fundamental algorithmic advancements in implicit reasoning. These innovations have broad applicability across AI domains, likely spurring more follow-up research than Paper 1's programming language and safety-oriented framework.

vs. R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

gemini-3.16/5/2026

Paper 1 introduces a fundamental methodological shift by internalizing explicit Chain-of-Thought into continuous latent representations paired with a generative world model. This elegantly solves critical latency and cost bottlenecks for autonomous agents. Its broad applicability to everyday UI control and general embodied AI promises wider real-world deployment and cross-field impact compared to Paper 2's highly complex, domain-specific orchestration protocol for mechanical design.

vs. PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

claude-opus-4.66/5/2026

MIRAGE addresses a fundamental challenge in mobile AI agents by compressing explicit chain-of-thought reasoning into continuous latent representations combined with a generative world model, achieving comparable performance with 3-5x fewer tokens. This has significant practical impact for on-device deployment and advances the important research direction of latent reasoning. PolarMem introduces an interesting negative memory concept for VLMs but is training-free and more incremental. MIRAGE's broader applicability to efficient reasoning and mobile agents, combined with its methodological novelty in latent reasoning distillation and world modeling, gives it higher potential impact.

vs. GITCO: Gated Inference-Time Context Optimization in TSFMs

gemini-3.16/5/2026

Paper 2 addresses a critical bottleneck in autonomous agents (latency and cost of explicit chain-of-thought) by introducing latent reasoning and a generative world model. This approach has broad implications for efficient, real-time multimodal agents and implicit reasoning, offering significant efficiency gains (75% fewer tokens). Paper 1 offers a useful but more niche inference-time optimization for time series models with relatively modest performance improvements, making Paper 2's methodological innovation and potential cross-field impact substantially higher.

vs. A Normative Intermediate Representation for ASP-Based Compliance Reasoning

gpt-5.26/5/2026

Paper 1 (MIRAGE) is likely higher impact: it introduces a broadly applicable approach (latent/internalized reasoning + generative world-model alignment) that addresses a central, timely bottleneck for multimodal UI agents—token inefficiency and deployment cost—while demonstrating sizable gains on established benchmarks. The method is relevant across agentic LMs, embodied/UI control, efficiency, and representation learning, with clear real-world applications. Paper 2 (MONIR) is valuable but more domain- and paradigm-specific (ASP compliance), with narrower cross-field reach despite solid rigor and practical relevance to regulatory reasoning.

vs. Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

gpt-5.26/5/2026

Paper 2 (MIRAGE) has higher likely impact: it introduces a broadly applicable paradigm—implicit latent reasoning plus a generative world-model objective—for efficient, deployable mobile/UI agents, with clear real-world utility (faster, cheaper inference) and strong empirical gains on established benchmarks. Its ideas can transfer across agentic settings (web, robotics, HCI, multimodal planning). Paper 1 is timely and valuable for LLM safety, but is more specialized to robustness against inference-time token injection and may have narrower application scope than MIRAGE’s efficiency and agent-control contributions.

vs. HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

gemini-3.16/5/2026

Paper 2 introduces a highly novel methodological approach by internalizing explicit Chain-of-Thought into continuous latent representations and integrating a generative world model. This tackles critical bottlenecks in LLM agents (latency and token cost) and offers a paradigm shift applicable beyond mobile agents. While Paper 1 provides a robust, practical data pipeline for smart homes, Paper 2's fundamental algorithmic innovation in latent reasoning holds greater potential for broad scientific impact across the broader AI and agentic research communities.

vs. Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

claude-opus-4.66/5/2026

Paper 1 introduces a more novel conceptual contribution—Imaginative Perception Tokens that externalize spatial reasoning as intermediate perceptual representations rather than text, revealing a fundamental modality mismatch in spatial reasoning via language. This has broader theoretical implications across VLM research, cognitive science connections, and multiple spatial reasoning tasks. Paper 2's latent reasoning distillation for mobile agents is impactful but more application-specific and incremental (efficiency gains via reasoning compression). Paper 1's finding that textual CoT degrades spatial performance challenges prevailing assumptions and could redirect research on multimodal reasoning more broadly.

vs. Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

gemini-3.16/5/2026

Paper 2 tackles a major bottleneck in autonomous agents (high latency and token costs of explicit Chain-of-Thought) by introducing implicit reasoning in latent space combined with a generative world model. This approach offers significant advancements in efficiency and planning for interactive AI systems. Paper 1 offers valuable insights into LLM priors for Bayesian optimization, but Paper 2's methodology has broader applicability and addresses a more pressing challenge in the rapidly growing field of LLM agents.