MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Zhichao Yang, Yuanze Hu, Haojie Hao, Longkun Hao, Dongshuo Huang, Hongyu Lin, Gen Li, Lanqing Hong
Abstract
Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world-model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward-looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain-of-thought supervised fine-tuning in the 4B ablation with a 3-5x lower decoded-token budget and improves a comparable instruction-tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.
AI Impact Assessments
(1 models)Scientific Impact Assessment: MIRAGE
1. Core Contribution
MIRAGE addresses a genuine tension in mobile GUI agents: explicit chain-of-thought (CoT) reasoning improves decision quality but inflates token generation, increasing latency and deployment cost. The paper proposes three interlocking ideas:
1. Latent reasoning slots: Replace verbose textual reasoning traces with N continuous latent vectors that occupy the same position in the decoder sequence but are never decoded to text.
2. Approximate Parallel Latent Refinement (APLR): A Jacobi-style parallel iterative scheme that refines all latent slots simultaneously across K rounds, rather than requiring N sequential forward passes as in Coconut-style serial latent CoT. The paper proves the first K slots exactly match the serial solution, with bounded tail error.
3. Q-Former world-model head: Aligns latent reasoning states with future screenshot features (extracted from the frozen vision encoder), encouraging the agent to anticipate GUI transitions. This provides auxiliary supervision that specifically targets tail-slot errors left by APLR.
The combination is well-motivated: APLR introduces approximation error in later latent slots, and these later slots correspond to the "predict" component of structured thought. The world-model loss directly supervises exactly these transition-predictive representations.
2. Methodological Rigor
Strengths:
Weaknesses:
3. Potential Impact
Practical impact is potentially significant. Mobile agents face hard latency constraints—users expect sub-second responsiveness. A 3-5× reduction in decoded tokens (from ~100+ to ~20) and 1.7× latency reduction translates directly to better user experience. The framework is backbone-agnostic and demonstrated on two model sizes.
Methodological influence could extend beyond mobile agents. The APLR technique for parallel latent refinement with provable early-slot recovery is generalizable to any latent-CoT application. The idea of using a world-model objective to compensate for tail-slot approximation error is architecturally elegant and could apply to other sequential decision-making domains.
Limitations to impact: The approach is supervised-only. No reinforcement learning or online adaptation is used. The world model operates in feature space rather than pixel space, limiting its use as an actual environment simulator. The Q-Former head is discarded at inference, so the model cannot generate predicted future screens for user verification or planning.
4. Timeliness & Relevance
This paper is well-timed. The deployment of VLM-based agents on mobile devices is an active industry and research priority. The tension between reasoning quality and inference efficiency is a current bottleneck, especially as models like Qwen3-VL, GPT-4o, and Claude are being applied to agentic tasks. The latent reasoning direction (Coconut, Quiet-Star, pause tokens) is a nascent but rapidly growing area, and MIRAGE is among the first to apply it to a concrete agentic domain with practical efficiency constraints.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
MIRAGE presents a well-conceived and technically sound framework that addresses a real deployment bottleneck in mobile GUI agents. The combination of parallel latent refinement with provable approximation guarantees and a world-model auxiliary objective is novel and well-motivated. Empirical results are strong though somewhat limited in scale. The work makes a meaningful contribution at the intersection of efficient reasoning and embodied agent deployment, though its impact would be strengthened by RL integration, broader benchmarking, and more rigorous statistical analysis.
Generated Jun 5, 2026
Comparison History (19)
Paper 2 introduces a foundational shift in agentic AI by replacing explicit chain-of-thought with continuous latent reasoning and generative world models. This addresses critical bottlenecks in latency and supervision for autonomous agents. While Paper 1 offers highly impressive engineering for LLM quantization, Paper 2's methodological innovation in internalizing reasoning steps presents a broader paradigm shift for the rapidly growing field of multimodal agents, likely inspiring widespread future research in implicit reasoning architectures.
Paper 1 offers a more technically novel and broadly enabling approach: compressing explicit reasoning into latent states and coupling it with a generative world-model objective for anticipating future UI states. This can materially improve efficiency and reliability of real-world mobile/UI agents, with quantitative gains and clear deployment relevance. While Paper 2 is timely and rigorous, its main contribution is a diagnostic/measurement finding plus a prompt-structure workaround; impactful for evaluation and prompting practice but narrower in downstream capability advances than MIRAGE’s method and application scope.
Paper 2 has higher estimated impact: it proposes a broadly applicable framework (latent internal reasoning + generative world-model alignment) with clear, timely real-world utility for efficient mobile UI agents and measurable gains (token reduction, performance improvements) on established benchmarks. Its contribution could transfer across agentic RL, multimodal modeling, and deployment-constrained settings. Paper 1 is novel and methodologically insightful for transformer interpretability/architecture, but is more specialized and primarily diagnostic, with less immediate application breadth and product-facing impact than MIRAGE.
Paper 1 (MIRAGE) is more likely to have higher scientific impact: it introduces a novel latent-reasoning transfer framework plus a generative world-model objective that ties internal reasoning states to future observations, enabling anticipatory control in real UI environments. The demonstrated gains on AndroidWorld/AndroidControl suggest strong real-world applicability to autonomous UI agents and embodied/interactive systems, with potential cross-field impact (reasoning, world models, agentic interaction, efficiency). Paper 2 is timely and useful but is primarily an inference-time systems optimization (KV-cache pruning) with narrower conceptual novelty and application scope.
Paper 2 (MRAgent) introduces a more broadly applicable and novel conceptual framework—active memory reconstruction inspired by cognitive science—that addresses a fundamental challenge (long-horizon memory reasoning) relevant across virtually all LLM agent applications. Its associative graph memory with dynamic reconstruction is a paradigm shift from static retrieve-then-reason approaches, with strong empirical gains (up to 23%). Paper 1 (MIRAGE) is well-executed but more narrowly focused on mobile UI agents and latent reasoning compression, which, while valuable, represents a more incremental advance in a specific application domain.
Paper 1 (MIRAGE) is likely to have higher scientific impact: it proposes a broadly useful training/inference framework (latent internal reasoning + world-model alignment) that directly improves efficiency and performance for UI-operating agents, a timely and high-demand application area. The approach has clear real-world deployment benefits (lower token/latency/cost) and can generalize to other embodied/interactive agent settings. Paper 2 is novel and relevant for provenance, but depends on access to internal activations/steering and may face adoption and threat-model limitations, narrowing practical impact.
Paper 1 establishes a foundational evaluation framework for AI agent reliability, addressing a critical bottleneck for real-world deployment. By proposing comprehensive metrics for safety, robustness, and consistency, it has the potential to broadly influence how the entire field evaluates and develops agents. While Paper 2 offers a strong technical innovation for mobile agent efficiency, Paper 1's conceptual contribution offers wider applicability, cross-field relevance, and long-term impact on AI safety and engineering.
SciDER addresses the broader challenge of automating the entire scientific research lifecycle with a multi-agent system, releasing open-source datasets and models (OpenSciDER-SFT-8K, OpenSciDER-27B). Its breadth of impact spans multiple scientific domains and benchmarks, and it tackles fundamental limitations in AI-driven scientific discovery. While MIRAGE presents a technically elegant approach to latent reasoning for mobile agents with clear efficiency gains, its scope is narrower (mobile UI navigation). SciDER's potential to democratize and accelerate scientific research across disciplines gives it higher estimated impact.
Paper 1 addresses a fundamental bottleneck in foundation model development: the self-evolution of reasoning capabilities without relying on scarce human process data. By tackling mimetic bias and reward decomposition, its theoretical contributions to latent logic mining and self-alignment have broader implications for general AI development compared to Paper 2, which focuses on a more specific application (mobile agents) and inference efficiency.
Paper 1 addresses a critical bottleneck in AI agents—the latency and cost of explicit Chain-of-Thought—by introducing continuous latent reasoning and generative world models. Its demonstrated 3-5x efficiency gain on standard benchmarks (AndroidWorld) offers immediate, widespread real-world utility. While Paper 2 presents a highly novel theoretical framework for agent memory, it currently remains a proof-of-concept. Paper 1's methodological rigor, timeliness in the fast-growing agentic AI space, and practical scalability give it a significantly higher potential for immediate scientific and industrial impact.
Paper 2 tackles the critical bottlenecks of token generation efficiency and environment modeling in LLM agents. By moving explicit chain-of-thought to continuous latent representations and integrating a generative world model, it offers fundamental algorithmic advancements in implicit reasoning. These innovations have broad applicability across AI domains, likely spurring more follow-up research than Paper 1's programming language and safety-oriented framework.
Paper 1 introduces a fundamental methodological shift by internalizing explicit Chain-of-Thought into continuous latent representations paired with a generative world model. This elegantly solves critical latency and cost bottlenecks for autonomous agents. Its broad applicability to everyday UI control and general embodied AI promises wider real-world deployment and cross-field impact compared to Paper 2's highly complex, domain-specific orchestration protocol for mechanical design.
MIRAGE addresses a fundamental challenge in mobile AI agents by compressing explicit chain-of-thought reasoning into continuous latent representations combined with a generative world model, achieving comparable performance with 3-5x fewer tokens. This has significant practical impact for on-device deployment and advances the important research direction of latent reasoning. PolarMem introduces an interesting negative memory concept for VLMs but is training-free and more incremental. MIRAGE's broader applicability to efficient reasoning and mobile agents, combined with its methodological novelty in latent reasoning distillation and world modeling, gives it higher potential impact.
Paper 2 addresses a critical bottleneck in autonomous agents (latency and cost of explicit chain-of-thought) by introducing latent reasoning and a generative world model. This approach has broad implications for efficient, real-time multimodal agents and implicit reasoning, offering significant efficiency gains (75% fewer tokens). Paper 1 offers a useful but more niche inference-time optimization for time series models with relatively modest performance improvements, making Paper 2's methodological innovation and potential cross-field impact substantially higher.
Paper 1 (MIRAGE) is likely higher impact: it introduces a broadly applicable approach (latent/internalized reasoning + generative world-model alignment) that addresses a central, timely bottleneck for multimodal UI agents—token inefficiency and deployment cost—while demonstrating sizable gains on established benchmarks. The method is relevant across agentic LMs, embodied/UI control, efficiency, and representation learning, with clear real-world applications. Paper 2 (MONIR) is valuable but more domain- and paradigm-specific (ASP compliance), with narrower cross-field reach despite solid rigor and practical relevance to regulatory reasoning.
Paper 2 (MIRAGE) has higher likely impact: it introduces a broadly applicable paradigm—implicit latent reasoning plus a generative world-model objective—for efficient, deployable mobile/UI agents, with clear real-world utility (faster, cheaper inference) and strong empirical gains on established benchmarks. Its ideas can transfer across agentic settings (web, robotics, HCI, multimodal planning). Paper 1 is timely and valuable for LLM safety, but is more specialized to robustness against inference-time token injection and may have narrower application scope than MIRAGE’s efficiency and agent-control contributions.
Paper 2 introduces a highly novel methodological approach by internalizing explicit Chain-of-Thought into continuous latent representations and integrating a generative world model. This tackles critical bottlenecks in LLM agents (latency and token cost) and offers a paradigm shift applicable beyond mobile agents. While Paper 1 provides a robust, practical data pipeline for smart homes, Paper 2's fundamental algorithmic innovation in latent reasoning holds greater potential for broad scientific impact across the broader AI and agentic research communities.
Paper 1 introduces a more novel conceptual contribution—Imaginative Perception Tokens that externalize spatial reasoning as intermediate perceptual representations rather than text, revealing a fundamental modality mismatch in spatial reasoning via language. This has broader theoretical implications across VLM research, cognitive science connections, and multiple spatial reasoning tasks. Paper 2's latent reasoning distillation for mobile agents is impactful but more application-specific and incremental (efficiency gains via reasoning compression). Paper 1's finding that textual CoT degrades spatial performance challenges prevailing assumptions and could redirect research on multimodal reasoning more broadly.
Paper 2 tackles a major bottleneck in autonomous agents (high latency and token costs of explicit Chain-of-Thought) by introducing implicit reasoning in latent space combined with a generative world model. This approach offers significant advancements in efficiency and planning for interactive AI systems. Paper 1 offers valuable insights into LLM priors for Bayesian optimization, but Paper 2's methodology has broader applicability and addresses a more pressing challenge in the rapidly growing field of LLM agents.