Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, Wenjie Li
Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.
This paper proposes "optical reasoning," a paradigm where images serve as the sole medium for chain-of-thought reasoning in both language and multimodal tasks. Rather than generating textual rationales, the approach renders reasoning steps into images and feeds them back to MLLMs. Two variants are introduced: (1) Typographic-based Optical Reasoning (T-OR), which optimizes visual layout parameters (font size, text width, line spacing) to maximize information density within a controllable visual token budget; and (2) Graphical-based Optical Reasoning (G-OR), which decomposes rationales into step-aligned visual panels combining text, diagrams, and spatial layouts.
The key claim is that images can serve as effective and more token-efficient reasoning media compared to text, achieving 1.96× token efficiency as measured by their Marginal Accuracy Gain metric.
The experimental design has several strengths but also notable concerns:
The paper addresses a legitimate and growing concern: reasoning token efficiency. As LRMs produce increasingly long reasoning traces, methods to compress these without performance loss are valuable. However, the practical impact is uncertain:
The most impactful finding may be the observation that different MLLMs respond very differently to visual information density (e.g., Gemini performing well under aggressive compression while others need more tokens), which provides useful empirical knowledge for the community.
The paper is timely, situated at the intersection of several active research directions: reasoning efficiency, multimodal reasoning, and interleaved-modal generation. The emergence of long-form reasoning models (DeepSeek-R1, o1-style models) makes token efficiency increasingly relevant. The optical compression literature is recent (2025-2026), and this paper extends it to reasoning contexts.
However, the framing as "images as standalone reasoning media" is somewhat misleading—the approach still depends on textual rationales being generated first (or provided), then rendered into images. True standalone visual reasoning would involve models that think natively in visual space.
The paper's most provocative finding—that extremely compressed rationale images still help reasoning—deserves deeper investigation. It raises the possibility that the image is serving more as a *prompt* or *activation trigger* than as a faithfully decoded rationale, which would reframe the contribution significantly. The ablation showing that red text and specific fonts matter also suggests the mechanism may be more about visual attention patterns than information content.
Generated Jun 9, 2026
Paper 2 introduces a fundamentally novel paradigm—using images as the primary reasoning medium instead of text—which challenges deep assumptions about how LLMs reason. This conceptual boldness has broader implications across AI, cognitive science, and multimodal understanding. While Paper 1 offers a solid engineering contribution (compressing memory tokens for resource-constrained QA), it is more incremental within the established RAG paradigm. Paper 2's potential to inspire new research directions in visual reasoning, token efficiency, and unified multimodal representations gives it higher estimated scientific impact.
Paper 2 introduces a fundamentally novel paradigm—using images as the primary reasoning medium instead of text—which challenges a core assumption in the field. This has broader implications across LLMs, multimodal AI, and cognitive science, with demonstrated efficiency gains (28.57% token reduction) and applicability across mathematical, scientific, and multimodal tasks. While Paper 1 makes a solid contribution to AI safety diagnostics with its CoT-Output safety matrix, it addresses a narrower problem within multi-turn safety evaluation. Paper 2's conceptual novelty and breadth of potential applications give it higher impact potential.
Paper 1 likely has higher scientific impact: it identifies and quantifies a safety-critical failure mode (memory-amplified sycophancy) in a rapidly emerging class of deployed systems, introduces a dedicated benchmark (MIST) spanning high-stakes domains, evaluates across multiple memory systems and model families, provides causal evidence (memory extraction/compression), and offers lightweight mitigations with preserved utility. This combines novelty with strong real-world relevance, methodological breadth, and timeliness for alignment, personalization, and agentic/memory LLM deployments. Paper 2 is novel but its impact may be narrower and more dependent on practical integration/cost tradeoffs.
Paper 1 proposes a fundamental paradigm shift in AI by using images as a standalone reasoning medium, potentially revolutionizing how multimodal models process information. Its ability to improve token efficiency by up to 28% while matching text reasoning performance gives it broad applicability across the entire AI field. In contrast, while Paper 2 offers significant practical value and time savings for scientific simulations, its impact is more domain-specific and applied.
Paper 1 introduces a fundamentally novel paradigm—using images as the primary reasoning medium for LLMs—which challenges core assumptions about how reasoning should be represented. This concept of 'optical reasoning' is highly innovative, broadly applicable across mathematical, scientific, and multimodal tasks, and demonstrates both improved performance and significant token efficiency gains. Its breadth of impact spans AI reasoning, multimodal learning, and cognitive science. Paper 2, while practically useful for traffic prediction, addresses a more incremental and domain-specific problem with narrower impact potential.
Paper 2 proposes a fundamentally novel paradigm—using images as the primary reasoning medium instead of text—which challenges core assumptions about how LLMs reason. This conceptual innovation (optical reasoning) has broader implications across AI reasoning, efficiency, and multimodal understanding. It demonstrates practical benefits (28.57% token reduction) and opens entirely new research directions. Paper 1, while methodologically sound and useful, is primarily a benchmarking contribution that evaluates known models on table formats—important but incremental. Paper 2's paradigm-shifting nature gives it higher potential for cross-field impact and future citations.
Paper 2 likely has higher scientific impact: it introduces a general analytical framework (GAMBLe) for understanding and designing AI-driven research systems across domains, backed by substantial replicated experiments and revealing non-obvious interaction effects and failure of standard guarantees. This breadth (methodology + empirical evidence) can influence how ADRS are built and evaluated in many fields, and is timely given rapid adoption of automated discovery pipelines. Paper 1 is novel and practical for multimodal reasoning efficiency, but its impact is narrower and more contingent on specific model/tooling support for image-based rationales.
Paper 1 proposes a fundamentally novel paradigm—using images as the sole reasoning medium for LLMs—which challenges a core assumption in the field (that reasoning must be text-based). This concept of 'optical reasoning' opens entirely new research directions for multimodal AI, demonstrates practical benefits (28.57% token reduction), and applies broadly across mathematical, scientific, and multimodal tasks. Paper 2, while rigorous and valuable for LLM evaluation, is more incremental as an analysis/benchmarking framework. Paper 1's paradigm-shifting nature and broad applicability give it higher potential for transformative impact.
Paper 1 proposes a fundamental paradigm shift by using images as a standalone reasoning medium instead of text, challenging traditional Chain-of-Thought methods. This high novelty could spark an entirely new direction in multimodal model architecture and reasoning efficiency. Paper 2, while highly practical and effective for agentic workflows, represents a more incremental architectural enhancement over existing RAG systems.
Paper 1 introduces a paradigm-shifting concept by proposing images as a standalone reasoning medium, challenging the text-centric status quo of Chain-of-Thought reasoning. This highly novel approach opens entirely new research directions in multimodal foundation models and knowledge representation. While Paper 2 offers a highly practical and rigorous system-level optimization for RAG efficiency, Paper 1's conceptual innovation is more likely to inspire a broader range of follow-up scientific research across the AI community, giving it higher potential for long-term scientific impact.