Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning
Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun, Yidong Chen, Jiayi Ji, Qian Chen, Rongrong Ji
Abstract
Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning"
1. Core Contribution
The paper identifies a tension in multimodal reasoning between two dominant paradigms: (a) pre-reasoning visual-to-text conversion that loses fine-grained details, and (b) unified vision-language representation spaces that suffer from linguistic dominance over visual tokens. The proposed solution, CSMR, decouples perception from reasoning by using an LLM as a "Cognitive Reasoning Core" (CRC) that dynamically decides when to query an independent VLM-based "Primary Visual Perception" (PVP) module for visual evidence. The CRC maintains an explicit reasoning state and iteratively issues targeted visual queries, integrating returned textual evidence until it determines sufficient information has been gathered.
The key insight is that visual evidence acquisition should be demand-driven by the evolving reasoning state rather than performed as a one-shot conversion or continuously fused in a joint embedding space. This is a clean architectural idea motivated by Baddeley's working memory theory.
2. Methodological Rigor
Strengths in analysis: The paper provides a thoughtful empirical analysis (Section 4) of why unified VLMs exhibit linguistic dominance. The attention distribution analysis across multiple architectures (Qwen3-VL-8B, LLaV A-1.6-7B) convincingly demonstrates that text tokens systematically receive higher attention weights than visual tokens, and that CoT-style long chains further dilute visual attention. The controlled ScienceQA experiment showing 57-68% accuracy without images effectively illustrates reliance on linguistic priors.
Weaknesses in evaluation: The experimental evaluation has several notable limitations:
3. Potential Impact
The framework introduces a modular reasoning architecture that could have several practical implications:
However, the approach is essentially a structured prompting framework over existing models with no training involved. This limits both novelty and potential performance ceiling. The "cognitive scheduling" terminology, while appealing, essentially reduces to an LLM deciding whether to ask another question or output an answer—a mechanism already explored in tool-augmented LLM reasoning (e.g., ReAct, Chameleon, Visual Programming).
4. Timeliness & Relevance
The paper addresses a timely concern: as VLMs scale and are applied to more complex reasoning tasks, the faithfulness of visual grounding becomes increasingly important. The attention dilution analysis and linguistic dominance argument are relevant to the growing literature on VLM hallucination. The zero-shot, training-free nature of CSMR is practically appealing given the cost of VLM training.
However, the landscape is rapidly evolving. Recent thinking/reasoning models (e.g., QwQ, Gemini 2.5) are already incorporating dynamic visual re-examination mechanisms within unified architectures. The paper's fundamental claim—that decoupled perception-reasoning is superior to unified—may become less compelling as unified models improve their visual grounding through better training objectives or architectural innovations.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Overall Assessment
This paper presents a clean and well-motivated framework for demand-driven visual evidence acquisition in multimodal reasoning. The empirical analysis of linguistic dominance in VLMs is its strongest contribution. However, the technical novelty is limited—the core mechanism is structured prompting for iterative VLM querying—and the experimental evaluation, while showing consistent gains, lacks the depth and rigor needed to establish this as a fundamentally superior paradigm. The work is a solid contribution to the multimodal reasoning literature but falls short of being transformative.
Generated May 28, 2026
Comparison History (13)
Paper 2 addresses a critical, highly timely challenge with broad implications across AI development, cognitive science, and governance: measuring AGI progress. While Paper 1 offers a strong, empirically validated technical contribution to multimodal reasoning, Paper 2 provides a foundational taxonomy and evaluation protocol that could shape future benchmarking standards, policy-making, and interdisciplinary research, giving it a higher potential for widespread, paradigm-shifting scientific impact.
Paper 2 is likely to have higher scientific impact due to stronger novelty and broader cross-field relevance: it introduces a general cognitive scheduling principle for when to acquire visual evidence, addressing structural limitations of both caption-then-reason and end-to-end VLMs. This idea is timely for multimodal agents and has clear real-world applications (interactive perception, robotics, document/UI understanding) and broader impact spanning vision, NLP, and agentic planning. Paper 1 is methodologically solid and useful, but confidence-weighted RL updates/replay are more incremental and narrower in scope, with impact mainly within self-training/RLHF-style LLM training.
Paper 1 addresses a more practical and broadly impactful problem—multimodal reasoning with a novel cognitive scheduling framework that determines when to invoke visual perception during reasoning. This has immediate applications across vision-language tasks and introduces an architecturally novel paradigm beyond existing approaches. Paper 2, while offering valuable mechanistic interpretability insights into LLM reasoning circuits, is more analytical/explanatory in nature with narrower scope (logical reasoning only). Paper 1's framework-level contribution with demonstrated benchmark improvements has greater potential to influence future system design across the multimodal AI community.
Paper 2 addresses a highly critical and timely bottleneck in AI deployment: the safety and governance of autonomous agents. By employing formal methods (Petri nets) to guarantee safety bounds and escalation protocols, it offers rigorous, cross-domain applications in high-stakes fields like healthcare and robotics. Paper 1 provides a valuable but more narrowly focused architectural improvement for multimodal reasoning, whereas Paper 2's theoretical framework for 'managed autonomy' has a broader potential impact across AI safety, policy, and human-machine interaction.
Paper 1 addresses a fundamental challenge in multimodal reasoning—when and how to integrate visual evidence—proposing a novel cognitive scheduling framework (CSMR) that rethinks the paradigm of vision-language integration. This has broad impact across the rapidly growing multimodal AI field, touching numerous applications (VQA, visual reasoning, embodied AI). Paper 2, while innovative in extracting optimization skills from expert GPU kernels, targets a narrower domain (GPU kernel optimization) with more limited cross-field applicability. Paper 1's architectural insight about dynamic visual evidence acquisition is more likely to influence diverse research directions.
Paper 2 addresses a fundamental architectural limitation in multimodal AI (linguistic dominance and visual detail loss) by introducing a novel cognitive scheduling mechanism. This advances core AI reasoning research and has broad theoretical implications. In contrast, Paper 1 offers a highly innovative and practical systems engineering solution for local AI agents, but its impact is more confined to software architecture and database engineering rather than foundational scientific discovery.
Paper 2 likely has higher scientific impact due to broader applicability and timeliness: cognitively scheduled visual evidence acquisition targets a central failure mode in multimodal reasoning and can influence many tasks (VQA, chart/diagram understanding, embodied agents, tool-using LLMs). The modular “on-demand perception” idea is novel and aligns with current trends toward agentic, tool-invoking models, making it widely reusable across fields. Paper 1 is rigorous and valuable for ad auctions, but its domain specificity (auto-bidding with a bid multiplier controller) narrows cross-field impact despite strong theoretical guarantees.
Paper 1 addresses a critical bottleneck in deploying Vision-Language Models (VLMs) by introducing a highly novel structured pruning method that preserves Chain-of-Thought reasoning. Its methodological rigor in identifying 'pivot tokens' and addressing cross-modal activation differences provides deep insights into VLM internals. While Paper 2 offers an interesting agentic framework for visual reasoning, Paper 1's approach enables significant real-world applications by reducing computational costs without sacrificing complex reasoning capabilities, likely driving broader adoption and follow-up research in model efficiency.
Paper 1 addresses a critical frontier in AI: autonomous scientific discovery and long-horizon reasoning. By providing a rigorous, multimodal benchmark for spatial biology, it enables the evaluation of AI agents on complex, real-world scientific tasks rather than isolated steps. This has immense potential to accelerate biological research and drug discovery. While Paper 2 presents a solid methodological improvement for multimodal reasoning, Paper 1's focus on end-to-end scientific reasoning in a high-impact applied domain offers greater potential for transformative, real-world scientific advancements.
Paper 1 (Hera) addresses a highly practical and timely problem—efficient device-cloud collaboration for LLM agents—with a rigorous two-stage training paradigm combining imitation and reinforcement learning. It demonstrates strong empirical results across three diverse benchmarks, achieving 92.5% of cloud performance at 46.3% cloud usage. The approach has broad real-world applicability as LLM deployment scales. Paper 2 (CSMR) proposes an interesting cognitive scheduling mechanism for multimodal reasoning, but the problem scope is narrower and the zero-shot evaluation setting, while notable, limits demonstrated impact. Hera's cost-efficiency contributions are more immediately impactful for the field.
Paper 1 addresses a fundamental and highly timely issue in LLM deployment—verifying attribution in retrieval-augmented generation. By introducing a novel, cognitively-inspired method to distinguish parametric memory from retrieved context via internal representations, it provides a critical step toward safe, high-stakes AI deployment. Paper 2 offers a valuable architectural framework for multimodal reasoning, but Paper 1's focus on trust, interpretability, and solving a major blind spot in widely used RAG systems gives it broader and more immediate scientific and real-world impact.
Paper 2 proposes a novel, actionable framework (cognitive scheduling of visual evidence acquisition) that changes the multimodal reasoning pipeline and shows consistent zero-shot gains across benchmarks, suggesting clearer methodological contribution and nearer-term applicability to real systems needing faithful visual grounding. Its modular “invoke perception on demand” idea is broadly relevant across VLMs, agents, and interactive perception. Paper 1 is a valuable critique with careful controls that improves evaluation rigor for LLM metacognition, but it is primarily corrective/diagnostic and may have narrower immediate downstream application than a new performant architecture-level approach.
Paper 2 has higher estimated impact due to broader applicability and timeliness: a general framework for multimodal reasoning that improves visual faithfulness via on-demand evidence acquisition is relevant across VQA, embodied/agentic perception, document understanding, and reliability/interpretability. The cognitive scheduling idea is a clear architectural contribution that can be reused with different LMs and vision modules, and strong zero-shot benchmark gains suggest immediate practical value. Paper 1 is innovative and rigorous (notably with guarantees) but targets a narrower community (optimization modeling) and depends on human-in-the-loop workflows, likely limiting breadth of adoption.