MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning
Aritra Dutta, Somak Aditya
Abstract
Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.
AI Impact Assessments
(1 models)Scientific Impact Assessment: MuCRASP
1. Core Contribution
MuCRASP addresses a genuinely underexplored problem: how to structurally prune Vision-Language Models while preserving chain-of-thought reasoning quality. The paper identifies two concrete failure modes of existing pruning methods applied to VLMs: (1) CoT reasoning depends on sparse "pivot tokens" at reasoning-step boundaries whose importance is diluted under uniform token aggregation, and (2) existing LLM pruning methods ignore cross-modal activation distribution differences between vision and language components.
The proposed framework combines four components: global Taylor-based attribution, trajectory pivot attribution restricted to reasoning transition windows, a Cross-Modal Dependency Score (CMDS) based on MMD to identify vision-language integration layers, and a global knapsack allocation that handles the extreme heterogeneity of structural unit sizes in VLMs (~250× cost difference between GQA groups and MLP neurons). This is a training-free method, which is practically valuable.
2. Methodological Rigor
Strengths in experimental design:
Weaknesses:
3. Potential Impact
Practical applications: Enabling deployment of reasoning-capable VLMs on resource-constrained hardware is directly relevant for edge deployment, mobile applications, and reducing inference costs. The training-free nature makes the approach accessible.
Methodological contributions: The observation that reasoning coherence and answer-extraction accuracy degrade along fundamentally different trajectories is an important insight that could influence how compressed generative models are evaluated more broadly. The CMDS concept for identifying cross-modal bottleneck layers could be adopted in other VLM optimization work beyond pruning.
Influence on evaluation practices: The paper makes a compelling case that perplexity is a misleading proxy for reasoning quality (e.g., Qwen2.5-VL-7B at 30%: Attribution Pruning achieves lower PPL=1.59 than MuCRASP's 1.74, yet its LLM-J collapses to 1.33 vs 8.87). This could shift evaluation norms in model compression research.
4. Timeliness & Relevance
The paper addresses a genuine bottleneck: VLMs with CoT reasoning are increasingly deployed but computationally expensive. The intersection of structured pruning, multimodal reasoning, and CoT preservation is timely and virtually unexplored—the paper convincingly argues that no prior work studies structured pruning of VLMs with explicit CoT preservation objectives.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Missing comparisons: No comparison with recent VLM-specific compression methods beyond ECoFLaP and TAMP, and no comparison with quantization approaches that are the dominant practical compression technique.
Generated May 26, 2026
Comparison History (21)
Paper 2 (HiSME) likely has higher scientific impact due to broader applicability and timeliness: hierarchical meta-evolving of skills and the evolving strategy targets a central bottleneck in deployed agentic systems (continual, test-time improvement) and can generalize across many tasks, domains, and LLM backends without costly parameter updates. This paradigm could influence agent design, lifelong learning, and meta-learning communities. Paper 1 is methodologically solid and useful for efficient VLM deployment, but its contribution is more specialized (structured pruning for CoT in VLMs) with narrower cross-field reach.
Paper 1 addresses a critical bottleneck in deploying Vision-Language Models (VLMs) by introducing a highly novel structured pruning method that preserves Chain-of-Thought reasoning. Its methodological rigor in identifying 'pivot tokens' and addressing cross-modal activation differences provides deep insights into VLM internals. While Paper 2 offers an interesting agentic framework for visual reasoning, Paper 1's approach enables significant real-world applications by reducing computational costs without sacrificing complex reasoning capabilities, likely driving broader adoption and follow-up research in model efficiency.
Paper 2 addresses a fundamental and highly debated question in AI (whether LLMs genuinely reason) by rigorously critiquing a high-profile benchmark. By exposing statistical flaws and dataset artifacts, it corrects the scientific record and promotes better evaluation methodologies. While Paper 1 offers a useful, practical pruning tool for VLMs, Paper 2's broader theoretical implications and critical methodological corrections will likely have a wider, paradigm-shifting impact across the AI research community.
MuCRASP addresses a fundamental and broadly applicable problem—preserving chain-of-thought reasoning during structured pruning of VLMs. Its novelty lies in identifying pivot tokens and cross-modal activation differences, providing a principled pruning framework with strong empirical results across multiple models and benchmarks. This has wide applicability to model compression across the VLM community. Paper 2, while valuable for its dataset and benchmark contributions to mobile GUI navigation, addresses a narrower application domain (Chinese mobile apps) and is more incremental in its technical contributions (scaling analysis, benchmarking toolkit).
Paper 2 has higher likely scientific impact due to broader applicability and timeliness: per-query optimization of retrieval-agent configurations addresses a widespread, practical bottleneck in deploying RAG systems across domains, with immediate cost/latency implications. The approach is modular (works over a pipeline catalog), aligns with current industry trends toward agentic systems, and can influence both systems and ML communities. Paper 1 is novel within VLM compression and CoT-aware pruning, but its impact is narrower (structured pruning for specific VLM classes) and less directly transferable beyond multimodal model compression.
Paper 2 likely has higher scientific impact due to a clearer, broadly applicable technical contribution with immediate deployment value: a structured pruning method tailored to multimodal chain-of-thought reasoning, validated across multiple VLMs and benchmarks with sizable gains at high pruning rates. Its relevance is timely given widespread VLM deployment constraints and the community focus on efficient inference. Paper 1 is novel and insightful for KG-guided scientific hypothesis generation, but its impact may be narrower (battery KG setting) and more diagnostic/empirical than enabling a widely reusable algorithmic advance.
MuCRASP addresses a fundamental and broadly applicable problem—efficiently compressing vision-language models while preserving chain-of-thought reasoning. It introduces novel concepts (pivot tokens, cross-modal pruning sensitivity) applicable across many VLMs and reasoning tasks, with rigorous experiments on multiple models and benchmarks. Paper 2, while valuable, is a domain-specific application (steel industry VOCs) combining existing techniques (KGs, RAG, multi-agent systems) with narrower impact scope. Paper 1's contributions to model compression methodology have broader relevance to the rapidly growing VLM community.
Paper 2 likely has higher scientific impact due to stronger real-world applicability and broader relevance: it addresses deployment-critical efficiency for large multimodal VLMs while explicitly preserving CoT reasoning, a central capability in current systems. Its contributions (pivot-token awareness, cross-modal activation considerations, global budgeted structured pruning) are timely and useful across many VLM deployments and tasks. Paper 1 is novel in combining mechanistic interpretability with unfaithfulness detection, but its immediate applications are narrower and depend on circuit-tracing assumptions and benchmarks.
Paper 1 demonstrates that agentic AI systems can achieve near-complete success on program verification benchmarks, fundamentally challenging the field's evaluation methodology and establishing a new paradigm (compiler-in-the-loop agentic proving) for formal verification. This has broad implications for software engineering, formal methods, and AI safety. Paper 2, while technically solid, presents an incremental improvement in model compression for VLMs—a narrower contribution in a crowded space. Paper 1's finding that benchmarks are already saturated by current AI capabilities is a significant wake-up call with wider reverberations across multiple research communities.
Paper 1 offers higher potential impact due to its immediate real-world applicability in deploying resource-intensive Vision-Language Models. By introducing a structured pruning method that preserves Chain-of-Thought reasoning, it solves a major bottleneck in AI efficiency. While Paper 2 provides valuable theoretical insights into LLM interpretability and behavioral simulation, Paper 1 addresses a more pressing, widespread engineering challenge. The ability to compress state-of-the-art VLMs by up to 30-50% without losing reasoning consistency will directly benefit researchers and industry practitioners, ensuring broader and more immediate technological adoption.
Paper 1 is more novel and broadly impactful: it introduces a new geometric framing (hyperbolic guidance) for multi-step reasoning that could generalize across LLM architectures, tasks, and even search/verification methods. If robust, it offers a lightweight, efficient alternative to expensive tree-search while improving deeper reasoning, a timely core problem. Paper 2 is strong and practical for deploying VLMs, but structured pruning is a more incremental area and its impact is narrower (primarily compression of multimodal CoT) and potentially sensitive to model/task specifics and evaluation via LLM-judge.
Paper 1 addresses a novel interdisciplinary question about collective creativity dynamics in human-AI systems, leveraging a large-scale naturalistic dataset to reveal fundamental mechanisms of cultural evolution. Its findings about attractor states, the paradox between novelty appreciation and remix preferences, and group-size effects have broad implications across cultural evolution, computational social science, creativity research, and AI-assisted design. Paper 2, while technically solid, represents an incremental improvement in model compression—a crowded subfield with rapid turnover. Paper 1's unique dataset, cross-disciplinary relevance, and insights into human-AI co-creation give it broader and more lasting impact.
Paper 1 is more novel and broadly impactful: it introduces “premature confidence” as a scalable, label-free signal for reasoning failures and proposes a general RL objective (progressive confidence shaping) that improves reasoning quality and faithfulness across multiple tasks and model sizes. This targets a central, timely limitation of LLM test-time compute and CoT reliability, with implications for alignment and safety. Paper 2 is valuable and applicable for efficient multimodal deployment, but pruning methods are more incremental and narrower in scope, with impact mainly in VLM compression rather than core reasoning improvements.
While Paper 1 presents an innovative application of masked diffusion for medical reporting, Paper 2 addresses a fundamental and widely applicable challenge in modern AI: deploying large Vision-Language Models efficiently without sacrificing complex reasoning. Its structured pruning framework for multimodal Chain-of-Thought preservation has a broader potential impact across numerous domains and tasks, making it highly relevant to a larger segment of the AI research community.
Paper 2 addresses a highly timely and widely relevant problem: compressing large Vision-Language Models (VLMs) while preserving chain-of-thought reasoning capabilities. Given the explosion in VLM usage and the critical need for efficient deployment in real-world applications, a successful pruning method for these models has broader immediate impact across AI domains. In contrast, Paper 1, while demonstrating strong methodological rigor and advancing the state-of-the-art, focuses on classical planning, which has a comparatively narrower scope and audience in the current research landscape.
Paper 1 likely has higher impact due to a more general, timelier contribution: an inference-time, model-agnostic protocol for selective prediction and calibrated abstention—critical for safe deployment across domains. Framing via interactive proof theory is novel and broadly relevant (AI safety, reliability, HCI, evaluation). It also surfaces concrete failure modes and transfer across model families, increasing practical value. Paper 2 is valuable for efficient VLM deployment, but structured pruning is a more incremental line with narrower scope; gains may depend on specific architectures/benchmarks and pruning/eval choices (e.g., LLM-judge).
Paper 2 (MuCRASP) likely has higher scientific impact due to a more actionable contribution: a structured pruning method that preserves multimodal chain-of-thought reasoning, directly addressing deployment cost—an urgent bottleneck for VLM adoption. It proposes specific mechanisms (pivot-token awareness, modality-sensitive activation handling, global-budget layer sensitivity) and reports strong quantitative gains across multiple models/benchmarks, suggesting methodological rigor and reproducibility. Paper 1 is valuable diagnostically, but mainly reveals limitations and introduces evaluation tasks; its real-world impact depends on downstream fixes, while Paper 2 offers an immediately usable technique with broader practical relevance.
ImProver 2 addresses a fundamental challenge in formal mathematics—scalable proof optimization—with a novel neurosymbolic framework that demonstrates small models can match frontier models. This has broader impact across formal verification, mathematical AI, and software engineering. The finding that a 7B model outperforms much larger models through proper scaffolding and training has significant implications for efficient AI. MuCRASP, while technically sound, is more incremental—applying structured pruning to VLMs with CoT-awareness. Paper 2 opens a new research direction with lasting impact on formal reasoning and proof engineering.
Paper 1 addresses a critical and universally relevant bottleneck in modern AI: the high deployment costs of Vision-Language Models. By proposing a novel structured pruning method that preserves Chain-of-Thought reasoning, it offers immense potential for real-world applications across various domains requiring efficient multimodal AI. Paper 2, while methodologically sound and valuable for aerospace/defense, has a much narrower scope of application and breadth of impact.
MuCRASP addresses a fundamental challenge in efficiently deploying vision-language models while preserving chain-of-thought reasoning—a problem with broad applicability across the rapidly growing VLM field. It introduces novel concepts (pivot tokens, cross-modal pruning awareness) with rigorous evaluation across multiple models and benchmarks. Paper 1, while clinically useful, is a narrower application study using an existing language model on a specific rare disease, with limited sample size (30 patients) and incremental contribution. Paper 2's methodological innovation has broader impact potential across AI/ML and downstream applications.