MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

Aritra Dutta, Somak Aditya

May 25, 2026

arXiv:2605.25842v1 PDF

cs.AI(primary)cs.CL

#1161of 2682·Artificial Intelligence

#1161 of 2682 · Artificial Intelligence

Tournament Score

1425±41

10501800

52%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor7

Novelty7.5

Clarity5.5

Tournament Score

1425±41

10501800

52%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MuCRASP

1. Core Contribution

MuCRASP addresses a genuinely underexplored problem: how to structurally prune Vision-Language Models while preserving chain-of-thought reasoning quality. The paper identifies two concrete failure modes of existing pruning methods applied to VLMs: (1) CoT reasoning depends on sparse "pivot tokens" at reasoning-step boundaries whose importance is diluted under uniform token aggregation, and (2) existing LLM pruning methods ignore cross-modal activation distribution differences between vision and language components.

The proposed framework combines four components: global Taylor-based attribution, trajectory pivot attribution restricted to reasoning transition windows, a Cross-Modal Dependency Score (CMDS) based on MMD to identify vision-language integration layers, and a global knapsack allocation that handles the extreme heterogeneity of structural unit sizes in VLMs (~250× cost difference between GQA groups and MLP neurons). This is a training-free method, which is practically valuable.

2. Methodological Rigor

Strengths in experimental design:

Evaluation across 5 VLMs (2B–11B parameters), 3 architectural families, and 3 reasoning domains provides strong evidence of generalizability.

The introduction of LLM-as-Judge (LLM-J) as a primary metric is well-motivated—the paper convincingly argues that perplexity and exact match are insufficient for evaluating reasoning preservation. The demonstration that LLM-J and EM_a follow visibly different degradation trajectories under MuCRASP is a notable finding.

The ablation study (Table A) systematically removes each component, demonstrating that all four are necessary. The random pivot control experiment (Table B) is particularly well-designed, confirming that correct pivot identification matters.

The sliding window MLP zero-out experiment (Figure 3b) provides causal evidence that CMDS correctly identifies cross-modal bottleneck layers.

KL divergence analysis (Figure 4) provides distributional-level evidence beyond surface metrics.

Weaknesses:

The pivot token detection relies on heuristic pattern matching (structural delimiters and logical connectives). While the paper argues robustness through windowing and random-pivot ablations, this approach may not generalize to less structured CoT formats or non-English languages.

The CMDS formulation uses a linear kernel MMD, which is the simplest possible instantiation. The paper doesn't explore whether richer kernels would improve layer identification.

The calibration set uses GPT-4o-generated synthetic CoT traces, introducing a dependency on a proprietary model. The paper acknowledges this but doesn't explore alternatives.

Hyperparameters (γ_base, ρ, α, β, window W) are manually tuned. While ablations show reasonable robustness, the interaction effects among these parameters are not explored.

No post-pruning fine-tuning or recovery is applied, which the authors acknowledge. While this makes the method more practical, it also means potential gains from recovery methods remain unknown.

3. Potential Impact

Practical applications: Enabling deployment of reasoning-capable VLMs on resource-constrained hardware is directly relevant for edge deployment, mobile applications, and reducing inference costs. The training-free nature makes the approach accessible.

Methodological contributions: The observation that reasoning coherence and answer-extraction accuracy degrade along fundamentally different trajectories is an important insight that could influence how compressed generative models are evaluated more broadly. The CMDS concept for identifying cross-modal bottleneck layers could be adopted in other VLM optimization work beyond pruning.

Influence on evaluation practices: The paper makes a compelling case that perplexity is a misleading proxy for reasoning quality (e.g., Qwen2.5-VL-7B at 30%: Attribution Pruning achieves lower PPL=1.59 than MuCRASP's 1.74, yet its LLM-J collapses to 1.33 vs 8.87). This could shift evaluation norms in model compression research.

4. Timeliness & Relevance

The paper addresses a genuine bottleneck: VLMs with CoT reasoning are increasingly deployed but computationally expensive. The intersection of structured pruning, multimodal reasoning, and CoT preservation is timely and virtually unexplored—the paper convincingly argues that no prior work studies structured pruning of VLMs with explicit CoT preservation objectives.

5. Strengths & Limitations

Key strengths:

Novel and well-motivated problem formulation at the intersection of pruning, multimodal learning, and reasoning

Strong empirical results: at 30% pruning, MuCRASP achieves LLM-J 8.87 vs 7.32 for the best baseline on Qwen2.5-VL-7B Physical reasoning

The method sustains performance up to 50% pruning where baselines collapse (mean LLM-J 3.90 vs 1.56)

Comprehensive evaluation with multiple complementary metrics and extensive ablations

The decoupling insight (reasoning coherence preserved while verbatim precision degrades) is scientifically interesting

Notable weaknesses:

The paper is extremely long with crucial algorithmic details relegated to the appendix, making the main paper harder to follow

The reliance on GPT-4o for calibration data and GPT-3.5 for evaluation introduces dependencies on proprietary models

No comparison with knowledge distillation or quantization approaches that could serve as alternative compression strategies

Actual inference speedup numbers are never reported—only parameter counts are discussed

The pivot detection heuristic, while functional, feels ad-hoc compared to the principled CMDS formulation

Limited to English; the heuristic pivot detection is language-dependent

Missing comparisons: No comparison with recent VLM-specific compression methods beyond ECoFLaP and TAMP, and no comparison with quantization approaches that are the dominant practical compression technique.

Rating:6.8/ 10

Significance 7Rigor 7Novelty 7.5Clarity 5.5

Generated May 26, 2026

Comparison History (21)

vs. You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

gpt-5.25/28/2026

Paper 2 (HiSME) likely has higher scientific impact due to broader applicability and timeliness: hierarchical meta-evolving of skills and the evolving strategy targets a central bottleneck in deployed agentic systems (continual, test-time improvement) and can generalize across many tasks, domains, and LLM backends without costly parameter updates. This paradigm could influence agent design, lifelong learning, and meta-learning communities. Paper 1 is methodologically solid and useful for efficient VLM deployment, but its contribution is more specialized (structured pruning for CoT in VLMs) with narrower cross-field reach.

vs. Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

gemini-3.15/28/2026

Paper 1 addresses a critical bottleneck in deploying Vision-Language Models (VLMs) by introducing a highly novel structured pruning method that preserves Chain-of-Thought reasoning. Its methodological rigor in identifying 'pivot tokens' and addressing cross-modal activation differences provides deep insights into VLM internals. While Paper 2 offers an interesting agentic framework for visual reasoning, Paper 1's approach enables significant real-world applications by reducing computational costs without sacrificing complex reasoning capabilities, likely driving broader adoption and follow-up research in model efficiency.

vs. The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

gemini-3.15/28/2026

Paper 2 addresses a fundamental and highly debated question in AI (whether LLMs genuinely reason) by rigorously critiquing a high-profile benchmark. By exposing statistical flaws and dataset artifacts, it corrects the scientific record and promotes better evaluation methodologies. While Paper 1 offers a useful, practical pruning tool for VLMs, Paper 2's broader theoretical implications and critical methodological corrections will likely have a wider, paradigm-shifting impact across the AI research community.

vs. Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

claude-opus-4.65/27/2026

MuCRASP addresses a fundamental and broadly applicable problem—preserving chain-of-thought reasoning during structured pruning of VLMs. Its novelty lies in identifying pivot tokens and cross-modal activation differences, providing a principled pruning framework with strong empirical results across multiple models and benchmarks. This has wide applicability to model compression across the VLM community. Paper 2, while valuable for its dataset and benchmark contributions to mobile GUI navigation, addresses a narrower application domain (Chinese mobile apps) and is more incremental in its technical contributions (scaling analysis, benchmarking toolkit).

vs. Natural Language Query to Configuration for Retrieval Agents

gpt-5.25/27/2026

Paper 2 has higher likely scientific impact due to broader applicability and timeliness: per-query optimization of retrieval-agent configurations addresses a widespread, practical bottleneck in deploying RAG systems across domains, with immediate cost/latency implications. The approach is modular (works over a pipeline catalog), aligns with current industry trends toward agentic systems, and can influence both systems and ML communities. Paper 1 is novel within VLM compression and CoT-aware pruning, but its impact is narrower (structured pruning for specific VLM classes) and less directly transferable beyond multimodal model compression.

vs. The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact due to a clearer, broadly applicable technical contribution with immediate deployment value: a structured pruning method tailored to multimodal chain-of-thought reasoning, validated across multiple VLMs and benchmarks with sizable gains at high pruning rates. Its relevance is timely given widespread VLM deployment constraints and the community focus on efficient inference. Paper 1 is novel and insightful for KG-guided scientific hypothesis generation, but its impact may be narrower (battery KG setting) and more diagnostic/empirical than enabling a widely reusable algorithmic advance.

vs. Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

claude-opus-4.65/27/2026

MuCRASP addresses a fundamental and broadly applicable problem—efficiently compressing vision-language models while preserving chain-of-thought reasoning. It introduces novel concepts (pivot tokens, cross-modal pruning sensitivity) applicable across many VLMs and reasoning tasks, with rigorous experiments on multiple models and benchmarks. Paper 2, while valuable, is a domain-specific application (steel industry VOCs) combining existing techniques (KGs, RAG, multi-agent systems) with narrower impact scope. Paper 1's contributions to model compression methodology have broader relevance to the rapidly growing VLM community.

vs. Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact due to stronger real-world applicability and broader relevance: it addresses deployment-critical efficiency for large multimodal VLMs while explicitly preserving CoT reasoning, a central capability in current systems. Its contributions (pivot-token awareness, cross-modal activation considerations, global budgeted structured pruning) are timely and useful across many VLM deployments and tasks. Paper 1 is novel in combining mechanistic interpretability with unfaithfulness detection, but its immediate applications are narrower and depend on circuit-tracing assumptions and benchmarks.

vs. Agentic Proving for Program Verification

claude-opus-4.65/26/2026

Paper 1 demonstrates that agentic AI systems can achieve near-complete success on program verification benchmarks, fundamentally challenging the field's evaluation methodology and establishing a new paradigm (compiler-in-the-loop agentic proving) for formal verification. This has broad implications for software engineering, formal methods, and AI safety. Paper 2, while technically solid, presents an incremental improvement in model compression for VLMs—a narrower contribution in a crowded space. Paper 1's finding that benchmarks are already saturated by current AI capabilities is a significant wake-up call with wider reverberations across multiple research communities.

vs. Representation Without Control: Testing the Realization Effect in Language Models

gemini-3.15/26/2026

Paper 1 offers higher potential impact due to its immediate real-world applicability in deploying resource-intensive Vision-Language Models. By introducing a structured pruning method that preserves Chain-of-Thought reasoning, it solves a major bottleneck in AI efficiency. While Paper 2 provides valuable theoretical insights into LLM interpretability and behavioral simulation, Paper 1 addresses a more pressing, widespread engineering challenge. The ability to compress state-of-the-art VLMs by up to 30-50% without losing reasoning consistency will directly benefit researchers and industry practitioners, ensuring broader and more immediate technological adoption.

vs. HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

gpt-5.25/26/2026

Paper 1 is more novel and broadly impactful: it introduces a new geometric framing (hyperbolic guidance) for multi-step reasoning that could generalize across LLM architectures, tasks, and even search/verification methods. If robust, it offers a lightweight, efficient alternative to expensive tree-search while improving deeper reasoning, a timely core problem. Paper 2 is strong and practical for deploying VLMs, but structured pruning is a more incremental area and its impact is narrower (primarily compression of multimodal CoT) and potentially sensitive to model/task specifics and evaluation via LLM-judge.

vs. Dynamics of collective creativity in AI art competitions

claude-opus-4.65/26/2026

Paper 1 addresses a novel interdisciplinary question about collective creativity dynamics in human-AI systems, leveraging a large-scale naturalistic dataset to reveal fundamental mechanisms of cultural evolution. Its findings about attractor states, the paradox between novelty appreciation and remix preferences, and group-size effects have broad implications across cultural evolution, computational social science, creativity research, and AI-assisted design. Paper 2, while technically solid, represents an incremental improvement in model compression—a crowded subfield with rapid turnover. Paper 1's unique dataset, cross-disciplinary relevance, and insights into human-AI co-creation give it broader and more lasting impact.

vs. Understanding and Mitigating Premature Confidence for Better LLM Reasoning

gpt-5.25/26/2026

Paper 1 is more novel and broadly impactful: it introduces “premature confidence” as a scalable, label-free signal for reasoning failures and proposes a general RL objective (progressive confidence shaping) that improves reasoning quality and faithfulness across multiple tasks and model sizes. This targets a central, timely limitation of LLM test-time compute and CoT reliability, with implications for alignment and safety. Paper 2 is valuable and applicable for efficient multimodal deployment, but pruning methods are more incremental and narrower in scope, with impact mainly in VLM compression rather than core reasoning improvements.

vs. AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

gemini-3.15/26/2026

While Paper 1 presents an innovative application of masked diffusion for medical reporting, Paper 2 addresses a fundamental and widely applicable challenge in modern AI: deploying large Vision-Language Models efficiently without sacrificing complex reasoning. Its structured pruning framework for multimodal Chain-of-Thought preservation has a broader potential impact across numerous domains and tasks, making it highly relevant to a larger segment of the AI research community.

vs. Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

gemini-3.15/26/2026

Paper 2 addresses a highly timely and widely relevant problem: compressing large Vision-Language Models (VLMs) while preserving chain-of-thought reasoning capabilities. Given the explosion in VLM usage and the critical need for efficient deployment in real-world applications, a successful pruning method for these models has broader immediate impact across AI domains. In contrast, Paper 1, while demonstrating strong methodological rigor and advancing the state-of-the-art, focuses on classical planning, which has a comparatively narrower scope and audience in the current research landscape.

vs. Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

gpt-5.25/26/2026

Paper 1 likely has higher impact due to a more general, timelier contribution: an inference-time, model-agnostic protocol for selective prediction and calibrated abstention—critical for safe deployment across domains. Framing via interactive proof theory is novel and broadly relevant (AI safety, reliability, HCI, evaluation). It also surfaces concrete failure modes and transfer across model families, increasing practical value. Paper 2 is valuable for efficient VLM deployment, but structured pruning is a more incremental line with narrower scope; gains may depend on specific architectures/benchmarks and pruning/eval choices (e.g., LLM-judge).

vs. SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

gpt-5.25/26/2026

Paper 2 (MuCRASP) likely has higher scientific impact due to a more actionable contribution: a structured pruning method that preserves multimodal chain-of-thought reasoning, directly addressing deployment cost—an urgent bottleneck for VLM adoption. It proposes specific mechanisms (pivot-token awareness, modality-sensitive activation handling, global-budget layer sensitivity) and reports strong quantitative gains across multiple models/benchmarks, suggesting methodological rigor and reproducibility. Paper 1 is valuable diagnostically, but mainly reveals limitations and introduces evaluation tasks; its real-world impact depends on downstream fixes, while Paper 2 offers an immediately usable technique with broader practical relevance.

vs. ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

claude-opus-4.65/26/2026

ImProver 2 addresses a fundamental challenge in formal mathematics—scalable proof optimization—with a novel neurosymbolic framework that demonstrates small models can match frontier models. This has broader impact across formal verification, mathematical AI, and software engineering. The finding that a 7B model outperforms much larger models through proper scaffolding and training has significant implications for efficient AI. MuCRASP, while technically sound, is more incremental—applying structured pruning to VLMs with CoT-awareness. Paper 2 opens a new research direction with lasting impact on formal reasoning and proof engineering.

vs. Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

gemini-3.15/26/2026

Paper 1 addresses a critical and universally relevant bottleneck in modern AI: the high deployment costs of Vision-Language Models. By proposing a novel structured pruning method that preserves Chain-of-Thought reasoning, it offers immense potential for real-world applications across various domains requiring efficient multimodal AI. Paper 2, while methodologically sound and valuable for aerospace/defense, has a much narrower scope of application and breadth of impact.

vs. Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

claude-opus-4.65/26/2026

MuCRASP addresses a fundamental challenge in efficiently deploying vision-language models while preserving chain-of-thought reasoning—a problem with broad applicability across the rapidly growing VLM field. It introduces novel concepts (pivot tokens, cross-modal pruning awareness) with rigorous evaluation across multiple models and benchmarks. Paper 1, while clinically useful, is a narrower application study using an existing language model on a specific rare disease, with limited sample size (30 patients) and incremental contribution. Paper 2's methodological innovation has broader impact potential across AI/ML and downstream applications.