Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

Joël Roman Ky, Salah Ghamizi, Maxime Cordy

May 21, 2026

arXiv:2605.22168v1 PDF

cs.AI(primary)cs.LG

#547of 2292·Artificial Intelligence

#547 of 2292 · Artificial Intelligence

Tournament Score

1466±49

10501800

75%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty7

Clarity8

Tournament Score

1466±49

10501800

75%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall's $τ = - 0.06$ ). To resolve this, we introduce Synergistic Faithfulness ( $\mathcal{F}_{syn}$ ), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate ( $ρ = 0.92$ ) while achieving a $24\times$ computational speedup. Evaluating 8 distinct XAI methods across 3 VLM architectures and 3 benchmark datasets, reveals that explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods in capturing true cross-modal synergy. By decoupling visual plausibility from cross-modal faithfulness, this work provides a rigorous evaluation framework required to safely audit VLM reasoning in high-stakes deployments.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability"

1. Core Contribution

This paper identifies a fundamental flaw in how explainability methods for Vision-Language Models (VLMs) are evaluated and proposes a principled solution. The core argument is that standard unimodal perturbation metrics (Insertion/Deletion AUC) fail in multimodal settings because of cross-modal redundancy: when VLMs can answer questions using one modality alone (due to language priors and dataset biases), perturbing a single modality yields uninformative faithfulness scores. The authors demonstrate this causes an "evaluation collapse" where visual and textual rankings of the same explainers fundamentally contradict each other (Kendall's τ = −0.06).

The proposed solution, Synergistic Faithfulness (F_syn), is grounded in cooperative game theory. It approximates the Shapley Interaction Index by computing the 2-way Harsanyi dividend along continuous perturbation trajectories, isolating the pure joint contribution of visual and textual features. This transforms the intractable O(2^{m+n}) exact computation into a manageable O(K) forward passes (6K+2 per sample), achieving a 24× speedup over macro-coalitional exact SII while maintaining ρ = 0.92 Spearman correlation.

2. Methodological Rigor

The paper demonstrates strong methodological discipline across several dimensions:

Theoretical grounding: The metric is formally derived from the Harsanyi dividend structure of cooperative game theory. The authors clearly define the mathematical formulation distinguishing unimodal metrics µ_I, µ_T from the multimodal metric µ_{I×T}, and prove the boundary-condition failure of unimodal metrics under perfect cross-modal redundancy (Section 3.2).

Validation against ground truth: The authors construct a clever macro-coalitional game (8 players, 2^8 = 256 states) that permits exact, zero-variance SII computation for N=200 instances. The ρ = 0.92 correlation with this ground truth is convincing, and critically, the correlation is stable across different explainer types (Random, Input×Grad, Rollout, TAM), demonstrating explainer-agnostic validity.

Statistical analysis: The use of Linear Mixed-Effects Models (LMM) to disentangle explainer performance from dataset difficulty and model architecture is sophisticated and appropriate. Treating explainer as a fixed effect while modeling dataset, VLM architecture, and instance as random effects addresses the nested/repeated-measures structure that simpler tests (ANOVA) would mishandle. The comprehensive reporting of β coefficients, standard errors, and p-values across datasets strengthens confidence.

Benchmark scale: 8 explainers × 3 VLM architectures × 3 datasets, evaluated on complete dataset splits (~300 GPU hours), provides substantial empirical coverage.

However, several concerns emerge. The macro-coalitional ground truth, while cleverly constructed, involves arbitrary choices (C=6 background coalitions, coupled cross-modal partitioning) that could introduce aggregation bias. The authors acknowledge this trade-off but don't quantify sensitivity to C. Additionally, the K=11 discrete steps for the Riemann approximation is relatively coarse; sensitivity analysis over K values is absent.

3. Potential Impact

Immediate impact on XAI evaluation: The paper's most actionable contribution is demonstrating that the current evaluation paradigm for multimodal XAI is fundamentally broken. This could redirect the community away from unimodal perturbation metrics toward interaction-based evaluation, which is a meaningful paradigm shift.

Practical implications for VLM auditing: The finding that VLM-native explainers (LLaVA-CAM, TAM) over-index on visual salience while underperforming attention-based methods (AttnLRP, Rollout) at capturing true cross-modal synergy is practically important. It challenges the assumption that architecture-specific explainers are inherently superior and suggests that deployed audit systems may be producing misleading explanations.

Regulatory relevance: With the EU AI Act mandating explainability for high-risk AI systems, having metrics that distinguish visually plausible but unfaithful explanations from genuinely faithful ones has direct regulatory utility.

Limitations on breadth: The restriction to VQA-format tasks, mid-scale open-weight models (2B-7B), and binary/multiple-choice outputs limits immediate applicability to the broader VLM ecosystem (open-ended generation, proprietary models, video/audio modalities). The authors acknowledge this transparently.

4. Timeliness & Relevance

The paper is highly timely. VLMs are being rapidly deployed in high-stakes applications (medical imaging, autonomous driving, robotics), and the gap between deployment pace and evaluation rigor for explainability is widening. The multimodal XAI evaluation literature has not kept pace with the transition from dual-encoder models (CLIP, ViLT) to autoregressive VLMs, and this paper directly addresses that gap.

The identification of cross-modal redundancy as a systematic confound in evaluation is particularly relevant given mounting evidence of modality biases in VLM benchmarks (citations [10-15] in the paper). This connects the XAI evaluation problem to the broader VLM evaluation crisis.

5. Strengths & Limitations

Key Strengths:

Clean theoretical identification of the evaluation collapse problem with formal proof under boundary conditions

Game-theoretic foundation provides mathematical legitimacy beyond ad-hoc metric design

Strong empirical validation against exact SII ground truth

Proper use of LMM for statistical analysis, avoiding common benchmarking pitfalls

Counterintuitive finding (VLM-native < attention-based) adds genuine scientific value

Comprehensive appendix with reproducibility details

Notable Weaknesses:

The absolute F_syn scores are quite low across all methods (0.07-0.10 range) with high variance, raising questions about whether any current explainer captures meaningful cross-modal synergy

The gap between best and worst methods (β = 0.029 vs 0.008, compared to Random at 0) is statistically significant but practically small

No analysis of how F_syn behaves when the VLM is actually wrong — explainability of failures may have different dynamics

The restriction to VQA formats with fixed-vocabulary outputs (Yes/No, A/B/C/D) constrains generalizability

The paper doesn't explore whether F_syn could serve as a training signal for better explainers, limiting actionability beyond benchmarking

Sensitivity analysis to hyperparameters (K, blur radius, number of macro-coalitions) is insufficient

Overall Assessment

This is a well-executed paper that identifies a genuine and important problem in multimodal XAI evaluation, proposes a theoretically principled and computationally practical solution, and delivers surprising empirical findings. The main contribution — shifting from unimodal to synergistic evaluation of multimodal explanations — represents a meaningful conceptual advance. While the absolute effect sizes are modest and the scope is restricted to VQA, the paper establishes a foundation that should influence how the community evaluates and develops XAI methods for VLMs.

Rating:7.2/ 10

Significance 7.5Rigor 7.5Novelty 7Clarity 8

Generated May 22, 2026

Comparison History (16)

vs. Latent-space Attacks for Refusal Evasion in Language Models

gemini-3.15/22/2026

Paper 2 addresses a critical and highly active area of AI safety: model alignment and jailbreaking. By providing a principled framework for latent-space attacks and demonstrating state-of-the-art success rates across 15 diverse models, it exposes severe vulnerabilities in current safety mechanisms. This will likely drive significant follow-up research in both mechanistic interpretability and robust alignment, giving it a broader and more urgent impact than Paper 1's focus on XAI evaluation metrics.

vs. AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

gpt-5.25/22/2026

Paper 2 offers a more novel, technically specific contribution: it identifies a concrete failure mode in current VLM explainability evaluation (cross-modal redundancy causing metric collapse) and proposes a new, principled metric grounded in Shapley interactions with strong empirical validation (high correlation, large speedup) across models, datasets, and XAI methods. This yields immediate real-world relevance for auditing multimodal systems and is timely given widespread VLM deployment. Paper 1 is a broad synthesis/positioning chapter with important applications but less methodological novelty and weaker direct, testable contributions.

vs. Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

gpt-5.25/22/2026

Paper 2 introduces a principled new metric (Synergistic Faithfulness) grounded in Shapley interaction to isolate true cross-modal contributions, addressing a clear failure mode (evaluation collapse) in VLM explainability. It offers strong methodological rigor (theory-backed metric, high surrogate correlation, large speedup, multi-model/multi-dataset evaluation) and broad relevance across multimodal ML, XAI, safety, and auditing. Paper 1 provides an important benchmark and negative result for clinical LLM interaction, but its impact is narrower (clinical decision support) and more incremental relative to existing interactive evaluation efforts.

vs. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

claude-opus-4.65/22/2026

Gated DeltaNet-2 introduces a fundamental architectural improvement to linear attention mechanisms by decoupling erase and write operations, with strong empirical results across multiple benchmarks at 1.3B scale. This addresses a core limitation in efficient sequence modeling—a rapidly growing field with broad impact on LLM efficiency, long-context modeling, and inference cost. Paper 2 contributes a valuable evaluation metric for VLM explainability, but its scope is narrower (benchmarking/evaluation rather than model architecture), and its impact is primarily within the XAI subcommunity rather than the broader deep learning ecosystem.

vs. What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct

gpt-5.25/22/2026

Paper 1 offers a novel, technical evaluation metric (Synergistic Faithfulness) grounded in Shapley interactions to fix a concrete failure mode in VLM explainability evaluation, with strong quantitative evidence (high surrogate correlation, large speedup) and broad applicability to auditing multimodal systems in high-stakes settings. Its methodological contribution is directly actionable for benchmarking and improving XAI methods across architectures/datasets, likely influencing subsequent empirical work. Paper 2 is timely and useful for conceptual clarity and governance, but is primarily taxonomic/survey-based with less direct methodological leverage for model development and evaluation.

vs. ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

gemini-3.15/22/2026

Paper 2 addresses a fundamental flaw in current VLM explainability evaluation by introducing a rigorous metric based on game theory (Shapley Interaction Index). Benchmarks and evaluation frameworks that expose and correct evaluation collapse typically have broad, long-lasting impact across the field, especially for high-stakes applications requiring trustworthy AI. While Paper 1 offers a practical efficiency improvement for video MLLMs, Paper 2's methodological rigor and contribution to safe AI deployment give it a higher potential for foundational scientific impact.

vs. Beyond Rational Illusion: Behaviorally Realistic Strategic Classification

gemini-3.15/22/2026

Paper 1 addresses a highly relevant and timely challenge in the booming field of Vision-Language Models (VLMs). By introducing a computationally efficient and rigorous metric for cross-modal explainability, it directly impacts the safe deployment of modern AI systems. While Paper 2 offers an innovative interdisciplinary approach by integrating behavioral economics into strategic classification, Paper 1's methodological rigor and immediate applicability to high-stakes, state-of-the-art multimodal AI give it a higher potential for broad and immediate scientific impact.

vs. Beyond Rational Illusion: Behaviorally Realistic Strategic Classification

claude-opus-4.65/22/2026

Paper 2 addresses a fundamental evaluation gap in VLM explainability—a rapidly growing field with broad relevance to AI safety and trustworthiness. It introduces a principled metric (Synergistic Faithfulness) grounded in game theory (Shapley Interaction Index), exposes a concrete failure mode (evaluation collapse), and provides comprehensive empirical validation across multiple architectures and datasets. Its impact spans XAI, multimodal learning, and AI auditing. Paper 1 is a solid contribution bridging behavioral economics and ML, but its scope is narrower (strategic classification) and the practical adoption barrier is higher given the need to model specific cognitive biases.

vs. Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

claude-opus-4.65/22/2026

Paper 1 introduces a novel evaluation framework (Synergistic Faithfulness) addressing a fundamental methodological gap in VLM explainability—cross-modal redundancy causing evaluation collapse. It provides rigorous theoretical grounding (Shapley Interaction Index), comprehensive benchmarking (8 methods, 3 architectures, 3 datasets), and practical utility (24× speedup). Paper 2 is a case study on a single speech with limited generalizability (n=51 segments, one speaker). Paper 1's broader applicability to XAI evaluation, methodological rigor, and relevance to high-stakes AI auditing give it substantially higher impact potential across multiple research communities.

vs. Scaling Observation-aware Planning in Uncertain Domains

gpt-5.25/22/2026

Paper 2 likely has higher impact due to strong novelty in identifying an evaluation failure mode for VLM explainability and proposing a principled, scalable cross-modal synergy metric with strong empirical validation (high correlation, major speedup) across multiple models/datasets/methods. Its applications (auditing multimodal reasoning, safety in high-stakes deployments) are broad and timely given rapid VLM adoption. Paper 1 offers substantial performance gains in a specialized POMDP observability/sensor selection niche, but its breadth and immediate relevance across fields are narrower than the VLM XAI benchmarking contribution.

vs. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

gemini-3.15/22/2026

Paper 1 introduces a paradigm-shifting concept of autonomous agents rewriting their own source code, enabling self-evolution beyond static text artifacts. This Turing-complete adaptation offers highly novel capabilities and broader potential impact for autonomous AI systems compared to Paper 2, which, while methodologically rigorous and important for AI safety, represents a more domain-specific evaluation metric improvement.

vs. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

claude-opus-4.65/22/2026

Paper 1 introduces a novel theoretical framework (Synergistic Faithfulness metric) addressing a fundamental limitation in VLM explainability evaluation. It identifies a previously unrecognized 'evaluation collapse' problem, proposes a principled solution grounded in game-theoretic concepts (Shapley Interaction Index), and demonstrates broad applicability across multiple architectures and methods. This has significant implications for AI safety and trustworthiness in high-stakes domains. Paper 2, while practically useful, is primarily an empirical benchmark for a specific application (finance spreadsheets) with more limited methodological novelty and narrower scientific scope.

vs. Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental and broadly applicable problem in guided generation across diffusion/flow models, proposing a principled solution (conflict-aware gradient correction) validated across diverse domains (images, planning, control). Its practical utility is high given the widespread adoption of diffusion models. Paper 2 introduces a valuable evaluation metric for VLM explainability, but its impact is more niche—focused on XAI evaluation methodology rather than enabling new capabilities. Paper 1's cross-domain applicability, actionable method, and alignment with the rapidly growing generative modeling field give it broader potential impact.

vs. Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

gemini-3.15/22/2026

Paper 2 exposes a fundamental flaw in current VLM explainability evaluation and introduces a theoretically grounded, scalable metric to resolve it. Because reliable benchmarks and evaluation frameworks often redirect community efforts and are essential for high-stakes AI safety, Paper 2 is likely to have a broader foundational impact across the rapidly growing field of multimodal models compared to the algorithmic improvements in generative guidance proposed in Paper 1.

vs. Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

claude-opus-4.65/22/2026

Paper 2 introduces a novel evaluation framework (Synergistic Faithfulness) addressing a fundamental gap in VLM explainability—cross-modal reasoning assessment. It identifies a critical limitation (evaluation collapse) in existing paradigms, proposes a theoretically grounded metric rooted in Shapley Interaction Index, and provides comprehensive evaluation across multiple architectures and datasets. Its broader applicability to high-stakes VLM auditing, methodological rigor, and relevance to the rapidly growing VLM field give it higher impact potential. Paper 1, while interesting, offers incremental insights on sycophancy mitigation with a narrower scope.

vs. Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

gpt-5.25/22/2026

Paper 1 introduces a new, principled evaluation metric (Synergistic Faithfulness) addressing a fundamental flaw in current VLM explainability evaluation (cross-modal redundancy causing metric collapse). It is methodologically grounded (Shapley/Harsanyi interaction), demonstrates strong surrogate validity and large compute gains, and is evaluated broadly (multiple explainers, architectures, datasets). Its impact spans VLM auditing, XAI methodology, and safety-critical deployment practices—highly timely given rapid VLM adoption. Paper 2 is useful and relevant to LLM safety/steering, but is more incremental (repurposing persona vectors) with narrower generality and fewer systems tested.