Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

Jinrui Jiang, Zhangtai Wu, Zhen Wu, Xinyu Dai

May 19, 2026

arXiv:2605.19250v1 PDF

cs.AI(primary)

#464of 2292·Artificial Intelligence

#464 of 2292 · Artificial Intelligence

Tournament Score

1475±44

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7

Novelty6.5

Clarity7.5

Tournament Score

1475±44

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Modality-conflict hallucination occurs when multimodal large language models (MLLMs) prioritize erroneous textual premises over contradictory visual evidence. To understand why visual evidence fails to prevail during generation, we take a mechanistic perspective and examine which internal components drive or resist this failure. We perform head-level causal analysis using path patching across five open-source MLLMs and identify two groups of attention heads with opposing causal roles: hallucination-driving heads and hallucination-resisting heads. We find a consistent asymmetry: driving effects are more broadly distributed and carry greater aggregate weight, whereas resisting effects concentrate in a small number of high-importance heads. Ablation experiments further confirm that these groups exert opposing effects during generation: distributed driving influence and localized resistance together form an imbalanced routing structure that biases generation toward the erroneous premise. Motivated by this finding, we propose MACI (Modality-conflict-Aware Causal Intervention), a conditional intervention that suppresses causally identified hallucination-driving heads only when conflict is detected. Across five MLLMs, MACI achieves the largest hallucination reduction among compared inference-time baselines on the MMMC benchmark with a favorable hallucination-accuracy trade-off, and transfers zero-shot to the SCI-SemanticConflict test.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a specific and well-defined failure mode of multimodal large language models (MLLMs): modality-conflict hallucination, where models follow erroneous textual premises rather than contradictory visual evidence. The core contribution is twofold: (1) a mechanistic, head-level causal analysis using path patching that reveals an asymmetric internal structure—hallucination-driving heads are broadly distributed while hallucination-resisting heads are concentrated in a few high-importance positions—and (2) MACI, a conditional inference-time intervention that leverages this asymmetry to selectively suppress driving heads when conflict is detected.

The identification of two functionally opposing groups of attention heads with a consistent structural asymmetry across five architecturally diverse MLLMs is a genuinely informative finding. It moves beyond prior work (e.g., Nguyen et al.) that showed conflict signals are linearly decodable but did not attribute causal roles to specific components.

Methodological Rigor

The methodology is generally sound, with several notable strengths:

Path patching design: The use of paired conflict/clean inputs sharing the same image provides a well-controlled counterfactual. The importance score formulation (Eq. 2) is clean and interpretable, and the sign-based separation into driving/resisting groups is principled.

Validation controls: The inclusion of random-head ablation as a size-matched control is important and demonstrates that hallucination reduction is not simply an artifact of removing capacity. The five-condition ablation (Base, Prune-D, Prune-R, Prune-Both, Prune-Random) across five models provides convincing evidence that the identified heads have genuine opposing causal roles.

Cross-type and cross-benchmark generalization: Testing object-identified driving heads on attribute/relation conflicts and SCI-SemanticConflict strengthens the claim that these heads capture a broader premise-following tendency.

Potential concerns: The prototype set of 256 samples for head identification is relatively small, though the reported split-half overlap (79.6%/64.7% for driving/resisting) provides some stability evidence. The choice of zero ablation over mean activation replacement is pragmatically motivated but may introduce distributional artifacts. The varying k+ values across models (30-64) and the sensitivity analysis, while honest, suggest that the method requires model-specific tuning. The reliance on single-token hallucinated/factual answers for the causal analysis limits the scope to cases where answers can be cleanly compared at the token level.

Potential Impact

Mechanistic understanding: The distributed-driving/concentrated-resisting asymmetry is an interpretable structural insight that could inform future architectural designs or training procedures for MLLMs. The observation that resisting heads carry disproportionate per-head importance despite smaller aggregate weight is particularly interesting—it suggests that the model does develop visual-fidelity mechanisms, but they are outnumbered.

Practical mitigation: MACI demonstrates that mechanistic insights can be translated into practical interventions. The conditional nature of the intervention (only activating when conflict is detected) is an important design choice that preserves non-conflict performance. The zero-shot transfer to SCI-SemanticConflict is encouraging for practical deployability.

Broader influence: The approach could be extended to other types of model failures (e.g., context-parametric conflict, visual illusions) and could inspire similar causal analyses in other multimodal settings. The framework of identifying opposing component groups and exploiting their asymmetry for targeted intervention is generalizable.

Timeliness & Relevance

Modality-conflict hallucination is a current and pressing concern as MLLMs are deployed in safety-critical applications. The mechanistic interpretability angle is timely, given growing interest in understanding transformer internals beyond behavioral evaluation. The paper sits at an active intersection of MLLM reliability and mechanistic interpretability, both rapidly growing fields.

Strengths

1. Breadth of validation: Testing across five architecturally diverse models (spanning dynamic-resolution tiling, MLP projection, and cross-attention) substantially strengthens the generality claim.

2. Clear causal framework: The path patching methodology provides genuine causal evidence rather than correlational observations, distinguishing this work from attention-weight-based analyses.

3. Principled intervention design: Separating detection (resisting heads) from action (suppressing driving heads) avoids entanglement and is methodologically clean.

4. Honest limitations: The paper acknowledges accuracy drops on InternVL3 and LLaVA, varying transfer magnitudes, and the reliance on object-conflict data for head identification.

5. Favorable comparison: MACI consistently outperforms baselines (VCD, ICD, OPERA, ASCD) on hallucination reduction while maintaining better accuracy trade-offs.

Limitations

1. Narrow evaluation scope: The evaluation is primarily on MMMC (one benchmark) with SCI-SemanticConflict as secondary validation. The setting assumes visual evidence is ground truth, excluding visual-illusion scenarios.

2. Probe supervision requirement: The Lasso logistic regression probe requires labeled conflict/non-conflict samples, limiting out-of-the-box applicability.

3. Accuracy trade-offs: On InternVL3 and LLaVA, non-conflict accuracy drops of ~5-7pp suggest the driving heads are not purely conflict-specific, undermining the clean mechanistic narrative.

4. Scale limitations: All models are 7-8B parameters; it remains unclear whether the asymmetry pattern holds at larger scales.

5. Prefill-only analysis: The causal analysis uses prefill activations only, potentially missing dynamics that emerge during autoregressive generation.

6. Limited theoretical grounding: The paper identifies the asymmetry empirically but offers no theoretical explanation for why it arises, limiting deeper understanding.

Additional Observations

The paper's framing as providing "causal evidence" is appropriate given the path-patching methodology, though the causal claims are about component-level contributions rather than full circuit-level understanding. The concentration curves and ranked-head analyses provide effective visualization of the asymmetry. The paper would benefit from exploring whether the identified heads have interpretable attention patterns (e.g., attending to text vs. image tokens), which could strengthen the mechanistic narrative.

Rating:6.5/ 10

Significance 6.5Rigor 7Novelty 6.5Clarity 7.5

Generated May 20, 2026

Comparison History (24)

vs. PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

gpt-5.25/21/2026

Paper 2 likely has higher impact: it introduces a broadly reusable benchmark-generation framework with scalable, verifiable data and a taxonomy that can become community infrastructure for both evaluation and training. Its applications span many domains requiring planning and constraint satisfaction, and it supports systematic diagnosis plus RL training improvements, increasing downstream adoption. Paper 1 is novel and mechanistically rigorous for multimodal hallucination mitigation, but its scope is narrower (modality-conflict in MLLMs) and interventions are more model/component-specific, potentially limiting cross-field breadth and standardization impact compared to a widely applicable benchmark framework.

vs. Open-World Evaluations for Measuring Frontier AI Capabilities

gpt-5.25/21/2026

Paper 2 likely has higher scientific impact due to breadth, timeliness, and real-world relevance: it argues for a new evaluation paradigm (open-world evals) applicable across models, domains, and safety/governance contexts, and proposes an ongoing program (CRUX) that could shape how frontier capabilities are measured. While Paper 1 is more mechanistically novel and methodologically rigorous within MLLM hallucination, its impact is narrower (specific failure mode + intervention) and may be less field-wide than an evaluation framework influencing research, deployment, and policy.

vs. Interaction Locality in Hierarchical Recursive Reasoning

gpt-5.25/21/2026

Paper 2 likely has higher impact due to direct relevance to a widely observed, high-stakes failure mode in deployed multimodal LLMs (hallucinations under modality conflict), broad applicability across multiple open-source MLLMs, and a concrete, actionable mitigation (MACI) with strong benchmark gains and transfer. Its causal head-level analysis plus intervention provides a clear mechanistic story and an immediately usable inference-time method. Paper 1 offers a valuable measurement framework for locality in recursive/spatial reasoning, but its applications are narrower and more interpretability-focused with less immediate real-world payoff.

vs. HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands

claude-opus-4.65/20/2026

Paper 1 addresses a fundamental mechanistic question in multimodal LLMs—why hallucinations occur due to modality conflict—using rigorous causal analysis across multiple models. It identifies specific attention head roles, proposes a principled intervention (MACI), and demonstrates generalizability. The breadth of impact is larger given the centrality of MLLMs in AI research. Paper 2 solves an important but geographically narrow applied problem (haor flood prediction in Bangladesh) with standard ML methods (RF+XGBoost). While valuable for disaster preparedness, its methodological novelty and cross-field impact are more limited.

vs. What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

gpt-5.25/20/2026

Paper 1 likely has higher scientific impact: it addresses a central, timely question in foundation-model training (whether code causally improves general reasoning) using large-scale controlled pretraining with domain-separated data, yielding actionable guidance for data-centric optimization and revealing trade-offs across capabilities. Its conclusions affect broad LM development (math, knowledge, programming) and can influence corpus design and training strategies across many labs. Paper 2 is rigorous and useful for MLLM hallucination mitigation, but its scope is narrower (modality-conflict cases, inference-time head interventions) and may generalize less broadly than Paper 1’s data-composition findings.

vs. Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

gpt-5.25/20/2026

Paper 2 has higher potential impact: it offers a novel mechanistic, head-level causal account of a timely, widely relevant MLLM failure mode (modality-conflict hallucination), validates it across multiple models with causal interventions and ablations, and proposes an actionable inference-time method (MACI) with demonstrated benchmark gains and transfer. This combines methodological rigor with clear real-world applicability and broad relevance to interpretability, multimodal reasoning, and safety. Paper 1 is a valuable diagnostic case study, but its scope is narrower and the main theorem remains unverified, limiting practical and cross-field impact.

vs. GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction

gemini-3.15/20/2026

Paper 2 addresses a critical and highly timely issue in AI—multimodal large language model hallucinations. By providing a mechanistic understanding of attention heads and proposing a zero-shot causal intervention to reduce hallucinations, it directly impacts the reliability and safety of state-of-the-art AI systems. While Paper 1 offers a valuable multimodal dataset for affective computing, the broader applicability, rapid adoption potential, and high relevance of fixing LLM hallucinations give Paper 2 a significantly higher potential for immediate and widespread scientific impact.

vs. Swimming with Whales: Analysis of Power Imbalances in Stake-Weighted Governance

gpt-5.25/20/2026

Paper 2 has higher potential impact due to strong timeliness (MLLM hallucinations), broad applicability across multimodal AI systems, and a mechanistic-causal methodology (path patching, head-level causal roles) that can generalize to interpretability and safety research. It also delivers an actionable intervention (MACI) with validated performance across multiple models and benchmarks, increasing real-world relevance. Paper 1 is novel and rigorous within PoS governance and computational social choice, but its application domain is narrower and likely affects fewer adjacent fields than advances in multimodal model reliability.

vs. Generative Recursive Reasoning

gpt-5.25/20/2026

Paper 2 (GRAM) is likely higher impact due to greater conceptual novelty and breadth: it reframes recursive/iterative reasoning as a probabilistic latent-trajectory generative model, enabling multi-hypothesis computation, inference-time scaling via depth and sampling, and both conditional and unconditional generation. This is a broadly applicable modeling paradigm relevant to reasoning, generative modeling, and scalable inference across domains. Paper 1 is rigorous and practically useful for mitigating a specific MLLM hallucination mode, but its contribution is more targeted (mechanistic diagnosis + intervention on attention heads) and may generalize less widely.

vs. EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

gemini-3.15/20/2026

Paper 2 addresses a fundamental and highly relevant issue in Multimodal Large Language Models (MLLMs)—modality-conflict hallucinations—using rigorous mechanistic interpretability (causal analysis). Its findings on attention head imbalance offer deep architectural insights, and the proposed MACI intervention demonstrates strong zero-shot transferability across multiple models. While Paper 1 addresses an important application (deepfake detection), its approach is more incremental and its evaluation narrower. Paper 2's foundational insights into MLLM behavior give it a broader and more significant potential scientific impact.

vs. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

gpt-5.25/20/2026

Paper 1 has higher potential impact due to a more fundamental, broadly applicable theoretical contribution: a new IC-SMDP formalization of decentralized handoff-based workflows and the first finite-sample guarantee for neural Q-learning under decentralized partial observability, with a decomposable error bound and methodological novelty (AIS lifted to multi-agent SMDPs). This could influence multi-agent RL, distributed learning, and multi-LLM pipeline design across trust boundaries. Paper 2 is timely and useful (mechanistic interpretability + intervention for MLLM hallucinations), but is narrower in scope and more benchmark/architecture-dependent.

vs. Useful Memories Become Faulty When Continuously Updated by LLMs

gpt-5.25/20/2026

Paper 2 has higher likely impact due to a more mechanistically novel contribution (head-level causal roles via path patching across multiple MLLMs) and a concrete, generalizable mitigation (MACI) that improves multimodal hallucination at inference time with good trade-offs and transfer. Its applications are immediate for safety/reliability in deployed multimodal systems and the findings may influence interpretability and architecture design broadly. Paper 1 is timely and important for agent memory practice, but its main contribution is largely diagnostic/negative (consolidation can degrade), with narrower cross-field reach and less direct intervention beyond gating/episodic retention.

vs. Why Retrying Fails: Context Contamination in LLM Agent Pipelines

gpt-5.25/20/2026

Paper 2 has higher estimated impact due to a clearer, more generalizable theoretical contribution: it formalizes a widely observed failure mode in LLM agent retries (context contamination), derives multiple closed-form results (success probability, overhead, optimal budget allocation), and provides an information-theoretic tightness bound plus empirical validation on SWE-bench Verified. This offers actionable guidance for designing agent pipelines across many domains and tools, and is timely given rapid adoption of tool-using agents. Paper 1 is strong and useful, but is narrower (multimodal conflict hallucination, head-level interventions) and more model- and benchmark-specific.

vs. Agentic Trading: When LLM Agents Meet Financial Markets

claude-opus-4.65/20/2026

Paper 2 presents a novel mechanistic finding about attention head imbalance in multimodal LLMs with causal evidence, proposes a practical intervention (MACI), and demonstrates effectiveness across multiple models and benchmarks. It combines mechanistic interpretability with a concrete solution, offering both scientific insight and practical utility. Paper 1 is a valuable survey identifying reproducibility gaps in LLM trading agents, but its contributions are primarily organizational and diagnostic rather than introducing new methods or discoveries. Paper 2's mechanistic insights into hallucination have broader implications for MLLM reliability and alignment.

vs. Agentic Systems as Boosting Weak Reasoning Models

gpt-5.25/20/2026

Paper 1 has higher likely impact due to a more general, timely contribution: a formal framework and bounds for inference-time boosting via agentic/committee search, clarifying when weak-model orchestration can reliably match stronger models and what signals are required (execution/tests/proof checking). It also demonstrates strong applied gains on SWE-bench Verified, a high-stakes benchmark for real-world coding agents. The ideas plausibly transfer across domains (reasoning, tool use, verification, agent design). Paper 2 is rigorous and valuable but narrower (modality-conflict hallucinations in MLLMs) and more model-internals-specific.

vs. Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

claude-opus-4.65/20/2026

Paper 1 offers a novel mechanistic explanation for modality-conflict hallucination in MLLMs through causal head-level analysis, identifying specific driving/resisting attention heads, and proposes MACI—a principled intervention method validated across five models and multiple benchmarks. This provides both fundamental understanding and a practical solution to a critical problem in multimodal AI. Paper 2 is a useful empirical study of multi-model LLM scheduling but primarily characterizes known trade-offs (offloading, preemption) without proposing new scheduling algorithms. Paper 1's methodological novelty, mechanistic insights, and actionable intervention give it broader and deeper scientific impact.

vs. OpenComputer: Verifiable Software Worlds for Computer-Use Agents

gpt-5.25/20/2026

Paper 2 likely has higher impact because it introduces a broadly useful, verifiable evaluation/training infrastructure for computer-use agents across 33 real applications and 1,000 tasks, addressing a timely bottleneck: reliable, auditable benchmarking beyond LLM-as-judge. Its framework (verifiers, self-improving verification, task generation, partial-credit rewards) can enable reproducible research and accelerate progress across agent learning, HCI, software engineering, and safety. Paper 1 is novel and mechanistically rigorous for MLLM hallucinations, but its intervention is narrower in scope and primarily impacts multimodal generation robustness rather than enabling a cross-field platform.

vs. Ex Ante Evaluation of AI-Induced Idea Diversity Collapse

gpt-5.25/20/2026

Paper 2 likely has higher impact: it provides mechanistic, causal evidence at the attention-head level across multiple MLLMs, backed by interventions and ablations, and proposes a practical inference-time method (MACI) that improves hallucination behavior and transfers zero-shot. This combines strong methodological rigor, clear real-world applicability (safety/reliability of multimodal systems), and timeliness in a high-priority failure mode. Paper 1 is novel and valuable for evaluation of population-level diversity, but its immediate downstream leverage and cross-system adoption may be less direct than a concrete mitigation technique for hallucinations.

vs. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

gemini-3.15/20/2026

Paper 1 advances fundamental understanding of Multimodal Large Language Models through mechanistic interpretability, pinpointing the exact internal dynamics causing modality-conflict hallucinations. Its proposed causal intervention directly addresses a critical AI safety and reliability issue without requiring retraining. While Paper 2 offers a valuable benchmark for multi-agent systems, Paper 1 provides deeper methodological innovation and broader immediate impact by uncovering the architectural root causes of hallucinations and offering a scientifically grounded, inference-time solution.

vs. Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation

claude-opus-4.65/20/2026

Paper 1 offers deeper mechanistic insight into a fundamental problem (hallucination in multimodal LLMs) with broad implications. It identifies causal mechanisms via path patching across five models, revealing a novel asymmetry between hallucination-driving and resisting attention heads, and proposes an effective intervention (MACI). This addresses a core challenge in the rapidly growing MLLM field. Paper 2 applies existing conformal prediction methods to AI agent evaluation—useful but more incremental, adapting known statistical tools to a narrower application domain with less fundamental novelty.