Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

Jiawei Kong, Hao Fang, Shunxiang Liao, Jinyu Li, Bin Chen, Hao Wu, Shu-Tao Xia, Min Zhang

May 27, 2026

arXiv:2605.27906v1 PDF

cs.AI(primary)

#884of 2682·Artificial Intelligence

#884 of 2682 · Artificial Intelligence

Tournament Score

1446±50

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty7

Clarity8

Tournament Score

1446±50

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Reasoning-Conditioned Preference Optimization for Hallucination Mitigation in MLRMs

1. Core Contribution

The paper identifies a specific and actionable problem: when standard DPO is applied to Multimodal Large Reasoning Models (MLRMs) that generate Chain-of-Thought (CoT) reasoning before answers, the optimization is biased toward answer-level preferences while leaving reasoning chains insufficiently aligned. The core novelty lies in RC-DPO, which decomposes the preference optimization by treating CoT as a *condition* for answer generation rather than part of a monolithic response. The key mathematical insight is elegant: by fixing the preferred answer and varying only the CoT condition, the framework isolates a reasoning-conditioned preference term (Equation 8) that directly supervises CoT quality through its effect on answer generation likelihood.

The contribution has three layers: (1) a diagnostic analysis showing response-level DPO suffers from answer-level shortcuts, (2) the RC-DPO objective that adds an explicit CoT-conditioned term, and (3) a data construction pipeline using MCTS for positive samples and attention-guided token pruning for negatives.

2. Methodological Rigor

Motivating analysis is well-structured. The three observations in Section 2 — loss-ratio dynamics showing faster answer optimization, near-equivalent performance of answer-only DPO, and conditional dependency between CoT and answer hallucinations — provide convincing evidence for the claimed problem. The answer-only ablation (Figure 2b) is particularly telling: if response-level DPO performs similarly to answer-only DPO, the CoT supervision is indeed being wasted.

Derivation is mathematically clean. The chain-rule decomposition (Equations 6-7) clearly shows how the reasoning-conditioned term relates to standard DPO, and the complementary nature of the two objectives is well-justified. The formulation avoids unnecessary complexity.

Experimental evaluation is comprehensive — four MLRMs (R1-Onevision, MM-Eureka, ThinkLite-VL, OpenVLThinker) across nine benchmarks covering hallucination-specific and general multimodal tasks. The inclusion of segment-level CHAIR analysis (Figure 6) provides unique insight showing CoT hallucination reduction specifically.

Potential concerns: The MCTS-based positive sample construction uses Qwen3-VL as a verifier, introducing dependence on an external model's quality. The paper does not thoroughly analyze cases where MCTS fails to find high-quality trajectories. The pruning-based negative construction, while intuitive, is somewhat simplistic — removing visually salient tokens is a coarse proxy for generating hallucinated reasoning. The paper acknowledges this limitation.

3. Potential Impact

Direct impact on MLRM alignment. As reasoning models become the default paradigm for complex multimodal tasks, the insight that response-level DPO is insufficient for CoT alignment is broadly relevant. RC-DPO could become a standard component in MLRM post-training pipelines.

Methodological template. The idea of conditioning on intermediate outputs for preference optimization could generalize beyond hallucination mitigation — to code generation (conditioning on plans), multi-step mathematical reasoning, or agent trajectories where intermediate steps matter.

Practical value. The method is relatively lightweight: it uses LoRA-based SFT, requires only 10K SFT + 5K DPO samples, and trains in ~5.5 hours on four L20 GPUs. This accessibility increases adoption potential.

Limitations on impact scope: The current evaluation focuses heavily on object hallucination (CHAIR, POPE), which is a well-studied but somewhat narrow failure mode. The method's effectiveness on more subtle hallucinations (spatial relations, temporal reasoning, abstract concepts) is less established.

4. Timeliness & Relevance

This paper addresses a critical gap. The rapid deployment of reasoning models (DeepSeek-R1, OpenAI o3) into multimodal settings has created an urgent need for alignment methods that respect the CoT-answer structure. Existing DPO methods designed for conventional MLLMs are being naively applied to MLRMs without accounting for this structure. The timing is excellent — the problem is emerging but solutions are nascent.

The observation that extended reasoning can *amplify* hallucinations (by propagating unfounded visual claims through reasoning steps) is particularly timely and connects to concurrent work on reasoning chain evaluation (MIRAGE, cited in the paper).

5. Strengths & Limitations

Key Strengths:

Clean problem formulation: The answer-shortcut diagnosis is convincing and the solution follows naturally from the analysis.

Principled decomposition: The mathematical separation of CoT and answer preference terms is well-motivated and interpretable.

Consistent improvements: RC-DPO improves across all four models and virtually all benchmarks, suggesting generalizability rather than model-specific tuning.

No degradation on general tasks: Results on MME, MMBench, VMCBench, and MMVP show that hallucination mitigation does not sacrifice general capability.

Segment-level analysis: The decomposed CHAIR evaluation provides unique evidence that CoT hallucination is specifically reduced.

Notable Weaknesses:

Negative sample construction is somewhat ad hoc: Pruning top-20% visually salient tokens is a heuristic with limited theoretical justification. The sensitivity analysis (Figure 5c) shows only modest variation, but the approach doesn't capture logical inconsistencies or commonsense errors.

Limited analysis of failure modes: When does RC-DPO fail? Are there cases where CoT alignment hurts answer quality or introduces new biases?

Scale of experiments: All models are 7B parameters. Whether findings hold for larger models (70B+) is unknown.

Single training dataset: All experiments use RLAIF-V data. Domain generalization is untested.

The λ=0.1 weighting suggests the RC term is a minor addition; the paper could better analyze why larger weights degrade performance.

Additional Observations

The paper's framing positions CoT as a hallucination bottleneck, but the conditional analysis (Figure 2c) shows that when CoT is hallucination-free, answers are also clean. This raises an interesting question: would simply filtering out hallucinated CoTs at inference time (via self-consistency or verification) achieve similar benefits without training? The paper does not discuss this inference-time alternative.

The connection to process reward models (PRMs) in mathematical reasoning is underexplored — RC-DPO essentially provides process-level supervision through preference optimization rather than explicit step-level rewards.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 7Clarity 8

Generated May 28, 2026

Comparison History (15)

vs. Auditable Decision Models with Learned Abstention and Real-Time Steering

gpt-5.25/28/2026

Paper 1 likely has higher scientific impact: it introduces a novel training objective (RC-DPO) that explicitly conditions answer preference on chain-of-thought quality, plus an MCTS-based preference data strategy, directly targeting multimodal hallucination—a timely, high-priority problem with broad relevance across vision-language systems. The methodological contribution is more generalizable to frontier multimodal reasoning models and could influence future preference-optimization and alignment work. Paper 2 is practically valuable for auditable deployment and abstention, but appears more application/interface-oriented with less foundational algorithmic novelty and narrower research spillover.

vs. CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models

claude-opus-4.65/28/2026

CaMBRAIN introduces a fundamentally new architecture paradigm for EEG processing—causal SSMs enabling real-time continuous inference—which addresses a critical bottleneck in clinical neuroscience. Its contributions span architectural innovation (first causal Mamba-based EEG model), a novel multi-stage self-supervised training pipeline for long-range memory retention, and practical clinical applicability with >10x throughput gains. The breadth of impact across neuroscience, clinical monitoring, and deep learning for time-series is substantial. Paper 1, while solid, is more incremental—refining DPO for multimodal reasoning hallucination—within an already crowded space of LLM alignment methods.

vs. Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact due to a broadly useful, multilingual diagnostic benchmark (MentalMap) that can become a standard evaluation tool across the field, informing model development, cognition-inspired analysis, and cross-lingual NLP. Its methodological contribution (capability hierarchy, multiple diagnostic axes, structured-text control, many models, plus human comparison) supports strong, generalizable claims about a persistent spatial reasoning bottleneck. Paper 1 is practically valuable for reducing multimodal hallucinations, but its impact is more specialized to a training recipe and may be superseded faster than a widely adopted benchmark and reframing result.

vs. SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

gpt-5.25/28/2026

Paper 2 likely has higher impact due to its direct real-world applicability (clinical deployment), strong timeliness (LLM governance/safety), and broad relevance to safety, auditing, and regulatory evidence across high-stakes AI domains. Its clinician-audited provenance pipeline and red-team evaluation address concrete barriers to adoption beyond accuracy. Paper 1 is methodologically novel and valuable for multimodal hallucination reduction, but the contribution is more incremental within ML training (a refined preference-optimization objective plus data generation) and may have narrower immediate societal impact than a governance-oriented medical alignment framework.

vs. RULER: Representation-Level Verification of Machine Unlearning

claude-opus-4.65/28/2026

Paper 1 addresses the critical and timely problem of hallucination in multimodal large reasoning models, which is central to the rapidly growing field of LLMs/MLLMs. Its novel decomposition of CoT reasoning from answer optimization (RC-DPO) and the MCTS-based data generation strategy represent meaningful methodological contributions with broad applicability across the fast-expanding multimodal AI ecosystem. Paper 2 makes a solid contribution to machine unlearning verification, but targets a narrower community. The sheer scale of interest in reasoning model reliability and hallucination mitigation gives Paper 1 greater potential for citations and real-world impact.

vs. Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

gemini-3.15/28/2026

Paper 2 addresses a critical, widely recognized problem in AI (hallucinations in multimodal models) using a mathematically grounded and empirically validated methodology. Its introduction of RC-DPO offers immediate, practical applications for improving state-of-the-art models. In contrast, while Paper 1 presents a highly novel perspective, its reliance on subjective auto-ethnography and AI 'first-person' self-reporting lacks the methodological rigor and reproducibility required for broad impact in the core machine learning community.

vs. Retrying vs Resampling in AI Control

claude-opus-4.65/28/2026

Paper 2 introduces a novel optimization framework (RC-DPO) that addresses a fundamental problem—hallucinations in multimodal reasoning models—with broad applicability. It provides a principled theoretical derivation showing why standard DPO fails to leverage CoT supervision, and proposes concrete solutions (MCTS-based data generation, attention-guided pruning). This has wider impact across the rapidly growing multimodal AI field. Paper 1 makes valuable but narrower contributions to AI control/safety protocols, with findings that are setting-specific and sometimes contradictory to prior work, suggesting limited generalizability.

vs. Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

gpt-5.25/28/2026

Paper 2 likely has higher impact: it proposes a principled training objective (RC-DPO) targeting a core, widely recognized failure mode—multimodal hallucination—by separating CoT-conditioned alignment from answer preference, plus a scalable data-generation pipeline (MCTS positives, attention-pruned negatives). This is methodologically substantial and broadly applicable across multimodal reasoning models and tasks, with clear real-world relevance (reliability/safety). Paper 1 contributes a valuable benchmark and an agentic inference framework, but its scope is narrower (audio-visual multi-hop video) and may be less broadly reusable than a general alignment/optimization method.

vs. Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

claude-opus-4.65/28/2026

Paper 2 presents a novel, actionable method (RC-DPO) that directly addresses hallucination in multimodal reasoning models with a concrete algorithmic contribution—decomposing CoT-level and answer-level preference optimization—backed by MCTS-based data generation. This offers immediate practical impact for improving model reliability. Paper 1 proposes a useful evaluation framework but is primarily diagnostic rather than prescriptive, and multi-dimensional evaluation frameworks, while valuable, tend to have slower adoption. Paper 2's methodological innovation in training optimization has broader applicability and addresses a critical, timely problem (hallucination mitigation) with a reproducible solution.

vs. Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables

gemini-3.15/28/2026

Paper 2 addresses hallucination mitigation in Multimodal Large Reasoning Models, a highly timely and critical issue with broad implications across AI applications. Its focus on reasoning-conditioned preference optimization aligns with the current frontier of enhancing Chain-of-Thought in foundation models. While Paper 1 presents a solid methodological advance for Knowledge Graphs, Paper 2's focus on large multimodal models promises wider applicability, broader cross-disciplinary impact, and more immediate real-world relevance.

vs. Operational AI Deployment Assurance: Governance-State Orchestration Under Threshold-Sensitive Deployment Conditions -- A Governance Framework for High-Stakes AI Systems

claude-opus-4.65/28/2026

Paper 2 addresses a critical, timely problem (hallucinations in multimodal large reasoning models) with a novel, well-defined technical contribution (RC-DPO) backed by extensive experiments. It introduces a concrete algorithmic innovation—decomposing CoT-level and answer-level preference optimization—with broad applicability across multiple models and benchmarks. Paper 1 proposes a governance framework (OADA) with useful conceptual constructs but is more incremental, less experimentally rigorous, and targets a narrower audience. Paper 2's methodological contribution is more likely to be adopted and cited given the rapid growth of multimodal LLM research.

vs. Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities

gemini-3.15/28/2026

Paper 2 addresses a critical, highly active bottleneck in AI (hallucinations in multimodal reasoning models) with a concrete, rigorous algorithmic solution (RC-DPO). Methodological advancements in preference optimization currently drive rapid, widespread adoption and high citation rates across multiple domains. In contrast, Paper 1 is a conceptual vision piece proposing a new manufacturing paradigm. While valuable for its specific industry, conceptual frameworks generally have a slower, narrower scientific impact compared to core algorithmic breakthroughs in AI.

vs. Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

gemini-3.15/28/2026

Paper 2 provides a fundamental theoretical proof (kernel obstruction theorem) explaining why current LLM paradigms fail at causal discovery, a cornerstone of scientific reasoning. It then offers a novel, provably convergent agentic solution (A-CBO) that circumvents this limitation. This represents a significant paradigm shift and foundational contribution. Paper 1, while practically valuable for mitigating hallucinations via improved preference optimization, represents a more incremental methodological refinement within an existing framework.

vs. LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

gpt-5.25/28/2026

Paper 2 likely has higher impact: LaneRoPE introduces a broadly applicable inference-time mechanism for collaborative parallel generation, improving test-time scaling with minimal architectural changes and negligible overhead. This is timely given widespread use of best-of-N/parallel decoding and could generalize across domains beyond math (e.g., planning, code, multimodal) and across many existing LLM deployments. Paper 1 is valuable for reducing multimodal hallucinations and improves training methodology, but its scope is more specialized (multimodal CoT/DPO training, data generation) and may be harder to adopt widely than an inference-time positional/attention modification.

vs. AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

gemini-3.15/28/2026

Paper 2 addresses a critical and fundamental issue (hallucinations in multimodal reasoning) by introducing a novel algorithmic training method (RC-DPO) and an innovative MCTS-based data generation strategy. Advances in foundational training methodologies typically yield broader scientific impact and higher adoption across various model architectures compared to specific benchmarking frameworks like the one proposed in Paper 1.