Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo

May 20, 2026

arXiv:2605.20834v1 PDF

cs.AI(primary)cs.LG

#23of 2292·Artificial Intelligence

#23 of 2292 · Artificial Intelligence

Tournament Score

1589±38

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1589±38

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies a critical implicit assumption in DPO's derivation: the RLHF-optimal policy must prefer human-preferred responses. The authors prove that DPO-RLHF equivalence is *conditional* on this assumption, which depends on reference policy quality. When the reference policy is sufficiently misaligned (δ_πref ≤ -Δr*/β), the RLHF-optimal policy can prefer dispreferred responses due to KL regularization dominance, causing DPO to optimize "relative advantage over the reference policy" rather than "absolute alignment with human preferences." The authors characterize an "undesirable solution space" U where policies decrease DPO loss while violating human preferences, and propose two fixes: CPO (soft constraint) and E-CPOC (hard constraint with conservative bounds).

Methodological Rigor

The theoretical development is thorough and well-structured. The core insight—that substituting π* into the Bradley-Terry model requires π* to respect preference ordering—is clearly articulated and formalized (Assumption 3.1, Proposition 3.2). The characterization of the undesirable solution space (Definition 3.3) and the gradient vanishing analysis (Proposition 3.4) provide concrete mechanisms for how DPO fails.

The paper provides a comprehensive set of proofs across extensive appendices. The CPO derivation from constrained RLHF (Definition 4.7, Theorem 4.8) is clean, and the approximation of 1/π* with 1/π_ref to achieve a stationary objective is well-justified with explicit error bounds (Proposition H.2). The E-CPOC development using conservative bounds (Φ_cons) and KKT conditions is mathematically sophisticated.

However, several concerns arise regarding rigor:

1. The assumption set for E-CPOC equivalence is extensive (Assumptions 4.1-4.5, Condition 4.6), and the ℓ2-to-ℓ∞ conversion via Lemma L.1 introduces a √N factor that could be problematic for large datasets, potentially making the pointwise bound vacuous in practice.

2. The approximation 1/π* ≈ 1/π_ref in CPO's margin term is precisely weakest where it matters most—when π_ref is misaligned (low probability on preferred responses). The authors acknowledge this self-correcting property (Remark H.3), but the theoretical guarantee degrades in the regime where the method is most needed.

3. Experimental validation is limited. Only one model size (Llama-3-8B-Instruct) is tested, and the improvements over DPO are modest (0.55% WR on AlpacaEval 2, 3.7% on Arena-Hard). The paper acknowledges that E-CPOC lacks experimental validation entirely—a significant gap given the extensive theoretical development devoted to it.

Potential Impact

The theoretical insight about DPO's conditional equivalence addresses a genuine concern in the alignment community. As DPO has become widely adopted, understanding its failure modes is valuable. The 45.5% violation rate measured on a standard setup (Figure 1, Appendix A.1) demonstrates this is not a corner case.

The geometric interpretation through margin ranking loss (Section 5) is particularly elegant—showing DPO implements margin ranking with potentially negative targets provides immediate intuition for practitioners. The connection to the learning-to-rank literature could enable cross-pollination of techniques.

However, practical impact may be limited by several factors: (1) the margin correction in CPO requires precomputing 1/π_ref(y|x), which can be numerically unstable for low-probability responses; (2) the hyperparameter γ requires tuning (sensitivity analysis in Table 5 shows significant performance variation); (3) reference-free methods like SimPO already sidestep the identified issue entirely, and the paper's positioning relative to such methods is primarily theoretical rather than empirical.

Timeliness & Relevance

The paper addresses a timely question as DPO adoption accelerates in production systems. Understanding when and why DPO fails is crucial for deploying aligned language models safely. The concurrent work by Shi et al. (2025) and others examining DPO's theoretical properties indicates growing community interest in this direction.

Strengths

1. Clear identification of a precise failure condition (Equation 11) with constructive characterization of when it occurs.

2. The margin ranking perspective (Section 5) provides accessible geometric intuition that complements the formal theory.

3. The misaligned reference experiment (Appendix A.2) directly validates the theory: DPO degrades while CPO remains robust under increasing corruption ratios, and the fraction-in-U training dynamics (Figure 2) beautifully illustrate the theoretical mechanism.

4. The assumption dependency map (Table 1) provides unusual clarity about what each theoretical result requires.

5. CPO's implementation simplicity—identical computational cost to DPO with only precomputed margin subtraction.

Limitations

1. Scale limitations: Only 8B model tested; no evidence the findings hold at larger scales where reference policies may be better calibrated.

2. E-CPOC remains purely theoretical with no experimental validation despite being presented as the primary theoretical contribution.

3. The relationship to existing margin-based methods (IPO, SimPO) could be more carefully delineated; the paper notes distinctions but doesn't provide systematic ablations.

4. The γ* formula (Equation 27) requires knowledge of r*(y_w) - r*(y_l), which is unknown in practice, creating a gap between theory and implementation.

5. Missing training dynamics analysis acknowledged by authors—loss curves, preference accuracy evolution, and fraction-in-U tracking on the main benchmarks would strengthen the empirical story.

Overall Assessment

This paper makes a genuine theoretical contribution by precisely characterizing when DPO's RLHF equivalence breaks down. The theory is well-developed, the geometric interpretation is insightful, and the proposed fixes are principled. However, the experimental evidence is thin relative to the theoretical claims, E-CPOC lacks any empirical validation, and the practical improvements are modest. The work is primarily a theoretical contribution with preliminary experimental support.

Rating:6.5/ 10

Significance 7Rigor 6.5Novelty 7Clarity 7.5

Generated May 21, 2026

Comparison History (34)

vs. Towards a General Intelligence and Interface for Wearable Health Data

claude-opus-4.65/22/2026

Paper 1 presents a foundation model for wearable health pretrained on an unprecedented scale (1 trillion minutes, 5 million participants), demonstrating broad applicability across 35 health tasks with few-shot learning, LLM agent integration, and clinician-validated Personal Health Agent. Its novelty in scale, cross-domain health impact, and practical clinical relevance give it higher potential impact. Paper 2 makes important theoretical contributions to DPO/RLHF understanding, but addresses a narrower technical issue within the LLM alignment community. Paper 1's breadth across healthcare, AI, and wearable computing, combined with its massive scale and real-world deployment potential, suggests greater overall scientific impact.

vs. Towards a General Intelligence and Interface for Wearable Health Data

claude-opus-4.65/22/2026

Paper 2 presents a foundation model for wearable health pretrained on an unprecedented scale (1 trillion minutes, 5 million participants), addressing a fundamental challenge in digital health. Its breadth of impact spans 35 health prediction tasks across multiple domains, introduces novel LLM-agent integration for automated model search, and demonstrates clinical validation. The scale of data, generalizability across health domains, and practical applicability to personalized health monitoring give it transformative potential across healthcare, AI, and wearable technology. Paper 1, while theoretically rigorous in identifying DPO failure modes, addresses a narrower problem within the RLHF/alignment community.

vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures

gemini-3.15/21/2026

While Paper 1 offers a strong advance in materials discovery, Paper 2 has a higher potential for immediate, widespread scientific impact. DPO is currently the dominant methodology for aligning Large Language Models. By mathematically proving the failure modes of DPO and introducing a state-of-the-art solution (CPO), Paper 2 addresses a critical bottleneck in AI alignment. Its fundamental theoretical contributions combined with practical improvements ensure broad adoption and high citation rates across the rapidly expanding and highly resourced field of artificial intelligence.

vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules

claude-opus-4.65/21/2026

MIMIC represents a significantly more ambitious and novel contribution—a unified generative multimodal foundation model for biomolecules spanning sequence, structure, regulation, and evolution. It introduces a new dataset (LORE), a novel architecture, and demonstrates broad applications from splicing prediction to RNA editing to protein design, with state-of-the-art results across multiple tasks. Its breadth of impact across computational biology, drug design, and genomics is substantial. Paper 2, while theoretically rigorous in characterizing DPO/RLHF equivalence conditions, addresses a more incremental issue in LLM alignment with narrower scope of impact.

vs. Machine Collective Intelligence for Explainable Scientific Discovery

gemini-3.15/21/2026

While Paper 1 offers crucial insights for LLM alignment, Paper 2 presents a broader paradigm shift for AI-driven scientific discovery. By enabling autonomous, explainable equation discovery across various scientific domains with drastically reduced extrapolation errors, Paper 2 has a much wider potential impact across all empirical sciences, addressing a fundamental limitation of deep learning in scientific applications.

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

claude-opus-4.65/21/2026

ReClaim represents a major advance in healthcare AI by training a foundation model on 43.8 billion medical events from 200M+ patients, demonstrating strong performance across 1,000+ prediction tasks, expenditure forecasting, and causal inference. Its scale, breadth of validation, and direct applicability to regulatory decision-making and real-world evidence generation give it enormous practical impact. Paper 2 makes a solid theoretical contribution clarifying DPO/RLHF equivalence conditions, but its scope is narrower and incremental relative to the rapidly evolving LLM alignment literature, where methods are frequently superseded.

vs. End-to-end autonomous scientific discovery on a real optical platform

gpt-5.25/21/2026

Paper 2 likely has higher scientific impact: it demonstrates end-to-end autonomous discovery in a real physical system, including experimental validation of a previously unreported optical mechanism with potential hardware implications (optical pairwise computation analogous to attention). This is highly novel, timely for autonomous agents, and broad-impact across AI, optics, and scientific automation, with clear real-world application potential. Paper 1 is rigorous and valuable for alignment theory and practice, but its impact is more specialized to LLM training methodology and less cross-disciplinary than an experimentally grounded autonomous discovery milestone.

vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

gpt-5.25/21/2026

Paper 1 is likely to have higher near-term scientific impact: it addresses a timely, widely used alignment method (DPO) in LLM training, identifies concrete failure modes with clear assumptions, and proposes an actionable fix (CPO) with proofs plus benchmarked SOTA results and code—supporting methodological rigor and immediate adoption. Paper 2 is broader and conceptually ambitious, potentially high long-term impact, but such unification frameworks often face higher validation/acceptance barriers and less direct, reproducible engineering uptake compared to a practical alignment improvement in mainstream ML.

vs. AI scientists produce results without reasoning scientifically

gpt-5.25/21/2026

Paper 2 is likely higher impact: it delivers a clear theoretical contribution (conditional—not universal—equivalence of DPO and RLHF), identifies concrete failure modes with formal characterization, and proposes a new method (CPO) with provable alignment and benchmarked SOTA results—high methodological rigor and immediate applicability to widely used LLM alignment pipelines. Paper 1 is timely and broadly relevant as an empirical/behavioral critique of “AI scientist” agents, but it is primarily diagnostic and may be less directly actionable than a new alignment objective with proofs and code.

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

claude-opus-4.65/21/2026

ReClaim represents a major advance in healthcare AI by building the first large-scale foundation model on administrative claims data (43.8B events, 200M+ patients). Its demonstrated improvements across 1,000+ prediction tasks, expenditure forecasting, and bias reduction in trial emulations address critical real-world needs in healthcare and regulatory decision-making. The breadth of applications (disease surveillance, cost forecasting, RWE generation) and rigorous external validation suggest transformative potential. Paper 2, while theoretically insightful regarding DPO/RLHF equivalence, addresses a more incremental concern within AI alignment methodology with narrower practical impact.

vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures

gemini-3.15/21/2026

Paper 2 addresses a critical theoretical flaw in DPO, a dominant algorithm for LLM alignment. Given the ubiquitous use of LLMs, providing provable alignment guarantees and fixing fundamental failure modes offers immense and immediate impact across the AI community. While Paper 1 is highly significant for materials science, Paper 2's foundational relevance to core AI models gives it a wider and more immediate scientific footprint.

vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules

claude-opus-4.65/21/2026

MIMIC represents a more transformative contribution by introducing a unified generative multimodal foundation model for biomolecules that bridges sequence, structure, regulation, evolution, and cellular context. Its breadth of impact spans genomics, transcriptomics, proteomics, and drug/biomolecular design, with demonstrated applications in clinically relevant mutation correction and protein design. While Paper 1 makes a solid theoretical contribution clarifying DPO/RLHF equivalence conditions, it addresses a narrower problem within AI alignment methodology. MIMIC's novel dataset (LORE), architecture, and cross-modal generative capabilities open fundamentally new research directions in computational biology.

vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

gpt-5.25/21/2026

Paper 2 has higher likely impact due to timeliness and direct relevance to widely used LLM alignment methods (DPO/RLHF), clear practical failure modes, and a concrete, testable fix (CPO) with benchmarked results and released code—supporting adoption and downstream work. Its claims are tightly scoped and falsifiable, with strong methodological rigor (conditional equivalence proofs + empirical validation) and immediate real-world applications across industry. Paper 1 is ambitious and cross-disciplinary, but broader unification frameworks typically face higher validation and adoption barriers, making near-term impact less certain.

vs. AI scientists produce results without reasoning scientifically

gpt-5.25/21/2026

Paper 2 likely has higher scientific impact: it tackles a timely, cross-cutting concern—whether autonomous LLM “scientists” meet epistemic norms—using large-scale empirical evaluation (25k+ runs) across eight domains and introduces behavioral/epistemic diagnostics that could reshape how agentic systems are evaluated and trained. Its conclusions affect AI safety, ML, scientific automation, and research policy. Paper 1 is rigorous and valuable for alignment theory/practice (DPO vs RLHF, CPO), but its scope is narrower (preference optimization methods) and primarily impacts a subarea of alignment/finetuning rather than broad scientific practice and evaluation.

vs. Machine Collective Intelligence for Explainable Scientific Discovery

gpt-5.25/21/2026

Paper 2 likely has higher scientific impact due to broader cross-disciplinary reach and real-world applicability: an autonomous, explainable equation-discovery paradigm can affect many sciences (physics, chemistry, biology, climate, engineering) and addresses a timely bottleneck—interpretability and extrapolation in AI for science. If results generalize, the reported large extrapolation gains and compact symbolic models are highly consequential. Paper 1 is novel and rigorous for alignment theory and could influence LLM training practice, but its impact is more concentrated within RLHF/DPO methodology and safety research rather than across scientific domains.

vs. End-to-end autonomous scientific discovery on a real optical platform

claude-opus-4.65/21/2026

Paper 1 demonstrates a fundamentally new capability: end-to-end autonomous scientific discovery by an AI system on a real physical platform, culminating in the identification and experimental validation of a previously unreported physical mechanism. This represents a paradigm shift in how science can be conducted, with broad implications across all experimental sciences. While Paper 2 makes valuable theoretical contributions clarifying DPO/RLHF equivalence conditions and proposes CPO, it is an incremental advance within the well-studied alignment optimization space. Paper 1's novelty, breadth of impact, and transformative potential far exceed Paper 2's contributions.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gemini-3.15/21/2026

Paper 1 challenges a fundamental theoretical assumption in DPO, a cornerstone of modern LLM alignment, and provides a provably correct alternative (CPO). Its rigorous theoretical analysis of failure modes in existing alignment techniques offers profound, long-lasting implications for AI safety and training. While Paper 2 provides a valuable benchmark for evaluating research agents, Paper 1 introduces foundational methodological innovations that directly impact how frontier models are optimized, granting it higher potential scientific impact.

vs. Simulating clinical interventions with a generative multimodal model of human physiology

claude-opus-4.65/21/2026

HealthFormer represents a paradigm-shifting contribution to precision medicine by creating a generative 'health world model' trained on deeply phenotyped longitudinal data from 15,000+ individuals across 667 measurements and seven physiological domains. Its ability to simulate clinical interventions in silico, validated against published RCTs (41/41 correct direction of effect), and transfer to four independent cohorts for disease/mortality prediction without task-specific training, establishes a new framework for clinical digital twins. This has transformative potential across healthcare. Paper 2, while theoretically rigorous in characterizing DPO/RLHF equivalence conditions, addresses a more incremental technical issue within LLM alignment.

vs. Simulating clinical interventions with a generative multimodal model of human physiology

claude-opus-4.65/21/2026

HealthFormer represents a paradigm-shifting contribution to precision medicine by creating a generative 'health world model' that integrates seven physiological domains, transfers across independent cohorts, outperforms established clinical risk scores, and simulates clinical interventions with remarkable accuracy. Its breadth of impact spans medicine, clinical trials, digital twins, and personalized healthcare. While Paper 1 makes solid theoretical contributions to preference optimization in LLM alignment, Paper 2 addresses a far more consequential real-world problem with broader interdisciplinary impact and transformative potential for how clinical interventions are designed and personalized.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gemini-3.15/21/2026

Paper 2 challenges the theoretical foundations of DPO, a cornerstone algorithm in modern LLM alignment, by proving its equivalence to RLHF is conditional. Introducing a theoretically sound and empirically superior alternative (CPO) offers profound methodological advancements. Paper 1 introduces a useful benchmark, but Paper 2's fundamental insights into alignment algorithms have a broader and longer-lasting impact across the AI field.