FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG

Zhe Yu, Wenpeng Xing, Tiancheng Zhao, Mohan Li, Changting Lin, Meng Han

Jun 4, 2026

arXiv:2606.05644v1 PDF

cs.AI(primary)

#1565of 3404·Artificial Intelligence

#1565 of 3404 · Artificial Intelligence

Tournament Score

1412±46

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7

Rigor7.5

Novelty6.5

Clarity7.5

Tournament Score

1412±46

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

When retrieved evidence contradicts parametric memory, language models frequently ignore context and default to memorized priors -- a failure that undermines the core purpose of retrieval augmentation. Contrastive decoding amplifies the context-conditioned output to suppress parametric bias, but existing methods rest on an implicit assumption that this bias is uniform across tokens. A single global contrastive weight over-penalizes safe tokens while leaving genuinely conflicted ones insufficiently corrected. We identify token-level conflict concentration: retrieval-memory tension is sharply heterogeneous, concentrated on a small fraction of answer-critical decoding steps. This reframes contrastive decoding from how much contrast to apply to where to apply it. We propose FIDES (Faithful Inference via Deep Evidence Signals), a training-free decoder that reads three internal signals probing retrieval-memory conflict at complementary depths -- output surface, hidden representations, and prediction trajectory -- and fuses them to govern intervention strength at each decoding step. Across three benchmarks and six backbones -- four primary 7B/8B models and two scaling backbones up to 70B -- FIDES achieves the best context fidelity in all 18 settings, outperforming the strongest training-free baseline by +3 to +13 points. On the 70B scale, fidelity reaches 92-94% while F1 surges to 62-63%, demonstrating that token-level selectivity unlocks generation capability that coarse contrastive rules suppress.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: FIDES

1. Core Contribution

FIDES addresses a well-recognized problem in RAG: when retrieved evidence contradicts a model's parametric memory, LLMs often ignore the context and hallucinate from memory. The paper's key conceptual insight is token-level conflict concentration — the observation that retrieval-memory tension is not uniformly distributed across generated tokens but concentrated on a small fraction of answer-critical decoding steps. This reframes contrastive decoding from a "how much contrast" question to a "where to apply contrast" question.

The method fuses three internal signals at complementary depths: (1) Opposition (JSD between context/no-context output distributions), (2) Shift (hidden-state trajectory divergence across layers), and (3) Noise (internal prediction instability via Logit Lens midpoint-to-final KL). These are combined via fixed inverse-scale calibrated weights to produce a per-token contrastive coefficient α_t. The approach is training-free and requires no per-setting tuning.

2. Methodological Rigor

Strengths in evaluation design:

The paper evaluates across 3 benchmarks × 6 backbones = 18 settings, providing comprehensive coverage.

Counterfactual evaluation is well-justified as it isolates decoder faithfulness from retrieval quality.

Statistical significance testing with paired bootstrap across all 48 pairwise comparisons (all p < 0.05, 44/48 p < 0.01) with Bonferroni correction is thorough.

The paper includes non-conflict control experiments, noisy retrieval robustness tests, and conflict severity stratification — these go beyond standard benchmarking.

Concerns about rigor:

The three signals and their fusion mechanism, while intuitive, involve several fixed constants (1/10, 1/5, 0.5/0.3/0.2, λ=1.5, α_min=0.1). Although the authors argue these are derived from label-free calibration and show robustness sweeps, the number of such constants is notable. The claim of "no tuning" is somewhat softened by the fact that a 2,000-sample calibration pool per benchmark is used.

The counterfactual datasets are constructed using GPT-4 rewriting, which introduces a dependency on another model's capabilities. While the paper argues this is method-agnostic, it's worth noting.

The AUROC of 0.923 for answer-token discrimination is impressive but evaluated only on one model-dataset pair (LLaMA3-8B/NQ-Swap). Broader verification would strengthen the token-level concentration claim.

Some results appear remarkably clean — e.g., Table 9 shows identical CF/EM for ratios 0.3 through 0.6, which seems unusually stable and may reflect the small evaluation subset (n=400).

3. Potential Impact

Practical relevance: RAG faithfulness under knowledge conflict is a critical deployment concern. A training-free method that works across multiple model families (LLaMA, Mistral, Qwen) at varying scales (7B–70B) has immediate practical applicability. The +8–11% overhead over CAD is modest.

Conceptual contribution: The token-level conflict concentration insight is the paper's most transferable contribution. This framing could influence how other researchers think about intervention granularity in decoding-time methods beyond RAG — e.g., in factuality enhancement, safety filtering, or style control.

Limitations on impact: The method is specifically designed for single-document English QA with entity-level conflicts. Multi-document, cross-lingual, and multimodal RAG remain untested. The honest scope statement — FIDES faithfully follows wrong evidence if retrieval errs — is important but limits end-to-end deployment without additional verification layers.

4. Timeliness & Relevance

The paper addresses a timely bottleneck. As RAG becomes standard in production LLM systems, the reliability of context-following behavior is a first-order concern. The training-free constraint is particularly relevant given that many deployments use frozen, API-served models. The scaling results to 70B are valuable as the field moves toward larger models.

The competitive landscape is active (CAD, AdaCAD, DeCoRe, DVD, COIECD, CoCoA, CLEAR), and the paper positions itself well within this space. The consistent gains across all 18 settings over the strongest training-free baseline are convincing.

5. Strengths & Limitations

Key strengths:

The token-level conflict concentration insight is well-motivated and empirically validated (3.3× weight gap, AUROC 0.923, monotonic severity scaling).

Comprehensive evaluation with strong statistical testing across diverse settings.

The multi-depth signal design is principled — each signal captures a distinct aspect of conflict.

Transparent about scope limitations and honest about failure modes (faithfully following wrong evidence).

Extensive appendices covering reproducibility, ablations, and robustness.

Notable weaknesses:

The evaluation is restricted to counterfactual entity-swap QA — a relatively narrow, albeit well-motivated, evaluation paradigm. Real-world conflicts are often more nuanced.

While individual signal contributions are ablated, the paper doesn't deeply analyze failure cases where FIDES still gets things wrong.

The paper claims training-free, but the calibration pool requirement (2,000 examples per benchmark) blurs this boundary slightly.

The paper is dense with results but the theoretical grounding for why these three specific signals should be sufficient (or optimal) is somewhat thin — it's primarily empirical.

Some of the reported gains are very large (e.g., +12.8 on PopQA for LLaMA2-7B over AdaCAD), which warrants careful scrutiny of whether the counterfactual construction might inadvertently favor the signal types FIDES detects.

Summary

FIDES makes a solid contribution to the active area of RAG faithfulness through a well-motivated conceptual insight (token-level conflict concentration) and a practical, training-free implementation. The comprehensive evaluation across 18 settings with consistent improvements is the paper's strongest selling point. The main limitations are the narrow evaluation domain (entity-swap QA) and the moderate theoretical depth. This is a well-executed engineering contribution with a useful conceptual framing, likely to influence follow-up work on adaptive decoding strategies.

Rating:7/ 10

Significance 7Rigor 7.5Novelty 6.5Clarity 7.5

Generated Jun 5, 2026

Comparison History (16)

vs. Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

claude-opus-4.66/8/2026

FIDES addresses a fundamental and widely-recognized problem in RAG systems (retrieval-memory conflict) with a training-free approach that works across multiple model scales and architectures. Its key insight about token-level conflict concentration is novel and reframes contrastive decoding in a principled way. The training-free nature makes it immediately applicable, and RAG is a broadly adopted paradigm. Paper 2, while solid, addresses a more niche intersection (RLVR for multimodal reasoning) with a more complex framework. FIDES's broader applicability, stronger empirical gains across 18 settings, and foundational insight give it higher impact potential.

vs. Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection

claude-opus-4.66/6/2026

Paper 1 addresses a timely, broadly impactful issue at the intersection of AI, psychology, and policy. Its findings—that routine AI interactions incidentally reshape human emotional support preferences—have profound implications for regulation, mental health, and society. The large-scale longitudinal study with OpenAI provides compelling empirical evidence. Its cross-disciplinary relevance (psychology, HCI, policy, ethics) and timeliness amid rapid AI adoption give it exceptionally broad impact potential. Paper 2, while technically rigorous and valuable for the NLP community, addresses a narrower technical problem (RAG faithfulness) with more limited audience and societal implications.

vs. Vision Language Models Cannot Reason About Physical Transformation

gemini-3.16/6/2026

Paper 2 identifies a fundamental and systemic limitation in current Vision Language Models regarding physical reasoning, evaluating over 100 models. By introducing a comprehensive benchmark that exposes a deep-seated flaw (reliance on textual priors over actual visual physics), it is highly likely to steer broad future research directions and inspire new model architectures. While Paper 1 offers a valuable, practical solution to a specific RAG issue, Paper 2's exposure of a core capability deficit in foundational models promises a broader and more transformative impact across the AI community.

vs. Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

gemini-3.16/6/2026

Paper 1 addresses a fundamental and widespread problem in Large Language Models (retrieval-memory conflict in RAG) with a novel, training-free approach applicable across various LLM backbones. Its broad applicability in the rapidly growing generative AI field promises wider impact than Paper 2, which, while methodologically sound and highly relevant to healthcare, focuses on a domain-specific application (EHR data), limiting its breadth of impact across diverse fields.

vs. MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

claude-opus-4.66/6/2026

FIDES addresses a fundamental and broadly relevant problem in RAG-based LLMs—retrieval-memory conflict—with a novel, training-free approach offering strong theoretical insight (token-level conflict concentration) and rigorous evaluation across multiple scales and benchmarks. Its breadth of applicability across all LLM-based RAG systems gives it wider cross-field impact. While MapAgent demonstrates impressive industrial deployment, it is more narrowly focused on autonomous driving map generation and represents more of an engineering integration than a fundamental methodological advance.

vs. Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

gpt-5.26/5/2026

Paper 1 introduces a broadly applicable, training-free decoding method (FIDES) that targets a central, timely failure mode in RAG—retrieval vs. parametric-memory conflict—using token-level adaptive intervention from multiple internal signals, and demonstrates consistent gains across many benchmarks and model scales up to 70B. This combination of methodological novelty, immediate deployability, and wide relevance to LLM reliability makes its likely impact higher. Paper 2 offers an important diagnostic finding about convergence in LLM-driven program evolution, but it is more domain-specific and primarily descriptive, with less direct, general-purpose intervention.

vs. LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

gpt-5.26/5/2026

Paper 2 (FIDES) likely has higher scientific impact: it introduces a novel, broadly applicable, training-free decoding method addressing a central failure mode in RAG (retrieval–parametric memory conflict) with a clear conceptual reframing (token-level conflict concentration) and strong cross-model, multi-benchmark gains. This has immediate real-world applicability for improving faithfulness in deployed LLM systems and can influence decoding/control research broadly. Paper 1 is a useful benchmark highlighting limitations, but benchmarks alone typically have narrower downstream impact unless they drive new methods; its application scope is more evaluative than solution-oriented.

vs. TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models

claude-opus-4.66/5/2026

TRACE addresses a fundamental challenge in multimodal time series foundation models—temporal misalignment and modality missingness—which is pervasive across healthcare, affective computing, and many other domains. Its contribution to the rapidly growing field of foundation models for time series, combined with its broad applicability across modalities and domains, gives it wider potential impact. FIDES, while technically strong and addressing an important RAG faithfulness problem, targets a more specific issue (retrieval-memory conflict in LLM decoding) with a training-free inference-time fix that may be superseded as models improve. TRACE's paradigm for conditional estimation under missingness has more foundational, cross-disciplinary relevance.

vs. Retry Policy Gradients in Continuous Action Spaces

claude-opus-4.66/5/2026

FIDES addresses a fundamental and widely recognized problem in RAG systems—retrieval-memory conflict—with a novel insight (token-level conflict concentration) that reframes contrastive decoding. It demonstrates strong empirical results across multiple benchmarks and model scales, is training-free (enabling broad adoption), and is highly timely given the explosion of RAG applications. Paper 1 extends ReMax to continuous action spaces with solid theoretical analysis but offers more incremental contributions (comparable to SAC performance) in a narrower RL subfield. Paper 2's broader applicability to LLM deployment gives it higher impact potential.

vs. Synapse: Federated Tool Routing via Typed Compendium Artifacts

gpt-5.26/5/2026

Paper 2 has higher potential impact due to a more novel framing (typed federated artifacts as the unit of collaboration) that enables new guarantees and operations (schema-aware merging, per-field DP, cross-architecture transfer) in an important, timely setting: federation across heterogeneous, frozen LLMs without sharing data/weights. This could influence privacy-preserving ML systems, federated learning theory/practice, and tool-using LLM infrastructure broadly. Paper 1 is strong and practical for RAG fidelity, but is more incremental within decoding/control methods and narrower in cross-field reach.

vs. Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

gemini-3.16/5/2026

Paper 2 addresses a critical and widespread issue in modern LLMs (retrieval-memory conflict in RAG). Its training-free decoding approach has immediate, broad applicability across NLP systems. Furthermore, its methodological rigor is stronger, testing across 18 settings and scaling up to 70B models, giving it a broader and more immediate real-world impact compared to the niche focus on visual spatial planning in Paper 1.

vs. Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

gpt-5.26/5/2026

Paper 2 has higher likely impact: it is the first systematic, tool-validated evaluation of NL-to-TLA+ synthesis across 30 LLMs with a released dataset/framework, enabling reproducible benchmarking and follow-on research in formal methods, program synthesis, and LLM reliability. Its negative results and identified hallucination modes are broadly actionable for both academia and industry verification workflows. Paper 1 is a strong, practical decoding innovation for RAG faithfulness, but it is more incremental within an active subarea and is less likely to reshape evaluation standards or cross-field practice than a foundational benchmark study in formal specification generation.

vs. Agents' Last Exam

gpt-5.26/5/2026

Paper 2 (ALE) likely has higher impact: it introduces a large, industry-validated, long-horizon benchmark with verifiable outcomes spanning 1K+ tasks across 13 clusters, addressing a central bottleneck in AI-to-economy translation. Its breadth enables cross-field influence (agents, evaluation, economics, HCI, software engineering) and timeliness given rapid agent deployment. Methodologically, expert collaboration and a living, continuously expanding task pool increase relevance and longevity. Paper 1 is a strong, novel decoding method for RAG faithfulness, but its impact is narrower and likely incremental relative to broader evaluation infrastructure.

vs. Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

claude-opus-4.66/5/2026

FIDES addresses a widely recognized and practical problem in RAG systems—retrieval-memory conflicts—with a principled, training-free solution that demonstrates strong empirical results across 18 settings and scales to 70B models. The insight about token-level conflict concentration is novel and actionable, with broad applicability to the rapidly growing RAG ecosystem. Paper 1 introduces an interesting diagnostic probe but addresses a narrower problem (detecting implicit reward hacking) with limited scale (single 3B model, one dataset), making it more of a proof-of-concept with less immediate breadth of impact.

vs. Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

gemini-3.16/5/2026

Paper 1 exposes a fundamental paradox in LLM alignment, demonstrating that enhanced safety awareness inherently introduces new vulnerabilities. This theoretical and empirical breakthrough challenges core assumptions in AI safety, offering broad implications that could force a critical rethinking of defense mechanisms across the field. In contrast, Paper 2 provides a highly effective but more specialized algorithmic improvement for RAG systems.

vs. Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

gpt-5.26/5/2026

Paper 2 is more novel methodologically, proposing a new training-free, token-selective decoding approach (FIDES) using internal model signals to resolve retrieval–memory conflict in RAG, with broad applicability to many LLM systems and tasks. It reports extensive benchmarking across multiple datasets and model scales up to 70B, suggesting strong rigor and immediate relevance to a fast-moving area with wide cross-field impact (NLP, IR, trustworthy AI). Paper 1 is timely and societally important but is primarily an attributional measurement study with narrower methodological innovation and more limited transferability.