Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

Aryan Khurana, Aravind Ramana RN, Dhruv Kumar

Jun 11, 2026arXiv:2606.13104v1

cs.LG

#1524of 5669·cs.LG

#1524 of 5669 · cs.LG

Tournament Score

1451±48

10501750

71%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity7.5

Abstract

Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AuthorityBench

1. Core Contribution

AuthorityBench introduces a 220,564-prompt benchmark with a fully balanced 2×2 factorial design crossing claim veracity (true/false) with citation veracity (real/fabricated) across four domains. The key novelty is the true claim × fabricated citation condition, which tests whether models can be induced to deny correct facts when presented with authoritative-looking but fabricated citations. This condition is absent from prior work, most notably FalseCite (Mao et al., 2025), which only examined false claims with fabricated citations. The benchmark additionally controls for venue prestige (4 tiers), author demographics (country-coded surnames), citation placement (40 templates), and temporal framing.

The central finding — that citation presence increases hallucination across all seven models tested, most severely when fabricated citations accompany true claims — is both surprising and consequential. This reframes the problem from "citations amplify false claims" to "citations degrade epistemic reasoning generally," which has direct implications for RAG system design.

2. Methodological Rigor

Strengths: The 2×2 factorial design is well-motivated and cleanly executed, allowing causal attribution of effects to citation presence rather than content. The use of Cohen's d over p-values given the large sample sizes reflects good statistical practice. The 40 prompt templates mitigate concerns about template-specific artifacts. Human validation of the judge model (Cohen's κ = 0.83, 90.7% inter-annotator agreement on 1,500 samples) provides reasonable confidence in automated evaluation.

Weaknesses: Several methodological concerns deserve scrutiny:

Judge model limitations: Using Qwen3-8B as judge without retrieval access means it cannot verify citation existence. While ground truth labels are supplied, the judge's own susceptibility to citation authority signals is not examined — a potential confound given the paper's thesis.

FEVER true citation caveat: The general knowledge domain's "true citations" use back-filled metadata from other domains, meaning this condition doesn't actually test real citations. Yet the general knowledge domain produces the most dramatic headline numbers (35–77%). This is acknowledged but the results are still foregrounded prominently, which risks misleading readers.

Subset evaluation: Four of seven models are evaluated on only 15K of 220K prompts. While the authors validate subset-full concordance for two models, the subset-evaluated models include the most capable ones (Claude, GPT 5.4 mini, Gemma 4 31B), limiting confidence in cross-model comparisons.

Claim construction: For MCQ datasets (SciQ, CaseHOLD, MedMCQA), false claims are constructed by pairing questions with distractors using a fixed template. This artificial construction may not reflect natural false claims and could interact with citation effects in ways not captured by the design.

3. Potential Impact

The practical implications are significant for the RAG ecosystem. The finding that citation presence — even from real sources — degrades model accuracy on claims models would otherwise answer correctly is a direct warning for systems that naively augment prompts with retrieved citations. This challenges the implicit assumption in retrieval-augmented generation that grounding in sources improves reliability.

The benchmark could serve as a testbed for developing citation-robust models, similar to how adversarial benchmarks drive robustness research. The open release of all data and code enhances this potential.

The discovery of a suppression/amplification split across model families on false claims (Llama/Claude/GPT suppress hallucination; Gemma/Phi amplify) is an intriguing finding that points toward architectural or training-regime differences in how authority signals are processed, though no mechanistic explanation is offered.

4. Timeliness & Relevance

This work is highly timely. RAG systems are being deployed at scale in high-stakes domains (legal, medical, scientific), and understanding how citation signals influence model behavior is a practical necessity. The paper addresses a gap between hallucination benchmarks (which ignore in-context authority) and citation faithfulness work (which examines whether models cite correctly, not how citations influence reasoning).

5. Strengths & Limitations

Key Strengths:

First benchmark to include the true claim × fabricated citation condition, revealing a previously uncharacterized failure mode

Scale (220K prompts) and breadth (4 domains, 7 models, 40 templates) substantially exceed prior work

Clean null results on venue prestige and author demographics are informative — they narrow the space of relevant variables

The suppression/amplification split on false claims is a genuinely novel observation that invites mechanistic follow-up

Well-structured research questions provide a reusable evaluation framework

Notable Weaknesses:

The FEVER true-citation limitation undermines the strongest headline claims, as the domain producing the most dramatic results (general knowledge, 35–77%) has compromised citation validity

No mechanistic analysis — the paper identifies patterns but cannot explain why citation signals override factual knowledge

The "hallucination" framing conflates multiple phenomena: a model agreeing with a fabricated citation on a true claim might be performing citation-faithful reasoning (trusting the provided source) rather than "hallucinating" in the conventional sense

The benchmark assumes binary veracity, which may not generalize to nuanced or partially true claims common in real-world RAG settings

Template diversity, while improved over FalseCite, still excludes web-native formats (hyperlinks, informal citations) that dominate real deployment contexts

Additional Observations:

The finding that model size/capability doesn't predict robustness is important but based on a limited model sample; frontier models (GPT-4o, Claude Opus, Gemini Ultra) are absent

The paper would benefit from analyzing the interaction between a model's baseline calibration and its citation susceptibility — the inverted baseline models (Claude, Phi-4, Gemma 4) hint at an important confound

The conceptual distinction between "epistemic susceptibility" and "context faithfulness" deserves deeper treatment, as RAG systems are explicitly designed to trust retrieved content

Overall Assessment

AuthorityBench makes a solid empirical contribution by systematically characterizing how citation signals influence LLM behavior across a well-designed factorial experiment. The true claim × fabricated citation condition is a genuine innovation that reveals a practically important failure mode. However, the work is primarily descriptive, the headline numbers are partially compromised by the FEVER citation limitation, and the conceptual framing could be sharpened. The benchmark's value as a community resource is substantial, contingent on uptake.

Rating:6.8/ 10

Significance 7.5Rigor 6.5Novelty 7Clarity 7.5

Generated Jun 12, 2026

Comparison History (21)

Wonvs. Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting

AuthorityBench addresses a critical and timely problem—how citation-based authority signals influence LLM hallucination—with a rigorous large-scale factorial benchmark design. As LLMs are increasingly deployed in citation-augmented settings (RAG, research assistants), understanding epistemic susceptibility has broad implications for AI safety, trust, and deployment across multiple high-stakes domains. The finding that even fabricated citations increase hallucination rates is highly actionable. Paper 2, while technically solid, addresses a more niche problem in time series forecasting with incremental methodological contributions. Paper 1's breadth of impact across AI safety, NLP, and responsible AI gives it higher potential influence.

claude-opus-4-6·Jun 12, 2026

Wonvs. Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

Paper 1 addresses a fundamental and widespread issue in LLM deployment—how citations affect hallucination and epistemic trust—which is critical for widely used RAG systems. Its large-scale benchmark and findings have broad implications for AI safety, reliability, and human-AI interaction across multiple disciplines. While Paper 2 presents a strong technical contribution for offline agent evaluation, its impact is more confined to the RL and agentic LLM subfields, giving Paper 1 a broader and more immediate scientific impact.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

Paper 1 provides fundamental mechanistic insights into Chain-of-Thought reasoning, identifying a 'commitment boundary' that allows for early exiting. This not only advances LLM interpretability but also offers a highly practical method to reduce inference costs by up to 55% without performance degradation. Given the massive scale of LLM deployment, this efficiency gain and theoretical contribution give it broader and more immediate scientific and real-world impact compared to the benchmarking of citation bias in Paper 2.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Uncertainty Estimation for Molecular Diffusion Models

AuthorityBench addresses a fundamental and timely problem—how citation-based authority signals cause LLMs to hallucinate—with a large-scale, rigorously designed benchmark (220K prompts, factorial design, multiple domains). This has broad implications for AI safety, misinformation, and the deployment of LLMs in high-stakes domains like law and medicine. Paper 2, while technically sound, addresses a narrower problem (uncertainty estimation for molecular diffusion models) with more incremental contributions (post-hoc Laplace approximation for filtering). Paper 1's breadth of impact, novelty of the benchmark design, and relevance to the rapidly growing LLM deployment ecosystem give it higher potential impact.

claude-opus-4-6·Jun 12, 2026

Lostvs. On Subquadratic Architectures: From Applications to Principles

Paper 1 addresses a fundamental bottleneck in AI: the quadratic computational cost of Transformers. By systematically evaluating and theorizing why specific subquadratic architectures like xLSTM succeed in complex tasks, it directly informs the design of next-generation foundation models. This architectural advancement has massive breadth of impact across sequence modeling domains. While Paper 2 offers a highly rigorous and valuable benchmark for evaluating LLM hallucinations in citation contexts, Paper 1 provides foundational algorithmic principles that are more likely to shape the future trajectory of core deep learning architectures.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

Paper 2 addresses a critical, timely issue in modern AI: how large language models are influenced by citation authority, impacting Retrieval-Augmented Generation (RAG) and AI safety across multiple domains (medicine, law, science). By releasing a large-scale, multi-domain benchmark (AuthorityBench), it provides a foundational resource likely to attract widespread attention and citations from the NLP and AI safety communities. In contrast, Paper 1 offers a highly specialized, though methodologically sound, approach to maritime anomaly detection, limiting its immediate breadth and overall scientific impact compared to the universally relevant LLM research in Paper 2.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Adjusted Cup-Product Neural Layer

Paper 2 has higher likely impact: it introduces a large, publicly released benchmark with a rigorous balanced factorial design isolating citation authority effects across multiple high-stakes domains, producing actionable findings for LLM reliability and deployment. Its applications (citation-augmented QA, safety, evaluation, and policy) are immediate and broad across NLP, HCI, and AI governance. Paper 1 is novel and mathematically elegant, but appears specialized with narrower near-term applicability and impact mainly within niche intersections of geometric topology, gauge theory, and physics-informed ML.

gpt-5.2·Jun 12, 2026

Lostvs. Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

Paper 1 likely has higher scientific impact due to a more novel methodological contribution: a principled multi-view contrastive framework that enables identifiable latent dynamical system recovery from noisy nonlinear observations, plus structured equation discovery with theoretical guarantees. This combination advances scientific machine learning and system identification and can transfer across many physical/biological domains (e.g., neuroscience, fluid dynamics, climate). Paper 2 provides a valuable benchmark for LLM epistemic behavior, but its primary contribution is empirical measurement/dataset creation with narrower methodological innovation and faster risk of being superseded by evolving model and retrieval setups.

gpt-5.2·Jun 12, 2026

Wonvs. How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators

Paper 2 addresses a highly timely and critical issue in AI safety—LLM hallucinations induced by citations. Its large-scale, multi-domain benchmark has broad implications for researchers and practitioners deploying LLMs in sensitive areas like medicine and law. In contrast, Paper 1 offers a valuable but more niche methodological improvement for neural operators solving PDEs, likely resulting in a narrower scope of scientific impact.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

AuthorityBench addresses a critical and timely problem—how citation-based authority signals cause LLMs to hallucinate—with a large-scale, rigorously designed benchmark (220K prompts, factorial design, multiple domains). This has broad impact across AI safety, NLP, information retrieval, and policy, and is highly relevant as LLMs are increasingly deployed in citation-augmented settings. Paper 2 presents a novel quantization metric using dynamical systems theory, which is technically interesting but more niche in scope, primarily benefiting model compression practitioners. Paper 1's findings on epistemic vulnerabilities have wider interdisciplinary relevance and societal implications.

claude-opus-4-6·Jun 12, 2026

#1524of 5669·cs.LG

#1524 of 5669 · cs.LG

Tournament Score

1451±48

10501750

71%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity7.5