Aryan Khurana, Aravind Ramana RN, Dhruv Kumar
Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench
AuthorityBench introduces a 220,564-prompt benchmark with a fully balanced 2×2 factorial design crossing claim veracity (true/false) with citation veracity (real/fabricated) across four domains. The key novelty is the true claim × fabricated citation condition, which tests whether models can be induced to deny correct facts when presented with authoritative-looking but fabricated citations. This condition is absent from prior work, most notably FalseCite (Mao et al., 2025), which only examined false claims with fabricated citations. The benchmark additionally controls for venue prestige (4 tiers), author demographics (country-coded surnames), citation placement (40 templates), and temporal framing.
The central finding — that citation presence increases hallucination across all seven models tested, most severely when fabricated citations accompany true claims — is both surprising and consequential. This reframes the problem from "citations amplify false claims" to "citations degrade epistemic reasoning generally," which has direct implications for RAG system design.
Strengths: The 2×2 factorial design is well-motivated and cleanly executed, allowing causal attribution of effects to citation presence rather than content. The use of Cohen's d over p-values given the large sample sizes reflects good statistical practice. The 40 prompt templates mitigate concerns about template-specific artifacts. Human validation of the judge model (Cohen's κ = 0.83, 90.7% inter-annotator agreement on 1,500 samples) provides reasonable confidence in automated evaluation.
Weaknesses: Several methodological concerns deserve scrutiny:
The practical implications are significant for the RAG ecosystem. The finding that citation presence — even from real sources — degrades model accuracy on claims models would otherwise answer correctly is a direct warning for systems that naively augment prompts with retrieved citations. This challenges the implicit assumption in retrieval-augmented generation that grounding in sources improves reliability.
The benchmark could serve as a testbed for developing citation-robust models, similar to how adversarial benchmarks drive robustness research. The open release of all data and code enhances this potential.
The discovery of a suppression/amplification split across model families on false claims (Llama/Claude/GPT suppress hallucination; Gemma/Phi amplify) is an intriguing finding that points toward architectural or training-regime differences in how authority signals are processed, though no mechanistic explanation is offered.
This work is highly timely. RAG systems are being deployed at scale in high-stakes domains (legal, medical, scientific), and understanding how citation signals influence model behavior is a practical necessity. The paper addresses a gap between hallucination benchmarks (which ignore in-context authority) and citation faithfulness work (which examines whether models cite correctly, not how citations influence reasoning).
AuthorityBench makes a solid empirical contribution by systematically characterizing how citation signals influence LLM behavior across a well-designed factorial experiment. The true claim × fabricated citation condition is a genuine innovation that reveals a practically important failure mode. However, the work is primarily descriptive, the headline numbers are partially compromised by the FEVER citation limitation, and the conceptual framing could be sharpened. The benchmark's value as a community resource is substantial, contingent on uptake.
Generated Jun 12, 2026
AuthorityBench addresses a critical and timely problem—how citation-based authority signals influence LLM hallucination—with a rigorous large-scale factorial benchmark design. As LLMs are increasingly deployed in citation-augmented settings (RAG, research assistants), understanding epistemic susceptibility has broad implications for AI safety, trust, and deployment across multiple high-stakes domains. The finding that even fabricated citations increase hallucination rates is highly actionable. Paper 2, while technically solid, addresses a more niche problem in time series forecasting with incremental methodological contributions. Paper 1's breadth of impact across AI safety, NLP, and responsible AI gives it higher potential influence.
Paper 1 addresses a fundamental and widespread issue in LLM deployment—how citations affect hallucination and epistemic trust—which is critical for widely used RAG systems. Its large-scale benchmark and findings have broad implications for AI safety, reliability, and human-AI interaction across multiple disciplines. While Paper 2 presents a strong technical contribution for offline agent evaluation, its impact is more confined to the RL and agentic LLM subfields, giving Paper 1 a broader and more immediate scientific impact.
Paper 1 provides fundamental mechanistic insights into Chain-of-Thought reasoning, identifying a 'commitment boundary' that allows for early exiting. This not only advances LLM interpretability but also offers a highly practical method to reduce inference costs by up to 55% without performance degradation. Given the massive scale of LLM deployment, this efficiency gain and theoretical contribution give it broader and more immediate scientific and real-world impact compared to the benchmarking of citation bias in Paper 2.
AuthorityBench addresses a fundamental and timely problem—how citation-based authority signals cause LLMs to hallucinate—with a large-scale, rigorously designed benchmark (220K prompts, factorial design, multiple domains). This has broad implications for AI safety, misinformation, and the deployment of LLMs in high-stakes domains like law and medicine. Paper 2, while technically sound, addresses a narrower problem (uncertainty estimation for molecular diffusion models) with more incremental contributions (post-hoc Laplace approximation for filtering). Paper 1's breadth of impact, novelty of the benchmark design, and relevance to the rapidly growing LLM deployment ecosystem give it higher potential impact.
Paper 1 addresses a fundamental bottleneck in AI: the quadratic computational cost of Transformers. By systematically evaluating and theorizing why specific subquadratic architectures like xLSTM succeed in complex tasks, it directly informs the design of next-generation foundation models. This architectural advancement has massive breadth of impact across sequence modeling domains. While Paper 2 offers a highly rigorous and valuable benchmark for evaluating LLM hallucinations in citation contexts, Paper 1 provides foundational algorithmic principles that are more likely to shape the future trajectory of core deep learning architectures.
Paper 2 addresses a critical, timely issue in modern AI: how large language models are influenced by citation authority, impacting Retrieval-Augmented Generation (RAG) and AI safety across multiple domains (medicine, law, science). By releasing a large-scale, multi-domain benchmark (AuthorityBench), it provides a foundational resource likely to attract widespread attention and citations from the NLP and AI safety communities. In contrast, Paper 1 offers a highly specialized, though methodologically sound, approach to maritime anomaly detection, limiting its immediate breadth and overall scientific impact compared to the universally relevant LLM research in Paper 2.
Paper 2 has higher likely impact: it introduces a large, publicly released benchmark with a rigorous balanced factorial design isolating citation authority effects across multiple high-stakes domains, producing actionable findings for LLM reliability and deployment. Its applications (citation-augmented QA, safety, evaluation, and policy) are immediate and broad across NLP, HCI, and AI governance. Paper 1 is novel and mathematically elegant, but appears specialized with narrower near-term applicability and impact mainly within niche intersections of geometric topology, gauge theory, and physics-informed ML.
Paper 1 likely has higher scientific impact due to a more novel methodological contribution: a principled multi-view contrastive framework that enables identifiable latent dynamical system recovery from noisy nonlinear observations, plus structured equation discovery with theoretical guarantees. This combination advances scientific machine learning and system identification and can transfer across many physical/biological domains (e.g., neuroscience, fluid dynamics, climate). Paper 2 provides a valuable benchmark for LLM epistemic behavior, but its primary contribution is empirical measurement/dataset creation with narrower methodological innovation and faster risk of being superseded by evolving model and retrieval setups.
Paper 2 addresses a highly timely and critical issue in AI safety—LLM hallucinations induced by citations. Its large-scale, multi-domain benchmark has broad implications for researchers and practitioners deploying LLMs in sensitive areas like medicine and law. In contrast, Paper 1 offers a valuable but more niche methodological improvement for neural operators solving PDEs, likely resulting in a narrower scope of scientific impact.
AuthorityBench addresses a critical and timely problem—how citation-based authority signals cause LLMs to hallucinate—with a large-scale, rigorously designed benchmark (220K prompts, factorial design, multiple domains). This has broad impact across AI safety, NLP, information retrieval, and policy, and is highly relevant as LLMs are increasingly deployed in citation-augmented settings. Paper 2 presents a novel quantization metric using dynamical systems theory, which is technically interesting but more niche in scope, primarily benefiting model compression practitioners. Paper 1's findings on epistemic vulnerabilities have wider interdisciplinary relevance and societal implications.