The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination

Moritz Staudinger, Wojciech Kusa, Allan Hanbury

Apr 7, 2026arXiv:2604.05766v1

cs.IR

#32of 620·cs.IR

#32 of 620 · cs.IR

Tournament Score

1541±24

11001750

72%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance6.5

Rigor4.5

Novelty5.5

Clarity7

Abstract

Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks. While excluding contaminated topics reduces effectiveness, confidence intervals remain wide, making it difficult to determine whether the LLM effect reflects genuine methodological advances or memorization from pretraining data.

AI Impact Assessments

(3 models)

Scientific Impact Assessment

Core Contribution

This paper introduces the concept of the "LLM effect" — a descriptive characterization of how large language model components have shifted effectiveness trends on established IR benchmarks. The authors analyze 143 publications reporting results on TREC Robust04 and TREC Deep Learning 2020 (DL20) passage retrieval, extending prior meta-analyses by Armstrong et al. (2009) and Yang et al. (2019) into the LLM era. The paper makes three interrelated contributions: (1) a longitudinal meta-analysis documenting effectiveness trends, (2) an analysis of evaluation practice shifts (particularly the MAP-to-nDCG@10 transition), and (3) an adaptation of the Data Contamination Quiz (DCQ) methodology to assess data contamination in reranking settings.

The contamination analysis is the most novel element. By adapting DCQ to the reranking setting — generating paraphrased passage variants and testing whether models can identify originals — the authors provide the first systematic contamination estimates for widely-used LLM rerankers on standard IR benchmarks. The finding that RankGPT shows ~41% contamination on DL20 and RankZephyr shows 26-32% is noteworthy and raises legitimate concerns about benchmark validity.

Methodological Rigor

The meta-analysis methodology is straightforward and follows established precedent, but has notable limitations that the authors partially acknowledge:

Literature search scope: Restricting to ACM Digital Library is a significant limitation. Major LLM-based retrieval work appears at EMNLP, ACL, NeurIPS, and ECIR. This restriction, while justified by consistency with prior meta-analyses, likely introduces systematic bias — particularly for LLM-based systems which are more frequently published at NLP venues. This could substantially undercount both strong baselines and state-of-the-art results.

Model categorization: The threshold of "more than 7B parameters or explicitly containing 'LLM' in their name" for the LLM category is somewhat ad hoc. Models like monoT5-3B or cross-encoders based on DeBERTa blur this boundary. The paper does not discuss how borderline cases were handled.

Contamination analysis: The DCQ adaptation is creative but has methodological uncertainties. Using gemini-2.5-flash to generate paraphrases introduces a dependency on another LLM's quality. The filtering approach (removing all topics where the model correctly identified at least one passage) is acknowledged as conservative, but it also conflates genuine passage recognition ability with random chance — even after accounting for baseline guessing rates. The resulting sample sizes after filtering (6 DL20 topics for RankZephyr, 4 for RankGPT) are too small for meaningful statistical comparison, and the authors appropriately note this.

Statistical analysis: The confidence intervals are appropriately bootstrapped, but the paper lacks formal statistical tests for many claims. Regression lines in figures are not accompanied by R² values or significance tests.

Potential Impact

The paper addresses a genuinely important concern for the IR community. If benchmark results are inflated by data contamination, this undermines the field's ability to measure progress. The practical implications include:

1. Benchmark validity: The contamination findings, even if preliminary, should motivate the community to develop contamination-aware evaluation protocols.

2. Evaluation standardization: The documentation of metric heterogeneity (19 different metrics across 72 Robust04 papers) highlights a real obstacle to progress measurement.

3. Community practices: The observation that the apparent "LLM effect" may partly reflect metric selection bias (MAP vs. nDCG@10) is an important methodological insight.

However, the inconclusive nature of the contamination analysis limits immediate actionability. The paper raises the alarm but cannot definitively answer whether observed gains are real or artifactual.

Timeliness & Relevance

This paper is highly timely. The IR community is at an inflection point where LLM-based systems dominate leaderboards, and questions about benchmark integrity are urgent. The concern about data contamination in LLMs is broadly recognized in NLP but has received insufficient attention in IR evaluation contexts. The paper fills a gap between the general NLP contamination literature and IR-specific evaluation practices.

Strengths

Important research question: Systematically examining whether LLM-era benchmark improvements are genuine is crucial for the field's scientific integrity.

Continuity with prior work: Extending Armstrong et al. and Yang et al.'s analyses provides valuable longitudinal perspective spanning two decades.

Novel contamination adaptation: Applying DCQ to reranking is a meaningful methodological contribution, even if results are inconclusive.

Honest reporting: The authors are commendably transparent about limitations and avoid overclaiming, particularly regarding the contamination analysis.

Observation about metric shifts: The insight that the MAP→nDCG@10 transition coincides with LLM adoption, potentially confounding progress measurement, is subtle and important.

Limitations

Scope of literature search: ACM-only coverage systematically misses important work, potentially biasing the "LLM effect" characterization.

Small sample sizes in contamination experiments: Only two reranking models tested, with post-filtering topic counts as low as 4, severely limiting generalizability.

Limited causal analysis: The paper documents correlations (LLM adoption ↔ higher scores) but cannot disentangle contributions from model architecture improvements, better training procedures, metric selection, and contamination.

No proposed solutions: Beyond calling for standardized metrics and contamination testing, the paper offers limited concrete remediation strategies.

Short paper format: At 5 pages, the analysis necessarily remains surface-level in places. The contamination methodology, in particular, would benefit from more thorough validation.

Cross-validation confound: The inclusion of CV-based results on Robust04 complicates interpretation, and while the authors note this, the visualization could be clearer in separating these evaluation regimes.

Overall Assessment

This is a timely and relevant contribution that raises important questions about IR benchmark validity in the LLM era. The meta-analysis provides useful longitudinal data, and the contamination analysis, while preliminary, identifies a genuine concern. However, the paper's impact is limited by its inconclusive findings — it effectively poses the question but cannot answer it. The restricted literature scope and small experimental scale in the contamination study reduce confidence in the findings. It is best viewed as a position-establishing paper that should motivate more rigorous follow-up work.

Rating:5.5/ 10

Significance 6.5Rigor 4.5Novelty 5.5Clarity 7

Generated Apr 13, 2026

Comparison History (89)

Lostvs. Topic Is Not Agenda: A Citation-Community Audit of Text Embeddings

Paper 2 identifies a fundamental and previously underexplored limitation of text embeddings—their inability to distinguish research agendas from broader topics—which has broad implications for RAG systems, scientific search, and embedding model design across many fields. It proposes a novel diagnostic methodology using citation graphs at multiple granularities and demonstrates a concrete, actionable retrieval signal that embeddings miss. Paper 1, while valuable as a meta-analysis of IR benchmarks and LLM contamination, is more narrowly scoped to the IR evaluation community and primarily raises concerns without resolving them. Paper 2's findings are more likely to influence embedding model development and RAG system design at scale.

claude-opus-4-6·May 11, 2026

Lostvs. Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

Paper 1 (FES-RAG) introduces a novel framework with a concrete technical contribution—fragment-level evidence selection for multimodal RAG with a principled metric (FIG) and distillation approach—demonstrating significant empirical gains (27% CIDEr improvement). It addresses a practical limitation in a rapidly growing field (MRAG) with broad applications. Paper 2 provides valuable meta-analysis of LLM contamination effects on IR benchmarks, raising important methodological concerns, but is primarily observational and diagnostic rather than proposing new methods. Paper 1's actionable framework and strong empirical results suggest broader adoption and higher impact.

claude-opus-4-6·May 6, 2026

Lostvs. RAG over Thinking Traces Can Improve Reasoning Tasks

Paper 2 introduces a novel and actionable idea—retrieving thinking traces instead of documents for reasoning tasks—that challenges a widely held assumption about RAG's limitations. It demonstrates strong empirical gains across multiple benchmarks and state-of-the-art models, with broad applicability to math, code, and scientific reasoning. Paper 1 provides a valuable meta-analysis of LLM contamination effects on IR benchmarks but is primarily diagnostic and retrospective. Paper 2's paradigm shift in what constitutes a useful retrieval corpus has greater potential to influence future research directions across multiple fields.

claude-opus-4-6·May 6, 2026

Wonvs. Aspect-Aware Content-Based Recommendations for Mathematical Research Papers

Paper 1 addresses a critical, field-wide methodological issue: data contamination and baseline validity in LLM-based Information Retrieval evaluations. By questioning whether recent performance gains are genuine or due to memorization, its findings have profound implications for how the entire AI and IR community benchmarks future models. Paper 2 presents a valuable but narrower contribution focused on a specific application domain (math paper recommendations), making Paper 1's potential breadth of impact and timely relevance significantly higher.

gemini-3-pro-preview·May 6, 2026

Lostvs. RAG over Thinking Traces Can Improve Reasoning Tasks

Paper 1 introduces a novel and counterintuitive idea—using thinking traces as a retrieval corpus for reasoning tasks—with strong empirical results across multiple challenging benchmarks and state-of-the-art models. It challenges a widely-held assumption about RAG's limitations, offers practical benefits (improved performance with reduced inference cost), and has broad applicability across math, code, and science reasoning. Paper 2 provides valuable meta-analytical insights about LLM contamination in IR benchmarks but is more diagnostic in nature, identifying problems rather than proposing transformative solutions, limiting its broader impact.

claude-opus-4-6·May 6, 2026

Wonvs. Aspect-Aware Content-Based Recommendations for Mathematical Research Papers

Paper 1 addresses a fundamental crisis in AI and IR research: data contamination in LLM evaluations. By questioning whether benchmark improvements are genuine or due to memorization, its findings have widespread implications for how AI systems are evaluated across multiple fields. While Paper 2 offers a valuable domain-specific recommendation system and dataset, Paper 1's critical examination of evaluation methodology ensures a broader, more foundational scientific impact.

gemini-3-pro-preview·May 6, 2026

Wonvs. Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent

Paper 2 addresses a fundamental methodological concern affecting the entire IR/NLP research community: whether LLM-driven improvements on established benchmarks reflect genuine advances or data contamination. This meta-analysis of 143 publications has broad implications for how the field evaluates progress, potentially influencing benchmark design, evaluation methodology, and scientific rigor across many subfields. Paper 1, while practically valuable as an engineering contribution deployed at LinkedIn, is more narrowly scoped to a specific industrial application with limited generalizable scientific insights beyond the system design.

claude-opus-4-6·Apr 30, 2026

Lostvs. When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

Paper 2 introduces a novel framework (ReaLM-Retrieve) that addresses a timely and practical problem—integrating retrieval with large reasoning models during inference. It proposes concrete innovations (step-level uncertainty detection, learned retrieval policy, efficiency optimization) with strong empirical results across multiple benchmarks, showing significant improvements in both accuracy and efficiency. Paper 1 is a valuable meta-analysis identifying the 'LLM effect' and contamination concerns in IR benchmarks, but its conclusions are largely observational with wide confidence intervals, limiting actionable impact. Paper 2's methodological contributions are more likely to influence future system design across NLP and IR.

claude-opus-4-6·Apr 30, 2026

Lostvs. Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

Paper 2 introduces a novel methodological framework that reframes retrieval evaluation as a statistical estimation problem, offering semantic coverage guarantees. This constructive approach provides a foundational solution to the critical bottleneck of trustworthy RAG evaluation. While Paper 1 offers a valuable meta-analysis on benchmark contamination, Paper 2's introduction of semantic stratification is likely to have a broader and more lasting impact by changing how future retrieval systems and RAG pipelines are systematically evaluated across the field.

gemini-3-pro-preview·Apr 29, 2026

Lostvs. MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

Paper 1 introduces a novel benchmark for omni-modality embeddings, addressing a critical gap in evaluating multi-sensory AI models. Its focus on text, image, video, audio, and agent scenarios pushes the boundaries of multimodal learning, offering foundational tools for next-generation AI. While Paper 2 provides a valuable meta-analysis on IR benchmarks and data contamination, Paper 1 has broader applicability across diverse AI fields and establishes a new standard for evaluating and diagnosing future full-modality embedding models, likely leading to higher scientific impact.

gemini-3-pro-preview·Apr 28, 2026

#32of 620·cs.IR

#32 of 620 · cs.IR

Tournament Score

1541±24

11001750

72%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance6.5

Rigor4.5

Novelty5.5

Clarity7