The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Wojciech Zarzecki, Jan Dubiński, Sebastian Cygert

Jun 2, 2026

arXiv:2606.03305v1 PDF

cs.AI(primary)

#363of 3355·Artificial Intelligence

#363 of 3355 · Artificial Intelligence

Tournament Score

1500±46

10501800

72%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6.5

Novelty5.5

Clarity7.5

Tournament Score

1500±46

10501800

72%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear. We identify two under-studied failure modes: distribution shift, which arises when suspect and validation sets violate the IID assumption, and scale constraints, which arise because benchmarks are orders of magnitude smaller than pre-training corpora. We systematically evaluate three leading paradigms: LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC across 27 models from multiple families (including Pythia, OLMo~2, and specialised cultural and medical LLMs) and scales (up to 27B). We then further extend our analysis to frontier industry models. Across 335 evaluations, only 199 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits. Our results reveal a systematic reliability gap between controlled validation and practical benchmark auditing, and show that statistical detection cannot yet replace transparent data provenance. We open-source our benchmark for further research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper identifies and systematically demonstrates a "reliability gap" between the controlled settings where benchmark contamination detection methods are validated and the realistic conditions where they would actually be deployed. The authors pinpoint two specific failure modes: distribution shift (when suspect and validation sets violate IID assumptions, as commonly occurs with benchmark train/test splits) and scale constraints (benchmarks containing thousands rather than millions of examples). They evaluate three leading detection paradigms—LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC—across 335 evaluations spanning 27 models, finding that only ~60% of detection outcomes are correct. The central claim is that statistical contamination detection cannot yet serve as a reliable replacement for transparent data provenance.

This is a valuable "stress-testing" contribution rather than a methodological advance. The paper does not propose new detection methods but instead provides a much-needed reality check on existing ones, which is arguably more impactful at this stage of the field's development.

2. Methodological Rigor

The experimental design is thoughtful and well-structured across four progressively more realistic tasks:

Task 1 (controlled Pythia/Pile with limited data) establishes baseline behavior under scale constraints

Task 2 (OLMo 2 split-level detection) tests IID violations with known ground truth

Task 3 (specialized medical/Polish LLMs) probes domain-specific settings

Task 4 (industry models) extends to practical frontier scenarios

The use of OLMo 2 with documented training data is a strength, providing reliable ground truth for Tasks 2-3. However, there are some concerns:

Threshold sensitivity: The authors increase Post-Hoc DI's threshold from p<0.05 to p<0.14, which they justify but which could be seen as favorable to the method. The CoDeC thresholds (0.6/0.8) are taken from the original work but the "inconclusive" zone introduces ambiguity in counting outcomes.

Aggregation of outcomes: The headline "199/335 correct" figure aggregates across very heterogeneous settings. The counting methodology treats all evaluations equally regardless of difficulty or practical importance.

Ground truth for Task 4: The industry model analysis necessarily lacks ground truth, making conclusions necessarily speculative (acknowledged by the authors).

Limited exploration of fixes: The paper diagnoses problems thoroughly but offers minimal guidance on how to improve these methods or what alternative approaches might work.

3. Potential Impact

Practical impact: This work is directly relevant to the growing community of benchmark maintainers, leaderboard operators, and model auditors who might rely on these statistical tools. The finding that LLM DI produces false positives when benchmark splits differ in difficulty is particularly important, as this is precisely how most practitioners would attempt to use the method (treating a test split as validation for the train split). This could prevent premature deployment of unreliable auditing pipelines.

Research impact: The paper provides a clear benchmark (open-sourced) for future contamination detection research. It establishes concrete desiderata that new methods must satisfy: robustness to distribution shift, functionality at benchmark scale, and split-level discrimination capability. This should redirect research effort toward addressing these specific failure modes.

Policy implications: The conclusion that "statistical detection cannot yet replace transparent data provenance" has implications for AI governance and evaluation standards, supporting calls for mandatory training data disclosure.

4. Timeliness & Relevance

This paper addresses an acute need. Benchmark contamination is arguably the most pressing validity threat to LLM evaluation. Multiple high-profile contamination incidents have been documented, and the community increasingly relies on statistical tools to audit models with opaque training data. The timing is particularly relevant given:

The proliferation of instruction-tuned models with complex, multi-stage training pipelines

Growing benchmark saturation that makes distinguishing capability from memorization critical

Increasing regulatory interest in AI evaluation integrity

5. Strengths & Limitations

Key Strengths:

Comprehensive evaluation scope: 27 models, multiple families (Pythia, OLMo 2, PLLuM, medical LLMs, industry models), scales up to 27B

Well-defined failure taxonomy: The distribution shift and scale constraint framework provides clean conceptual handles

Diagnostic depth: Section 5.5's analysis of *why* methods fail (e.g., Post-Hoc DI's text-only classifier achieving AUC>0.78, demonstrating distributional artifacts dominate) is particularly valuable

Open-source benchmark: Enables reproducibility and future benchmarking

Honest reporting: The paper doesn't cherry-pick; it reports all outcomes including ambiguous ones

Notable Limitations:

No new solutions proposed: The paper is purely diagnostic. While recommendations are given (use LLM DI only with verified IID validation sets, treat CoDeC as comparative), no algorithmic improvements are offered

Missing methods: Other contamination detection approaches (retrieval-based, behavioral, performance-based like ConStat) are mentioned but not evaluated

Statistical analysis of the meta-results: The 199/335 figure would benefit from confidence intervals or statistical analysis of what drives success/failure

Limited analysis of false negative costs vs. false positive costs: In practice, these have very different implications for auditing

The paper could more explicitly discuss what "correct" means in edge cases, particularly for CoDeC's inconclusive zone

Additional Observations

The paper's finding that CoDeC scores decrease with model size for OLMo 2 (Figure 2) regardless of training membership is intriguing and deserves further investigation—it suggests the method may be capturing something about model capability rather than contamination. The computational analysis (Table 14) showing CoDeC is ~36x faster than Post-Hoc DI is practically useful but underexplored.

The work would have been strengthened by including at least a preliminary attempt at improving one of the methods (e.g., distribution-shift-aware LLM DI, or scale-adaptive Post-Hoc DI), transforming it from a purely negative result into a constructive one.

Rating:6.5/ 10

Significance 7Rigor 6.5Novelty 5.5Clarity 7.5

Generated Jun 3, 2026

Comparison History (18)

vs. Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

claude-opus-4.66/6/2026

Paper 1 addresses a critical and timely security concern—AI agent sabotage in human-developer workflows—with a novel large-scale human study (100+ participants, frontier models, 5-hour realistic tasks). The finding that 94% of developers fail to detect sabotage is striking and actionable, with immediate implications for AI safety policy, tool design, and software engineering practices. Paper 2 makes a solid methodological contribution to contamination detection reliability, but its scope is narrower and more incremental. Paper 1's broader cross-disciplinary impact (AI safety, HCI, security, software engineering) and urgency given rapid AI agent adoption give it higher potential impact.

vs. Towards World Models in Biomedical Research

gpt-5.26/6/2026

Paper 1 has higher near-term scientific impact due to concrete novelty (identifying and empirically validating two key failure modes in contamination detection), strong methodological rigor (335 evaluations across 27 models plus frontier models), and immediately actionable implications for LLM evaluation practices and data governance. Its open-sourced benchmark can seed follow-on work across the broader ML community. Paper 2 is timely and potentially transformative for biomedicine, but is primarily a conceptual agenda without demonstrated methods or empirical validation, making its impact more speculative and longer-horizon.

vs. Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

claude-opus-4.66/3/2026

Paper 1 demonstrates a novel, well-quantified finding—systematic gender-dependent diagnostic substitution in LLM medical triage—with immediate real-world patient safety implications. It reveals a concrete, actionable bias mechanism (epidemiological priors suppressing urgency) across multiple frontier models, directly relevant to the rapidly growing deployment of AI in healthcare. While Paper 2 makes a solid methodological contribution regarding contamination detection reliability, it primarily reveals limitations of existing tools without proposing solutions. Paper 1's findings are more likely to drive policy changes, model design improvements, and cross-disciplinary attention from both AI and medical communities.

vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

gpt-5.26/3/2026

Paper 2 likely has higher impact due to its direct relevance to a core, timely problem in LLM evaluation (benchmark contamination) with broad implications for academia and industry. It offers a systematic, large-scale empirical study across many model families (335 evaluations, incl. frontier models), identifies concrete failure modes (distribution shift, scale), and provides actionable conclusions (limits of current auditing; need for provenance) plus an open-source benchmark. Paper 1 is novel for mechanistic multi-agent simulation, but its applications and validation appear narrower and more exploratory.

vs. Effect of Demographic Bias on Skin Lesion Classification

claude-opus-4.66/3/2026

Paper 2 addresses a critical and timely problem in LLM evaluation—benchmark contamination detection—with broad implications for the entire AI community. Its systematic evaluation across 335 evaluations, 27 models, and multiple detection paradigms reveals fundamental reliability gaps that challenge widely-used auditing tools. The identification of specific failure modes (distribution shift and scale constraints) provides actionable insights for the field. Paper 1, while methodologically sound, addresses a more incremental question about demographic bias in skin lesion classification with relatively expected findings. Paper 2's open-sourced benchmark and relevance to the rapidly growing LLM ecosystem give it substantially broader impact potential.

vs. DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

claude-opus-4.66/3/2026

DAG-MoE introduces a novel structural aggregation mechanism for MoE models that expands the expert-combination space through DAG-based aggregation, backed by theoretical analysis and extensive experiments. This addresses a core scalability challenge in LLM architecture design with broad applicability. While Paper 2 provides valuable empirical insights about contamination detection reliability, it primarily identifies limitations of existing methods without proposing solutions. Paper 1's architectural innovation has higher potential for widespread adoption and follow-on research in the rapidly growing MoE landscape.

vs. TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

gemini-3.16/3/2026

Paper 1 addresses a critical foundational issue in LLM research: benchmark contamination. By exposing the failure modes of current detection methods across numerous models, it impacts how the entire field evaluates LLMs, making it highly relevant and broadly applicable. While Paper 2 presents a solid technical solution for multimodal hallucinations, Paper 1's findings on the fundamental limits of current evaluation paradigms have broader and more urgent implications for AI safety, benchmarking validity, and training transparency.

vs. Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight

gemini-3.16/3/2026

Paper 2 addresses benchmark contamination, a foundational crisis threatening the validity of LLM evaluation field-wide. By exposing severe limitations in current contamination detection tools under realistic conditions, it has broad and immediate implications for how the community measures model capabilities. Paper 1 offers a valuable methodological advance in scalable oversight, but Paper 2's focus on the integrity of the evaluation paradigm itself gives it higher potential for widespread scientific impact and necessitates urgent methodological shifts.

vs. Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

gpt-5.26/3/2026

Paper 2 likely has higher impact due to timeliness and broad relevance: reliable benchmark auditing and contamination detection affect nearly all LLM evaluation across academia and industry. It identifies concrete, under-studied failure modes (distribution shift, scale), provides large systematic evidence across many model families including frontier models, and releases an open benchmark—supporting reproducibility and follow-on work. Paper 1 is novel and valuable for understanding multimodal interaction, but its impact is narrower (focused on MLLMs) and the PID-guided performance gains appear preliminary compared to Paper 2’s immediate implications for evaluation validity.

vs. DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

gemini-3.16/3/2026

Paper 1 addresses a critical and fundamental issue in the LLM field: benchmark contamination and the unreliability of current detection methods. Because accurate evaluation is the bedrock of LLM progress, exposing these failure modes impacts the entire NLP community. Paper 2 presents a solid architectural contribution for LLM agents, but its scope is limited to a specific subfield, making Paper 1's findings broader and more foundational.

vs. Decomposing how prompting steers behavior

gpt-5.26/3/2026

Paper 2 offers a novel, broadly applicable mechanistic framework for how prompting changes internal representations, with causal tests across multiple LLMs/VLMs and tasks. Its decomposition (from translation to nonlinear maps) yields interpretable, general insights (e.g., affine mixing as a key mechanism) that can influence prompting theory, representation learning, interpretability, and model steering—high breadth and timeliness. Paper 1 is important and timely for evaluation integrity, but is primarily a diagnostic/limitations study of existing contamination detectors; its impact is more specialized and less conceptually generative than Paper 2’s mechanistic, cross-domain methodology.

vs. Subliminal Learning Is Steering Vector Distillation

gpt-5.26/3/2026

Paper 2 offers a more novel, mechanistic explanation of a surprising alignment/safety phenomenon (subliminal learning) by reducing it to steering-vector distillation, plus optimizer-dependent gradient evidence. This creates actionable implications for model editing, distillation safety, and interpretability, with potential broad impact across alignment, transfer learning, and mechanistic interpretability. Paper 1 is timely and valuable as a large-scale audit showing current contamination-detection methods fail under distribution shift/scale, but it is primarily diagnostic/benchmarking and may have narrower conceptual novelty than Paper 2’s unifying mechanism.

vs. PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

gpt-5.26/3/2026

Paper 1 has higher likely impact due to strong timeliness and broad relevance: it interrogates the validity of LLM evaluation itself, identifying concrete failure modes (distribution shift, scale) that affect many existing contamination-detection methods and thus the credibility of benchmarks across domains. Its multi-model, multi-paradigm empirical study (335 evaluations, including frontier models) and open-sourced benchmark support methodological rigor and reuse. Paper 2 is useful for math evaluation and incremental performance gains, but resembles many benchmark-plus-training-module works and has narrower cross-field implications.

vs. Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact: it targets an urgent, widely relevant problem (LLM benchmark validity), identifies concrete failure modes (distribution shift, scale), and backs claims with broad, systematic evaluation (27 models, 335 runs) plus an open-source benchmark, enabling follow-on work. Its conclusions affect evaluation methodology across many LLM subfields and industry practice. Paper 1 is novel in probing “natural experiments” in datasets via causal feature selection, but appears more preliminary, with narrower immediate applications and heavier dependence on causal discovery assumptions that can limit robustness and adoption.

vs. ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

gemini-3.16/3/2026

Paper 1 addresses a foundational issue in LLM evaluation—benchmark contamination—which impacts the validity of research across the entire field of AI. Its comprehensive evaluation exposing critical flaws in current detection methods provides an essential baseline for future work. Paper 2 presents a valuable applied contribution to healthcare AI by combining EHRs with LLMs, but its impact is much narrower in scope compared to the field-wide relevance of ensuring valid LLM assessment presented in Paper 1.

vs. Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

gpt-5.26/3/2026

Paper 2 is likely higher impact because it addresses a cross-cutting, timely problem—benchmark contamination auditing—that affects the validity of results across essentially all LLM research and deployment. It contributes broad empirical evidence (335 evaluations across 27 models plus frontier models) identifying concrete failure modes (distribution shift, benchmark-scale limits) and limitations of widely used detection paradigms, with an open-source benchmark enabling follow-on work. Paper 1 is innovative and practically useful for retrieval agents, but its impact is more domain-specific and dependent on adoption of a particular harness/RL framework.

vs. Subliminal Learning is a LoRA Artifact

gemini-3.16/3/2026

Paper 1 addresses a widespread, critical issue in AI—benchmark contamination and the validity of LLM evaluation. By demonstrating that current statistical detection methods fail in realistic scenarios across hundreds of evaluations, it has profound implications for data provenance and model assessment. Paper 2, while methodologically rigorous, serves primarily as a corrective note debunking a specific, niche phenomenon (subliminal learning) as a LoRA artifact, resulting in a narrower overall scientific impact.

vs. Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

gpt-5.26/3/2026

Paper 2 has higher impact potential due to its broad, timely relevance to LLM evaluation integrity and policy: it exposes systematic failure modes in contamination detection under realistic auditing constraints (distribution shift, benchmark scale). It evaluates multiple leading methods across many model families and includes frontier models, providing strong empirical rigor and actionable conclusions (limits of current statistical auditing, need for provenance). Its implications span ML evaluation, benchmarking, safety, and governance. Paper 1 is useful and applied, but its impact is narrower (CS1 autograding) and more incremental methodologically.