LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu

May 27, 2026

arXiv:2605.28721v1 PDF

cs.AI(primary)

#463of 2682·Artificial Intelligence

#463 of 2682 · Artificial Intelligence

Tournament Score

1484±48

10501800

64%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance8

Rigor7.5

Novelty7.5

Clarity8.5

Tournament Score

1484±48

10501800

64%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: LiveBrowseComp

1. Core Contribution

This paper makes two interrelated contributions. First, it identifies and formally characterizes Intrinsic Knowledge Dependence (IKD) — a systematic confound in search-agent benchmarks where agents succeed by generating hypotheses from parametric memory and using search merely for confirmation, rather than genuinely discovering new information. Second, it introduces LiveBrowseComp, a 335-question benchmark designed to suppress IKD by requiring answers grounded in facts published within 90 days of construction, drawn from long-tail events across six structured data sources.

The IKD concept is the paper's most intellectually significant contribution. It goes beyond standard data contamination concerns: even without literal leakage, models with broad world knowledge can answer many "search" benchmark questions closed-book (up to 44.5% pass@4 on BrowseComp). The paper distinguishes between *knowing what to search for* and *discovering what is not already known* — a distinction that has been insufficiently appreciated in the search-agent evaluation literature.

2. Methodological Rigor

The diagnostic framework is well-designed with three complementary experiments that progressively isolate the role of retrieval:

Closed-book coverage (Q1) provides a clean lower bound on parametric knowledge overlap with benchmarks. The results across 24 model-benchmark pairs (average pass@4 of 38.9) are striking and convincing.

Evidence-blocked search (Q2) uses BrowseComp-Plus's annotated document library to surgically remove answer-supporting evidence while preserving the search interface. The consistent reversal — every model performs *worse* than closed-book when evidence is blocked — is a powerful finding that demonstrates agents cannot gracefully degrade when confirmation fails.

Trajectory grounding analysis (Q3) traces query provenance, showing >50% of queries are model-originated and evidence utilization rates remain below 33%. This mechanistic analysis strengthens the causal story.

The benchmark construction pipeline is thorough, with temporal filtering, long-tail scoring, answer stability checks, multi-stage human verification (including uniqueness testing via multi-model rollouts), and difficulty calibration. The five-stage pipeline with three independent verifiers per check plus a fourth cross-checker is rigorous, though expensive.

One methodological strength is the human calibration study: human searchers achieve nearly identical solve rates on BrowseComp (30%) and LiveBrowseComp (31%), with matching time distributions. This elegantly controls for the alternative explanation that LiveBrowseComp is simply harder.

However, several limitations exist. The dense retrieval setup for evidence-blocking uses a single embedding model (Qwen3-8B), and retrieval quality could affect results. The unified scaffold approach, while ensuring fairness, may disadvantage models optimized for different interaction protocols. The 256k context limit without summarization strategies likely depresses absolute scores.

3. Potential Impact

Benchmark design influence: The paper's central argument — that static benchmarks conflate memory with search capability — could reshape how the community designs and interprets search-agent evaluations. The temporal anchoring methodology provides a concrete template for "living" benchmarks that resist knowledge absorption.

Training signal implications: The finding that agents generate >50% of queries from internal hypotheses and use retrieved evidence <33% of the time has direct implications for search-agent training. It suggests current RL/SFT pipelines may inadvertently reward guess-and-verify strategies over evidence-driven exploration.

Broader evaluation methodology: IKD extends beyond search agents to any evaluation where models might already know the answer through parametric knowledge — a growing concern as foundation models become more capable. The diagnostic toolkit (closed-book baselines, evidence blocking, trajectory analysis) is transferable.

Practical limitations on impact: The benchmark's 335-question size is modest, the 90-day freshness window means it requires periodic reconstruction, and the labor-intensive construction process (professional annotators at ~$9.60/hr with multi-stage verification) limits scalability.

4. Timeliness & Relevance

This paper addresses a critical and timely problem. As frontier models rapidly improve on BrowseComp (scores now reaching 70-80%), the community needs to understand *why*. The paper provides evidence that a substantial portion of this improvement reflects growing parametric knowledge rather than improving search capabilities. With major companies deploying "Deep Research" products (OpenAI, Google, etc.), the distinction between genuine search and knowledge verification has direct commercial and scientific relevance.

The paper is also well-positioned relative to concurrent work: BrowseComp-Plus provides the annotated evidence library that enables the evidence-blocking experiments, and the growing ecosystem of live benchmarks (LiveBench, LiveCodeBench) validates the temporal refresh paradigm.

5. Strengths & Limitations

Key Strengths:

The IKD concept is clearly articulated and supported by converging evidence from three independent diagnostics

Human calibration study elegantly controls for difficulty confounds

Comprehensive model coverage (11 models spanning open and closed-source families)

The rank correlation analysis (Spearman dropping from 0.87 to 0.74) demonstrates that static rankings are misleading

Turn distribution analysis (Figure 9) provides intuitive behavioral evidence for IKD

Notable Weaknesses:

The 335-question dataset is relatively small and domain-skewed (52% movies/entertainment)

The paper does not propose solutions for IKD beyond the benchmark itself — no training interventions or architectural changes are explored

Temporal freshness is a moving target; the benchmark requires continuous reconstruction, and the paper doesn't discuss sustainability

The per-domain analysis (Appendix G) reveals substantial variance that the aggregate numbers obscure

Some model references (GPT-5.4, Claude Sonnet 4.6, DeepSeek-V4-Pro) appear to be from 2026, suggesting the paper evaluates models not yet publicly available at time of assessment, which creates reproducibility concerns

The causal mechanism behind IKD — why agents fail to pivot from failed hypotheses — is described but not deeply analyzed

Additional Observations:

The compression of inter-model gaps on LiveBrowseComp (16.6→10.3 points) is noteworthy: it suggests that once the memory advantage is removed, current search agents are more similar in capability than leaderboards suggest. This has implications for how the community assesses progress.

Rating:7.5/ 10

Significance 8Rigor 7.5Novelty 7.5Clarity 8.5

Generated May 28, 2026

Comparison History (14)

vs. LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

claude-opus-4.65/28/2026

Paper 1 makes a fundamental contribution to how we evaluate LLM-based search agents by identifying a critical flaw (Intrinsic Knowledge Dependence) in existing benchmarks and proposing a principled solution (LiveBrowseComp). This has broad implications for the entire field of agent evaluation, affecting how researchers measure genuine retrieval capabilities versus memorization. Paper 2 presents a useful technical contribution (LaneRoPE) for parallel reasoning, but it is more incremental—extending RoPE for inter-sequence attention in a narrower application domain. Paper 1's diagnostic insights and new benchmark methodology are likely to influence evaluation practices across multiple research communities.

vs. AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

gemini-3.15/28/2026

Paper 2 directly addresses a critical bottleneck in interdisciplinary research by enabling non-experts to automatically build high-performing AI models. Its evolving knowledge system and state-of-the-art results demonstrate strong practical utility. While Paper 1 provides valuable evaluation insights and a novel benchmark for search agents, Paper 2's potential to accelerate AI application and scientific discovery across diverse domains (biology, physics, chemistry) gives it broader and more immediate real-world scientific impact.

vs. DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

gpt-5.25/28/2026

Paper 2 has higher likely impact due to broader relevance and timeliness: it targets a widely deployed capability (web/search agents), introduces clear diagnostics (Intrinsic Knowledge Dependence), and releases a benchmark explicitly designed to avoid contamination via recent, non-salient facts—addressing a central evaluation failure mode for LLM tooling. Its implications span IR, NLP, agent evaluation, and AI safety/robustness. Paper 1 is methodologically strong but narrower (DFJSP scheduling) and its benchmark impact is more domain-specific, limiting cross-field adoption.

vs. Not all uncertainty is alike: volatility, stochasticity, and exploration

gpt-5.25/28/2026

Paper 2 has higher potential scientific impact due to a more fundamental, generalizable theoretical contribution: it formally distinguishes volatility vs. stochasticity and proves they have opposite effects on optimal exploration, extending Gittins-index ideas to Gaussian state-space (restless) bandits and deriving a principled closed-form bonus (CAUSE). This spans machine learning, control, neuroscience, and computational psychiatry, with clear downstream algorithmic and empirical implications. Paper 1 is timely and useful for evaluating LLM search agents, but its impact is more benchmark-/domain-specific and may age with rapid tooling and model changes.

vs. Calibrating Conservatism for Scalable Oversight

claude-opus-4.65/28/2026

Paper 1 introduces a novel theoretical framework (CCO) combining Conformal Decision Theory with conservatism calibration for AI oversight—a critical safety problem. It provides finite-time statistical guarantees without distributional assumptions and demonstrates practical effectiveness across multiple benchmarks. This addresses fundamental AI alignment/control challenges with broad implications. Paper 2 makes a valuable diagnostic contribution revealing LLM search agents' reliance on intrinsic knowledge and introduces a useful benchmark, but its impact is more narrowly focused on evaluation methodology and is more incremental. Paper 1's theoretical depth and safety relevance give it higher long-term impact potential.

vs. Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

claude-opus-4.65/28/2026

Both papers expose critical flaws in how AI benchmarks conflate distinct capabilities. Paper 2 has higher impact because: (1) it addresses the rapidly growing and commercially important area of LLM search agents, with broader relevance; (2) it provides both a diagnostic framework AND a concrete, publicly available benchmark (LiveBrowseComp) that can be immediately adopted; (3) the finding that search agents rely on intrinsic knowledge rather than genuine retrieval has immediate practical implications for product development and safety; (4) the timeliness is stronger given the explosion of agentic AI systems. Paper 1's composition collapse finding is valuable but more niche in scope.

vs. Measuring Progress Toward AGI: A Cognitive Framework

gemini-3.15/28/2026

Paper 2 addresses a critical and immediate flaw in LLM evaluation: agents relying on intrinsic knowledge rather than actual search. By providing rigorous empirical diagnostics and introducing a dynamic, actionable benchmark (LiveBrowseComp), it offers immediate utility to the highly active field of AI agents. While Paper 1 presents an interesting conceptual framework for AGI, it is primarily theoretical. Paper 2's concrete methodology, dataset, and relevance to current evaluation bottlenecks make it highly likely to see rapid adoption and citations.

vs. Reward Hacking in Rubric-Based Reinforcement Learning

gemini-3.15/28/2026

Paper 1 addresses reward hacking, a fundamental and critical bottleneck in reinforcement learning for language models. By dissecting the causes of divergence (verifier failure vs. rubric limitations) and introducing a novel verifier-free diagnostic, it provides deep theoretical and practical insights into model alignment. While Paper 2 offers a valuable benchmark for search agents, Paper 1 tackles core training dynamics that impact the broader development and safety of frontier AI models across multiple domains.

vs. CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

claude-opus-4.65/28/2026

Paper 1 addresses a fundamental and timely question about whether LLM search agents genuinely search or merely verify intrinsic knowledge. It introduces a concrete diagnostic (IKD) and a practical benchmark (LiveBrowseComp) that exposes critical limitations in existing evaluation paradigms. Its findings have broad implications for the rapidly growing field of LLM-based agents and search evaluation, affecting how the community benchmarks and develops these systems. Paper 2 (CausaLab) is rigorous and innovative in evaluating causal reasoning, but targets a narrower audience. Paper 1's relevance to the widely-deployed search agent ecosystem gives it greater immediate and broad impact.

vs. Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

gpt-5.25/28/2026

Paper 1 has higher potential impact: it introduces a general, mechanistic framework (agent “aging”) and a longitudinal benchmark (AgingBench) applicable to many deployed agent systems beyond web search, directly targeting reliability over time—a key real-world deployment bottleneck. Its taxonomy (compression/interference/revision/maintenance aging) plus diagnostic tooling (temporal dependency graphs, counterfactual probes) suggests actionable, stage-targeted repairs, indicating strong methodological contribution and broad relevance across memory-augmented agents, continual operation, and MLOps. Paper 2 is timely and useful but narrower (search/browsing) and primarily benchmark-refresh oriented.

vs. Can LLMs Introspect? A Reality Check

gemini-3.15/28/2026

While both papers critically evaluate current LLM capabilities, Paper 2 introduces a novel benchmark (LiveBrowseComp) that addresses a critical flaw in how search agents are evaluated. Given the immense current focus on RAG and autonomous web agents, a rigorous dataset to separate intrinsic knowledge from actual search capabilities offers immediate, highly practical utility for researchers. This benchmark is likely to see broader adoption and citation than the more theoretical cognitive arguments presented in Paper 1.

vs. Let Relations Speak: An End-to-End LLM-GNN Soft Prompt Framework for Fraud Detection

claude-opus-4.65/28/2026

LiveBrowseComp addresses a fundamental evaluation flaw in LLM-based search agents—revealing that agents verify intrinsic knowledge rather than genuinely searching. This insight has broad implications for the entire LLM evaluation community, challenging assumptions underlying popular benchmarks. The paper introduces a principled, reproducible benchmark with a clear methodology (temporal filtering, recency constraints) that can reshape how search-augmented LLMs are evaluated. Paper 1 contributes incrementally to fraud detection with a soft-prompt LLM-GNN framework, but operates in a narrower domain with less paradigm-shifting potential.

vs. Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

gpt-5.25/28/2026

Paper 1 is more likely to have higher scientific impact due to a concrete, novel diagnostic framing (Intrinsic Knowledge Dependence) backed by quantitative experiments, and a new benchmark (LiveBrowseComp) that directly addresses a timely evaluation failure mode in LLM search/RAG agents. It provides an immediately usable dataset and shows rank-order reversals, which can reshape how the field evaluates “web-enabled” agents. Paper 2 is important and broad, especially for deployment ethics and policy, but is primarily a conceptual/reporting framework with less methodological novelty and fewer empirically testable artifacts.

vs. Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

gpt-5.25/28/2026

Paper 1 likely has higher impact: it introduces a new benchmark (LiveBrowseComp) addressing a timely and widely relevant failure mode in LLM search agents (intrinsic-knowledge verification vs evidence-driven search), with clear diagnostics and actionable evaluation implications. Benchmarks tend to shape research directions and provide durable infrastructure used across labs and products. Its applications span retrieval-augmented generation, agentic systems, and evaluation methodology. Paper 2 presents a valuable conceptual decomposition for multi-stakeholder alignment, but appears narrower in immediate ecosystem leverage without a comparable community-wide artifact or demonstrated broad downstream adoption vector.