LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu
Abstract
Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.
AI Impact Assessments
(1 models)Scientific Impact Assessment: LiveBrowseComp
1. Core Contribution
This paper makes two interrelated contributions. First, it identifies and formally characterizes Intrinsic Knowledge Dependence (IKD) — a systematic confound in search-agent benchmarks where agents succeed by generating hypotheses from parametric memory and using search merely for confirmation, rather than genuinely discovering new information. Second, it introduces LiveBrowseComp, a 335-question benchmark designed to suppress IKD by requiring answers grounded in facts published within 90 days of construction, drawn from long-tail events across six structured data sources.
The IKD concept is the paper's most intellectually significant contribution. It goes beyond standard data contamination concerns: even without literal leakage, models with broad world knowledge can answer many "search" benchmark questions closed-book (up to 44.5% pass@4 on BrowseComp). The paper distinguishes between *knowing what to search for* and *discovering what is not already known* — a distinction that has been insufficiently appreciated in the search-agent evaluation literature.
2. Methodological Rigor
The diagnostic framework is well-designed with three complementary experiments that progressively isolate the role of retrieval:
The benchmark construction pipeline is thorough, with temporal filtering, long-tail scoring, answer stability checks, multi-stage human verification (including uniqueness testing via multi-model rollouts), and difficulty calibration. The five-stage pipeline with three independent verifiers per check plus a fourth cross-checker is rigorous, though expensive.
One methodological strength is the human calibration study: human searchers achieve nearly identical solve rates on BrowseComp (30%) and LiveBrowseComp (31%), with matching time distributions. This elegantly controls for the alternative explanation that LiveBrowseComp is simply harder.
However, several limitations exist. The dense retrieval setup for evidence-blocking uses a single embedding model (Qwen3-8B), and retrieval quality could affect results. The unified scaffold approach, while ensuring fairness, may disadvantage models optimized for different interaction protocols. The 256k context limit without summarization strategies likely depresses absolute scores.
3. Potential Impact
Benchmark design influence: The paper's central argument — that static benchmarks conflate memory with search capability — could reshape how the community designs and interprets search-agent evaluations. The temporal anchoring methodology provides a concrete template for "living" benchmarks that resist knowledge absorption.
Training signal implications: The finding that agents generate >50% of queries from internal hypotheses and use retrieved evidence <33% of the time has direct implications for search-agent training. It suggests current RL/SFT pipelines may inadvertently reward guess-and-verify strategies over evidence-driven exploration.
Broader evaluation methodology: IKD extends beyond search agents to any evaluation where models might already know the answer through parametric knowledge — a growing concern as foundation models become more capable. The diagnostic toolkit (closed-book baselines, evidence blocking, trajectory analysis) is transferable.
Practical limitations on impact: The benchmark's 335-question size is modest, the 90-day freshness window means it requires periodic reconstruction, and the labor-intensive construction process (professional annotators at ~$9.60/hr with multi-stage verification) limits scalability.
4. Timeliness & Relevance
This paper addresses a critical and timely problem. As frontier models rapidly improve on BrowseComp (scores now reaching 70-80%), the community needs to understand *why*. The paper provides evidence that a substantial portion of this improvement reflects growing parametric knowledge rather than improving search capabilities. With major companies deploying "Deep Research" products (OpenAI, Google, etc.), the distinction between genuine search and knowledge verification has direct commercial and scientific relevance.
The paper is also well-positioned relative to concurrent work: BrowseComp-Plus provides the annotated evidence library that enables the evidence-blocking experiments, and the growing ecosystem of live benchmarks (LiveBench, LiveCodeBench) validates the temporal refresh paradigm.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations:
The compression of inter-model gaps on LiveBrowseComp (16.6→10.3 points) is noteworthy: it suggests that once the memory advantage is removed, current search agents are more similar in capability than leaderboards suggest. This has implications for how the community assesses progress.
Generated May 28, 2026
Comparison History (14)
Paper 1 makes a fundamental contribution to how we evaluate LLM-based search agents by identifying a critical flaw (Intrinsic Knowledge Dependence) in existing benchmarks and proposing a principled solution (LiveBrowseComp). This has broad implications for the entire field of agent evaluation, affecting how researchers measure genuine retrieval capabilities versus memorization. Paper 2 presents a useful technical contribution (LaneRoPE) for parallel reasoning, but it is more incremental—extending RoPE for inter-sequence attention in a narrower application domain. Paper 1's diagnostic insights and new benchmark methodology are likely to influence evaluation practices across multiple research communities.
Paper 2 directly addresses a critical bottleneck in interdisciplinary research by enabling non-experts to automatically build high-performing AI models. Its evolving knowledge system and state-of-the-art results demonstrate strong practical utility. While Paper 1 provides valuable evaluation insights and a novel benchmark for search agents, Paper 2's potential to accelerate AI application and scientific discovery across diverse domains (biology, physics, chemistry) gives it broader and more immediate real-world scientific impact.
Paper 2 has higher likely impact due to broader relevance and timeliness: it targets a widely deployed capability (web/search agents), introduces clear diagnostics (Intrinsic Knowledge Dependence), and releases a benchmark explicitly designed to avoid contamination via recent, non-salient facts—addressing a central evaluation failure mode for LLM tooling. Its implications span IR, NLP, agent evaluation, and AI safety/robustness. Paper 1 is methodologically strong but narrower (DFJSP scheduling) and its benchmark impact is more domain-specific, limiting cross-field adoption.
Paper 2 has higher potential scientific impact due to a more fundamental, generalizable theoretical contribution: it formally distinguishes volatility vs. stochasticity and proves they have opposite effects on optimal exploration, extending Gittins-index ideas to Gaussian state-space (restless) bandits and deriving a principled closed-form bonus (CAUSE). This spans machine learning, control, neuroscience, and computational psychiatry, with clear downstream algorithmic and empirical implications. Paper 1 is timely and useful for evaluating LLM search agents, but its impact is more benchmark-/domain-specific and may age with rapid tooling and model changes.
Paper 1 introduces a novel theoretical framework (CCO) combining Conformal Decision Theory with conservatism calibration for AI oversight—a critical safety problem. It provides finite-time statistical guarantees without distributional assumptions and demonstrates practical effectiveness across multiple benchmarks. This addresses fundamental AI alignment/control challenges with broad implications. Paper 2 makes a valuable diagnostic contribution revealing LLM search agents' reliance on intrinsic knowledge and introduces a useful benchmark, but its impact is more narrowly focused on evaluation methodology and is more incremental. Paper 1's theoretical depth and safety relevance give it higher long-term impact potential.
Both papers expose critical flaws in how AI benchmarks conflate distinct capabilities. Paper 2 has higher impact because: (1) it addresses the rapidly growing and commercially important area of LLM search agents, with broader relevance; (2) it provides both a diagnostic framework AND a concrete, publicly available benchmark (LiveBrowseComp) that can be immediately adopted; (3) the finding that search agents rely on intrinsic knowledge rather than genuine retrieval has immediate practical implications for product development and safety; (4) the timeliness is stronger given the explosion of agentic AI systems. Paper 1's composition collapse finding is valuable but more niche in scope.
Paper 2 addresses a critical and immediate flaw in LLM evaluation: agents relying on intrinsic knowledge rather than actual search. By providing rigorous empirical diagnostics and introducing a dynamic, actionable benchmark (LiveBrowseComp), it offers immediate utility to the highly active field of AI agents. While Paper 1 presents an interesting conceptual framework for AGI, it is primarily theoretical. Paper 2's concrete methodology, dataset, and relevance to current evaluation bottlenecks make it highly likely to see rapid adoption and citations.
Paper 1 addresses reward hacking, a fundamental and critical bottleneck in reinforcement learning for language models. By dissecting the causes of divergence (verifier failure vs. rubric limitations) and introducing a novel verifier-free diagnostic, it provides deep theoretical and practical insights into model alignment. While Paper 2 offers a valuable benchmark for search agents, Paper 1 tackles core training dynamics that impact the broader development and safety of frontier AI models across multiple domains.
Paper 1 addresses a fundamental and timely question about whether LLM search agents genuinely search or merely verify intrinsic knowledge. It introduces a concrete diagnostic (IKD) and a practical benchmark (LiveBrowseComp) that exposes critical limitations in existing evaluation paradigms. Its findings have broad implications for the rapidly growing field of LLM-based agents and search evaluation, affecting how the community benchmarks and develops these systems. Paper 2 (CausaLab) is rigorous and innovative in evaluating causal reasoning, but targets a narrower audience. Paper 1's relevance to the widely-deployed search agent ecosystem gives it greater immediate and broad impact.
Paper 1 has higher potential impact: it introduces a general, mechanistic framework (agent “aging”) and a longitudinal benchmark (AgingBench) applicable to many deployed agent systems beyond web search, directly targeting reliability over time—a key real-world deployment bottleneck. Its taxonomy (compression/interference/revision/maintenance aging) plus diagnostic tooling (temporal dependency graphs, counterfactual probes) suggests actionable, stage-targeted repairs, indicating strong methodological contribution and broad relevance across memory-augmented agents, continual operation, and MLOps. Paper 2 is timely and useful but narrower (search/browsing) and primarily benchmark-refresh oriented.
While both papers critically evaluate current LLM capabilities, Paper 2 introduces a novel benchmark (LiveBrowseComp) that addresses a critical flaw in how search agents are evaluated. Given the immense current focus on RAG and autonomous web agents, a rigorous dataset to separate intrinsic knowledge from actual search capabilities offers immediate, highly practical utility for researchers. This benchmark is likely to see broader adoption and citation than the more theoretical cognitive arguments presented in Paper 1.
LiveBrowseComp addresses a fundamental evaluation flaw in LLM-based search agents—revealing that agents verify intrinsic knowledge rather than genuinely searching. This insight has broad implications for the entire LLM evaluation community, challenging assumptions underlying popular benchmarks. The paper introduces a principled, reproducible benchmark with a clear methodology (temporal filtering, recency constraints) that can reshape how search-augmented LLMs are evaluated. Paper 1 contributes incrementally to fraud detection with a soft-prompt LLM-GNN framework, but operates in a narrower domain with less paradigm-shifting potential.
Paper 1 is more likely to have higher scientific impact due to a concrete, novel diagnostic framing (Intrinsic Knowledge Dependence) backed by quantitative experiments, and a new benchmark (LiveBrowseComp) that directly addresses a timely evaluation failure mode in LLM search/RAG agents. It provides an immediately usable dataset and shows rank-order reversals, which can reshape how the field evaluates “web-enabled” agents. Paper 2 is important and broad, especially for deployment ethics and policy, but is primarily a conceptual/reporting framework with less methodological novelty and fewer empirically testable artifacts.
Paper 1 likely has higher impact: it introduces a new benchmark (LiveBrowseComp) addressing a timely and widely relevant failure mode in LLM search agents (intrinsic-knowledge verification vs evidence-driven search), with clear diagnostics and actionable evaluation implications. Benchmarks tend to shape research directions and provide durable infrastructure used across labs and products. Its applications span retrieval-augmented generation, agentic systems, and evaluation methodology. Paper 2 presents a valuable conceptual decomposition for multi-stakeholder alignment, but appears narrower in immediate ecosystem leverage without a comparable community-wide artifact or demonstrated broad downstream adoption vector.