DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu, Chongyang Pan, Peilun Jia
Abstract
Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.
AI Impact Assessments
(1 models)Scientific Impact Assessment: DeepWeb-Bench
1. Core Contribution
DeepWeb-Bench introduces a benchmark for evaluating "deep research" capabilities of frontier LLM agents — tasks requiring extensive web evidence gathering, cross-source reconciliation, and multi-step quantitative derivation. The key design innovation is an 8×8 matrix format (entities × analytical dimensions) per task, yielding 6,400 independently scored cells across 100 tasks in six industry domains. Each cell is annotated with a four-tier source-provenance record (T1–T4) and cross-source agreement labels. The benchmark decomposes performance into four capability families: Retrieval, Derivation, Reasoning, and Calibration. The benchmark targets a genuine gap: frontier deep research products reportedly saturate existing benchmarks, so a harder, more discriminative evaluation is needed.
2. Methodological Rigor
Strengths in design: The structured matrix format enables cell-level analysis rather than holistic judgments, which is a meaningful advance over free-form report evaluation. The four-tier rubric ({0, 0.25, 0.5, 1}) with explicit rules for precise values, ranges, and "not available" answers provides reproducible scoring. The three answer types (precise, range, not available) capture important gradations in agent behavior, particularly around calibrated abstention.
Concerns: The automated grading relies on GPT-5.5, and while the reported inter-grader agreement (κ=0.82) is reasonable, this raises circularity issues since Codex CLI + GPT-5.5 is also the top-performing evaluated model. The authors acknowledge this partially (Appendix Figure 5b) but the potential for systematic bias — even if unintentional — is a methodological vulnerability. The human validation sample of 200 cells out of 55,936 (0.36%) is quite small. Additionally, the fixed capability-family split (1-4-1-2 per task) is somewhat rigid and may not reflect natural variation across domains.
The failure-mode analysis (Table 2) is based on human annotation of 500 cells, which provides useful qualitative insights but is a relatively small sample for making strong quantitative claims about error distributions. The binary grouping into "top four" and "other five" models is coarse.
3. Potential Impact
Immediate utility: The benchmark fills a real need for discriminating among frontier deep research agents. The 16.58-point spread between the best (33.37%) and worst (16.79%) models, with substantial headroom to 100%, suggests long-term utility before saturation.
Diagnostic value: The three key findings are genuinely informative for the field:
Broader influence: The source-provenance framework (T1-T4) and the emphasis on auditability could influence how future benchmarks handle evidence verification. The structured matrix format could be adopted by other evaluation efforts.
4. Timeliness & Relevance
This is highly timely. Deep research products from OpenAI, Anthropic, Google, and others have launched in rapid succession (2024-2026), and existing benchmarks (BrowseComp, GAIA, DeepSearchQA) are reportedly approaching saturation. The benchmark evaluates models announced as recently as April 2026 (GPT-5.5, Claude Opus 4.7, DeepSeek V4), making it one of the most current evaluations available. The focus on quantitative financial/industry analysis reflects a high-value commercial use case where accuracy matters enormously.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations
The paper's positioning within the benchmark evolution (web QA → deep search → deep research) is well-articulated and historically grounded. The case studies (BYD per-vehicle gross profit, Qualcomm Cloud AI 100 margin) are illustrative and concrete. The finding that models exhibit genuine domain specialization has implications for ensemble and routing strategies in production systems.
The benchmark's emphasis on calibrated abstention (the "not available" mechanism) addresses a critical but often overlooked dimension of LLM reliability. However, capping the number of not-available cells per task to prevent the benchmark from becoming "an abstention benchmark" reveals a tension between measuring calibration and maintaining score variance.
Generated May 21, 2026
Comparison History (33)
DeepWeb-Bench provides a concrete, reproducible benchmark with detailed evaluation of 9 frontier models, revealing actionable insights about failure modes (derivation vs. retrieval bottlenecks, model specialization). It offers immediately usable artifacts (data, rubrics, code) that the research community can adopt. Paper 2 introduces valuable conceptual framing for open-world evaluations but is more of a position/survey paper with a single case study (iOS app deployment). DeepWeb-Bench's empirical rigor, granular error taxonomy, and practical benchmark release give it broader and more immediate scientific utility for advancing deep research systems.
Paper 2 likely has higher scientific impact: a well-designed benchmark can rapidly shape an entire field by standardizing evaluation, exposing failure modes, and driving new methods. DeepWeb-Bench targets a timely, high-stakes capability (web-based deep research), offers auditable provenance and error taxonomies, and is broadly applicable across LLM agents, retrieval, reasoning, and safety/calibration research. Paper 1 is methodologically innovative and useful for controlled generative modeling, but its impact is narrower (guided diffusion/flow sampling under compositional rewards) and may compete within a crowded guidance-method space.
Paper 2 has higher potential scientific impact due to its critical application in high-stakes healthcare. While Paper 1 introduces a valuable benchmark for general LLM web research, Paper 2 addresses a fundamental flaw in clinical AI evaluation by moving beyond flawed behavior imitation (treating suboptimal real-time physician actions as ground truth). By utilizing hindsight-annotated labels validated by senior physicians, RealICU provides a highly rigorous framework for evaluating AI safety and reasoning. Its focus on patient outcomes, red-flag actions, and long-horizon clinical contexts addresses an urgent, life-saving need with profound interdisciplinary impact.
Paper 2 likely has higher scientific impact because a public, auditable benchmark for deep web research can become a widely adopted standard across labs, driving measurable progress and enabling comparative evaluation over time. Its methodology (capability taxonomy, provenance records, disclosure levels, cross-source checks, multi-model analysis) directly addresses a timely gap: existing benchmarks saturating for frontier systems. While Paper 1 is novel and practically useful for recurring-context agents, it is more system-specific and may see narrower adoption than a broadly applicable benchmark shaping evaluation and research agendas.
Paper 1 introduces a highly timely and rigorous benchmark for 'deep research,' a critical frontier in LLM capabilities. By identifying that derivation, rather than retrieval, is the primary bottleneck, it provides clear, actionable directions for future model development. Benchmarks that successfully differentiate frontier models typically drive significant follow-on research and widespread adoption across the field.
Paper 2 presents a population-scale foundation model trained on massive real-world healthcare data (200 million patients). Its ability to significantly improve disease prediction, healthcare expenditure forecasting, and clinical trial emulation demonstrates immense real-world utility and methodological rigor. While Paper 1 introduces a valuable AI benchmark, Paper 2's direct application to critical healthcare outcomes, its massive scale, and its breadth of impact across medicine, epidemiology, and AI give it a substantially higher potential scientific and societal impact.
Paper 2 presents a novel theoretical framework unifying three major fields—Bayesian inference, game theory, and thermodynamics—under a single variational principle. This kind of cross-disciplinary unification has enormous potential impact across physics, biology, AI, and economics. It introduces new mathematical results (connecting collective free energy to Nash equilibria, variational Harsanyi dividends) with falsifiable predictions validated across multiple systems. While Paper 1 is a solid benchmarking contribution for AI evaluation, benchmarks have limited longevity and narrower scope. Paper 2's foundational theoretical contribution has broader and more lasting scientific impact.
Paper 2 has higher potential scientific impact because it addresses a fundamental bottleneck in AI-driven science: discovering explainable, extrapolatable governing equations from data. While Paper 1 provides a valuable benchmark for LLM agents, Paper 2 offers a novel paradigm (machine collective intelligence) that significantly outperforms traditional deep neural networks in extrapolation and interpretability. Its ability to autonomously recover scientific laws without hand-crafted knowledge has immense, broad-reaching applications across physics, chemistry, biology, and other quantitative sciences, marking a paradigm shift rather than just an evaluation tool.
MIMIC represents a fundamental advance in computational biology by unifying multiple biological modalities (sequence, structure, regulation, evolution, context) into a single generative foundation model. Its applications span RNA and protein design with clinically relevant demonstrations (HBB splice correction, PD-L1/hACE2 binder design), achieving state-of-the-art across multiple tasks. This has broad impact across drug design, synthetic biology, and genomics. DeepWeb-Bench, while valuable for AI evaluation, is primarily a benchmark contribution with more limited scope—it characterizes existing model failures rather than enabling new scientific capabilities. MIMIC's methodological novelty and real-world biological applications give it substantially higher potential impact.
Paper 2 proposes a fundamental theoretical framework unifying thermodynamics, Bayesian inference, and game theory. Its breadth of impact across physics, biology, economics, and AI gives it profound, long-lasting scientific potential. While Paper 1 provides a highly useful and timely benchmark for evaluating current AI systems, benchmarks tend to be transient in fast-moving fields, whereas Paper 2 offers a foundational scientific theory with broad explanatory power for collective intelligence.
Paper 2 has higher potential impact: it addresses a timely, high-stakes question (validity of autonomous AI science) with broad relevance across AI, metascience, and research policy. Its large-scale empirical design (25,000+ runs, eight domains) and decomposition of model vs scaffold effects provide actionable conclusions (outcome metrics miss epistemic failures; scaffold tweaks insufficient; reasoning must be trained). Paper 1 is valuable as a harder deep-research benchmark, but its impact is more scoped to evaluation of web-based LLM research agents rather than the foundational epistemic reliability of AI-generated science.
Paper 1 likely has higher scientific impact: it introduces a novel, physically grounded framework unifying diffusion generation and random structure search, directly addressing a core bottleneck in materials/molecular discovery (efficient exploration of high-dimensional energy landscapes). The claimed >10× sampling efficiency and out-of-distribution effectiveness suggest strong real-world applicability to drug/materials design and broad relevance across chemistry, physics, and materials science. Paper 2 is timely and useful for LLM evaluation, but benchmarks tend to have narrower downstream scientific impact than methods enabling new molecular/crystal discoveries, and can be superseded quickly.
Paper 1 challenges a fundamental theoretical assumption in DPO, a cornerstone of modern LLM alignment, and provides a provably correct alternative (CPO). Its rigorous theoretical analysis of failure modes in existing alignment techniques offers profound, long-lasting implications for AI safety and training. While Paper 2 provides a valuable benchmark for evaluating research agents, Paper 1 introduces foundational methodological innovations that directly impact how frontier models are optimized, granting it higher potential scientific impact.
Paper 2 likely has higher scientific impact: it claims the first end-to-end autonomous discovery with experimental validation on a real optical platform, including a previously unreported physical mechanism (optical bilinear interaction) with potential implications for optical computing hardware—high novelty, clear real-world applications, and broad relevance across AI, automation, and photonics. Paper 1 is valuable and timely for evaluation methodology, but benchmarks typically have narrower downstream impact than a validated new physical mechanism and demonstrated autonomous experimentation, assuming the results are rigorous and reproducible.
Paper 2 has higher potential impact: it introduces a generative “health world model” trained on large, longitudinal, multimodal human data and demonstrates broad, clinically relevant capabilities (forecasting, risk prediction, cross-cohort transfer, and intervention simulation) with quantitative validation against endpoints and RCTs—supporting real-world applications like digital twins and decision support. Methodological scope and cross-domain biomedical relevance are wide and timely. Paper 1 is valuable for LLM evaluation rigor and benchmarking, but its primary impact is narrower (AI eval) and less directly translational than a validated physiology model.
Paper 1 presents a fundamental advance in AI-driven scientific discovery, enabling the autonomous derivation of interpretable and extrapolatable governing equations. Its impact spans across all natural sciences, addressing a core limitation of current AI models. While Paper 2 offers a valuable benchmark for LLM evaluation, Paper 1's profound implications for accelerating cross-disciplinary scientific breakthroughs give it significantly higher potential scientific impact.
Paper 2 introduces a fundamentally novel framework (GSS) that unifies two paradigms—generative diffusion models and random structure search—into a principled sampling process. This has broad, lasting impact across materials science, chemistry, and drug discovery by enabling efficient exploration of energy landscapes with >10x cost reduction. It addresses a core bottleneck in molecular/materials discovery with strong methodological innovation. Paper 1, while useful as a benchmark for AI evaluation, is more incremental—benchmarks have shorter lifespans and narrower impact, primarily within the AI/NLP community.
Paper 1 has higher potential impact: it tackles a foundational question about whether LLM scientific agents satisfy epistemic norms, using large-scale evaluation (25k+ runs) across eight domains and introducing process-level behavioral diagnostics that expose failure modes invisible to outcome-only metrics. Its conclusions directly affect how autonomous science agents should be trained, evaluated, and trusted, with broad implications for AI, philosophy of science, and scientific practice. Paper 2 is valuable and timely as an auditing-friendly benchmark, but its impact is more incremental and primarily methodological within evaluation.
MIMIC presents a fundamentally new generative multimodal foundation model for biomolecules that unifies sequence, structure, evolution, regulation, and context across nucleic acids and proteins. Its breadth of demonstrated applications—splicing prediction, RNA editing design, protein binder design, and context-dependent probing—addresses central challenges in computational biology with a single framework. The novelty of jointly modeling partially observed multimodal biological states and enabling constrained design has transformative potential across drug discovery, synthetic biology, and genomics. Paper 2, while valuable for AI benchmarking, is incremental in scope—a harder evaluation dataset for deep research agents—with narrower long-term scientific impact.
Paper 2 likely has higher impact: it introduces a large-scale foundation model trained on unprecedented nationwide claims data and demonstrates strong, broadly useful improvements across >1,000 clinical prediction tasks, external validations, expenditure forecasting, and reduced bias in target trial emulation—directly enabling real-world evidence generation with clear healthcare and regulatory applications. Methodology appears rigorous (scale studies, prospective/retrospective tests, external datasets). Paper 1 is novel and timely as an evaluation benchmark, but its real-world impact is more indirect and primarily within LLM evaluation, whereas Paper 2 can influence clinical research, health economics, epidemiology, and policy.