Zhuofan Shi, Mingzhe Ma, Lu Wang, Fangkai Yang, Pu Zhao, Yiming Guan, Youling Huang, Wei Zhang
Deep search requires agents to answer complex questions through multi-step web search, browsing, evidence comparison, and synthesis. A central challenge is deciding how to search when several directions look plausible but only some will later lead to reliable evidence. If an agent greedily follows the current best-looking direction, it may keep extending a weak continuation. If it explores without discipline, it may waste budget on disconnected trials. We propose TreeSeeker, an inference-time framework for controlled trial-and-error in deep search. TreeSeeker organizes search as branch-and-return search over tree-structured states, where each branch is a tentative direction for a sub-goal. At each round, TreeSearch reads all sub-goal trees, identifies active goals, and uses textual UCB signals of value, uncertainty, and risk to select among exploiting a promising branch, exploring an uncertain alternative, or pruning an unproductive continuation and returning to an earlier branch point. TreeMem supports this control loop by keeping evidence, uncertainty, conflicts, progress, and failure cues attached to the branches that produced them, so trial outcomes can guide later decisions. Experiments on XBench-DeepSearch, BrowseComp, and BrowseComp-ZH show that TreeSeeker consistently outperforms strong open-source baselines, suggesting that explicit branch-and-return control complements stronger reasoning and tool execution.
TreeSeeker addresses a genuine and well-motivated problem in deep search: how should an agent allocate its search budget across multiple plausible but uncertain directions during long-horizon web information seeking? The paper frames this as a branch-and-return search problem and proposes two tightly coupled components: TreeSearch, which uses a textual UCB-style rule to decide between exploiting, exploring, or pruning search branches; and TreeMem, which maintains structured, branch-local evidence states including uncertainty, conflicts, and failure cues.
The key conceptual insight is moving from operation on flat or linearly-compressed search histories to tree-structured states where each branch is a "decision object" that can be independently evaluated, continued, or abandoned. This is a clean abstraction that bridges the gap between classical tree search (MCTS-style) and the messier reality of semantic, open-ended web search where "branch quality" is not a scalar reward but a complex evidence state.
Strengths in experimental design: The paper evaluates on three benchmarks (XBench-DeepSearch, BrowseComp, BrowseComp-ZH) spanning English and Chinese, and reports averages over three independent runs. The controlled comparison with Flash-Searcher under both gpt-4.1 and gpt-5.2 backends is particularly valuable, as it isolates framework-level contributions from backbone model effects. The ablation study systematically removes the three core components, and the cost analysis addresses a natural concern about whether gains come simply from spending more compute.
Weaknesses: The evaluation samples only 100 instances from BrowseComp and BrowseComp-ZH due to resource constraints, which limits statistical power. No confidence intervals or significance tests are reported — only means over three runs. The margins on some comparisons (e.g., 1.7–2.6 points under gpt-4.1) are small enough that statistical significance is unclear. The textual UCB formulation, while intuitively appealing, is essentially a heuristic: mapping ordinal labels {LOW, MEDIUM, HIGH} to {0, 1, 2} and computing ψ(a) = V̂ + Û − R̂ lacks theoretical grounding. The connection to actual UCB theory is metaphorical rather than formal — there are no regret bounds or convergence guarantees, and the "uncertainty" signal is itself an LLM judgment rather than a statistically grounded quantity. The paper acknowledges this implicitly by calling it "textual UCB" but the theoretical framing may overstate the formalism.
The paper addresses a practical bottleneck in deploying deep-search agents. As LLM-based research assistants become more prevalent (OpenAI DeepResearch, Gemini DeepResearch), the question of how to control multi-step search under budget constraints becomes increasingly important. The branch-and-return abstraction is general enough to be applied beyond the specific benchmarks tested — it could extend to multi-hop question answering, investigative journalism agents, scientific literature review, and competitive intelligence gathering.
The TreeMem design, which attaches failure cues and evidence to specific branches rather than flattening into a single context, is a useful architectural pattern that could influence how future agent memory systems are designed. The operation-level decision framework (exploit/explore/prune) provides a vocabulary and control interface that other systems could adopt.
However, the impact is somewhat bounded by several factors: the framework is inference-time only and doesn't involve training, the improvements over the strongest baseline (Flash-Searcher) are moderate (5.6 points on XBench-DS with gpt-5.2), and the approach relies heavily on the backbone LLM's ability to produce meaningful ordinal assessments of value, uncertainty, and risk.
This paper is highly timely. Deep search is an active research frontier in 2025–2026, with major labs releasing competing deep-research products. The paper positions itself well against concurrent work (IterResearch, Flash-Searcher, WebSailor, etc.) and addresses a specific gap: evidence-driven budget reallocation during search. The choice of benchmarks is current and appropriate.
TreeSeeker makes a solid engineering and systems contribution to the active area of deep-search agents. The branch-and-return abstraction is clean and the experimental evaluation is reasonably thorough. However, the theoretical framing overstates the formalism of the approach, the improvements are moderate and not always clearly significant, and the reliance on LLM-generated quality signals for "UCB" makes the method's reliability hard to characterize. It represents a useful step forward in inference-time search control but is more of an incremental systems contribution than a fundamental advance.
Generated Jun 11, 2026
Paper 2 (TreeSeeker) likely has higher scientific impact due to broader applicability and timeliness: inference-time branch-and-return control for web/deep search is a core capability for LLM agents across domains (QA, research assistants, enterprise search). Its tree-structured UCB-style selection with memory for uncertainty/conflicts generalizes beyond education and could transfer to other tool-using settings. The evaluation spans multiple established deep-search benchmarks with consistent gains, suggesting stronger methodological breadth. Paper 1 is innovative for creativity assessment but is more niche and depends on human-subject context, limiting cross-field uptake.
Paper 1 offers a concrete, technically grounded inference-time framework (tree-structured branch-and-return control with UCB-style selection and memory) validated on multiple established deep-search benchmarks with consistent gains over strong baselines. It is timely for LLM tool-use/search agents, methodologically more rigorous, and has clear real-world applicability (web research, enterprise search, autonomous assistants) with impact across IR, planning, and agentic LLM systems. Paper 2 is more speculative and philosophically framed, with limited empirical grounding and higher risk of unclear operationalization/generalizability despite provocative alignment relevance.
TreeSeeker addresses the broadly impactful problem of agentic deep search with a novel tree-structured trial-and-error framework incorporating UCB-based exploration-exploitation tradeoffs. Its approach is applicable across many AI domains (web search, reasoning, multi-step decision-making) and builds on timely trends in inference-time compute and LLM agents. Paper 2, while solid, addresses a narrower domain (BIM compliance checking in AEC) with more limited cross-field applicability. TreeSeeker's methodological innovations—branch-and-return control, textual UCB signals, TreeMem—have broader potential to influence future research in agentic AI systems.
Paper 2 addresses a highly timely and widely applicable problem: enhancing LLM agents' deep search capabilities through structured inference-time reasoning. Its tree-structured trial-and-error framework has broad implications for autonomous agents, information retrieval, and complex question answering. While Paper 1 offers a rigorous theoretical contribution to adversarial robustness in data summarization, Paper 2 aligns with rapidly growing trends in agentic AI and inference-time search, suggesting a broader and more immediate scientific and practical impact across multiple domains.
TreeSeeker addresses the fundamental challenge of search strategy in complex multi-step reasoning, proposing a novel tree-structured trial-and-error framework with textual UCB signals. It demonstrates stronger methodological innovation by combining exploration-exploitation tradeoffs from MCTS with deep web search, and shows consistent improvements across multiple benchmarks. Its impact spans AI search, reasoning, and decision-making. Paper 2, while addressing an important memory management problem, presents a more incremental contribution with moderate benchmark results (64.7%) and narrower scope focused on long-term memory maintenance.
Paper 1 exposes emergent metaprogramming strategies in frontier LLMs and introduces a novel evaluation paradigm using unfamiliar languages. This provides fundamental insights into agentic adaptation and reasoning, likely sparking broader research across LLM evaluation and capability discovery compared to Paper 2's more incremental methodological improvement to search frameworks.
Paper 2 (HIPIF) targets a broadly limiting failure mode for LLM agents—long-context interference in long-horizon tasks—via an end-to-end learning framework that couples hierarchical planning with history “folding,” plus reflection and process rewards without extra models or expert trajectories. This is more general and likely to transfer across many agent settings beyond web search, increasing breadth of impact and real-world applicability. Paper 1 is novel and useful for deep web search, but is more domain-specific (inference-time control for search trees) and thus likely narrower in cross-field impact.
Paper 2 (STAGE-Claw) likely has higher scientific impact: it introduces a scalable, automated benchmarking framework for realistic, state-based personal-computing environments with programmatic verification, addressing a major evaluation bottleneck for agent research. Its applicability spans many agent types, tools, and domains, enabling reproducible comparisons and guiding progress across the field. Paper 1 (TreeSeeker) is a solid inference-time control contribution for deep search, but is narrower in scope and may be more quickly subsumed by broader agent frameworks; benchmarks/infrastructure tend to have wider, longer-lasting cross-field influence.
Paper 1 has higher potential scientific impact due to its foundational, theory-driven contribution: a formalization of ELK with Causal Influence Diagrams and an impossibility theorem about feedback-based training for honesty under latent variables. This targets a central, timely problem in AI alignment/safety with broad implications for training paradigms, evaluation, and deployment of advanced systems across domains. While Paper 2 appears practically useful for deep-search performance, its impact is narrower (an inference-time control heuristic) and more incremental relative to ongoing work on search/planning and tool-using agents.
TreeSeeker addresses a broader and more practically impactful problem—deep web search with multi-step reasoning—which has wide applicability across information retrieval, autonomous agents, and AI systems. Its tree-structured trial-and-error framework with UCB-based exploration/exploitation is methodologically novel, combining ideas from MCTS with LLM-based search in a principled way. While RecToM achieves impressive results on ToM benchmarks (including 100% on Hi-ToM), Theory of Mind reasoning is a narrower subfield. TreeSeeker's framework has greater potential for real-world deployment and cross-domain influence in the rapidly growing area of agentic AI systems.