Plan Before Search: Search Agents Need Plan

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Jiayi Ji, Chenyi Lei, Wenwu Ou

#1211 of 2682 · Artificial Intelligence
Share
Tournament Score
1420±50
10501800
58%
Win Rate
11
Wins
8
Losses
19
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start distilled from a stronger model. However, this paradigm overlooks two fundamental factors: the dependency structure among sub-skills, and the possibility that distillation is not the only route to capability acquisition. We study this through Plan, a structured agentic behavior for multi-hop retrieval that decomposes a question into ordered sub-questions before any retrieval is performed, so that each search step can be anchored to a pre-designed sub-question instead of drifting under the influence of partially relevant documents retrieved earlier. However, across three model families spanning 3B to 14B parameters, we find that an identical reward signal induces qualitatively different RL failure modes. This phenomenon indicates that successful training hinges not only on reward design but also on model-specific feasibility conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. Motivated by this, we propose a self-bootstrapping paradigm in which a small-scale seed model generates filtered trajectories that activate Plan in any target model, eliminating the need for distillation from an external stronger model. Our pipeline activates Plan across every tested model and consistently outperforms competitive baselines on multi-hop QA benchmarks.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Plan Before Search: Search Agents Need Plan"

1. Core Contribution

This paper makes two intertwined contributions to the training of retrieval-augmented reasoning agents. First, it formalizes Plan as a structured agentic behavior where a model decomposes a multi-hop question into ordered sub-questions *before* any retrieval occurs, anchoring each subsequent search step to a pre-designed sub-question. This prevents the "query drift" problem where reactive agents progressively deviate from the original question due to noise accumulation from partially relevant retrieved documents.

Second, and more analytically interesting, the paper identifies three feasibility conditions for successful RL training of complex composite behaviors: sufficient initial entropy, training stability, and prerequisite sub-skills. When these conditions are violated, qualitatively distinct failure modes emerge across model families—prior collapse (instruction-tuned models with compressed behavioral distributions), late-stage entropy explosion, and instability from missing prerequisite capabilities. This diagnostic framework motivates a self-bootstrapping paradigm where a small seed model (Qwen2.5-3B-Base) that satisfies these conditions generates filtered trajectories to activate Plan in arbitrary target models, eliminating dependence on distillation from stronger external models.

2. Methodological Rigor

The experimental design is notably thorough. The paper tests across three model families (Qwen2.5, Llama3.2, Qwen3) and scales from 3B to 14B, covering 12 model configurations. This breadth is unusual and strengthens the generalizability claims.

The reward design is well-motivated: the plan-aware reward (Rplan) uses thresholded F1 alignment between planned sub-questions and think blocks, with the threshold preventing reward hacking through verbatim copying—demonstrated convincingly in Figure 3 where removing the threshold causes training collapse around step 90. The auxiliary rewards are only activated when the answer score is zero, a sensible design choice that avoids interfering with already-successful trajectories.

The ablation studies are comprehensive: plan reward necessity (Table 2), plan-and-refine vs. refine-only across all 12 configurations (Table 3), self-bootstrapping vs. strong-model distillation (Table 4), SFT data scaling (Table 7), and threshold sensitivity (Table 8). The training order ablation (R→P vs. P→R on the same model) is particularly valuable as it isolates the sub-skill dependency from model-intrinsic properties.

However, there are methodological concerns. The seed model selection (Qwen2.5-3B-Base) appears somewhat ad hoc—the paper doesn't provide a systematic way to predict which models will satisfy the feasibility conditions *a priori*. The three conditions are identified post hoc from observed failures rather than derived from a theoretical framework. Additionally, the static Wikipedia corpus limits ecological validity for real deployment scenarios.

3. Potential Impact

Immediate practical impact: The self-bootstrapping paradigm is directly useful for practitioners training search agents at moderate scales who lack access to large teacher models. The consistent improvements across model families (average MH-Avg. improvement of +0.053 over AutoRefine) suggest broad applicability.

Conceptual impact: The paper's most valuable contribution may be the analysis of RL failure modes and sub-skill dependencies. The finding that identical reward signals produce qualitatively different failures across models challenges the prevailing focus on reward engineering and redirects attention to the initial conditions of the policy. This insight—that the bottleneck is often the starting point rather than the reward—has implications beyond search agents, potentially applying to any RL training of multi-skill agentic behaviors.

Methodological precedent: The self-bootstrapping approach (small seed → filtered trajectories → SFT activation → RL refinement) provides a template that could generalize to other composite agentic behaviors. The finding that self-bootstrapped data outperforms 72B-distilled data (Table 4) due to distributional alignment is particularly noteworthy and challenges the assumption that stronger teachers always produce better training data.

4. Timeliness & Relevance

This paper addresses a current bottleneck in the rapidly growing field of RL-trained agentic LLMs. The Search-R1 → AutoRefine → PL-Search progression represents a natural evolution, and the timing is appropriate. The question of when and why SFT cold-start is necessary for RL training is practically urgent given the proliferation of search agent training pipelines. The paper's analysis of why certain models fail under direct RL fills a genuine gap in understanding.

5. Strengths & Limitations

Key Strengths:

  • Exceptional breadth of evaluation across model families and scales (12 configurations)
  • The failure mode taxonomy (prior collapse, entropy explosion, missing prerequisites) provides actionable diagnostic categories
  • Self-bootstrapping eliminates the dependency on strong external models, a meaningful practical contribution
  • The distributional alignment insight (why small-model trajectories outperform 72B trajectories) is counter-intuitive and valuable
  • Token budget analysis (Appendix C.2) preempts the concern that gains come from increased generation budget
  • Notable Limitations:

  • The plan decomposition is relatively straightforward (ordered sub-questions); it's unclear how this extends to questions requiring non-linear reasoning dependencies
  • The feasibility conditions are descriptive rather than predictive—no metric is provided to determine *a priori* whether a model will succeed under direct RL
  • Evaluation uses static Wikipedia rather than live web search, limiting practical relevance
  • The plan behavior is studied only in multi-hop QA; generalization to other agentic tasks (tool use, long-horizon planning) is speculative
  • The seed model choice appears contingent—what if no model in the available pool satisfies the feasibility conditions?
  • Some baselines report numbers from their original papers rather than being reproduced under identical conditions
  • Additional Observations:

    The paper's framing around "sub-skill dependencies" is more theoretically suggestive than rigorously developed. A formalization of what constitutes a "sub-skill" and how dependencies could be systematically identified would significantly strengthen the conceptual contribution. The entropy-based analysis, while informative, remains correlational rather than causal.

    The improvements, while consistent, are modest in absolute terms (+2.5 average EM over the strongest baseline). The practical significance depends heavily on the application context.

    Rating:6.5/ 10
    Significance 6.5Rigor 7Novelty 6.5Clarity 7

    Generated May 28, 2026

    Comparison History (19)

    vs. On the Origin of Synthetic Information by Means of Steganographic Inheritance
    claude-opus-4.65/28/2026

    Paper 1 addresses a practical, timely problem in LLM-based retrieval-augmented reasoning with concrete methodological contributions (self-bootstrapping paradigm, analysis of RL failure modes across model families) and empirical results on multi-hop QA benchmarks. Paper 2, while creative in its biological analogy for tracing synthetic information provenance via steganography, addresses a narrower problem with less immediate practical applicability and relies heavily on conceptual framing. Paper 1's contributions to training methodology for reasoning agents have broader, more immediate impact given the current focus on improving LLM capabilities.

    vs. Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation
    claude-opus-4.65/28/2026

    Paper 1 addresses a critical and underexplored failure mode of chain-of-thought distillation: that improved answer accuracy can mask degraded reasoning quality. This finding has broad implications for AI safety, trustworthiness, and deployment in high-stakes domains like medicine. The rigorous multi-dimensional evaluation (multiple evaluators, scales, benchmarks, clinical expert validation) and the counterintuitive finding that accuracy and reasoning quality diverge makes it highly impactful. Paper 2, while solid, offers a more incremental contribution to retrieval-augmented reasoning with a planning mechanism. Paper 1's warning about relying on answer-level metrics alone is more broadly consequential.

    vs. CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
    gemini-3.15/28/2026

    Paper 2 addresses a major bottleneck in LLM reasoning improvement—the high computational cost of parametric RL and prompt optimization. By proposing a non-parametric method that achieves significant gains with as few as five samples, it offers a highly accessible, sample-efficient, and interpretable solution. This rapid adaptation capability is likely to have a broader and more immediate impact across various applications compared to Paper 1's focus on multi-hop retrieval and self-bootstrapping, which, while valuable, is more specialized.

    vs. AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models
    claude-opus-4.65/28/2026

    AIBuildAI-2 addresses the broader and more transformative problem of democratizing AI model development for non-experts, particularly scientists. Its knowledge-enhanced agent with an evolving knowledge system is highly novel and demonstrates strong empirical results (first on MLE-Bench, top 6.6% against human experts). The potential real-world impact spans multiple scientific fields (biology, physics, chemistry). While Paper 1 makes solid contributions to retrieval-augmented reasoning with its self-bootstrapping paradigm, its scope is narrower (multi-hop QA). Paper 2's breadth of application and potential to accelerate scientific discovery gives it higher estimated impact.

    vs. AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
    claude-opus-4.65/28/2026

    AutoScientists presents a more broadly impactful contribution: a general-purpose framework for autonomous scientific experimentation that demonstrates strong results across multiple diverse domains (biomedical ML, LLM training, protein fitness prediction). Its decentralized multi-agent architecture for sustained scientific exploration is highly novel, and the demonstrated improvements over state-of-the-art in protein engineering (+12.5% on ACE2-Spike binding) represent tangible scientific discoveries. Paper 1, while technically solid, addresses a narrower problem (multi-hop QA training strategies) with more incremental contributions to the retrieval-augmented reasoning community.

    vs. From Learning Resources to Competencies: LLM-Based Tagging with Evidence and Graph Constraints
    claude-opus-4.65/28/2026

    Paper 2 addresses fundamental challenges in training LLM-based reasoning agents with reinforcement learning, offering broadly applicable insights about RL failure modes across model families and a novel self-bootstrapping paradigm that eliminates dependency on stronger teacher models. Its findings about model-specific feasibility conditions for RL and the structured planning approach for multi-hop retrieval have wider applicability across NLP and AI. Paper 1, while methodologically sound, addresses a narrower educational technology application with modest performance metrics and limited generalizability beyond competency tagging in LMS systems.

    vs. SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks
    claude-opus-4.65/28/2026

    Paper 2 addresses a concrete, well-defined problem (multi-hop retrieval-augmented reasoning) with rigorous empirical methodology across multiple model families and benchmarks. It offers novel insights into RL failure modes, identifies model-specific feasibility conditions, and proposes a self-bootstrapping paradigm that eliminates dependence on stronger teacher models—a broadly applicable contribution. Paper 1, while addressing an interesting decentralized compute problem, is primarily a protocol/architecture proposal with concepts (Shapley-value credit, DHT routing, pheromone metaphors) that are largely recombinations of existing ideas, and lacks empirical validation of its claims.

    vs. MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation
    claude-opus-4.65/28/2026

    Paper 2 addresses a fundamental challenge in training LLM-based reasoning agents—how to structure planning before retrieval in multi-hop QA—with a novel self-bootstrapping paradigm that eliminates dependence on stronger teacher models. It offers broad methodological contributions (analysis of RL failure modes across model families, self-bootstrapping training) applicable across many NLP tasks. Paper 1, while addressing a valid gap in cinematic video generation evaluation, is a narrower benchmark contribution for a still-emerging subfield. Paper 2's insights into RL training dynamics and practical training pipeline have wider applicability and timeliness given the current focus on agentic LLMs.

    vs. Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
    gemini-3.15/28/2026

    Paper 1 offers a deeper scientific contribution by merging mechanistic interpretability (using Temporal Sparse Autoencoders) with Reinforcement Learning with Verifiable Reward (RLVR). It provides fundamental insights into how sample difficulty affects internal model representations and optimization dynamics. While Paper 2 presents a practical approach to multi-hop retrieval agents, the 'plan-before-search' paradigm is less fundamentally novel. Paper 1's rigorous internal analysis of LLM behavior during RL has broader implications for understanding and improving alignment and reasoning training.

    vs. GONDOR to the Rescue: Satisficing Planning with Low Memory
    gpt-5.25/28/2026

    Paper 2 likely has higher impact: it targets timely, widely relevant LLM agent training for multi-hop retrieval, identifies model-dependent RL failure modes, and proposes a self-bootstrapping alternative to costly distillation—potentially broadly applicable across agentic RAG systems and model families. Its implications span RL, evaluation of agent behaviors, and practical deployment of smaller models. Paper 1 is methodologically solid and useful for memory-constrained planning/search, but its impact is more specialized to heuristic search/planning communities with narrower cross-field reach.

    vs. Entropy-aware Masking for Masked Language Modeling
    gpt-5.25/28/2026

    Paper 2 likely has higher impact due to its broader relevance to retrieval-augmented agents and RL training, identifying model-dependent RL failure modes and proposing a distillation-free self-bootstrapping pipeline that generalizes across model families. The approach targets timely, high-value applications (multi-hop QA, tool/retrieval use) and contributes methodological insights about feasibility conditions beyond reward design. Paper 1 is a solid optimization to MLM pretraining, but entropy-based masking is a narrower increment with more limited cross-field reach compared to agent planning and training paradigms.

    vs. Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration
    gpt-5.25/28/2026

    Paper 2 likely has higher impact due to broader applicability and conceptual contribution: it identifies model-specific feasibility conditions for RL training of retrieval agents, introduces a general “plan-before-search” behavior, and proposes a self-bootstrapping alternative to expensive teacher distillation. These ideas can influence agent training, RLHF/RLAIF, retrieval-augmented reasoning, and practical QA systems across many model sizes. Paper 1 is timely and useful but more specialized to LVLM hallucination mitigation and limited to an inference-time attention recalibration technique with narrower cross-field reach.

    vs. Diffusion Large Language Models for Visual Speech Recognition
    claude-opus-4.65/28/2026

    Paper 1 addresses a fundamental challenge in training LLM-based reasoning agents with RL, offering broadly applicable insights about model-specific failure modes and a novel self-bootstrapping paradigm that eliminates dependence on stronger teacher models. Its findings about dependency structures among sub-skills and feasibility conditions for RL training are generalizable across many domains. Paper 2, while achieving state-of-the-art on a specific benchmark and introducing a novel application of diffusion LLMs to VSR, has narrower impact scope limited primarily to the visual speech recognition community. Paper 1's methodological contributions are more likely to influence the larger and rapidly growing field of LLM agent training.

    vs. Generating Robust Portfolios of Optimization Models using Large Language Models
    gemini-3.15/28/2026

    Paper 2 addresses fundamental challenges in training LLM agents, specifically tackling RL failure modes and dependency structures in multi-hop reasoning. Its proposed self-bootstrapping paradigm eliminates the need for distillation from stronger, often proprietary, models. This contributes significantly to the highly active area of autonomous AI agents and open-source model development, offering broader implications and applicability across the AI field compared to Paper 1's more specialized focus on generating optimization models.

    vs. Revealing Algorithmic Deductive Circuits for Logical Reasoning
    gemini-3.15/28/2026

    Paper 2 addresses a fundamental black-box problem in AI by investigating the mechanistic interpretability of logical reasoning in LLMs. By identifying specific deductive circuits and attention heads, it offers broad theoretical implications for AI safety, transparency, and future architecture design. While Paper 1 provides a valuable, practical training paradigm for search agents, Paper 2's insights into how reasoning capabilities structurally emerge within models have a deeper and potentially more foundational scientific impact across the broader AI research community.

    vs. Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains
    gpt-5.25/28/2026

    Paper 2 is more novel and timely, addressing core limitations in training retrieval-augmented LLM agents (planning before search, model-specific RL feasibility conditions) and proposing a self-bootstrapping alternative to distillation. Its applications span many LLM systems (multi-hop QA, agentic search) with broad cross-field impact across NLP, RL, and information retrieval. Paper 1 improves rigor and reproducibility within PHM evaluation—valuable but more infrastructure-focused and domain-bounded, with narrower immediate breadth than advances in scalable LLM agent training.

    vs. JobBench: Aligning Agent Work With Human Will
    gemini-3.15/28/2026

    JobBench introduces a paradigm-shifting benchmark that redefines occupational AI evaluation from economic replacement to human enhancement. By aligning AI capabilities with human delegation preferences, it offers profound societal relevance and broad multidisciplinary impact. While Paper 1 provides a valuable methodological improvement for agent reasoning, Paper 2 addresses urgent ethical and economic questions, likely driving extensive future research in AI alignment, HCI, and agent evaluation.

    vs. When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models
    claude-opus-4.65/28/2026

    Paper 1 identifies a fundamental and previously underexplored vulnerability in aligned LLMs—brittle safety under context flips—with systematic empirical evidence across 12 models and actionable architectural recommendations (state-aware validators). This has broad implications for AI safety, deployment practices, and alignment research. Paper 2 makes a solid contribution to retrieval-augmented reasoning with a self-bootstrapping training paradigm, but addresses a more incremental improvement in a narrower subfield. Paper 1's findings on the inadequacy of current safety guardrails are more likely to influence policy, benchmarking standards, and safety-critical deployment decisions across the field.

    vs. The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic
    claude-opus-4.65/28/2026

    Paper 1 introduces a novel training paradigm (self-bootstrapping for planning in retrieval-augmented agents) with concrete methodological contributions: structured planning before search, analysis of RL failure modes across model families, and a distillation-free capability activation pipeline. It addresses a fundamental challenge in training agentic LLMs with broad applicability. Paper 2, while methodologically rigorous and valuable as a statistical critique of GSM-Symbolic, is primarily a re-evaluation/rebuttal paper with narrower scope—it corrects existing claims rather than opening new research directions. Paper 1's contributions are more constructive and likely to influence future work in agentic AI systems.