Heming Zou, Qi Wang, Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.
TRACE introduces a unified framework for allocating rollout budgets in multi-turn agentic reinforcement learning. The key insight is that both prompt-level selection and prefix-level branching within rollouts can be treated as instances of the same optimization problem: allocating sampling budget to "anchors" (either prompt roots or intermediate prefixes) that are most likely to yield mixed terminal rewards (both successes and failures). This unification is conceptually elegant — it reframes prompt filtering, rollout-count allocation, and prefix branching as budget decisions over a rollout tree, guided by a shared predictor of conditional success probability.
The framework operates in two stages: (1) global root allocation over a candidate prompt pool, solving a knapsack-style optimization to assign rollout counts to prompts with intermediate difficulty; (2) local prefix expansion, where continuation budget is allocated to visited prefixes within completed rollouts that are predicted to generate counterfactual outcomes. A lightweight Qwen3-0.6B predictor estimates success probability at both levels.
The theoretical foundations are well-developed. Proposition 1 (prefix information improves difficulty prediction) leverages the law of total variance to show that deeper prefix histories cannot worsen prediction quality — a clean and intuitive result. Proposition 2 connects Bernoulli variance V(1-V) to the expected quadratic variation of the conditional success martingale, providing theoretical justification for using this as an allocation criterion. Proposition 3 establishes that TRACE's activation-maximizing allocation dominates uniform allocation in expected squared gradient norm, though the normalized conditional gradient energy assumption is strong and may not hold uniformly in practice.
The experimental evaluation covers three distinct agentic settings (Mathematical Reasoning, Multi-Hop QA, Function Calling) across two model scales (Qwen3-8B, Qwen3-14B) with an additional backbone (Llama-3.2-3B). The baselines are appropriate: GRPO (flat uniform), PCL (prompt-level selection), and TreePO (tree structure with random branching). The controlled comparison with TreePO is particularly valuable as it isolates the effect of learned allocation from the tree structure itself. Budget accounting is carefully handled, with continuations counted as half-trajectories.
However, improvements are moderate in absolute terms. The headline result is a 2.8-point improvement on Multi-Hop QA for Qwen3-14B. On Mathematical Reasoning, gains are approximately 1 point on average. The variance across benchmarks is notable — TRACE sometimes underperforms individual baselines on specific benchmarks (e.g., AIME24 for Qwen3-8B). The paper does not report confidence intervals or statistical significance for the final accuracy numbers.
Practical applicability: TRACE addresses a genuine bottleneck in agentic RL — the sparsity and inefficiency of outcome-only rewards in multi-turn settings. The computational overhead is minimal (2-3% of total training time for predictor operations), making it a practical add-on to existing RLVR pipelines.
Framework generality: The separation of allocation from optimization is a clean design principle. TRACE is optimizer-agnostic and compatible with different tree-aware policy objectives (TreeRPO, Tree-GRPO). This modularity increases adoption potential.
Broader influence: The "contrast allocation" perspective could influence how the community thinks about sample efficiency in RLVR more broadly. The idea that budget allocation should be viewed as a structural decision within the rollout tree (not just a prompt-level filtering decision) is a useful conceptual contribution.
Limitations on impact: The improvements, while consistent, are modest. The framework is specifically designed for outcome-only RLVR with binary rewards, limiting applicability to tasks with continuous or non-verifiable rewards. The predictor's effectiveness depends on having sufficient training signal, which may be less available in early training or for novel task distributions.
This work is highly timely. RLVR for LLMs is among the hottest research areas in 2025-2026, driven by DeepSeek-R1 and related successes. The shift toward multi-turn agentic settings (tool use, multi-hop reasoning, function calling) creates exactly the sparse credit assignment challenges TRACE addresses. The paper also connects to the growing literature on sample-efficient RLVR, including prompt curriculum learning and adaptive sampling, positioning itself at the intersection of several active research threads.
The budget shape analysis (Table 2) revealing that (1024, 2) outperforms (512, 6) at the same total budget is an interesting practical finding suggesting that root coverage matters more than continuation depth. The paper's system-aware design (per-prompt Stage 2 triggering without inter-prompt synchronization) shows engineering awareness relevant to practical deployment.
Generated Jun 10, 2026
Paper 2 addresses a critical, fundamental bottleneck in mechanistic interpretability: the reproducibility and reliability of Sparse Autoencoders (SAEs). By theoretically and empirically explaining seed dependence and feature stability, it fundamentally shifts how researchers interpret neural network representations. While Paper 1 offers a valuable algorithmic efficiency improvement for agentic RL, Paper 2 has broader implications for AI safety, alignment, and our foundational understanding of model interpretability tools.
Paper 2 (TRACE) likely has higher impact: it targets a timely, high-interest problem—efficient RL for agentic LLMs under costly rollouts—directly affecting real-world training compute and capability scaling. The turn/prefix-level budget allocation over tree-structured rollouts is a practical, broadly applicable framework across RLVR, agentic reasoning, and LLM alignment, with clear efficiency/accuracy gains. Paper 1 is theoretically elegant and useful for multimodal transformers, but positional-embedding advances tend to be more incremental and narrower in downstream leverage than methods that reduce RL training cost and improve agentic performance.
Paper 1 challenges the resource-intensive Sparse Autoencoder (SAE) paradigm in mechanistic interpretability by reviving Independent Component Analysis (ICA). By providing a stable, gradient-free alternative that is computationally highly efficient, it democratizes LLM interpretability research and accelerates alignment efforts. While Paper 2 offers valuable efficiency gains in agentic RL, Paper 1 has the potential to fundamentally shift the foundational methodology used across the rapidly growing field of model representation analysis.
Paper 2 introduces a fundamentally new surrogate-modeling method (FTM) with broad applicability across stochastic dynamical systems, turbulence, and chaotic systems. It addresses a foundational problem in computational physics and applied mathematics—efficiently predicting ensemble statistics of complex stochastic systems—with strong theoretical grounding (stability analysis) and wide cross-disciplinary relevance (climate, fluid dynamics, molecular dynamics). Paper 1, while useful, presents an incremental optimization framework for agentic RL with moderate empirical gains (2.8 points) on specific benchmarks, targeting a narrower problem within LLM training methodology.
Paper 1 likely has higher impact: it reframes PEFT from cost-saving to a scalable paradigm for persistent, personalized “local state” atop shared trillion-parameter models, addressing major practical needs (personalization, memory, multi-tenant serving) and introducing systems implications (identity/provenance/eval/serving via MinT). This has broad applicability across personalization, deployment, privacy, and model management. Paper 2 is timely and useful for agentic RL efficiency, but is more incremental (budget allocation/predictor-guided rollouts) and narrower in scope and downstream applicability.
Paper 2 addresses a critical and universal bottleneck in LLM deployment: KV cache memory in long-context inference. By demonstrating that selective, learnable KV eviction not only reduces memory but actually improves reasoning performance by minimizing attention dilution, it challenges the assumption that full-cache attention is optimal. This insight offers immense practical applications across diverse language and vision-language models, granting it broader potential impact than Paper 1's more specialized focus on agentic RL training pipelines.
Paper 1 addresses a highly timely and applied problem in optimizing reinforcement learning for large language models and agentic workflows. Given the current explosive interest in LLM reasoning and efficient RL processes, TRACE's practical empirical gains are likely to see rapid adoption and high citation counts. While Paper 2 offers strong theoretical unification for kernel bandits, Paper 1's immediate relevance to the booming field of generative AI agents gives it a broader and more immediate potential scientific impact.
Paper 2 offers a unifying theoretical framework with provable regime boundaries (phase diagram) for two core multimodal paradigms, plus a practical diagnostic to choose objectives before training. This combination of theory + actionable guidance is broadly applicable across domains (vision-language, scientific multimodal data) and can change how practitioners design multimodal learning pipelines, including identifying harmful settings. Paper 1 is a useful algorithmic advance for RLVR efficiency in agentic LLMs, but its impact is narrower and more engineering-/benchmark-driven with less general cross-field influence.
Paper 2 addresses a critical bottleneck in computational biology and drug discovery by significantly reducing the computational cost of all-atom generative modeling for protein-ligand complexes. Achieving state-of-the-art accuracy with 5x fewer inference steps enables scalable deployment and accelerates biological research. While Paper 1 offers a valuable efficiency improvement for LLM agents, Paper 2's direct application to accelerating biomolecular structure prediction promises broader and more immediate real-world scientific impact across life sciences and medicine.
Paper 1 proposes a broadly applicable, novel rollout-budget allocation framework (TRACE) for multi-turn agentic RL that operates at both prompt and prefix levels, directly addressing a key bottleneck in RLVR (low reward contrast under fixed sampling budgets). It introduces a generalizable success-probability predictor and tree-structured adaptive sampling with demonstrated efficiency/accuracy gains on standard agentic benchmarks, making it timely and likely reusable across LLM-based RL systems. Paper 2 is methodologically rigorous and insightful but is primarily a controlled diagnostic study on a narrower synthetic seismogram setting, with more limited immediate cross-domain impact.