Yijin Zhou, Linqian Zeng, Xiaoya Lu, Wenyuan Xie, Dongrui Liu, Junchi Yan, Jing Shao
Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing approaches mainly improve these behaviors through inference-time correction or coarse-grained reward signals based on decision outcomes and structured checklists, leaving the uncertainty characteristics of agent decisions underexplored. We observe that decision-oriented reinforcement learning tends to weaken the uncertainty separation between correct and incorrect actions, resulting in overconfident mistakes and weaker exploration signals. Therefore, we propose TRUST, which incorporates uncertainty quantification into reward design as a repulsive force for maintaining uncertainty separation, and labels lightweight key-turn annotations for unified post-training of multi-turn trajectories. Experimental results across diverse tool-use benchmarks show that TRUST consistently enhances both decision quality and agent performance while maintaining more reliable uncertainty estimates during optimization.
TRUST addresses a specific and practically important problem: LLM-based agents make suboptimal tool-calling decisions (unnecessary tool invocations, hallucinated direct answers) that propagate errors in multi-turn interactions. The paper's central insight is that standard decision-oriented RL collapses the uncertainty separation between correct and incorrect actions—the IoU between correct/wrong decision PPL distributions increases from 34.50% to 70.21% after vanilla GRPO. This is a meaningful and empirically grounded observation.
The proposed solution integrates uncertainty quantification (via perplexity margins) as a "repulsive force" in the reward function, encouraging the model to maintain high uncertainty on incorrect decisions while being confident on correct ones. The reward design (Eq. 6) combines format validity, answer correctness, and an uncertainty-modulated classification reward. Additionally, TRUST introduces lightweight key-turn annotations for trajectory-level training, avoiding the need to relabel entire conversations.
Practical applications: Tool-calling reliability is a genuine bottleneck in deployed LLM agents. Reducing hallucinated tool calls and missed necessary invocations directly impacts financial costs, execution failures, and information leakage. TRUST's zero additional inference latency (improvements baked into weights) is a meaningful practical advantage over inference-time intervention methods.
Broader influence: The idea of using uncertainty as a repulsive reward signal during RL training could generalize beyond tool-calling to other agentic decision points—planning, memory management, or API selection. The trajectory annotation methodology (annotating only key turns) provides a scalable blueprint for multi-turn RL training.
Limitations of impact scope: The action space is fixed to four categories (DIRECT, TOOL, ASK, UNABLE), which may not capture the nuance of real-world agent decisions. The benchmarks, while diverse, are all text-based with predefined tool sets—the gap to dynamic, open-world tool ecosystems is significant.
This work is highly timely. The proliferation of LLM-based agents in production settings (coding assistants, customer service, research agents) has made tool-calling reliability a first-order concern. The paper addresses a current bottleneck at the intersection of two active research areas: RL for LLM post-training (GRPO, DeepSeek-style) and agent reliability/safety. The integration of uncertainty quantification into RL rewards, rather than treating it as a post-hoc diagnostic, represents a conceptually appealing direction that aligns with growing interest in calibrated and trustworthy AI systems.
The paper's framing around "repulsive force" is intuitive but mathematically, c(s) simply scales the classification reward—it's unclear whether this constitutes a fundamentally new reward paradigm or a well-designed scaling factor. The connection to exploration (the claim that uncertainty alignment provides "stronger exploration signals") deserves more theoretical grounding. The paper would benefit from analysis of how the reward landscape changes with TRUST versus vanilla GRPO.
Generated Jun 8, 2026
Paper 1 likely has higher impact because it introduces a new, timely benchmark targeting long-horizon, economically valuable, domain-specific GUI workflows—an evaluation gap with broad relevance to agent research, HCI, and real-world deployment. Benchmarks often become shared infrastructure, shaping research directions and enabling standardized comparisons across models and methods. While Paper 2’s uncertainty-aligned RL for tool-calling is innovative and useful, it is a more incremental algorithmic contribution whose impact depends on adoption and generalization, whereas Workflow-GYM can catalyze a wider ecosystem of methods and evaluations.
Paper 2 addresses a critical gap in AI safety and interpretability by recovering hidden instructions and prompt injections directly from model activations. Its novel framing of instruction set retrieval provides a vital tool for monitoring deployed LLM agents, offering broader and more profound implications for AI security and alignment compared to Paper 1's performance optimization for tool-calling.
Paper 1 addresses a fundamental challenge in LLM-based agents—tool-use decision quality—by introducing a novel uncertainty-aligned reinforcement learning framework (TRUST). This combines uncertainty quantification with reward design in a principled way, offering both methodological innovation and broad applicability across agentic AI systems. Paper 2, while providing a useful benchmark for table understanding across formats, is primarily an evaluation/benchmarking contribution with narrower scope. Paper 1's integration of uncertainty into RL training for agents has deeper methodological implications and broader impact potential given the rapid growth of agentic AI systems.
Paper 2 likely has higher impact due to its immediate real-world applicability to long-context LLM inference efficiency (a major deployment bottleneck), broad relevance across models and systems, and training-free, fine-grained adaptive compute allocation that can be adopted widely. The reported large speedups at 100k+ tokens with minimal quality loss and released code further increase practical and scientific uptake. Paper 1 is novel for uncertainty-aligned RL in tool-calling, but is narrower in scope and depends on post-training/RL setups, potentially limiting immediate, cross-field adoption compared to inference-time acceleration.
Paper 1 presents a concrete, novel training method (uncertainty-aligned RL reward shaping plus lightweight annotations) with benchmarked improvements in tool-use decisions and calibrated uncertainty—likely to be adopted and extended in current agentic LLM pipelines. Its methodological rigor and near-term applicability to widely deployed tool-calling agents support strong impact. Paper 2 is timely and potentially broad (governance/accountability) but is primarily conceptual/architectural with limited demonstrated implementation, making near-term scientific uptake and measurable influence less certain despite high-level importance.
Paper 1 tackles a foundational challenge in agentic AI: automatic skill acquisition from heterogeneous traces. Its structured RWSA decomposition offers a highly novel methodological framework to transition from unstructured logs to robust, executable specifications. This has wide-ranging applications for scalable, self-improving agents. While Paper 2 presents a solid improvement for RL-based tool calling, Paper 1's architectural innovation in procedural knowledge encoding gives it a broader potential impact across multiple domains.
Paper 2 addresses a highly timely and critical issue in modern Large Reasoning Models—inference inefficiency or 'overthinking'. Its training-free approach using dynamic difficulty modeling from latent representations offers an elegant, computationally efficient solution. This broadens its applicability across various domains like math, QA, and coding. While Paper 1 presents a solid methodological improvement for tool-use agents, Paper 2's potential to significantly reduce inference costs for state-of-the-art reasoning models without sacrificing accuracy gives it a wider and more immediate scientific and practical impact.
Paper 1 introduces a more novel, generalizable methodological contribution: integrating uncertainty quantification directly into RL reward shaping to preserve uncertainty separation and reduce overconfident tool-use errors. This targets a foundational learning dynamics issue likely applicable beyond tool-calling (e.g., decision-making, exploration, calibration), offering broad cross-field impact and stronger scientific novelty. Paper 2 is a capable systems/engineering advance with benchmark gains and auditability, but its contributions are more architectural and platform-specific, with less clear methodological generality and rigor compared to a principled learning objective.
Paper 1 addresses a fundamental challenge in LLM-based agents—uncertainty and hallucination in tool-calling—by integrating uncertainty quantification into reinforcement learning. This theoretical and methodological innovation has broad applicability across various domains relying on agentic AI. In contrast, while Paper 2 offers significant practical improvements for automated software development, its impact is more narrowly focused on code localization efficiency, making Paper 1 more scientifically foundational and widely impactful.
Paper 2 addresses a critical and broadly applicable challenge in modern AI: improving tool-use decisions and reducing hallucinations in LLM agents. By incorporating uncertainty quantification into reinforcement learning, it offers a novel methodological improvement that impacts the rapidly growing field of autonomous AI agents. While Paper 1 presents a strong framework for VQA, the ubiquitous need for reliable agentic tool-use gives Paper 2 a wider potential real-world application and broader impact across various domains.