PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
Wonjoong Kim, Yeonjun In, Sangwu Park, Dongha Lee, Chanyoung Park
Abstract
A significant hurdle for current LLMs is the execution of complex, multi-stage tasks. Group Relative Policy Optimization (GRPO) has been emerging as a leading choice, but its reliance on sparse outcome rewards severely limits credit assignment across intermediate steps. Existing remedies such as running full rollouts to assign step-level advantages, calling external LLM judges at each step, or computing intrinsic rewards that require ground-truth answers at every evaluation introduce significant costs or practical constraints. We hypothesize that internal correctness probing over LLM hidden states can be repurposed as a step-level reward signal, potentially addressing all of these limitations at once. However, existing probing research assumes clean inputs, and we first show that this assumption breaks down in multi-step settings: hidden-state probes degrade severely under prefix contamination tracking coherence with the (possibly corrupted) prefix rather than grounded correctness, while attention-based features remain robust to contamination but underperform on clean prefixes. Building on this complementary relationship, we propose the Prefix-Aware Internal Reward (PAIR), a two-stage model with a frozen hidden-state probe estimating belief-consistency and a lightweight attention-based head correcting it toward grounded correctness. Experimental results show that PAIR achieves the highest AUROC on contaminated trajectories while operating at negligible inference cost, enabling dense step-level reward signals for GRPO training without external model calls, ground-truth dependencies, or full-trajectory rollouts.
AI Impact Assessments
(1 models)Scientific Impact Assessment: PAIR — Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
1. Core Contribution
PAIR addresses the credit assignment problem in multi-turn reinforcement learning for LLM agents. The key insight is that hidden-state probes trained on LLM internal representations track *belief-consistency* (coherence with the prefix) rather than *grounded correctness* (actual task utility), and this distinction becomes critical when earlier steps in a trajectory contain errors—a condition termed "prefix contamination." The paper identifies that attention-based features are comparatively robust to prefix contamination while hidden-state features excel under clean conditions, then proposes a two-stage architecture: a frozen hidden-state probe provides a belief-consistency baseline score, and a lightweight attention-based correction head adjusts it toward grounded correctness. This provides dense step-level rewards for GRPO without external LLM calls, ground-truth dependencies, full-trajectory rollouts, or separate reward models.
2. Methodological Rigor
Strengths in experimental design: The motivation experiments are well-constructed. The matched clean/contaminated trajectory pairs with identical evaluation turns enable causal identification of prefix contamination effects. The adversarial diagnostic set (Section 3.1.3), where belief-consistency and grounded correctness are deliberately anti-correlated, provides compelling mechanistic evidence—hidden-state probes score at or below chance while attention probes remain above chance.
Concerns: The probing architecture is remarkably simple—two logistic regression models. While simplicity is a strength for deployment, it raises questions about whether the method captures sufficient complexity for diverse contamination patterns. The paper uses a single backbone model (Qwen-2.5-7B-Instruct) throughout, limiting generalizability claims. The contaminated trajectories are generated by GPT-4o-mini using prescribed templates, introducing potential distribution mismatch with naturally occurring errors during actual RL exploration.
The downstream RL results (Table 3) show PAIR achieving the best results on both benchmarks, but the improvements over strong baselines are modest—0.2489 vs. 0.2387 (LLM-as-judge) on GTA and 0.4498 vs. 0.4203 on ToolBench. Standard deviations overlap on GTA (±0.042 for both), making the GTA improvement statistically questionable. The ToolBench improvement is more convincing. The evaluation protocol on ToolBench uses LLM-judge scoring with partial credit rather than the standard evaluation, which the authors justify but which complicates comparison with other work.
3. Potential Impact
Practical significance: The computational efficiency argument is strong. At ~0.2ms per step versus ~500ms for LLM-as-judge or seconds for tree-based methods, PAIR enables dense reward computation that is genuinely practical for online RL training at scale. This addresses a real bottleneck—the prohibitive cost of step-level reward computation in multi-turn agent training.
Conceptual contribution: The belief-consistency vs. grounded-correctness distinction is a genuinely useful conceptual framework. This insight extends beyond the specific PAIR architecture and could inform future work on internal state probing, reward modeling, and mechanistic interpretability. The finding that attention features are more robust to distributional shift (cross-domain transfer experiment) than hidden-state features adds to our understanding of what different internal representations encode.
Broader applicability: The approach could potentially extend to any setting where LLMs must evaluate intermediate steps in sequential decision-making—code generation, multi-hop reasoning, planning—though the paper only demonstrates this on tool-use benchmarks.
4. Timeliness & Relevance
The paper is highly timely. Multi-turn agent training with RL (particularly GRPO) is an active frontier, and the sparse reward problem is widely recognized. The intersection of mechanistic interpretability (probing internal states) and practical RL training (reward design) is an emerging area with significant potential. The comparison table (Table 1) effectively positions PAIR against concurrent work (AT2PO, Tree-GRPO, IGPO, TIPS, SWEET-RL, AgentPRM) and identifies a genuine gap.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Missing comparisons: The paper does not compare against recently proposed process reward models adapted for agentic settings (beyond citing them), nor does it compare against self-consistency or majority voting approaches that also require no external supervision.
6. Additional Observations
The paper is well-written with clear exposition. The framing around four simultaneous desiderata (dense rewards, no LLM calls, no full rollouts, no ground truth) is effective. The appendices are thorough, with detailed ablations and reproducibility information. The sensitivity analysis regarding contamination distance (Appendix B) strengthens the mechanistic claims. However, the practical impact depends heavily on whether the offline probe training generalizes to the evolving policy distribution during RL training—a question not fully addressed.
Generated May 19, 2026
Comparison History (20)
Paper 2 (PAIR) is likely higher impact due to its broadly useful, low-cost mechanism for dense step-level rewards in multi-turn agent RL without external judges, ground-truth per-step labels, or full rollouts—constraints that commonly block real deployments. The prefix-contamination framing is timely and clarifies why prior probing fails, and the proposed hybrid (hidden-state + attention correction) seems methodologically testable and reusable across models/tasks. Paper 1 is innovative for token-level credit assignment and stability, but relies more on complex training heuristics and may be narrower to RLVR/self-distillation setups.
PAIR addresses a fundamental challenge in LLM agent training—credit assignment in multi-turn tasks—with a novel internal reward mechanism that avoids costly external judges or rollouts. The discovery about prefix contamination degrading hidden-state probes is a genuinely new insight with broad implications for the RL-from-human-feedback and agent optimization communities. Its applicability spans any multi-step LLM agent task, giving it wider impact. PRISMat, while valuable for materials science, addresses a narrower domain with incremental improvements over existing methods. PAIR's methodological contributions are more likely to influence a larger research community.
Paper 2 addresses a critical bottleneck in LLM multi-turn agent optimization by improving reward assignment for GRPO, a highly relevant and timely topic in modern AI. Its approach operates without external model calls or ground-truth dependencies, offering immediate practical utility for scaling LLM reasoning. In contrast, Paper 1 focuses on executable world models in a specific puzzle game environment, which, while theoretically novel, has a narrower scope and less immediate real-world applicability compared to LLM reinforcement learning.
Paper 1 presents a significantly more novel contribution by addressing a fundamental limitation of GRPO in multi-turn agent optimization through internal reward modeling. It identifies a new problem (prefix contamination degrading hidden-state probes), proposes an innovative two-stage solution (PAIR), and eliminates the need for external LLM judges or ground-truth dependencies. This has broad applicability across LLM agent tasks. Paper 2 proposes a relatively incremental modification (shared backbone PPO) applied to a narrow UAV coverage domain, with limited novelty beyond standard architectural parameter sharing.
PAIR addresses a fundamental challenge in multi-turn RL training for LLMs—dense credit assignment without external supervision—by discovering and exploiting a novel phenomenon (prefix contamination degrading hidden-state probes) and proposing an elegant two-stage internal reward model. This has broader impact: it enables scalable GRPO training across many multi-step agent tasks without costly external judges or ground-truth requirements. Paper 1, while addressing a real problem (stale-premise grounding), shows modest empirical gains (+1.3-2.6pp on main benchmarks) and addresses a narrower verification use case. PAIR's methodological insight about internal representations is more generalizable across the field.
Paper 1 is likely to have higher scientific impact due to its novel, generally applicable method for dense step-level reward shaping in multi-turn agent optimization without external judges or ground-truth, addressing a core bottleneck in RLHF/agent training. It introduces a clear technical insight (prefix contamination) and a new model design (probe + attention correction) that could transfer across tasks, models, and training algorithms, potentially influencing future agent-RL methods broadly. Paper 2 demonstrates strong industrial value and scale, but its contributions (FSM augmentation, selective generation, dual-evaluator iteration) are more application-specific and less likely to generalize as a new foundational technique.
Paper 1 likely has higher impact: it identifies a timely, under-addressed safety failure mode unique to exposed reasoning traces, provides a large-scale multi-model evaluation (15 models, 41K prompts/model, OOD sources), and proposes a practical white-box mitigation with quantified safety gains while preserving task accuracy. Its implications span AI safety, interpretability, evaluation standards, and deployment policy. Paper 2 is innovative for dense internal rewards in multi-turn RL and may be influential for agent training, but its scope is narrower and impact depends more on downstream adoption and validation across tasks/models.
Paper 2 addresses a fundamental bottleneck in training LLM agents: efficient credit assignment in multi-step reasoning without relying on expensive external judges or full rollouts. By repurposing internal hidden states and attention features into a dense reward signal (PAIR), it offers a scalable solution to a highly critical and timely problem in LLM optimization. While Paper 1 is valuable for domain-specific scientific reasoning, Paper 2 presents a broader, more fundamental methodological innovation with widespread implications across all complex LLM agent tasks.
Paper 2 has higher potential scientific impact because it targets a foundational, timely problem—making responsibility in agentic AI computable via explicit provenance—relevant across many domains (software agents, governance, safety, auditing, law/policy). Its framing (responsibility gaps, causal attribution, responsibility tensor, lifecycle-layer computability, intervention) could set a broad research and standards agenda with significant real-world adoption pressure. Paper 1 is technically novel and useful for RLHF/GRPO efficiency, but its impact is narrower to LLM training and depends on robustness/generalization of probing-based rewards.
Paper 1 is more novel and timely: it introduces a new internal, dense step-level reward signal for multi-turn LLM agent training, addressing a major bottleneck in RLHF/GRPO credit assignment without costly rollouts, external judges, or ground-truth step labels. The prefix-contamination analysis plus a two-stage correction mechanism suggests methodological depth and broad applicability across agentic LLM optimization and reinforcement learning. Paper 2 is an incremental architecture variation within a mature traffic-forecasting GNN space, with narrower domain impact despite practical relevance.
PAIR addresses a fundamental challenge in multi-turn LLM agent optimization—credit assignment with dense step-level rewards—offering a novel, efficient solution using internal hidden-state probing. Its methodological contribution (prefix-aware reward model combining hidden-state and attention-based features) is broadly applicable across diverse multi-step reasoning and agent tasks, not limited to a specific domain. SVFSearch, while valuable as a benchmark for gaming-domain video search, is narrower in scope (Chinese gaming vertical) and primarily evaluative rather than methodologically innovative. PAIR's potential to improve GRPO training across many agent applications gives it broader impact.
Paper 1 presents a highly novel approach by repurposing internal hidden states as step-level rewards, addressing the critical bottleneck of credit assignment in multi-step reasoning without relying on expensive external models or ground-truth data. This methodological innovation offers profound implications for efficient agent optimization, potentially transforming how LLMs are trained for complex tasks, giving it a broader and more fundamental scientific impact compared to the exploration techniques in Paper 2.
Paper 2 addresses a fundamental and broadly applicable problem—capability erosion in self-evolving LLM agents—that affects all major evolution channels (workflow, skill, model, memory). This identifies a new phenomenon with clear parallels to catastrophic forgetting but in a novel agent context, making it highly relevant as autonomous agents become widespread. Paper 1, while technically rigorous with its prefix-aware internal reward model, addresses a more specific optimization challenge (credit assignment in GRPO training). Paper 2's broader scope, timeliness given the rise of autonomous agents, and cross-cutting implications give it higher potential impact.
Paper 2 has higher likely scientific impact due to its strong real-world applicability (clinical CDSS), broad community value (fully open, auditable end-to-end pipeline), and timeliness amid demands for transparency, reproducibility, and safety in medical AI. Methodologically, it emphasizes decontamination, clinician auditing, and human-calibrated evaluation, enabling adoption and extension across institutions and domains. Paper 1 is novel and potentially important for RL/agent training, but its impact is more specialized and depends on wider validation across tasks/models and downstream performance gains beyond reward-model AUROC.
Paper 2 is more likely to have higher scientific impact: it introduces a novel, scalable mechanism (PAIR) for dense, low-cost step-level rewards in multi-turn agent RL, directly addressing a core bottleneck in agent optimization (credit assignment without expensive judges/rollouts). If validated, it can broadly improve training pipelines across tasks and domains, with clear real-world applicability and timeliness given current interest in RL for LLM agents. Paper 1 is important for evaluation transparency and rigor, but is more incremental/diagnostic and its impact may be narrower than a training method that can materially improve agent performance.
Paper 2 likely has higher impact due to broader applicability: improving multi-turn agent optimization and dense credit assignment is relevant across many LLM/agent domains (coding, tool use, planning, robotics), not just chemistry. The approach (internal reward modeling robust to prefix contamination) is novel and timely for RLHF/RLAIF alternatives, potentially reducing reliance on costly external judges and rollouts. Paper 1 is strong and rigorous, with clear real-world utility in chem-informatics and a new benchmark, but its impact is more field-specific compared to general agent training infrastructure advances.
PAIR addresses a fundamental challenge in multi-turn LLM agent optimization—credit assignment in reinforcement learning—with a novel, principled approach that repurposes internal hidden states as reward signals. It identifies a new phenomenon (prefix contamination degrading probes), proposes an elegant two-stage solution, and enables practical dense reward signals without external dependencies. This has broad applicability across agentic AI systems. Paper 2, while valuable for challenging claims about chess LLMs and demonstrating memorization vs. generalization, is more domain-specific (chess) and primarily a critical/empirical contribution rather than introducing a broadly applicable new method.
Paper 2 likely has higher scientific impact due to its methodological novelty (prefix-aware internal reward modeling to enable dense step-level signals), broad applicability to multi-turn agent optimization across domains, and strong timeliness given current interest in RL for LLM agents and GRPO-like methods. It targets a core bottleneck—credit assignment without expensive rollouts, judges, or ground truth—offering practical scalability. Paper 1 is valuable and application-relevant, but its impact is narrower (CBT distress estimation) and more dataset/evaluation-centric with constrained generalization due to data privacy and domain specificity.
Paper 2 introduces a novel, broadly applicable training component (PAIR) that provides dense step-level rewards for multi-turn agent optimization without costly rollouts, external judges, or per-step ground truth—addressing a key bottleneck in RL/RLAIF for LLM agents. The method is technically innovative (prefix-contamination analysis + hybrid probe/attention correction), likely to be reusable across tasks and labs, and highly timely given current focus on agentic multi-step reasoning and efficient RL. Paper 1 offers important diagnostic insights about LLM negotiation limits, but is more domain-specific and primarily descriptive rather than enabling a general new capability.
Paper 2 is likely higher impact: it introduces a novel, generally applicable training signal (dense step-level rewards) for multi-turn agent optimization without external judges or ground-truth, directly addressing a key bottleneck in RLHF-style methods (credit assignment) with low inference cost. The approach is methodologically grounded (diagnosis of prefix contamination, then a two-stage corrective model) and broadly relevant across LLM agent training, alignment, and RL. Paper 1 is valuable infrastructure for web-agent evaluation, but its impact is more domain-specific (e-commerce) and benchmark-focused rather than a widely reusable learning method.