Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su

May 5, 2026

arXiv:2605.03862v2 PDF

v1v2

cs.AI(primary)cs.CL

#183of 2292·Artificial Intelligence

#183 of 2292 · Artificial Intelligence

Tournament Score

1523±46

10501800

80%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor6.5

Novelty6

Clarity6.5

Tournament Score

1523±46

10501800

80%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards"

1. Core Contribution

The paper addresses a genuine credit-assignment problem in reinforcement learning for LLM reasoning: outcome-only rewards (e.g., final-answer correctness) can reinforce reasoning traces that are "right for the wrong reasons." The proposed solution, TraceLift, introduces a planner-executor decomposition where a trainable planner produces reasoning traces consumed by a frozen executor. The reward combines: (a) verifier feedback on executor outputs, (b) a rubric-based Reasoning Reward Model (RM) score, and (c) measured executor uplift—whether the trace actually improves executor performance over a no-reasoning baseline. The multiplicative coupling of RM quality and executor uplift is the key design choice: traces receive credit only when they are both high-quality *and* useful to the downstream executor.

A secondary contribution is TraceLift-Groups, a dataset of 6,000 reasoning groups (math + code) where each group contains a reference trace and multiple targeted perturbations with rubric annotations. This enables training a reasoning-quality RM without reliance on final-answer labels.

2. Methodological Rigor

Strengths in experimental design:

The frozen-executor evaluation protocol is well-designed: since only the planner changes, improvements must come from better reasoning traces rather than a stronger answer generator.

The paper tests across three model families (Qwen2.5-7B, Llama3.1-8B, Qwen3-4B) and two domains (code, math), with results averaged over three seeds.

Ablations are thorough: reward component ablations (no-uplift, RM-uplift only, LLM-as-judge), executor comparison count K, LoRA vs. full-parameter training, and reasoning length analysis all provide useful diagnostics.

The theoretical analysis in Section 10 is extensive, formalizing the reward as a quality-weighted conditional treatment effect and proving properties about credit assignment.

Weaknesses and concerns:

The improvements, while consistent, are moderate: ~2-4 percentage points on micro-averages. For math on Qwen3-4B, the gain is only 0.84 points. While the frozen-executor constraint makes any gain meaningful, the practical significance could be questioned.

The Reasoning RM is validated only on held-out perturbation groups from the same synthetic pipeline (Table 13). It's unclear how well this RM generalizes to naturally occurring reasoning errors versus the seven predefined perturbation types.

The rubric annotation relies on LLM judges (likely GPT-4 or similar), which the paper itself notes can have calibration issues. The paper somewhat addresses this by showing the trained RM outperforms direct LLM-as-judge, but the training data quality still depends on the initial LLM judge.

The uplift estimation requires running the executor multiple times per trace during training (K=3 comparisons), significantly increasing computational cost. The paper doesn't discuss wall-clock training time overhead.

The theoretical analysis, while thorough, is largely descriptive—it formalizes properties of the chosen reward rather than proving anything about convergence or optimality.

3. Potential Impact

The core insight—that reasoning traces should be evaluated by their downstream utility to a consumer model, not just intrinsic quality—is valuable and generalizable. This framing could influence:

Multi-agent LLM systems: Any pipeline where one model's output is consumed by another could benefit from executor-grounded training.

Process supervision: The work complements existing process reward models by adding a utility dimension.

Modular AI systems: The planner-executor decomposition aligns with trends toward compositional, tool-using AI systems.

However, the specific implementation choices (GRPO, the particular rubric dimensions, the perturbation types) are somewhat narrowly tailored, limiting immediate broader applicability.

4. Timeliness & Relevance

This work is highly timely. The RL-with-verifiable-rewards paradigm (exemplified by DeepSeek-R1) has become dominant for training reasoning models, and the paper correctly identifies that outcome-only rewards create a gap between reward signal and reasoning quality. The planner-executor framing also aligns with the growing use of agentic LLM pipelines. The problem of "right for wrong reasons" is well-recognized but under-addressed with principled solutions.

5. Strengths & Limitations

Key Strengths:

Clean problem formulation: The idea that reasoning is a "consumable intermediate artifact" is intuitive and well-motivated.

Principled reward design: The multiplicative coupling of quality and uplift is elegant—zero uplift zeroes out the RM bonus regardless of rubric score.

Comprehensive ablations: Each component is justified empirically and theoretically. The LLM-as-judge saturation analysis (95.5% at score 1.0) is particularly revealing.

Length analysis: Demonstrating that TraceLift produces *shorter* but more effective traces (~30 tokens shorter than Exec-only) effectively rules out the verbosity confound.

Code availability: Promised release supports reproducibility.

Notable Limitations:

Scale: All experiments use 7B/8B/4B models. It's unclear whether the gains persist at larger scales where reasoning may already be more robust.

Dataset construction: The seven perturbation types per domain are manually designed. Natural reasoning failures may not follow these patterns, potentially limiting RM generalization.

Evaluation benchmarks: The benchmarks (HumanEval, GSM8K, MATH500) are relatively standard. Harder reasoning benchmarks (e.g., competition math, complex software engineering) would better test the claim.

No comparison to process reward models: The paper positions against outcome-only rewards but doesn't compare against existing process supervision methods (e.g., Math-Shepherd, PRM800K-style approaches).

Executor diversity: All experiments use same-family executors. Cross-family executor transfer is unexplored.

6. Additional Observations

The paper is notably long (36 pages) with extensive appendices including full prompts, theoretical proofs, and qualitative case studies. While thorough, the main contribution could be communicated more concisely. The qualitative case studies in Section 9 are genuinely informative, showing specific mechanisms by which TraceLift traces differ (e.g., regex anchoring, type guards, entity tracking).

The theoretical contribution, while technically sound, primarily serves to formalize intuitions rather than provide non-obvious insights. The most useful theoretical result is arguably Corollary 4 (saturated judges reduce to execution-only ranking), which directly explains an empirical finding.

Rating:6.2/ 10

Significance 6.5Rigor 6.5Novelty 6Clarity 6.5

Generated May 7, 2026

Comparison History (20)

vs. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

gemini-3.15/22/2026

Paper 2 addresses a critical and highly timely challenge in LLM reasoning: reward hacking and unfaithful reasoning traces in RL training. By introducing executor-grounded rewards, it ensures reasoning is both high-quality and practically useful for the consuming model. This has broad, immediate applications across agentic AI, complex problem solving, and verifiable reasoning frameworks, potentially offering a wider real-world impact than the architectural improvements to linear attention presented in Paper 1.

vs. Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

gpt-5.25/18/2026

Paper 1 has higher potential impact due to a more novel and broadly relevant framing: reasoning traces as consumable intermediate artifacts with executor-grounded rewards that explicitly target faithfulness/usefulness, not just outcome correctness. It contributes a concrete training framework (planner–executor with uplift-weighted rubric scoring) plus a new rubric-annotated dataset of contrastive flawed traces, which can be reused by the community. This addresses a central limitation in RLVR (rewarding “right for wrong reasons”) with implications for multi-step tool use and agentic systems. Paper 2 is timely and useful for efficiency, but is more incremental within exploration/rollout optimization.

vs. Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

gemini-3.15/16/2026

Paper 2 addresses a critical bottleneck in Vision-Language Models by disentangling perceptual errors ('bad seeing') from logical errors ('bad thinking'). Its novel Modality-Aware Credit Assignment and 'blindfolded reasoning' proxy offer a fundamental solution to the perception-reasoning trade-off in VLMs. While Paper 1 presents a strong approach for LLM reasoning, Paper 2's methodology tackles a more complex multi-modal challenge, likely yielding broader impact across the rapidly growing field of spatial and visual reasoning.

vs. $δ$-mem: Efficient Online Memory for Large Language Models

gpt-5.25/16/2026

Paper 2 ($δ$-mem) likely has higher impact due to broader applicability and timeliness: efficient long-term memory for LLM assistants/agents is a core bottleneck across many domains, and its lightweight, frozen-backbone online update mechanism is broadly deployable without retraining or context extension. The method is simple, scalable, and could influence architectures and systems work widely. Paper 1 addresses an important reliability issue in RL-for-reasoning, but is more specialized to planner–executor setups and trace-supervision paradigms, with narrower cross-field reach.

vs. Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

gemini-3.15/16/2026

Paper 2 addresses a critical bottleneck in the current frontier of AI: training reliable reasoning models. By proposing a method to reward faithful and useful reasoning traces rather than just correct final answers, it directly tackles reward hacking in RL-based reasoning systems. While Paper 1 offers a valuable diagnostic benchmark for multimodal grounding, Paper 2's methodological innovation in planner-executor frameworks has broader implications for scaling test-time compute and improving explicit reasoning capabilities across diverse domains like math and code.

vs. Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

gemini-3.15/16/2026

Paper 2 offers higher scientific impact because it addresses a critical bottleneck in AI evaluation: data contamination. By providing a living benchmark of open research conjectures formalized in Lean 4, it guarantees zero contamination while directly bridging AI and pure mathematics. The benchmark has already facilitated new mathematical discoveries, demonstrating immediate real-world utility. While Paper 1 presents a strong methodological improvement for training LLM reasoning, Paper 2 provides essential infrastructure for the next frontier of automated reasoning, fostering cross-disciplinary collaboration and verifiable scientific discovery.

vs. RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

gemini-3.15/16/2026

Paper 1 offers higher potential scientific impact due to its broad methodological applicability across the entire field of LLM reasoning. By addressing the critical 'right for the wrong reasons' flaw in outcome-based reinforcement learning, TraceLift introduces a fundamental improvement to how multi-step reasoning models are trained. While Paper 2 provides an excellent and rigorous benchmark for a specific, high-stakes domain (ICU healthcare), Paper 1's executor-grounded reward mechanism tackles a core bottleneck in general AI capabilities, promising widespread algorithmic adoption across numerous downstream fields and applications, including coding, math, and general agentic workflows.

vs. WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models

claude-opus-4.65/16/2026

Paper 1 (TraceLift) addresses a fundamental and broadly applicable problem in LLM reasoning—ensuring reasoning traces are not just correct but genuinely useful to downstream consumers. This has wide implications across all multi-step LLM systems (code, math, agents). The executor-grounded reward framework and rubric-annotated dataset represent novel, rigorous contributions with potential to reshape how reasoning is supervised in RL-trained LLMs. Paper 2 (WorldMAP) makes a solid contribution to embodied navigation but targets a narrower domain. TraceLift's broader applicability to the rapidly growing LLM reasoning field gives it higher impact potential.

vs. Reasoning Compression with Mixed-Policy Distillation

claude-opus-4.65/16/2026

TraceLift introduces a fundamentally new paradigm for training reasoning models by evaluating whether reasoning traces are actually useful to downstream consumers, not just correct. This addresses a deeper and more broadly applicable problem—faithful and functional reasoning—with a novel executor-grounded reward framework and a new dataset construction methodology. Paper 2 addresses the important but more incremental problem of reasoning compression via distillation, a well-explored direction. TraceLift's insight that reasoning should be evaluated by its utility to consuming models has broader implications for multi-agent systems, tool use, and compositional AI pipelines.

vs. Attributing Emergence in Million-Agent Systems

gemini-3.15/16/2026

Paper 2 addresses a fundamental scalability bottleneck in multi-agent systems, extending attribution from thousands to millions of agents. Its theoretical proof of Attribution Scaling Bias invalidates common small-scale proxy methods, promising broad, paradigm-shifting impact across AI, complex systems, and computational social science. Paper 1 offers a valuable but more narrow algorithmic improvement for LLM reasoning traces.

vs. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

claude-opus-4.65/16/2026

Paper 1 addresses a fundamental and broadly applicable problem in RL-based reasoning for LLMs: ensuring reasoning traces are not just correct but actually useful to downstream consumers. The TraceLift framework introduces a novel executor-grounded reward mechanism with broad implications for multi-step AI systems, code generation, and math reasoning. Paper 2 tackles a more niche problem (visual semantic arithmetic) with narrower applicability. Paper 1's methodology—separating reasoning quality from outcome correctness—represents a more generalizable insight that could influence how the field trains reasoning models at scale.

vs. Visual Perceptual to Conceptual First-Order Rule Learning Networks

gemini-35/7/2026

Paper 1 addresses a critical and highly timely challenge in large language models: ensuring the faithfulness and utility of intermediate reasoning steps rather than just final-answer correctness. Given the massive current focus on reasoning models (e.g., test-time compute, RL for reasoning), this executor-grounded reward approach has immense immediate applicability and relevance. Paper 2's contribution to neuro-symbolic AI is strong, but Paper 1's alignment with the forefront of LLM development gives it a higher potential for broad, immediate scientific impact.

vs. Actionable Real-Time Modeling of Surgical Team Dynamics via Time-Expanded Interaction Graphs

claude-opus-4.65/7/2026

Paper 1 addresses a fundamental and broadly impactful problem in LLM reasoning—ensuring that reinforcement learning rewards faithful, useful reasoning traces rather than shortcuts. It introduces a novel framework (TraceLift) with a concrete training methodology, dataset, and open-source code applicable across math and code domains. The approach has wide relevance given the massive interest in LLM reasoning. Paper 2 presents a valuable but more niche contribution in surgical AI, with a narrower target community and application scope. Paper 1's methodological innovation in reward design for reasoning has broader potential to influence the rapidly growing field of LLM training.

vs. Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

claude-opus-4.65/7/2026

Paper 2 (TraceLift) addresses a fundamental and broadly applicable problem in LLM reasoning: ensuring that reasoning traces are not just superficially correct but genuinely useful to downstream consumers. This has wider impact across math, code, and any multi-step reasoning system. The executor-grounded reward concept is novel and methodologically rigorous, with clear generalizability. Paper 1 (EBM-RL) is creative but targets a narrower domain (video role-playing dialogue), limiting its breadth of impact despite showing interesting zero-shot generalization results.

vs. SensingAgents: A Multi-Agent Collaborative Framework for Robust IMU Activity Recognition

gpt-5.25/7/2026

Paper 2 has higher potential impact due to a more generally applicable and timely contribution: improving faithfulness and utility of LLM reasoning via executor-grounded rewards and a curated perturbation-based dataset. The approach targets a central, cross-domain problem in modern AI (reliable reasoning traces for multi-step systems) with clear methodological rigor (planner–executor setup, reward shaping tied to measured executor uplift, benchmark evaluations, code release). Paper 1 is novel in applying multi-agent LLMs to IMU HAR, but its scope is narrower (a specific sensing task/dataset) and may face deployment constraints (LLM cost/latency) limiting breadth.

vs. SensingAgents: A Multi-Agent Collaborative Framework for Robust IMU Activity Recognition

gpt-5.25/7/2026

Paper 2 introduces a broadly applicable training paradigm (executor-grounded rewards) that addresses a timely, central issue in LLM reasoning: faithfulness and utility of intermediate traces beyond final-answer correctness. It contributes a new framework (TraceLift), a principled reward design tied to downstream consumption, and a purpose-built dataset with controlled trace perturbations, supporting methodological rigor and reproducibility (code released). Its potential impact spans RLHF/RLAIF, tool-using agents, program synthesis, and multi-step AI systems. Paper 1 is interesting but narrower (IMU HAR) and relies on LLM-agent scaffolding with less generalizable methodological contribution.

vs. How Does Thinking Mode Change LLM Moral Judgments? A Controlled Instant-vs-Thinking Comparison Across Five Frontier Models

gemini-35/7/2026

Paper 1 introduces a novel training framework (TraceLift) and dataset to address a fundamental flaw in current RLHF methods for LLM reasoning (reward hacking and unfaithful traces). Methodological contributions that improve model training generally yield higher and broader scientific impact than observational evaluation studies, such as Paper 2, which focuses on evaluating existing models on the niche topic of moral judgments.

vs. Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

gemini-35/7/2026

Paper 1 addresses a fundamental and highly active problem in AI—improving the faithfulness and utility of LLM reasoning traces beyond mere final-answer correctness. Its novel planner-executor framework and executor-grounded reward system have broad implications for AI alignment, multi-step reasoning, and reinforcement learning. While Paper 2 offers valuable applied insights for healthcare forecasting, Paper 1 introduces core methodological advancements likely to influence a wider range of foundational AI research and applications.

vs. On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

gemini-35/7/2026

Paper 2 addresses a highly timely and critical challenge in large language models—improving the faithfulness and utility of reasoning traces via reinforcement learning. Given the massive current interest in LLM reasoning capabilities and multi-agent frameworks, this executor-grounded approach has broad applicability and high potential for widespread impact. Paper 1 offers solid theoretical contributions to reinforcement learning in Tree MDPs, but its scope and potential audience are much narrower compared to the ubiquitous applications of LLM reasoning improvements.

vs. On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

gemini-35/7/2026

Paper 1 addresses a critical bottleneck in modern Large Language Models (LLMs)—process supervision and faithful reasoning traces—which is highly relevant to the rapidly expanding fields of generative AI and reinforcement learning from human feedback (RLHF). Its approach to executor-grounded rewards directly impacts real-world LLM capabilities in math and coding. While Paper 2 offers a strong theoretical contribution to Tree MDPs and bandit algorithms, Paper 1 has significantly higher potential for broad, immediate scientific and industrial impact due to the massive current focus on improving LLM reasoning.