Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su
Abstract
Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards"
1. Core Contribution
The paper addresses a genuine credit-assignment problem in reinforcement learning for LLM reasoning: outcome-only rewards (e.g., final-answer correctness) can reinforce reasoning traces that are "right for the wrong reasons." The proposed solution, TraceLift, introduces a planner-executor decomposition where a trainable planner produces reasoning traces consumed by a frozen executor. The reward combines: (a) verifier feedback on executor outputs, (b) a rubric-based Reasoning Reward Model (RM) score, and (c) measured executor uplift—whether the trace actually improves executor performance over a no-reasoning baseline. The multiplicative coupling of RM quality and executor uplift is the key design choice: traces receive credit only when they are both high-quality *and* useful to the downstream executor.
A secondary contribution is TraceLift-Groups, a dataset of 6,000 reasoning groups (math + code) where each group contains a reference trace and multiple targeted perturbations with rubric annotations. This enables training a reasoning-quality RM without reliance on final-answer labels.
2. Methodological Rigor
Strengths in experimental design:
Weaknesses and concerns:
3. Potential Impact
The core insight—that reasoning traces should be evaluated by their downstream utility to a consumer model, not just intrinsic quality—is valuable and generalizable. This framing could influence:
However, the specific implementation choices (GRPO, the particular rubric dimensions, the perturbation types) are somewhat narrowly tailored, limiting immediate broader applicability.
4. Timeliness & Relevance
This work is highly timely. The RL-with-verifiable-rewards paradigm (exemplified by DeepSeek-R1) has become dominant for training reasoning models, and the paper correctly identifies that outcome-only rewards create a gap between reward signal and reasoning quality. The planner-executor framing also aligns with the growing use of agentic LLM pipelines. The problem of "right for wrong reasons" is well-recognized but under-addressed with principled solutions.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
6. Additional Observations
The paper is notably long (36 pages) with extensive appendices including full prompts, theoretical proofs, and qualitative case studies. While thorough, the main contribution could be communicated more concisely. The qualitative case studies in Section 9 are genuinely informative, showing specific mechanisms by which TraceLift traces differ (e.g., regex anchoring, type guards, entity tracking).
The theoretical contribution, while technically sound, primarily serves to formalize intuitions rather than provide non-obvious insights. The most useful theoretical result is arguably Corollary 4 (saturated judges reduce to execution-only ranking), which directly explains an empirical finding.
Generated May 7, 2026
Comparison History (20)
Paper 2 addresses a critical and highly timely challenge in LLM reasoning: reward hacking and unfaithful reasoning traces in RL training. By introducing executor-grounded rewards, it ensures reasoning is both high-quality and practically useful for the consuming model. This has broad, immediate applications across agentic AI, complex problem solving, and verifiable reasoning frameworks, potentially offering a wider real-world impact than the architectural improvements to linear attention presented in Paper 1.
Paper 1 has higher potential impact due to a more novel and broadly relevant framing: reasoning traces as consumable intermediate artifacts with executor-grounded rewards that explicitly target faithfulness/usefulness, not just outcome correctness. It contributes a concrete training framework (planner–executor with uplift-weighted rubric scoring) plus a new rubric-annotated dataset of contrastive flawed traces, which can be reused by the community. This addresses a central limitation in RLVR (rewarding “right for wrong reasons”) with implications for multi-step tool use and agentic systems. Paper 2 is timely and useful for efficiency, but is more incremental within exploration/rollout optimization.
Paper 2 addresses a critical bottleneck in Vision-Language Models by disentangling perceptual errors ('bad seeing') from logical errors ('bad thinking'). Its novel Modality-Aware Credit Assignment and 'blindfolded reasoning' proxy offer a fundamental solution to the perception-reasoning trade-off in VLMs. While Paper 1 presents a strong approach for LLM reasoning, Paper 2's methodology tackles a more complex multi-modal challenge, likely yielding broader impact across the rapidly growing field of spatial and visual reasoning.
Paper 2 ($δ$-mem) likely has higher impact due to broader applicability and timeliness: efficient long-term memory for LLM assistants/agents is a core bottleneck across many domains, and its lightweight, frozen-backbone online update mechanism is broadly deployable without retraining or context extension. The method is simple, scalable, and could influence architectures and systems work widely. Paper 1 addresses an important reliability issue in RL-for-reasoning, but is more specialized to planner–executor setups and trace-supervision paradigms, with narrower cross-field reach.
Paper 2 addresses a critical bottleneck in the current frontier of AI: training reliable reasoning models. By proposing a method to reward faithful and useful reasoning traces rather than just correct final answers, it directly tackles reward hacking in RL-based reasoning systems. While Paper 1 offers a valuable diagnostic benchmark for multimodal grounding, Paper 2's methodological innovation in planner-executor frameworks has broader implications for scaling test-time compute and improving explicit reasoning capabilities across diverse domains like math and code.
Paper 2 offers higher scientific impact because it addresses a critical bottleneck in AI evaluation: data contamination. By providing a living benchmark of open research conjectures formalized in Lean 4, it guarantees zero contamination while directly bridging AI and pure mathematics. The benchmark has already facilitated new mathematical discoveries, demonstrating immediate real-world utility. While Paper 1 presents a strong methodological improvement for training LLM reasoning, Paper 2 provides essential infrastructure for the next frontier of automated reasoning, fostering cross-disciplinary collaboration and verifiable scientific discovery.
Paper 1 offers higher potential scientific impact due to its broad methodological applicability across the entire field of LLM reasoning. By addressing the critical 'right for the wrong reasons' flaw in outcome-based reinforcement learning, TraceLift introduces a fundamental improvement to how multi-step reasoning models are trained. While Paper 2 provides an excellent and rigorous benchmark for a specific, high-stakes domain (ICU healthcare), Paper 1's executor-grounded reward mechanism tackles a core bottleneck in general AI capabilities, promising widespread algorithmic adoption across numerous downstream fields and applications, including coding, math, and general agentic workflows.
Paper 1 (TraceLift) addresses a fundamental and broadly applicable problem in LLM reasoning—ensuring reasoning traces are not just correct but genuinely useful to downstream consumers. This has wide implications across all multi-step LLM systems (code, math, agents). The executor-grounded reward framework and rubric-annotated dataset represent novel, rigorous contributions with potential to reshape how reasoning is supervised in RL-trained LLMs. Paper 2 (WorldMAP) makes a solid contribution to embodied navigation but targets a narrower domain. TraceLift's broader applicability to the rapidly growing LLM reasoning field gives it higher impact potential.
TraceLift introduces a fundamentally new paradigm for training reasoning models by evaluating whether reasoning traces are actually useful to downstream consumers, not just correct. This addresses a deeper and more broadly applicable problem—faithful and functional reasoning—with a novel executor-grounded reward framework and a new dataset construction methodology. Paper 2 addresses the important but more incremental problem of reasoning compression via distillation, a well-explored direction. TraceLift's insight that reasoning should be evaluated by its utility to consuming models has broader implications for multi-agent systems, tool use, and compositional AI pipelines.
Paper 2 addresses a fundamental scalability bottleneck in multi-agent systems, extending attribution from thousands to millions of agents. Its theoretical proof of Attribution Scaling Bias invalidates common small-scale proxy methods, promising broad, paradigm-shifting impact across AI, complex systems, and computational social science. Paper 1 offers a valuable but more narrow algorithmic improvement for LLM reasoning traces.
Paper 1 addresses a fundamental and broadly applicable problem in RL-based reasoning for LLMs: ensuring reasoning traces are not just correct but actually useful to downstream consumers. The TraceLift framework introduces a novel executor-grounded reward mechanism with broad implications for multi-step AI systems, code generation, and math reasoning. Paper 2 tackles a more niche problem (visual semantic arithmetic) with narrower applicability. Paper 1's methodology—separating reasoning quality from outcome correctness—represents a more generalizable insight that could influence how the field trains reasoning models at scale.
Paper 1 addresses a critical and highly timely challenge in large language models: ensuring the faithfulness and utility of intermediate reasoning steps rather than just final-answer correctness. Given the massive current focus on reasoning models (e.g., test-time compute, RL for reasoning), this executor-grounded reward approach has immense immediate applicability and relevance. Paper 2's contribution to neuro-symbolic AI is strong, but Paper 1's alignment with the forefront of LLM development gives it a higher potential for broad, immediate scientific impact.
Paper 1 addresses a fundamental and broadly impactful problem in LLM reasoning—ensuring that reinforcement learning rewards faithful, useful reasoning traces rather than shortcuts. It introduces a novel framework (TraceLift) with a concrete training methodology, dataset, and open-source code applicable across math and code domains. The approach has wide relevance given the massive interest in LLM reasoning. Paper 2 presents a valuable but more niche contribution in surgical AI, with a narrower target community and application scope. Paper 1's methodological innovation in reward design for reasoning has broader potential to influence the rapidly growing field of LLM training.
Paper 2 (TraceLift) addresses a fundamental and broadly applicable problem in LLM reasoning: ensuring that reasoning traces are not just superficially correct but genuinely useful to downstream consumers. This has wider impact across math, code, and any multi-step reasoning system. The executor-grounded reward concept is novel and methodologically rigorous, with clear generalizability. Paper 1 (EBM-RL) is creative but targets a narrower domain (video role-playing dialogue), limiting its breadth of impact despite showing interesting zero-shot generalization results.
Paper 2 has higher potential impact due to a more generally applicable and timely contribution: improving faithfulness and utility of LLM reasoning via executor-grounded rewards and a curated perturbation-based dataset. The approach targets a central, cross-domain problem in modern AI (reliable reasoning traces for multi-step systems) with clear methodological rigor (planner–executor setup, reward shaping tied to measured executor uplift, benchmark evaluations, code release). Paper 1 is novel in applying multi-agent LLMs to IMU HAR, but its scope is narrower (a specific sensing task/dataset) and may face deployment constraints (LLM cost/latency) limiting breadth.
Paper 2 introduces a broadly applicable training paradigm (executor-grounded rewards) that addresses a timely, central issue in LLM reasoning: faithfulness and utility of intermediate traces beyond final-answer correctness. It contributes a new framework (TraceLift), a principled reward design tied to downstream consumption, and a purpose-built dataset with controlled trace perturbations, supporting methodological rigor and reproducibility (code released). Its potential impact spans RLHF/RLAIF, tool-using agents, program synthesis, and multi-step AI systems. Paper 1 is interesting but narrower (IMU HAR) and relies on LLM-agent scaffolding with less generalizable methodological contribution.
Paper 1 introduces a novel training framework (TraceLift) and dataset to address a fundamental flaw in current RLHF methods for LLM reasoning (reward hacking and unfaithful traces). Methodological contributions that improve model training generally yield higher and broader scientific impact than observational evaluation studies, such as Paper 2, which focuses on evaluating existing models on the niche topic of moral judgments.
Paper 1 addresses a fundamental and highly active problem in AI—improving the faithfulness and utility of LLM reasoning traces beyond mere final-answer correctness. Its novel planner-executor framework and executor-grounded reward system have broad implications for AI alignment, multi-step reasoning, and reinforcement learning. While Paper 2 offers valuable applied insights for healthcare forecasting, Paper 1 introduces core methodological advancements likely to influence a wider range of foundational AI research and applications.
Paper 2 addresses a highly timely and critical challenge in large language models—improving the faithfulness and utility of reasoning traces via reinforcement learning. Given the massive current interest in LLM reasoning capabilities and multi-agent frameworks, this executor-grounded approach has broad applicability and high potential for widespread impact. Paper 1 offers solid theoretical contributions to reinforcement learning in Tree MDPs, but its scope and potential audience are much narrower compared to the ubiquitous applications of LLM reasoning improvements.
Paper 1 addresses a critical bottleneck in modern Large Language Models (LLMs)—process supervision and faithful reasoning traces—which is highly relevant to the rapidly expanding fields of generative AI and reinforcement learning from human feedback (RLHF). Its approach to executor-grounded rewards directly impacts real-world LLM capabilities in math and coding. While Paper 2 offers a strong theoretical contribution to Tree MDPs and bandit algorithms, Paper 1 has significantly higher potential for broad, immediate scientific and industrial impact due to the massive current focus on improving LLM reasoning.