Jiangnan Xia, Yucheng Shi, Yu Yang, Kishan Panaganti, Zhenwen Liang, Ninghao Liu
Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts. Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy. Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO). Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.
DiRL introduces a principled mechanism for distinguishing reasoning-driven exploration from memorization-driven exploration during reinforcement learning of LLMs. The key insight is that not all diversity is equally valuable: a trajectory can appear novel simply by varying memorized patterns rather than genuinely exploring new reasoning paths. DiRL operationalizes this distinction by (1) extracting a reasoning-memorization direction from the model's residual stream using contrastive activation analysis, (2) constructing direction-weighted gradient features that emphasize reasoning-relevant token updates, (3) partitioning rollouts into reasoning-aligned and memorization-aligned subgroups, and (4) shaping rewards to amplify the former while suppressing the latter. The framework plugs directly into GRPO without modifying the core optimization loop.
The conceptual contribution—that exploration quality matters more than exploration quantity—is valuable and somewhat underexplored. While the idea that LLMs separate reasoning and memorization along linear directions in representation space comes from prior work (Hong et al., 2025), the novelty lies in incorporating this geometric distinction into the RL training loop itself, rather than using it solely as a diagnostic tool.
The approach is technically sound with several well-motivated design choices. The gradient factorization through final-layer features (Appendix A) provides mathematical justification for using Φ vectors as proxies for parameter update directions. The direction-weighted aggregation (Eq. 7) is a natural way to focus on reasoning-relevant updates, and the subgroup partitioning with asymmetric reference sets (Section 3.4) is a clever mechanism to ensure memorization responses are penalized relative to reasoning baselines.
However, several aspects warrant scrutiny:
Immediate applications: DiRL directly benefits anyone training LLMs for reasoning via RL. The computational overhead is modest (13-18% per step), making it practical for adoption. The framework's compatibility with GRPO is valuable given GRPO's widespread use.
Broader implications: The paper advances an important conceptual shift in how we think about exploration in LLM RL—from "more diversity is better" to "the right kind of diversity matters." This principle could influence future exploration strategies beyond the specific implementation proposed.
Limitations on impact: The reliance on a pre-computed linear direction may limit applicability to domains where reasoning-memorization distinction is less clear-cut. The method also requires curating contrastive datasets (D+, D−), which introduces domain-specific engineering.
This paper is highly timely. RL for LLM reasoning (DeepSeek-R1, GRPO-based training) is a dominant paradigm in 2025-2026, and exploration remains a recognized bottleneck. The paper directly builds on very recent work (G2RL, EVOL-RL) and addresses a limitation that practitioners have intuitively recognized but not formally addressed. The connection between mechanistic interpretability and training-time optimization is a growing frontier that this work meaningfully advances.
The paper's framing around "reasoning vs. memorization" is compelling but somewhat imprecise. The distinction is operationalized through external labels and linear probes, which may capture a proxy rather than the true phenomenon. Nevertheless, the empirical results suggest this proxy is useful enough to improve training outcomes meaningfully. The consistent gains across pass@1, maj@16, and pass@16 suggest the method genuinely improves the policy rather than just shifting probability mass.
Generated Jun 10, 2026
Paper 1 tackles a fundamental challenge in LLM training—distinguishing genuine reasoning from memorization during reinforcement learning. By steering exploration along an internal reasoning direction, it offers a novel approach with broad applicability to foundation model training. While Paper 2 presents a strong, specialized framework for spatial reasoning in MLLMs, Paper 1's focus on core reasoning mechanisms is likely to yield a wider impact across numerous NLP domains and general model development.
Paper 2 (INFRAMIND) likely has higher scientific impact due to stronger real-world applicability and timeliness: it directly addresses deployment-critical latency/SLO issues in shared GPU clusters for multi-agent LLM systems, offering a broadly useful infrastructure-aware control framework (planning, routing, scheduling) cast as a hierarchical constrained MDP. Its claimed gains span both quality and systems performance under varying load, with potential impact across ML systems, RL, and agentic AI. Paper 1 is novel for RLHF-style reasoning exploration, but is narrower and harder to translate into immediate production benefits.
Paper 1 addresses a critical, high-stakes problem (AI synthesis of health/scientific conclusions) and introduces a rigorous benchmark (SciConBench) along with a clean-room evaluation harness to mitigate data leakage. Its audit of consumer-facing agents provides immediate real-world relevance. While Paper 2 offers a valuable methodological improvement in LLM reinforcement learning, Paper 1's focus on AI reliability in consequential domains gives it broader interdisciplinary and societal impact.
Paper 1 targets an urgent, under-addressed safety gap: emergent misalignment in multi-agent LLM systems, proposing a continual, budget-aware auditor that actively inspects conversations. This is both novel and timely as agentic systems proliferate, with clear real-world applicability (deployment monitoring, governance, incident detection) and potential cross-field impact (AI safety, HCI, security, multi-agent systems). The evaluation spans multiple adversarial conditions and tool configurations, suggesting solid rigor. Paper 2 is a meaningful RL exploration refinement, but is narrower in scope and likely incremental relative to the broader, high-stakes monitoring framework in Paper 1.
Paper 1 proposes a novel algorithmic framework (DiRL) to address a fundamental challenge in LLM reinforcement learning: distinguishing genuine reasoning from memorization during exploration. This addresses a critical bottleneck in advancing LLM reasoning capabilities, offering broad applicability across various models and tasks. In contrast, Paper 2 is an application-focused study that applies existing techniques (LoRA, NEFTune) to a specific domain (Financial NER). Therefore, Paper 1 has significantly higher methodological innovation and potential for widespread impact across the broader AI research community.
Paper 1 tackles a critical bottleneck in embodied AI: the lack of scalable, interactive 3D training environments. By introducing an open-source platform that co-evolves generated environments with agent capabilities, it provides a foundational tool that could become a standard for robotics and RL research. While Paper 2 offers a valuable algorithmic optimization for LLM reasoning, Paper 1's creation of a deployable generative simulation ecosystem has a higher potential to trigger widespread methodological shifts and practical applications across physical AI fields.
DiRL addresses a fundamental challenge in LLM reinforcement learning—distinguishing reasoning from memorization during exploration—which is broadly applicable across all LLM training. Its novel direction-aware framework offers methodological innovation with wide impact across the rapidly growing RL-for-LLM-reasoning field. Paper 2, while valuable for biomedical automation, targets a narrower domain-specific problem (biomedical agent tool orchestration via MCP graphs). Paper 1's contributions to understanding and improving how LLMs learn to reason have broader implications for the entire AI community.
Paper 1 is likely to have higher scientific impact due to broader relevance and transferability: it targets a general problem in LLM RL (steering exploration toward reasoning vs memorization) applicable across many tasks, models, and RLHF-style pipelines. The proposed direction-aware mechanism is a methodological contribution that can be adopted widely and evaluated on standard reasoning benchmarks. Paper 2 is innovative and application-driven, but its impact is more domain-specific (supply chains) and depends heavily on the fidelity and adoption of its bespoke simulator/benchmark, potentially narrowing breadth and reproducibility.
Paper 2 has higher potential impact due to strong real-world applicability (personalized cancer treatment) and timeliness in clinical decision support under constraints. It frames treatment as belief-space control with active inference, addressing partial observability, patient heterogeneity, and measurement budgets, and evaluates on real clinical data (AACR GENIE), increasing methodological and translational credibility. Its ideas can generalize to other healthcare and constrained POMDP domains. Paper 1 is innovative for RL exploration in LLMs, but its impact is more specialized to model training and hinges on internal representation heuristics and benchmark improvements.
Paper 1 (DiRL) addresses a fundamental and timely problem in LLM reasoning—distinguishing genuine reasoning from memorization during RL-based training. This touches a core challenge in the rapidly growing field of LLM reasoning enhancement. The conceptual insight of decomposing exploration into reasoning vs. memorization directions is novel and broadly applicable. Paper 2 proposes an efficient memory compression technique for QA, which is useful but more incremental—compressing evidence into latent tokens is a natural extension of existing retrieval-augmented generation work. Paper 1's potential to reshape how RL training for LLMs is conducted gives it broader impact.