Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman
Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.
This paper proposes a mechanistic taxonomy for how RL post-training improves reasoning in language models, identifying two core mechanisms: strategy selection (routing problems to appropriate pre-existing reasoning strategies) and strategy improvement (enhancing existing reasoning patterns on harder problems). The authors further argue that two previously observed phenomena—strategy amplification and strategy composition—are not independent mechanisms but rather observable consequences of selection and improvement, respectively.
The key experimental insight is that a model trained via SFT on *both* forward and backward reasoning strategies (the "FB" model) achieves ~95% accuracy after RL, dramatically outperforming models trained on a single strategy (~50-60%), despite all models starting from comparable ~80% SFT accuracy. This cleanly demonstrates that RL's primary role is selecting among pre-existing strategies rather than generating novel ones.
The experimental design is thoughtfully constructed. The finite-field arithmetic task provides excellent experimental control: it uses abstract symbols to break away from pre-trained math knowledge, has tunable difficulty (number of arithmetic steps), natural strategy dichotomy (forward vs. backward reasoning), and two problem types (evaluation vs. inversion) with clear strategy-problem alignment.
The paper's practical implications are clear and actionable:
These findings connect to the broader debate about whether RL creates new capabilities or merely refines existing ones, providing controlled evidence for the latter view. This has implications for how the community allocates resources between pre-training data curation and RL pipeline engineering.
This paper is highly timely. The rapid deployment of RL post-training (DeepSeek-R1, OpenAI o-series) has created an urgent need for mechanistic understanding. The field is currently in a phase where many groups are applying RL recipes without understanding why they work, leading to inefficient iteration. Papers like this help build the theoretical foundation that can guide principled improvements.
The paper directly addresses active debates in the community: whether RL creates "aha moments" or merely amplifies pre-existing capabilities, whether R1-Zero-style training truly learns reasoning from scratch, and what role data plays in RL success. The strategy selection finding provides a clean explanation for why models with richer pre-training tend to benefit more from RL.
The paper is well-written with clear figures and a logical structure. The distinction between mechanisms (selection, improvement) and phenomena (amplification, composition) is a useful conceptual framework. However, the paper would benefit from discussion of when these mechanisms might break down or interact in more complex ways, and from experiments on at least one natural language reasoning task to bridge the synthetic-to-real gap.
The work is best viewed as providing useful hypotheses and a clean experimental framework rather than definitive answers about RL post-training at scale.
Generated Jun 12, 2026
Paper 2 addresses the highly timely challenge of understanding reinforcement learning post-training for large language models, a rapidly expanding frontier in AI. By revealing the underlying mechanics of strategy selection and improvement, it offers broad, immediate applicability for scaling reasoning capabilities across foundational models. While Paper 1 tackles important ethical and privacy issues in synthetic data generation, Paper 2 has a wider potential scientific impact across the machine learning community due to the current intense industry and academic focus on advancing LLM reasoning.
Paper 1 introduces a highly novel, tangible framework (SWITCH) that solves critical optimization and interpretability bottlenecks in latent chain-of-thought reasoning using on-policy RL. While Paper 2 provides valuable empirical insights into RL post-training, Paper 1 proposes a fundamental methodological innovation (Switch-GRPO) that enables new capabilities in hidden-state recurrence, which is currently at the frontier of LLM reasoning research. Its combination of a new architectural approach with mechanistic interpretability gives it higher potential for widespread adoption and subsequent follow-up research.
Paper 1 addresses a highly timely and broadly impactful question: understanding the mechanics of RL post-training for reasoning in LLMs. Given the explosive interest in reasoning models (e.g., OpenAI o1, DeepSeek-R1), mechanistic insights into how RL training works—identifying strategy selection and strategy improvement as core mechanisms—provides both theoretical understanding and practical guidance for scaling. This has broad implications across the entire LLM community. Paper 2 addresses a more niche problem (uncertainty in molecular diffusion models) with solid but more incremental contributions and a narrower audience.
Paper 1 addresses a fundamental and timely question about the mechanisms underlying reinforcement learning post-training for reasoning models, which is at the forefront of AI research. Its findings on strategy selection and improvement provide broadly applicable mechanistic insights and practical guidance for scaling reasoning capabilities. Paper 2 addresses a niche hardware-specific optimization for memristor-based analog computation in speech recognition, which, while technically useful, has a narrower scope of impact and audience. The breadth, timeliness, and relevance of Paper 1 to the rapidly growing field of LLM reasoning give it substantially higher potential impact.
Paper 2 addresses the highly timely and impactful area of reinforcement learning for LLM reasoning. By providing mechanistic insights and practical interventions for scaling reasoning capabilities, it has immediate, broad applicability in current AI research. While Paper 1 offers a novel theoretical framework for transfer learning, Paper 2's direct relevance to the rapid development of advanced reasoning models gives it higher potential for widespread scientific and real-world impact.
Paper 1 addresses a highly timely and broadly impactful question—understanding the mechanics of RL post-training for reasoning models, which is central to current LLM development. Its mechanistic insights (strategy selection and improvement) offer practical guidance for scaling reasoning capabilities, relevant to a massive and growing research community. Paper 2 makes solid theoretical contributions on truncated positional encodings for GNNs, but addresses a more niche topic with narrower audience. The timeliness and breadth of impact of Paper 1, given the current explosion in reasoning model research, gives it higher estimated scientific impact.
Paper 1 likely has higher impact due to its timeliness and broad relevance to current LLM post-training with RL, a central bottleneck for reasoning/coding systems. Its mechanistic framing (strategy selection vs. strategy improvement) offers a potentially general conceptual model that can guide data design and scaling interventions across many tasks and labs, with immediate practical implications. Paper 2 presents a solid, incremental architectural advance for multimodal VAEs (exact Hölder pooling + shared/private hierarchy) with clearer but narrower applicability, in a subarea that currently has less field-wide momentum than RL post-training for LLMs.
Paper 2 addresses the highly timely and rapidly growing field of reasoning LLMs and RL post-training, which is currently one of the most active areas in AI research. Its mechanistic insights into strategy selection and strategy improvement provide foundational understanding that could influence how the entire field approaches training reasoning models. Paper 1, while technically solid and novel in combining spectral methods with diffusion models for PDEs, addresses a more niche intersection of physics-informed ML and generative modeling with narrower immediate impact. Paper 2's practical implications for scaling reasoning capabilities give it broader influence.
Paper 1 addresses a highly timely and impactful question—understanding the mechanics of reinforcement learning post-training for reasoning models, which is central to current LLM development. It provides actionable insights (strategy selection vs. improvement, role of data diversity and difficulty) with direct practical implications for scaling reasoning capabilities. Paper 2 offers interesting findings about module-specific optimization geometry but addresses a more niche topic (weight-space manifold constraints) with narrower applicability, limited to specific optimizer variants and small-scale GPT-2 experiments. Paper 1's broader relevance to the rapidly growing RL-for-reasoning field gives it higher potential impact.
Paper 2 addresses a highly critical and timely topic: the mechanics of reinforcement learning post-training for LLM reasoning capabilities. Given the current focus on scaling reasoning in foundation models, its insights into strategy selection and improvement offer profound implications for advancing AI capabilities globally. While Paper 1 provides a strong, rigorous method for constrained generative modeling in physical systems, Paper 2's potential to influence the broader, rapidly evolving field of LLM training gives it a significantly higher overall scientific impact.