Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman

Jun 11, 2026arXiv:2606.13125v1

cs.LGcs.AI

#338of 5669·cs.LG

#338 of 5669 · cs.LG

Tournament Score

1519±40

10501750

66%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7

Novelty6

Clarity8

Abstract

Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Select and Improve: Understanding the Mechanics of Post-Training for Reasoning"

1. Core Contribution

This paper proposes a mechanistic taxonomy for how RL post-training improves reasoning in language models, identifying two core mechanisms: strategy selection (routing problems to appropriate pre-existing reasoning strategies) and strategy improvement (enhancing existing reasoning patterns on harder problems). The authors further argue that two previously observed phenomena—strategy amplification and strategy composition—are not independent mechanisms but rather observable consequences of selection and improvement, respectively.

The key experimental insight is that a model trained via SFT on *both* forward and backward reasoning strategies (the "FB" model) achieves ~95% accuracy after RL, dramatically outperforming models trained on a single strategy (~50-60%), despite all models starting from comparable ~80% SFT accuracy. This cleanly demonstrates that RL's primary role is selecting among pre-existing strategies rather than generating novel ones.

2. Methodological Rigor

The experimental design is thoughtfully constructed. The finite-field arithmetic task provides excellent experimental control: it uses abstract symbols to break away from pre-trained math knowledge, has tunable difficulty (number of arithmetic steps), natural strategy dichotomy (forward vs. backward reasoning), and two problem types (evaluation vs. inversion) with clear strategy-problem alignment.

Strengths in rigor:

Three SFT seeds with independent RL runs per seed

Comprehensive ablations over learning rates, KL coefficients, group sizes, and field sizes (GF(11), GF(13))

Pass@k analysis alongside pass@1, providing distributional insights

Disaggregated analysis by problem type and difficulty level

Rule-based strategy classifier enabling quantitative routing analysis

Time-scale separation analysis distinguishing selection (~40 steps) from improvement (hundreds of steps)

Concerns:

The experiments use a single, relatively small model (Qwen-2.5-1.5B) with LoRA, raising questions about generalizability to larger models or full fine-tuning

The synthetic task, while well-controlled, has only two strategies and two problem types—real-world reasoning involves a much richer strategy space

SFT datasets are small (2048 examples) and RL datasets are also modest (1024 prompts), which may not reflect dynamics at scale

Only GRPO is tested; other RL algorithms (PPO, REINFORCE variants) might yield different mechanistic behavior

The claim that "RL does not induce novel reasoning capabilities" is strong given the limited experimental scope

3. Potential Impact

The paper's practical implications are clear and actionable:

Data diversity in SFT matters more than RL recipe: If strategy selection is the primary driver, practitioners should invest in diverse solution strategies during SFT/pre-training

Difficulty curriculum for RL: Strategy improvement requires RL problems harder than SFT problems; same-difficulty RL yields no generalization benefit (Table 1 is particularly compelling here)

Composition requires seeding: RL cannot learn composition from scratch but can extend compositional abilities seeded in SFT data

These findings connect to the broader debate about whether RL creates new capabilities or merely refines existing ones, providing controlled evidence for the latter view. This has implications for how the community allocates resources between pre-training data curation and RL pipeline engineering.

4. Timeliness & Relevance

This paper is highly timely. The rapid deployment of RL post-training (DeepSeek-R1, OpenAI o-series) has created an urgent need for mechanistic understanding. The field is currently in a phase where many groups are applying RL recipes without understanding why they work, leading to inefficient iteration. Papers like this help build the theoretical foundation that can guide principled improvements.

The paper directly addresses active debates in the community: whether RL creates "aha moments" or merely amplifies pre-existing capabilities, whether R1-Zero-style training truly learns reasoning from scratch, and what role data plays in RL success. The strategy selection finding provides a clean explanation for why models with richer pre-training tend to benefit more from RL.

5. Strengths & Limitations

Key Strengths:

Clean experimental design that isolates mechanisms effectively

The strategy selection finding is both novel and actionable—prior work showed convergence to a single strategy, but routing across strategies is a richer and more practically relevant phenomenon

The temporal decomposition showing selection happens fast (~40 steps) while improvement is slower provides useful training diagnostics

Table 1 is a highlight: it cleanly shows that same-difficulty RL enables only selection (for FB models) while harder RL enables both selection and improvement

The unification of amplification under selection is elegant and well-supported

Notable Limitations:

Scale gap: The gap between 1.5B parameter models on synthetic arithmetic and frontier-scale models on real math/coding is enormous. The mechanisms identified may not transfer

Binary strategy space: Real reasoning involves a continuum of strategies with partial overlaps; the clean dichotomy may not reflect realistic settings

The paper's strongest claim—that RL doesn't create novel capabilities—is essentially a negative result bounded by their experimental scope. The authors acknowledge this but the framing could overstate confidence

No analysis of what happens mechanistically at the representation level (attention patterns, feature emergence, etc.)—the "mechanistic" framing is at the behavioral rather than computational level

The composition experiments (Section 3.2.1) contradict Yuan et al. [YCZ+25], but the resolution of this tension is not deeply explored

6. Additional Observations

The paper is well-written with clear figures and a logical structure. The distinction between mechanisms (selection, improvement) and phenomena (amplification, composition) is a useful conceptual framework. However, the paper would benefit from discussion of when these mechanisms might break down or interact in more complex ways, and from experiments on at least one natural language reasoning task to bridge the synthetic-to-real gap.

The work is best viewed as providing useful hypotheses and a clean experimental framework rather than definitive answers about RL post-training at scale.

Rating:6.5/ 10

Significance 6.5Rigor 7Novelty 6Clarity 8

Generated Jun 12, 2026

Comparison History (38)

Wonvs. Disparate Impact in Synthetic Data Generation

Paper 2 addresses the highly timely challenge of understanding reinforcement learning post-training for large language models, a rapidly expanding frontier in AI. By revealing the underlying mechanics of strategy selection and improvement, it offers broad, immediate applicability for scaling reasoning capabilities across foundational models. While Paper 1 tackles important ethical and privacy issues in synthetic data generation, Paper 2 has a wider potential scientific impact across the machine learning community due to the current intense industry and academic focus on advancing LLM reasoning.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Paper 1 introduces a highly novel, tangible framework (SWITCH) that solves critical optimization and interpretability bottlenecks in latent chain-of-thought reasoning using on-policy RL. While Paper 2 provides valuable empirical insights into RL post-training, Paper 1 proposes a fundamental methodological innovation (Switch-GRPO) that enables new capabilities in hidden-state recurrence, which is currently at the frontier of LLM reasoning research. Its combination of a new architectural approach with mechanistic interpretability gives it higher potential for widespread adoption and subsequent follow-up research.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Uncertainty Estimation for Molecular Diffusion Models

Paper 1 addresses a highly timely and broadly impactful question: understanding the mechanics of RL post-training for reasoning in LLMs. Given the explosive interest in reasoning models (e.g., OpenAI o1, DeepSeek-R1), mechanistic insights into how RL training works—identifying strategy selection and strategy improvement as core mechanisms—provides both theoretical understanding and practical guidance for scaling. This has broad implications across the entire LLM community. Paper 2 addresses a more niche problem (uncertainty in molecular diffusion models) with solid but more incremental contributions and a narrower audience.

claude-opus-4-6·Jun 12, 2026

Wonvs. Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition

Paper 1 addresses a fundamental and timely question about the mechanisms underlying reinforcement learning post-training for reasoning models, which is at the forefront of AI research. Its findings on strategy selection and improvement provide broadly applicable mechanistic insights and practical guidance for scaling reasoning capabilities. Paper 2 addresses a niche hardware-specific optimization for memristor-based analog computation in speech recognition, which, while technically useful, has a narrower scope of impact and audience. The breadth, timeliness, and relevance of Paper 1 to the rapidly growing field of LLM reasoning give it substantially higher potential impact.

claude-opus-4-6·Jun 12, 2026

Wonvs. Loss-Shift Transfer via Bayes Quotients

Paper 2 addresses the highly timely and impactful area of reinforcement learning for LLM reasoning. By providing mechanistic insights and practical interventions for scaling reasoning capabilities, it has immediate, broad applicability in current AI research. While Paper 1 offers a novel theoretical framework for transfer learning, Paper 2's direct relevance to the rapid development of advanced reasoning models gives it higher potential for widespread scientific and real-world impact.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Understanding Truncated Positional Encodings for Graph Neural Networks

Paper 1 addresses a highly timely and broadly impactful question—understanding the mechanics of RL post-training for reasoning models, which is central to current LLM development. Its mechanistic insights (strategy selection and improvement) offer practical guidance for scaling reasoning capabilities, relevant to a massive and growing research community. Paper 2 makes solid theoretical contributions on truncated positional encodings for GNNs, but addresses a more niche topic with narrower audience. The timeliness and breadth of impact of Paper 1, given the current explosion in reasoning model research, gives it higher estimated scientific impact.

claude-opus-4-6·Jun 12, 2026

Wonvs. Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

Paper 1 likely has higher impact due to its timeliness and broad relevance to current LLM post-training with RL, a central bottleneck for reasoning/coding systems. Its mechanistic framing (strategy selection vs. strategy improvement) offers a potentially general conceptual model that can guide data design and scaling interventions across many tasks and labs, with immediate practical implications. Paper 2 presents a solid, incremental architectural advance for multimodal VAEs (exact Hölder pooling + shared/private hierarchy) with clearer but narrower applicability, in a subarea that currently has less field-wide momentum than RL post-training for LLMs.

gpt-5.2·Jun 12, 2026

Wonvs. Physics-informed diffusion models in spectral space

Paper 2 addresses the highly timely and rapidly growing field of reasoning LLMs and RL post-training, which is currently one of the most active areas in AI research. Its mechanistic insights into strategy selection and strategy improvement provide foundational understanding that could influence how the entire field approaches training reasoning models. Paper 1, while technically solid and novel in combining spectral methods with diffusion models for PDEs, addresses a more niche intersection of physics-informed ML and generative modeling with narrower immediate impact. Paper 2's practical implications for scaling reasoning capabilities give it broader influence.

claude-opus-4-6·Jun 12, 2026

Wonvs. Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

Paper 1 addresses a highly timely and impactful question—understanding the mechanics of reinforcement learning post-training for reasoning models, which is central to current LLM development. It provides actionable insights (strategy selection vs. improvement, role of data diversity and difficulty) with direct practical implications for scaling reasoning capabilities. Paper 2 offers interesting findings about module-specific optimization geometry but addresses a more niche topic (weight-space manifold constraints) with narrower applicability, limited to specific optimizer variants and small-scale GPT-2 experiments. Paper 1's broader relevance to the rapidly growing RL-for-reasoning field gives it higher potential impact.

claude-opus-4-6·Jun 12, 2026

Wonvs. PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

Paper 2 addresses a highly critical and timely topic: the mechanics of reinforcement learning post-training for LLM reasoning capabilities. Given the current focus on scaling reasoning in foundation models, its insights into strategy selection and improvement offer profound implications for advancing AI capabilities globally. While Paper 1 provides a strong, rigorous method for constrained generative modeling in physical systems, Paper 2's potential to influence the broader, rapidly evolving field of LLM training gives it a significantly higher overall scientific impact.

gemini-3.1-pro-preview·Jun 12, 2026

#338of 5669·cs.LG

#338 of 5669 · cs.LG

Tournament Score

1519±40

10501750

66%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7

Novelty6

Clarity8