Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang

#155 of 2292 · Artificial Intelligence
Share
Tournament Score
1529±26
10501800
66%
Win Rate
35
Wins
18
Losses
53
Matches
Rating
7.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.

AI Impact Assessments

(3 models)

Scientific Impact Assessment

Core Contribution

This paper challenges the influential "SFT memorizes, RL generalizes" narrative (Chu et al., 2025) by demonstrating that cross-domain generalization in reasoning SFT is not absent but conditional, shaped by three interacting factors: optimization dynamics, training data quality/structure, and base model capability. The central insight is that many previously reported generalization failures of SFT are artifacts of under-optimization, low-quality data, or insufficient model capability rather than intrinsic limitations of the SFT objective.

The most novel finding is the dip-and-recovery pattern: out-of-domain performance first degrades before recovering and surpassing the base model with extended training. This directly explains why short-epoch protocols (common in prior work) systematically underestimate SFT's generalization potential. The paper also reveals that procedural patterns (backtracking, verification) in long-CoT traces—not domain content—drive cross-domain transfer, evidenced by a toy Countdown arithmetic game improving performance on math, code, and science benchmarks. The asymmetric finding that reasoning improves while safety degrades adds nuance and practical importance.

Methodological Rigor

The experimental design is notably systematic. The authors hold two factors constant while varying the third, enabling clearer causal attribution than typical multi-factor studies. Key design strengths include:

  • Starting from pretrained base models (not instruction-tuned), eliminating confounds from prior alignment
  • Multiple model families (Qwen3, Qwen2.5, InternLM2.5) across scales (1.7B–20B), demonstrating robustness
  • Controlled data comparisons: Math-CoT vs. Math-NoCoT (same queries/answers, isolating CoT effect), Math-NoCoT vs. NuminaMath (isolating quality), Countdown-CoT (isolating procedural patterns from domain content)
  • Comprehensive evaluation suite: 9+ benchmarks spanning in-domain reasoning, OOD reasoning, general capabilities, and safety
  • However, there are methodological caveats. The overfitting stress test (Section 3.4) uses somewhat limited hyperparameter combinations. The "dip-and-recovery" pattern, while compelling, lacks formal characterization—it's identified visually from training curves rather than rigorously defined. The repeated exposure experiment (Section 3.3) confounds batch size with epoch count, though the authors acknowledge this. The safety evaluation relies on a single benchmark (HEx-PHI) with GPT-4.1 as judge, which introduces potential judge bias.

    Potential Impact

    Immediate practical impact: The findings directly inform practitioners about training long-CoT reasoning models. The key actionable insights—train longer than you think, prioritize data quality over quantity, use repeated exposure over one-pass coverage, and ensure sufficient model capability—are immediately applicable. The response length diagnostic as a proxy for optimization progress is a pragmatic contribution.

    Reframing the SFT vs. RL debate: By demonstrating that SFT generalization failures are often conditional rather than intrinsic, this work challenges a narrative that has motivated significant research into RL-based alternatives and modified SFT objectives. This could redirect community attention toward understanding *when* each approach is appropriate rather than assuming RL's superiority.

    Safety implications: The finding that long-CoT SFT systematically degrades safety—with models learning to self-rationalize around safety guardrails—has important implications for the deployment of reasoning models. The controlled CoT vs. no-CoT comparison provides causal evidence that procedural patterns, not domain content, drive safety degradation.

    Broader influence: The model capability finding (weaker models imitate verbosity while stronger models internalize transferable patterns) connects to fundamental questions about what models learn during fine-tuning and has implications for distillation and knowledge transfer research.

    Timeliness & Relevance

    This paper is highly timely. The "SFT memorizes, RL generalizes" claim has become near-axiomatic in the post-training community, driving substantial investment in RL-based methods (RLHF, RLVR, etc.). Several concurrent works modify the SFT objective specifically to address perceived generalization limitations. By showing these limitations are often artifactual, the paper provides a necessary corrective at a moment when the field risks over-indexing on RL-based solutions. The focus on long-CoT reasoning SFT is directly relevant to the current wave of reasoning-focused models (DeepSeek-R1, Qwen3, etc.).

    Strengths

    1. Systematic factorial design with controlled comparisons across optimization, data, and model capability

    2. Strong replicability: the authors promise to release all code, data, models, and intermediate checkpoints

    3. Novel diagnostic insight: response length as a proxy for optimization stage

    4. Surprising and counterintuitive findings: Countdown (a toy game) outperforming diverse math data without CoT on math benchmarks

    5. Comprehensive appendix with full training dynamics tables across all configurations (40+ pages of detailed results)

    6. The token-level analysis (Section C.7) showing that larger models' advantages concentrate on reasoning transition words ("therefore," "alternatively," "wait") is a compelling mechanistic insight

    Limitations

    1. No direct SFT vs. RL comparison: The paper deliberately avoids this, arguing conditions-dependence makes fair comparison difficult, but this limits the ability to contextualize findings relative to the motivating claim

    2. Math-only training data: Whether findings transfer to code, scientific, or multi-modal reasoning SFT remains unvalidated

    3. Scale ceiling at 20B parameters: Dynamics may differ substantially at 70B+ or MoE scales

    4. Lack of theoretical grounding: The dip-and-recovery phenomenon is empirically documented but not mechanistically explained beyond qualitative speculation about surface pattern imitation followed by deeper learning

    5. Limited safety evaluation: Single benchmark, single judge model

    Overall Assessment

    This is a well-executed empirical study that challenges an influential but potentially oversimplified narrative. Its systematic approach to disentangling confounded factors provides actionable insights for the reasoning model training community. While it doesn't provide theoretical explanations for the observed phenomena and lacks direct RL comparisons, the breadth of experiments and controlled design make the conditional generalization framework convincing. The paper's impact will likely be substantial in shifting the discourse from "does SFT generalize?" to "under what conditions?"

    Rating:7.5/ 10
    Significance 8Rigor 7.5Novelty 7Clarity 8.5

    Generated Apr 9, 2026

    Comparison History (53)

    vs. Conditional Attribute Estimation with Autoregressive Sequence Models
    claude-opus-4.65/16/2026

    Paper 1 introduces a novel architectural method (Conditional Attribute Transformers) that enables per-token credit assignment, counterfactual analysis, and steerable generation in a single forward pass—capabilities with broad applications across language modeling, molecular design, and RL. Its methodological novelty and practical utility across multiple domains give it higher impact potential. Paper 2 provides valuable empirical insights on SFT vs. RL generalization, but is primarily an analytical/empirical study that refines existing understanding rather than introducing a new technical capability. Paper 1's framework is more likely to spawn follow-up work and adoption.

    vs. Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents
    gemini-34/21/2026

    Paper 1 challenges a fundamental assumption in LLM post-training (that SFT only memorizes) and provides actionable insights into optimization dynamics, data quality, and model capabilities. Its findings have foundational implications for the generative AI community, directly impacting how future models are aligned and trained. Paper 2, while highly relevant for AI safety and economics, has a more specialized focus on multi-agent market simulations, giving Paper 1 a broader and more ubiquitous scientific impact across the field.

    vs. Sequence Search: Automated Sequence Design using Neural Architecture Search
    claude-opus-4.64/17/2026

    Paper 2 addresses a fundamental and timely question in LLM post-training—whether SFT can generalize for reasoning—challenging a widely held narrative with systematic empirical analysis. Given the massive investment in LLM reasoning (e.g., o1, DeepSeek-R1), these findings on optimization dynamics, data quality, and model capability conditions have immediate broad impact across the AI community. Paper 1, while novel in applying NAS to MRI sequence design, targets a narrower domain (MRI physics) with limited cross-field applicability. Paper 2's insights on the safety-reasoning tradeoff also raise important alignment implications.

    vs. TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models
    gemini-34/17/2026

    Paper 1 offers fundamental insights into LLM learning dynamics, challenging the prevailing narrative that SFT only memorizes while RL generalizes. By uncovering the conditional nature of reasoning generalization, it broadly impacts foundational research in model training, scaling, and alignment. Paper 2, while offering a highly practical and efficient system for model collaboration, is more applied and focuses on latency/cost optimization rather than shifting core theoretical paradigms.

    vs. Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training
    claude-opus-4.64/15/2026

    Paper 2 challenges a widely-held belief in the LLM community ('SFT memorizes, RL generalizes') with nuanced empirical analysis, offering broadly applicable insights about optimization dynamics, data quality, and model capability that affect virtually all LLM post-training research. Its findings (dip-and-recovery pattern, asymmetric generalization, capability-dependent transfer) have immediate implications for how practitioners train reasoning models. Paper 1, while novel in applying cycle-consistency to search agents, addresses a narrower problem (gold-supervision-free search training) with more limited cross-field impact.

    vs. Evaluating Relational Reasoning in LLMs with REL
    gemini-34/15/2026

    Paper 2 challenges a prevailing narrative regarding SFT and RL in LLM post-training, offering foundational insights into generalization dynamics, data quality, and model capability. Because it addresses core training methodologies used universally across the AI field, its findings are highly actionable and likely to shift paradigms in how models are fine-tuned, resulting in a broader and more immediate scientific impact compared to the specific benchmarking framework introduced in Paper 1.

    vs. Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities
    gpt-5.24/15/2026

    Paper 2 has higher potential impact because it introduces a broadly applicable, principled evaluation methodology (multidimensional IRT with fine-grained ability taxonomies) validated across many models and multiple STEM domains, enabling downstream uses like targeted training, model selection, and benchmark design. Its methodological rigor (predictive validity on unseen items, cross-benchmark generalization, clear metrics) and breadth/timeliness in LLM evaluation make it likely to influence both research and practice. Paper 1 is valuable but more specific to SFT/CoT dynamics and offers narrower, less immediately general tooling.

    vs. Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories
    gemini-34/14/2026

    Paper 1 introduces a highly novel and practical methodology (CRPS) that directly addresses a major bottleneck in LLM reasoning training. By achieving a 20x reduction in dataset size while improving out-of-domain generalization, it offers immediate, high-impact applications for developing advanced reasoning models. While Paper 2 provides valuable theoretical insights into SFT generalization, Paper 1's algorithmic innovation and dramatic efficiency gains are likely to drive broader and more immediate adoption across the field.

    vs. Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
    claude-opus-4.64/14/2026

    Paper 1 addresses a fundamental and timely question in LLM training methodology—whether reasoning SFT generalizes—with nuanced conditional analysis across optimization, data, and model capability. It challenges a widely-held narrative ('SFT memorizes, RL generalizes') with rigorous empirical evidence, identifying the dip-and-recovery pattern and asymmetric generalization (reasoning vs. safety). This has broad implications for the massive LLM post-training community. Paper 2 presents an impressive engineering contribution (reduction library via AI agents), but its impact is more niche, targeting computational complexity practitioners, and the core contribution is more engineering than scientific insight.

    vs. SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
    claude-opus-4.64/13/2026

    Paper 1 challenges a widely held belief ('SFT memorizes, RL generalizes') with nuanced conditional analysis across optimization, data, and model capability dimensions. Its findings—dip-and-recovery patterns, asymmetric generalization (reasoning improves but safety degrades), and capability-dependent transfer—provide foundational insights that reshape how the field thinks about post-training. Paper 2 offers a solid engineering contribution (sequence-level PPO) but is more incremental, optimizing an existing paradigm. Paper 1's broader conceptual impact across training methodology, safety, and generalization theory gives it wider influence.

    vs. SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
    gemini-34/10/2026

    Paper 1 challenges a prevailing core assumption in LLM post-training (that SFT memorizes while RL generalizes) and uncovers fundamental training dynamics like the 'dip-and-recovery' pattern. Such paradigm-shifting insights into optimization and generalization typically have a broader and longer-lasting scientific impact across the field than the practical data curation framework proposed in Paper 2, despite Paper 2's strong empirical results.

    vs. KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
    gpt-5.24/10/2026

    Paper 1 likely has higher impact due to a concrete, reusable benchmark and evaluation protocol that enables standardized progress on personalized/proactive mobile agents—an area with clear real-world deployment relevance. Its interactive Android emulation, hidden user profiles, and consent/proactiveness evaluation address a timely capability gap beyond static preference modeling, with potential broad influence across HCI, agentic LLMs, and safety/UX evaluation. Paper 2 offers valuable insight into SFT generalization conditions, but is more diagnostic/nuanced within post-training research and may translate less directly into widely adopted artifacts or application-facing evaluation standards.

    vs. Hidden Biases in Conditioning Autoregressive Models
    gpt-5.24/10/2026

    Paper 2 has higher potential impact: it provides formal problem definitions and complexity-theoretic hardness results (NP-hard/#P-hard) for exact decoding and exact conditioning of autoregressive models under global constraints. These are broadly applicable, durable theoretical contributions spanning NLP, music generation, constrained decoding, and probabilistic inference, and they clarify fundamental limits underlying many practical “conditioning” heuristics. Paper 1 is timely and useful for LLM post-training practice, but its contributions are more empirical and conditional on specific optimization/data/model regimes, making the impact likely narrower and less foundational.

    vs. From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation
    gpt-5.24/10/2026

    Paper 2 is more novel and broadly impactful: it introduces a principled, distribution-free conformal decision layer for multi-agent deliberation that turns debate into calibrated act-vs-escalate policies with formal coverage guarantees, directly addressing deployment safety. The method is immediately applicable to real-world automated decision systems and is timely given interest in agentic LLMs. Paper 1 provides valuable diagnostic insights into SFT generalization dynamics, but is more incremental/interpretive and narrower in application compared to a reusable, safety-oriented framework with measurable guarantees.

    vs. ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection
    claude-opus-4.64/10/2026

    Paper 2 challenges a widely-held assumption in the LLM community ('SFT memorizes, RL generalizes') with nuanced conditional analysis, offering broadly applicable insights about optimization dynamics, data quality, and model capability that transcend any single domain. Its findings about asymmetric generalization (reasoning improves but safety degrades) have fundamental implications for all LLM post-training research. Paper 1, while valuable for medical AI alignment, is more domain-specific and incremental in its contributions (dataset + reward model + benchmark). Paper 2's conceptual reframing will likely influence a wider range of future work.

    vs. Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
    gemini-34/10/2026

    Paper 1 fundamentally challenges a prevailing narrative in LLM training, offering foundational insights into optimization, data, and model dynamics in reasoning SFT. By explaining 'why' and 'how' generalization occurs, it has the potential to broadly influence theoretical understanding and future training paradigms across the field. While Paper 2 offers a highly practical and efficient orchestration framework, Paper 1's theoretical contributions to the mechanics of learning and generalization present a deeper, longer-lasting scientific impact.

    vs. How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
    gemini-34/10/2026

    Paper 2 addresses a critical, widespread issue in LLM evaluation and ensembling (behavioral entanglement) with a rigorous statistical framework and actionable metrics. Its proposed de-entangled reweighting directly improves multi-model systems like LLM-as-a-judge. While Paper 1 provides valuable empirical insights into SFT generalization, Paper 2's methodology has broader implications for benchmarking, auditing, and safely deploying LLM ensembles, offering a highly practical solution to synchronized failures.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    gpt-5.24/10/2026

    Paper 1 has higher potential impact due to its novel, pre-registered benchmark revealing identity-contingent safety withholding that can cause real clinical harm, with validated physician-linked scoring and clear, actionable failure-mode taxonomy (trained withholding vs incompetence vs filtering). Its real-world applicability (medical advice, safety policy, evaluation tooling) and timeliness for AI governance are strong, and it can influence multiple communities (AI safety, healthcare, evaluation methodology, policy). Paper 2 is valuable for post-training science, but its contributions are more incremental and primarily within ML optimization/generalization.

    vs. Reasoning Graphs: Deterministic Agent Accuracy through Evidence-Centric Chain-of-Thought Feedback
    gemini-34/10/2026

    Paper 1 introduces a novel, structurally innovative methodology (Reasoning Graphs) that directly addresses major bottlenecks in LLM agent deployment: unpredictability and high variance. By enabling evidence-centric, persistent memory without retraining, it offers a highly practical and scalable solution with broad real-world applicability. While Paper 2 provides valuable empirical insights into SFT generalization, Paper 1's concrete architectural contribution is likely to drive more immediate adoption, tooling development, and follow-up research in autonomous agent design.

    vs. Emotion Concepts and their Function in a Large Language Model
    gemini-34/10/2026

    Paper 1 challenges a core assumption in LLM post-training by demonstrating that SFT can achieve cross-domain generalization. Its discovery of the 'dip-and-recovery' training pattern and the necessity of long-CoT data provides highly actionable, immediate implications for how the entire AI industry trains reasoning models, likely impacting foundational model development more broadly than Paper 2's specific, albeit fascinating, interpretability findings on functional emotions.