Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective
Junpeng Zhang, Lei Cheng, Guoxi Zhang, Hua Cai, Qing Xu, Quanshi Zhang
Abstract
This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. We find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically, we find that (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. We validate these findings across multiple LLMs and datasets. Our findings provide new insights into early stopping and offer practical guidance for LLM training.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper investigates why supervised fine-tuning (SFT) yields inconsistent results for LLMs—sometimes improving and sometimes degrading performance—using the framework of AND-OR interactions between input tokens as a diagnostic lens. The key findings are: (1) SFT primarily removes noise-like interactions (high-order, mutually canceling, non-generalizable patterns) rather than learning new reliable ones; (2) this denoising phase is extremely brief (often within ~1000 training steps), after which continued training introduces overfitted interactions. The paper frames the SFT process as having two distinct phases—a short beneficial denoising stage followed by a prolonged harmful overfitting stage—which reconciles seemingly contradictory views in the literature about SFT's effectiveness.
Methodological Rigor
The paper builds on the AND-OR interaction framework of Chen et al. (2024), which has theoretical guarantees (universal matching property) and empirical validation of sparsity. The methodology for categorizing interactions into removed, preserved, and newly emerged types is clean and well-defined. The authors validate findings across seven model-dataset combinations (Qwen2.5-3B/7B, Llama-2-7B, Llama-3-8B, Gemma-3-4B across GoEmotions, Unilaw-R1-Data, and Databricks-Dolly-15k), providing reasonable breadth.
However, there are notable methodological concerns:
1. Input variable selection: The analysis operates on a small number of sampled words (~10 per input), which significantly reduces the input space. While this is justified for computational tractability, it raises questions about whether the observed interaction dynamics truly capture the full picture of SFT's effects on the model.
2. LoRA-only fine-tuning: All experiments use LoRA, a parameter-efficient method that updates a low-rank subspace. It remains unclear whether the findings generalize to full fine-tuning, which is the more classical SFT setup and arguably the setting where overfitting concerns are most acute.
3. Limited task diversity: The datasets span emotion classification, legal QA, and instruction-following, but the evaluation focuses primarily on token-level prediction scores rather than downstream task metrics. The connection between interaction quality and actual task performance could be more explicitly established.
4. Generalizability metric: Using a single different-architecture LLM as the baseline for measuring interaction generalizability is somewhat brittle. The choice of baseline model could influence the generalizability scores.
Potential Impact
The paper offers several potentially valuable contributions:
However, the practical impact is tempered by the computational cost of interaction extraction (exponential in the worst case, though approximations exist). The paper acknowledges this limitation but doesn't demonstrate a practical early-stopping pipeline that would be usable at scale.
Timeliness & Relevance
The paper addresses a timely question as the field debates the relative merits of SFT versus RLHF and other post-training paradigms. The finding that SFT's primary value is denoising rather than learning new capabilities resonates with recent observations (e.g., LIMA's "less is more" thesis) and the shift toward reinforcement learning-based training (DeepSeek-R1). The interaction-based explanation framework is gaining traction, and applying it to understand training dynamics is a natural and timely direction.
Strengths
1. Clear narrative: The two-phase characterization (denoising then overfitting) is intuitive and well-supported by multiple complementary metrics (order distribution, generalizability, cancellation ratio).
2. Theoretical grounding: The universal matching property provides formal justification for using interactions as faithful representations of model behavior.
3. Comprehensive visualization: The interaction distribution plots across training steps effectively communicate the temporal dynamics.
4. Multiple validation perspectives: Section 3.3 validates that preserved interactions form the inference backbone from three independent angles (cancellation, individual contribution, sufficiency for prediction).
5. Breadth of models: Testing across four model families strengthens the generality claims.
Limitations & Weaknesses
1. Causal claims are overstated: The paper describes correlations between interaction quality metrics and SFT stages but doesn't establish causation. The claim that "SFT primarily removes noise" is an interpretation of the interaction dynamics, not a mechanistic explanation.
2. No actionable early-stopping algorithm: Despite claiming practical guidance, the paper doesn't propose or evaluate a concrete early-stopping criterion. When exactly should training stop? How sensitive is this to hyperparameters?
3. Missing downstream evaluation: The analysis focuses on interaction-level metrics without systematically connecting them to actual task performance (accuracy, F1, etc.) across the training trajectory.
4. Scale limitations: The largest model tested is 8B parameters. Whether these findings hold for models at 70B+ scale—where the SFT debate is most contentious—is unknown.
5. Single fine-tuning method: Only LoRA is tested, limiting generalizability to full fine-tuning or other PEFT methods.
6. Computational cost: The exponential cost of exact interaction extraction limits the scalability and reproducibility of the approach, despite available approximations.
Overall Assessment
This paper provides an interesting and well-executed empirical study using interactions to characterize SFT dynamics. The two-phase narrative is compelling and consistent across experiments. However, the work is primarily descriptive rather than prescriptive—it offers explanations but limited actionable tools. The gap between the diagnostic insights and practical utility (e.g., a working early-stopping algorithm with demonstrated performance improvements) reduces the immediate impact. The restriction to LoRA and moderate-scale models also limits the strength of the conclusions for the broader SFT debate.
Generated May 19, 2026
Comparison History (23)
Paper 2 is more likely to have higher scientific impact: it addresses a timely, widely relevant question about supervised fine-tuning behavior in LLMs and proposes an explanatory mechanism (interaction evolution) with empirical validation across multiple models/datasets, yielding actionable guidance (early stopping, avoiding overfitted interactions). This combination of novel insight + methodological evidence + broad applicability to LLM training and alignment suggests impact across ML research and practice. Paper 1 is a useful systems/platform contribution, but its novelty is more engineering-oriented and evaluations appear narrower (internal case studies, preference judgments).
Paper 1 addresses a fundamental scientific question about why SFT behaves differently for LLMs versus small networks, providing mechanistic insights through interaction-based analysis. Its findings on denoising dynamics and overfitting have broad implications for LLM training practices and early stopping strategies. Paper 2, while practically useful, is primarily an engineering/systems contribution presenting an autonomous research platform with limited internal evaluation. Paper 1 offers deeper theoretical understanding with validated findings across multiple models, contributing more lasting scientific knowledge to the field.
Paper 1 likely has higher impact due to a concrete algorithmic contribution (SERL) that improves long-horizon credit assignment for multi-turn LLM agents using environment feedback, with strong empirical gains on widely used agent benchmarks (ALFWorld, WebShop). This is timely for tool-using agents and has clear real-world applicability and potential to generalize across interactive settings. Paper 2 offers valuable interpretability insights into SFT dynamics and practical guidance (early stopping), but is more explanatory/diagnostic and may translate less directly into broadly adopted training methods.
Paper 2 introduces a highly innovative, interdisciplinary approach by formalizing LLM self-correction through control theory. By replacing ad-hoc prompting with a systematic, closed-loop framework and introducing rigorous new metrics (convergence, overshoot), it establishes a strong mathematical foundation for a critical area of LLM research. While Paper 1 provides valuable mechanistic insights into SFT, Paper 2's methodological rigor and potential to create a new paradigm for evaluating and improving LLM reasoning give it a higher potential for broad scientific impact.
Paper 2 addresses LLM hallucinations, a critical bottleneck for real-world deployment. Its training-free, inference-time algorithm (TRACE) demonstrates massive empirical gains across a wide variety of models without requiring labels or fine-tuning. While Paper 1 provides valuable theoretical insights into SFT dynamics, Paper 2 offers an immediate, highly scalable, and universally applicable solution to a more pressing problem, likely leading to broader adoption and higher scientific impact.
Paper 2 likely has higher scientific impact: it tackles a core, timely scientific discrepancy in LLM training (why SFT can fail), proposes a mechanistic interaction-based explanation, and reports validation across multiple models/datasets, offering actionable guidance (early stopping) relevant to broad ML/LLM research and practice. Paper 1 introduces a developer framework to balance SDK usability and vendor neutrality; while useful and applicable, it is more engineering/tooling-focused with narrower scientific novelty and cross-field impact compared to Paper 2’s generalizable insights into learning dynamics.
Paper 2 addresses a fundamental theoretical and practical issue in Supervised Fine-Tuning (SFT) for LLMs. Its insights into how SFT affects token interactions and its guidance on early stopping have broad implications across the entire AI and NLP community. While Paper 1 is an innovative and highly useful application of AI to materials science, Paper 2's contributions impact the core methodologies used to train foundation models, yielding a wider breadth of impact.
Paper 2 addresses a fundamental question about LLM training dynamics that affects the rapidly growing field of large language models. Its finding that SFT primarily removes noise-like interactions rather than learning new ones, and that overfitting quickly follows, provides actionable insights for the massive community working on LLM fine-tuning. The breadth of impact across NLP, AI safety, and practical LLM deployment is substantial. Paper 1, while practically useful, addresses a more niche application in Raman spectroscopy with incremental methodological contributions (applying existing Noise2Noise to 1D spectra).
Paper 1 addresses a fundamental and practically important question about SFT effectiveness in LLMs with concrete empirical findings across multiple models and datasets. Its insights into interaction dynamics, denoising stages, and overfitting provide actionable guidance (e.g., early stopping) for the large LLM training community. Paper 2, while intellectually interesting, is more of a position/agenda paper proposing a framework ('Token Economics Trilemma') without concrete solutions or empirical validation. Paper 1's methodological rigor and direct applicability to widespread LLM training practices give it higher near-term scientific impact.
Paper 1 likely has higher impact due to a more novel, actionable method: an RL framework that learns prompting policies for black-box LLMs via iterative distillation with a critique-augmented experience buffer, showing large empirical gains on established multi-step reasoning/tool-use benchmarks and improved sample efficiency vs baselines. Its real-world applicability is high (works with frozen proprietary models) and spans tool-use, reasoning, and prompt optimization. Paper 2 offers valuable mechanistic insight into SFT dynamics and early stopping, but is more explanatory/diagnostic and may yield narrower immediate performance improvements.
AutoRubric-T2I introduces a novel framework for automatic rubric learning in T2I alignment that is both practical and impactful. It addresses key limitations of existing reward models (cost, opacity, adaptability) with a principled approach requiring only 0.01% of annotated data, demonstrating strong results across multiple benchmarks. Paper 2 provides interesting theoretical insights into SFT dynamics via interaction-based explanations, but its contributions are more analytical/explanatory rather than introducing a new actionable methodology. Paper 1's broader applicability to the rapidly growing T2I generation field and its practical utility for reward model training give it higher potential impact.
Paper 2 likely has higher impact due to strong novelty and timeliness: it identifies a concrete, underexplored safety failure mode in memory-augmented LLM agents (memory laundering/state contamination) that directly affects real deployments. It proposes a new metric (SPG), uses counterfactual multi-agent rollouts for causal-style comparisons, and yields actionable guidance on where to place mitigations in agent pipelines. Its implications span safety, agent architectures, monitoring, and governance. Paper 1 offers useful training insight on SFT dynamics, but is narrower in application and closer to incremental interpretability/training diagnostics.
Paper 1 addresses a practically important and timely question about SFT in LLMs, which is central to current AI research and deployment. Its findings on interaction dynamics during fine-tuning offer actionable insights (e.g., early stopping guidance) with broad applicability across the LLM community. The work is empirically validated across multiple models and datasets. Paper 2, while intellectually interesting in bridging phenomenology and AI, addresses a niche topic (artificial subjectivity via embodied cognition) with limited immediate practical impact, tested only in a simple reward-free gridworld, and appeals to a much narrower audience.
Paper 1 has higher likely scientific impact due to broader relevance and timeliness: it addresses a central, widely-used procedure in LLM development (SFT) and offers an interaction-based explanation for inconsistent outcomes, with actionable guidance (early stopping) applicable across many models and tasks. Its novelty lies in using interaction evolution to reconcile contradictory empirical findings, potentially influencing training protocols and interpretability research across NLP/ML. Paper 2 is methodologically solid and useful for a high-value clinical application, but its impact is narrower (sleep staging) and more domain-specific.
Paper 1 likely has higher impact due to a more novel, actionable framework (agentic, self-evolving reward modeling via context/tool evolution rather than training), strong timeliness for multimodal RLHF/RLAIF, and clear real-world applicability to instruction-guided image editing evaluation and RL fine-tuning with large annotation savings. It also suggests a generally reusable paradigm for data-efficient reward design across tasks. Paper 2 offers valuable interpretability-driven insight into SFT dynamics and practical early-stopping guidance, but is more incremental and primarily diagnostic rather than enabling a new capability.
Paper 2 addresses a fundamental question about SFT in LLMs that affects the entire deep learning community, providing theoretical insights into why SFT behaves differently at scale. Its findings about interaction dynamics, denoising stages, and overfitting have broad implications for LLM training practices, early stopping, and fine-tuning strategies across many domains. Paper 1, while technically solid and practically useful, addresses a narrower industrial application (music search at Amazon) with domain-specific optimizations that have limited generalizability beyond information retrieval for media content.
Paper 2 addresses a fundamental and widely debated theoretical question in AI regarding the dynamics of Supervised Fine-Tuning (SFT) in Large Language Models. By providing a novel interaction-based explanation for SFT's effectiveness, its findings have broad implications across all domains utilizing LLMs, offering practical guidance for early stopping and training. In contrast, while Paper 1 presents a highly valuable, interpretable framework for clinical ECG classification, its impact is largely confined to the specific domain of medical AI and cardiology.
Paper 2 addresses a fundamental question regarding Supervised Fine-Tuning (SFT) in LLMs, offering broad implications for model training across the AI field. While Paper 1 provides a valuable practical framework for scientific reproducibility, Paper 2's insights into interaction dynamics and early stopping will likely influence a wider range of foundational AI research and development.
Paper 2 is likely higher impact due to its concrete, open-sourced benchmark addressing a pressing, widely relevant problem (hallucination detection in RAG with long contexts) and introducing realistic label-noise stress tests. Benchmarks often become community standards, enabling broad, reproducible comparisons and accelerating progress across many methods and applications. Its desiderata framework plus empirical findings (performance ceilings, LLM-judge competitiveness, noise sensitivity) are timely and actionable. Paper 1 offers valuable mechanistic insight into SFT dynamics, but its impact may be narrower and more dependent on adoption of the interaction analysis methodology.
Paper 2 likely has higher impact due to timeliness and broad relevance: it addresses a central, widely encountered issue in LLM training (when/why SFT helps or hurts) and proposes a mechanistic, interaction-based explanation validated across multiple models and datasets, yielding actionable guidance (early stopping) applicable across NLP and ML. Paper 1 is innovative and useful for nanomedicine discovery support, but its impact is more domain-specific and relies on complex LLM-based workflows with modest human agreement and limited benchmark scale, which may constrain general adoption.