Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models
Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen
Abstract
Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE-USTC/EKSFT.
AI Impact Assessments
(1 models)Scientific Impact Assessment: EKSFT — Entropy-KL Divergence-based Token Masking for Selective Fine-tuning
1. Core Contribution
EKSFT addresses a recognized problem in the SFT-then-RL pipeline: standard SFT on limited data causes distribution sharpening and parameter drift, which impairs exploration during subsequent RL training. The key idea is to selectively mask tokens exhibiting high entropy (model uncertainty) or high KL divergence (distributional drift from the reference model) from the cross-entropy loss, while applying entropy and KL regularization on those masked tokens. The conceptual framing—SFT should "activate" task-relevant capabilities rather than "memorize" specific content—is intuitive and well-articulated, though not entirely novel as a perspective (the authors cite Chu et al., 2025 and Xie et al., 2024 who express similar ideas).
The method is simple to implement: compute per-token entropy and KL divergence, select top-ρ fraction for masking via union of both sets, apply standard CE on unmasked tokens, and entropy/KL regularization on masked tokens. This simplicity is a practical strength.
2. Methodological Rigor
Strengths:
Weaknesses:
3. Potential Impact
The paper targets a practically important problem: how to best initialize models for RL in low-data SFT regimes. This is directly relevant to the growing ecosystem of SFT-then-RLVR training for reasoning models. The method is lightweight (no additional models needed beyond the reference, which is just the base model) and can be integrated into standard training pipelines.
However, the impact is somewhat limited by:
4. Timeliness & Relevance
The paper is well-timed. The SFT-then-RL paradigm has become standard practice (DeepSeek-R1, Qwen3, etc.), and optimizing the SFT stage for better RL exploration is an active area. The observation that SFT narrows output distributions is well-documented (Figure 1 is compelling), and practical solutions are in demand. The paper engages with very recent concurrent work (many 2025 citations), placing it firmly in the current discourse.
5. Key Strengths & Limitations
Strengths:
Limitations:
Overall Assessment
EKSFT is a well-executed incremental contribution to the SFT-then-RL optimization literature. The token-level selective masking approach is intuitive, practically simple, and consistently (if modestly) beneficial. The paper is clearly written with good experimental methodology, though limited in scope. It addresses a timely problem but operates in a crowded space where differentiation is challenging. The contribution is primarily empirical-engineering rather than conceptually transformative.
Generated May 29, 2026
Comparison History (16)
Paper 1 addresses a fundamental bottleneck in the standard SFT-RL pipeline for LLMs (distribution shift in low-data regimes). By proposing a principled, entropy-KL based token masking approach, it improves core training efficiency and reasoning capabilities. While Paper 2 offers a valuable system-level architecture for LLM memory, Paper 1's contribution to foundational model alignment and training paradigms gives it broader potential impact across the machine learning community.
Paper 2 likely has higher impact due to strong real-world applicability and timeliness: KV-cache memory is a key deployment bottleneck for long-context generation, and decode-time compression directly improves serving efficiency with minimal quality loss. The momentum-based temporal attention aggregation is a broadly applicable systems/algorithm idea that could transfer across models and inference stacks, affecting many downstream applications. Paper 1 is novel for post-training stability in low-data SFT→RL, but its impact is narrower (mainly RLHF/RLAIF pipelines) and more benchmark-dependent.
Paper 2 has higher estimated impact due to its broader and timely problem—robustness to semantically neutral prompt variations—affecting many LLM applications and evaluation settings. It offers a theoretically motivated, lightweight fine-tuning method (debiasing) with stated conditions for success/failure and claims of certification, suggesting stronger methodological rigor and generality. Paper 1 is a useful, novel training tweak for low-data SFT→RL pipelines and shows gains on math reasoning, but its scope is narrower (specific masking heuristic, primarily post-training/RL initialization) and likely less cross-domain than robustness improvements.
Paper 2 addresses a fundamental challenge in the dominant LLM post-training paradigm (SFT+RL), proposing a principled, broadly applicable token-masking strategy grounded in information-theoretic measures. Its impact spans the entire LLM training community, as distribution shift during SFT is a widely encountered problem. The method is simple, general, and directly improves downstream RL performance. Paper 1, while technically solid, addresses a narrower safety-steering problem specific to diffusion transformers, with a more complex architecture-dependent solution and more limited applicability beyond T2I safety.
Paper 2 demonstrates a profound interdisciplinary impact by bridging AI and geoscience to solve a massive data-silo problem. By producing the largest integrated marine lead database, it delivers an immediate, tangible resource for oceanographic and climate research. This concrete scientific artifact and the scalable 'AI for Science' methodology offer broader real-world applications and scientific value than the incremental LLM fine-tuning optimization proposed in Paper 1.
Paper 1 addresses a fundamental challenge in the dominant LLM training paradigm (SFT+RL), proposing a principled token-level masking strategy that preserves pre-trained distributions during fine-tuning. This has broad applicability across all LLM post-training scenarios, particularly in low-data regimes. The theoretical grounding in information theory (entropy, KL divergence) and demonstrated improvements in downstream RL performance make it highly impactful. Paper 2 addresses multi-agent security, which is important but more niche, and the cooperative attack/defense framework, while novel, targets a narrower application domain with less fundamental methodological contribution.
Paper 2 addresses a fundamental challenge in LLM post-training (distribution shift during SFT before RL) with a novel, principled method (entropy-KL token masking). It offers a concrete algorithmic contribution with theoretical motivation, empirical validation on mathematical reasoning benchmarks, and open-source code. Its impact spans the broad and active research area of LLM alignment and training pipelines. Paper 1, while timely, is primarily observational—documenting tone sensitivity in LLMs without proposing a solution—and its contributions are more incremental and application-specific.
CIVIC addresses a critical and broadly relevant bottleneck in Vision-Language Models—efficient inference with high-resolution visual tokens—offering genuine wall-clock speedups and memory savings, which has immediate practical impact for deploying VLMs at scale. Its end-to-end framework spanning vision encoder to KV-cache is architecturally novel and applicable across many VLM applications. While Paper 1 (EKSFT) presents a useful token masking strategy for SFT in low-data regimes, it targets a narrower problem (SFT initialization for RL) with incremental improvements on math reasoning benchmarks. Paper 2's broader applicability and efficiency gains give it higher impact potential.
Paper 2 addresses a fundamental challenge in the widely used SFT+RL paradigm for Large Language Models. By mitigating distribution shift during fine-tuning, it offers significant methodological improvements for LLM training. This fundamental contribution to AI foundation models gives it a much wider potential impact across numerous fields and applications compared to Paper 1, which focuses on the specific, albeit important, applied domain of traffic signal control.
Paper 1 introduces a more fundamental and broadly applicable concept (Data-Model Compatibility metric) for reasoning distillation that addresses a core challenge across multiple models and tasks. Its dynamic curriculum approach is novel and demonstrates consistent improvements. Paper 2, while addressing a valid concern about distribution shift during SFT, tackles a narrower problem (low-data SFT as RL initialization) with a more incremental contribution (token masking via entropy/KL). Paper 1's DMC metric has broader potential adoption across the distillation community and offers deeper theoretical insights into data-model alignment.
Paper 1 likely has higher impact: it demonstrates a large-scale, tool-integrated multi-agent pipeline that produces a major new public artifact (a 45k-declaration Lean library) with broad applicability to formal verification, mathematical knowledge management, and trustworthy AI. The combination of methodological engineering (scheduling, version control, verification) and a substantial released corpus makes it a platform-level contribution with cross-field spillovers. Paper 2 is a neat, timely fine-tuning technique with practical value, but it is narrower in scope and more incremental within post-training methods.
Paper 2 has higher estimated impact due to stronger novelty and broader relevance: it introduces a new preference-optimization formulation (RC-DPO) that explicitly leverages chain-of-thought conditioning, plus a scalable data generation pipeline (MCTS positives, attention-guided negatives) targeting a widely recognized failure mode—multimodal hallucination. This is timely and applicable across many vision-language reasoning systems and safety/reliability settings. Paper 1 is a clever, likely useful SFT regularization technique, but its scope is narrower (mainly low-data SFT→RL pipelines) and may offer more incremental gains.
Paper 2 presents a concrete, novel technical contribution (EKSFT) with empirical validation, code availability, and broad applicability to the widely-studied LLM post-training pipeline. It addresses a fundamental challenge in the SFT-then-RL paradigm—distribution shift in low-data regimes—with a principled information-theoretic approach. Paper 1, while addressing an important topic in AI for education, is primarily a conceptual/architectural proposal without empirical validation, limiting its immediate scientific impact. Paper 2's method is more likely to be adopted and cited across the LLM research community.
Paper 1 addresses a fundamental challenge in the universally adopted LLM post-training pipeline (SFT followed by RL). By proposing a novel token-masking approach to prevent distribution shift, it directly improves model reasoning and RL exploration. Paper 2, while offering valuable methodological rigor for RAG evaluation, has a narrower scope compared to fundamentally improving core LLM training dynamics and alignment.
Paper 2 introduces a clearly novel, generalizable algorithmic idea (entropy/KL-based token masking for selective SFT) that targets a widely relevant and timely problem—distribution shift and low-data post-training—showing consistent gains and providing code/data for reproducibility. Its methodological contribution can be adopted across tasks and model families, potentially impacting RLHF/post-training practice broadly. Paper 1 is valuable engineering and an open model release, but its scientific novelty is more incremental (system integration + competitive scaling) and its impact is more tied to specific released weights and benchmarks.
Paper 1 is more novel and broadly impactful: it introduces an inductive paradigm for domain-specific data synthesis from reference examples (no explicit domain description), with a concrete framework (DOMINO) combining minimal sufficient representation learning, contrastive disentanglement, and prompt tuning, plus theoretical support-expansion guarantees and demonstrated gains on implicit-domain coding tasks. This addresses a central bottleneck (domain data acquisition) with clear real-world applicability across many domains. Paper 2 is useful but more incremental (a masking heuristic for SFT stability in low-data regimes) and narrower in scope (mainly SFT-to-RL pipelines on math benchmarks).