Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen

#1553 of 2821 · Artificial Intelligence
Share
Tournament Score
1397±45
10501800
50%
Win Rate
8
Wins
8
Losses
16
Matches
Rating
5.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE-USTC/EKSFT.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: EKSFT — Entropy-KL Divergence-based Token Masking for Selective Fine-tuning

1. Core Contribution

EKSFT addresses a recognized problem in the SFT-then-RL pipeline: standard SFT on limited data causes distribution sharpening and parameter drift, which impairs exploration during subsequent RL training. The key idea is to selectively mask tokens exhibiting high entropy (model uncertainty) or high KL divergence (distributional drift from the reference model) from the cross-entropy loss, while applying entropy and KL regularization on those masked tokens. The conceptual framing—SFT should "activate" task-relevant capabilities rather than "memorize" specific content—is intuitive and well-articulated, though not entirely novel as a perspective (the authors cite Chu et al., 2025 and Xie et al., 2024 who express similar ideas).

The method is simple to implement: compute per-token entropy and KL divergence, select top-ρ fraction for masking via union of both sets, apply standard CE on unmasked tokens, and entropy/KL regularization on masked tokens. This simplicity is a practical strength.

2. Methodological Rigor

Strengths:

  • The experimental setup is relatively controlled: all methods use the same 3k SFT / 43k RL data split from OpenR1-Math-46k, same hyperparameters, and two model scales (Qwen3-4B, 8B).
  • The paper reports both pass@1 (performance) and pass@32 (exploration/diversity), which is appropriate given the stated goals.
  • Ablation studies systematically remove components (entropy regularization, KL regularization, masking mechanism) and include informative baselines (random masking, global regularization).
  • The IoU analysis between entropy and KL token sets (avg ~0.50) demonstrates the two signals are complementary, justifying the union strategy.
  • The parameter drift analysis (Appendix C) provides supporting evidence for the "effective drift" narrative.
  • Weaknesses:

  • The evaluation is limited to a single training dataset (OpenR1-Math-46k-8192) and predominantly mathematical reasoning benchmarks (AIME24, AIME25, AMC, HMMT25). The tool-use experiment (Appendix I) is a welcome addition but limited in scope.
  • Only two model scales (4B, 8B) from one model family (Qwen3) are tested. Generalizability to other architectures (LLaMA, Mistral) or larger scales (32B+) is unknown.
  • The improvements, while consistent, are often modest. For Qwen3-4B Stage 1, the average pass@1 improvement over SFT is +0.7%, which is within typical variance for these benchmarks. The pass@32 improvements are more convincing (+5.1%).
  • The theoretical analysis in Appendix E, while providing useful intuition about gradient norms, is more of a heuristic argument than a formal proof. The claim that high-entropy tokens "dominate" gradients is well-known and not uniquely insightful.
  • Variance reporting is absent. AIME-style benchmarks have only 30 problems, meaning pass@32 granularity is 3.3%. Many reported differences fall within one or two problems.
  • The sensitivity analysis (Appendix G) shows ρ=0.2 works well but only tests one model size; the optimal ρ may vary across models and datasets.
  • 3. Potential Impact

    The paper targets a practically important problem: how to best initialize models for RL in low-data SFT regimes. This is directly relevant to the growing ecosystem of SFT-then-RLVR training for reasoning models. The method is lightweight (no additional models needed beyond the reference, which is just the base model) and can be integrated into standard training pipelines.

    However, the impact is somewhat limited by:

  • The method is specifically designed for the "cold-start" SFT scenario with limited data. As training data scales increase, the benefits may diminish.
  • The competitive landscape is crowded: PSFT, IW-SFT, DFT, CHORD, BRIDGE, AMFT, and others all address similar problems with different mechanisms. EKSFT's advantages over IW-SFT, its strongest competitor, are relatively small.
  • The method adds computational overhead (computing full-vocabulary entropy and KL for every token at every step), though this is likely minor relative to the overall training cost.
  • 4. Timeliness & Relevance

    The paper is well-timed. The SFT-then-RL paradigm has become standard practice (DeepSeek-R1, Qwen3, etc.), and optimizing the SFT stage for better RL exploration is an active area. The observation that SFT narrows output distributions is well-documented (Figure 1 is compelling), and practical solutions are in demand. The paper engages with very recent concurrent work (many 2025 citations), placing it firmly in the current discourse.

    5. Key Strengths & Limitations

    Strengths:

  • Clean, well-motivated approach with intuitive design choices
  • Comprehensive comparison against multiple recent baselines
  • Consistent improvements across both SFT and post-RL stages
  • Thorough ablations and supporting analyses (IoU, parameter drift, random masking, global regularization)
  • Code availability
  • Limitations:

  • Narrow experimental scope (one dataset, one model family, two scales, primarily math)
  • Modest absolute improvements that may not be statistically significant given benchmark granularity
  • The theoretical analysis provides intuition but lacks formal guarantees
  • The method introduces three hyperparameters (ρ, λ_H, λ_KL) with limited sensitivity analysis
  • No wall-clock time or computational overhead analysis
  • The "activation vs. memorization" framing, while appealing, is not rigorously operationalized beyond the masking heuristic
  • Overall Assessment

    EKSFT is a well-executed incremental contribution to the SFT-then-RL optimization literature. The token-level selective masking approach is intuitive, practically simple, and consistently (if modestly) beneficial. The paper is clearly written with good experimental methodology, though limited in scope. It addresses a timely problem but operates in a crowded space where differentiation is challenging. The contribution is primarily empirical-engineering rather than conceptually transformative.

    Rating:5.5/ 10
    Significance 5.5Rigor 5.5Novelty 5Clarity 7

    Generated May 29, 2026

    Comparison History (16)

    vs. VikingMem: A Memory Base Management System for Stateful LLM-based Applications
    gemini-3.15/29/2026

    Paper 1 addresses a fundamental bottleneck in the standard SFT-RL pipeline for LLMs (distribution shift in low-data regimes). By proposing a principled, entropy-KL based token masking approach, it improves core training efficiency and reasoning capabilities. While Paper 2 offers a valuable system-level architecture for LLM memory, Paper 1's contribution to foundational model alignment and training paradigms gives it broader potential impact across the machine learning community.

    vs. Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation
    gpt-5.25/29/2026

    Paper 2 likely has higher impact due to strong real-world applicability and timeliness: KV-cache memory is a key deployment bottleneck for long-context generation, and decode-time compression directly improves serving efficiency with minimal quality loss. The momentum-based temporal attention aggregation is a broadly applicable systems/algorithm idea that could transfer across models and inference stacks, affecting many downstream applications. Paper 1 is novel for post-training stability in low-data SFT→RL, but its impact is narrower (mainly RLHF/RLAIF pipelines) and more benchmark-dependent.

    vs. Harnessing non-adversarial robustness in large language models
    gpt-5.25/29/2026

    Paper 2 has higher estimated impact due to its broader and timely problem—robustness to semantically neutral prompt variations—affecting many LLM applications and evaluation settings. It offers a theoretically motivated, lightweight fine-tuning method (debiasing) with stated conditions for success/failure and claims of certification, suggesting stronger methodological rigor and generality. Paper 1 is a useful, novel training tweak for low-data SFT→RL pipelines and shows gains on math reasoning, but its scope is narrower (specific masking heuristic, primarily post-training/RL initialization) and likely less cross-domain than robustness improvements.

    vs. Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers
    claude-opus-4.65/29/2026

    Paper 2 addresses a fundamental challenge in the dominant LLM post-training paradigm (SFT+RL), proposing a principled, broadly applicable token-masking strategy grounded in information-theoretic measures. Its impact spans the entire LLM training community, as distribution shift during SFT is a widely encountered problem. The method is simple, general, and directly improves downstream RL performance. Paper 1, while technically solid, addresses a narrower safety-steering problem specific to diffusion transformers, with a more complex architecture-dependent solution and more limited applicability beyond T2I safety.

    vs. Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent
    gemini-3.15/29/2026

    Paper 2 demonstrates a profound interdisciplinary impact by bridging AI and geoscience to solve a massive data-silo problem. By producing the largest integrated marine lead database, it delivers an immediate, tangible resource for oceanographic and climate research. This concrete scientific artifact and the scalable 'AI for Science' methodology offer broader real-world applications and scientific value than the incremental LLM fine-tuning optimization proposed in Paper 1.

    vs. Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification
    claude-opus-4.65/29/2026

    Paper 1 addresses a fundamental challenge in the dominant LLM training paradigm (SFT+RL), proposing a principled token-level masking strategy that preserves pre-trained distributions during fine-tuning. This has broad applicability across all LLM post-training scenarios, particularly in low-data regimes. The theoretical grounding in information theory (entropy, KL divergence) and demonstrated improvements in downstream RL performance make it highly impactful. Paper 2 addresses multi-agent security, which is important but more niche, and the cooperative attack/defense framework, while novel, targets a narrower application domain with less fundamental methodological contribution.

    vs. Mind Your Tone: Does Tone Alter LLM Performance?
    claude-opus-4.65/29/2026

    Paper 2 addresses a fundamental challenge in LLM post-training (distribution shift during SFT before RL) with a novel, principled method (entropy-KL token masking). It offers a concrete algorithmic contribution with theoretical motivation, empirical validation on mathematical reasoning benchmarks, and open-source code. Its impact spans the broad and active research area of LLM alignment and training pipelines. Paper 1, while timely, is primarily observational—documenting tone sensitivity in LLMs without proposing a solution—and its contributions are more incremental and application-specific.

    vs. CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models
    claude-opus-4.65/29/2026

    CIVIC addresses a critical and broadly relevant bottleneck in Vision-Language Models—efficient inference with high-resolution visual tokens—offering genuine wall-clock speedups and memory savings, which has immediate practical impact for deploying VLMs at scale. Its end-to-end framework spanning vision encoder to KV-cache is architecturally novel and applicable across many VLM applications. While Paper 1 (EKSFT) presents a useful token masking strategy for SFT in low-data regimes, it targets a narrower problem (SFT initialization for RL) with incremental improvements on math reasoning benchmarks. Paper 2's broader applicability and efficiency gains give it higher impact potential.

    vs. ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control
    gemini-3.15/29/2026

    Paper 2 addresses a fundamental challenge in the widely used SFT+RL paradigm for Large Language Models. By mitigating distribution shift during fine-tuning, it offers significant methodological improvements for LLM training. This fundamental contribution to AI foundation models gives it a much wider potential impact across numerous fields and applications compared to Paper 1, which focuses on the specific, albeit important, applied domain of traffic signal control.

    vs. Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility
    claude-opus-4.65/29/2026

    Paper 1 introduces a more fundamental and broadly applicable concept (Data-Model Compatibility metric) for reasoning distillation that addresses a core challenge across multiple models and tasks. Its dynamic curriculum approach is novel and demonstrates consistent improvements. Paper 2, while addressing a valid concern about distribution shift during SFT, tackles a narrower problem (low-data SFT as RL initialization) with a more incremental contribution (token masking via entropy/KL). Paper 1's DMC metric has broader potential adoption across the distillation community and offers deeper theoretical insights into data-model alignment.

    vs. Formalizing Mathematics at Scale
    gpt-5.25/29/2026

    Paper 1 likely has higher impact: it demonstrates a large-scale, tool-integrated multi-agent pipeline that produces a major new public artifact (a 45k-declaration Lean library) with broad applicability to formal verification, mathematical knowledge management, and trustworthy AI. The combination of methodological engineering (scheduling, version control, verification) and a substantial released corpus makes it a platform-level contribution with cross-field spillovers. Paper 2 is a neat, timely fine-tuning technique with practical value, but it is narrower in scope and more incremental within post-training methods.

    vs. Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization
    gpt-5.25/29/2026

    Paper 2 has higher estimated impact due to stronger novelty and broader relevance: it introduces a new preference-optimization formulation (RC-DPO) that explicitly leverages chain-of-thought conditioning, plus a scalable data generation pipeline (MCTS positives, attention-guided negatives) targeting a widely recognized failure mode—multimodal hallucination. This is timely and applicable across many vision-language reasoning systems and safety/reliability settings. Paper 1 is a clever, likely useful SFT regularization technique, but its scope is narrower (mainly low-data SFT→RL pipelines) and may offer more incremental gains.

    vs. Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance
    claude-opus-4.65/29/2026

    Paper 2 presents a concrete, novel technical contribution (EKSFT) with empirical validation, code availability, and broad applicability to the widely-studied LLM post-training pipeline. It addresses a fundamental challenge in the SFT-then-RL paradigm—distribution shift in low-data regimes—with a principled information-theoretic approach. Paper 1, while addressing an important topic in AI for education, is primarily a conceptual/architectural proposal without empirical validation, limiting its immediate scientific impact. Paper 2's method is more likely to be adopted and cited across the LLM research community.

    vs. A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test
    gemini-3.15/29/2026

    Paper 1 addresses a fundamental challenge in the universally adopted LLM post-training pipeline (SFT followed by RL). By proposing a novel token-masking approach to prevent distribution shift, it directly improves model reasoning and RL exploration. Paper 2, while offering valuable methodological rigor for RAG evaluation, has a narrower scope compared to fundamentally improving core LLM training dynamics and alignment.

    vs. Laguna M.1/XS.2 Technical Report
    gpt-5.25/29/2026

    Paper 2 introduces a clearly novel, generalizable algorithmic idea (entropy/KL-based token masking for selective SFT) that targets a widely relevant and timely problem—distribution shift and low-data post-training—showing consistent gains and providing code/data for reproducibility. Its methodological contribution can be adopted across tasks and model families, potentially impacting RLHF/post-training practice broadly. Paper 1 is valuable engineering and an open model release, but its scientific novelty is more incremental (system integration + competitive scaling) and its impact is more tied to specific released weights and benchmarks.

    vs. Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning
    gpt-5.25/29/2026

    Paper 1 is more novel and broadly impactful: it introduces an inductive paradigm for domain-specific data synthesis from reference examples (no explicit domain description), with a concrete framework (DOMINO) combining minimal sufficient representation learning, contrastive disentanglement, and prompt tuning, plus theoretical support-expansion guarantees and demonstrated gains on implicit-domain coding tasks. This addresses a central bottleneck (domain data acquisition) with clear real-world applicability across many domains. Paper 2 is useful but more incremental (a masking heuristic for SFT stability in low-data regimes) and narrower in scope (mainly SFT-to-RL pipelines on math benchmarks).