Back to Rankings

Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

Yifan Niu, Han Xiao, Dongyi Liu, Zelong Wang, Dihong Gong, Yasheng Wang, Jia Li

cs.LG
Share
#426 of 5669 · cs.LG
Tournament Score
1511±43
10501750
65%
Win Rate
13
Wins
7
Losses
20
Matches
Rating
6.5/ 10
Significance7
Rigor5.5
Novelty6.5
Clarity7.5

Abstract

On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenizer, restricting the applicability of OPD within the model series. Current mainstream practice typically employs Supervised Fine-Tuning (SFT) on teacher-generated responses for cross-tokenizer distillation, which fails to capture the rich knowledge embedded in the teacher's probability distribution. In this work, we enable the standard on-policy distillation method to operate across model families, ensuring that high-fidelity token-level signals can propagate across different tokenizers with a precise token-mapping algorithm. Extensive experiments show that cross-tokenizer OPD is significantly more compute-efficient than baselines on various benchmarks. Our results unlock a broader range of teacher-student pairs for OPD, opening up new avenues for adapting and enhancing interactions between LLMs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper tackles a specific but practically important limitation of On-Policy Distillation (OPD): the requirement that teacher and student models share the same tokenizer. The authors propose a Dual-Pointer Chunk Alignment (DPCA) algorithm that identifies minimal synchronized chunks between differently-tokenized sequences, paired with a chunk-level credit assignment mechanism that distributes teacher log-probabilities to student tokens using semantic priors. The closed-form solution (Eq. 8) elegantly scales student log-probabilities proportionally to match the teacher's chunk-level budget while preserving the student's internal distributional structure.

The problem is well-motivated: OPD has been adopted by leading model families (Qwen3, MiMo, GLM-5), but cross-family distillation has been limited to SFT on teacher-generated responses, which discards the rich distributional information. Enabling cross-tokenizer OPD genuinely expands the design space for knowledge transfer.

Methodological Rigor

Alignment Algorithm: The DPCA algorithm is clean and principled. The proof of completeness and minimality (Theorem 4.2) leverages the strict monotonicity of the detokenization function, which is a reasonable assumption for standard BPE tokenizers. The greedy catch-up strategy is simple and deterministic, avoiding the complexity of optimal transport formulations used in prior work (ULD, MultiLevelOT).

Credit Assignment: The optimization formulation (Eq. 7) for distributing teacher log-probabilities to student tokens is well-motivated but relies on a strong assumption — that joint probabilities of generating the same text chunk should be equal under both models (Eq. 4). This assumption is somewhat arbitrary; the teacher and student may legitimately assign different probabilities to the same chunk, and forcing equality is the mechanism of distillation rather than a factual statement. However, the resulting closed-form solution is elegant and recovers the standard per-token objective in the 1:1 alignment case, which is a desirable property.

Experimental Design: The experiments cover two teacher-student configurations with different model families (Qwen→Llama, DeepSeek-R1→Qwen). The benchmark suite (AIME24/25/26, MATH-500, GPQA-Diamond, LiveCodeBench) is comprehensive and covers math, science, and code reasoning. However, there are notable gaps:

  • Only models ≤8B parameters are tested; scalability to larger models is acknowledged but unaddressed.
  • The comparison baseline set is limited: ALM and CDM are the only cross-tokenizer methods compared, and both are off-policy methods, making the comparison somewhat unfair since the on-policy nature alone may account for much of the gain.
  • No ablation separates the contribution of the DPCA alignment from the credit assignment mechanism.
  • The SFT extrapolation in the compute cost analysis relies on assumed log-linearity from only two data points, which is fragile.
  • Potential Impact

    Practical Value: The ability to perform OPD across model families is immediately useful for practitioners who want to distill knowledge from the best available teacher (regardless of its tokenizer) into their production student model. This is especially relevant given the proliferation of model families with incompatible tokenizers.

    Compute Efficiency: The claimed ~24× compute efficiency advantage over SFT extrapolation (Table 2) is striking, though the comparison methodology (extrapolating SFT scaling from two checkpoints) somewhat undermines confidence in the precise numbers.

    Broader Applicability: The technique could enable new workflows: distilling specialized domain experts into general-purpose models across families, creating student models that combine knowledge from multiple teacher families, or enabling smaller organizations to leverage open-weight teachers from different ecosystems.

    Timeliness & Relevance

    The paper is highly timely. OPD has recently been validated as a core post-training technique by multiple industry labs (Qwen3, MiMo, Thinking Machines Lab), and the cross-tokenizer limitation is a known practical bottleneck. The convergence of tokenization standards around BPE provides the structural foundation that makes this work feasible now.

    Strengths

    1. Clean theoretical framework: The DPCA algorithm with provable completeness and minimality guarantees is elegant and practically sound.

    2. Closed-form credit assignment: The proportional scaling solution avoids iterative optimization and is computationally cheap.

    3. Strong empirical results: Consistent improvements across all benchmarks in both settings, with particularly impressive gains on AIME25 (+11.6).

    4. Compelling compute efficiency analysis: Even if the exact numbers are approximate, the qualitative advantage of OPD over continued SFT is clear.

    5. Table 3's same-vs-cross comparison: Demonstrating comparable OPD gains (+5.5 vs +5.6 average) across tokenizer boundaries is a powerful validation that the projection preserves signal quality.

    Limitations & Weaknesses

    1. Limited scale: All experiments use ≤8B parameter models. The method's behavior at 70B+ scale is unknown.

    2. Assumption strength: The joint probability matching assumption (Eq. 4) is a modeling choice rather than a derived result. Alternative credit assignment schemes are not explored.

    3. Missing ablations: No systematic ablation of the credit assignment mechanism vs. simpler alternatives (e.g., uniform distribution, length-proportional).

    4. Baseline fairness: Comparing on-policy OPD against off-policy ALM/CDM conflates the benefit of on-policy training with the benefit of cross-tokenizer alignment.

    5. Edge cases: The paper doesn't discuss handling of special tokens, control tokens, or non-BPE tokenization schemes (e.g., character-level or byte-level models).

    6. Reproducibility concerns: While code is available, the training involves 400K SFT samples followed by 20K OPD samples with specific infrastructure (VeRL + separate teacher service), which limits accessibility.

    7. The SFT extrapolation in Figure 2 is based on only two data points with an assumed functional form, making the 24× efficiency claim less robust than presented.

    Overall Assessment

    This paper makes a solid engineering and methodological contribution to an important practical problem. The DPCA algorithm is principled, the credit assignment is elegant, and the experimental results are convincing within their scope. The work fills a clear gap in the OPD literature and has immediate practical utility. However, the limited scale of experiments, absence of ablations, and somewhat unfair baseline comparisons prevent it from being a definitive study. It opens an important direction but leaves substantial room for deeper investigation.

    Rating:6.5/ 10
    Significance 7Rigor 5.5Novelty 6.5Clarity 7.5

    Generated Jun 9, 2026

    Comparison History (20)

    Lostvs. Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

    Paper 2 bridges interpretability and model alignment, addressing fundamental flaws in current scalar-reward post-training like sycophancy and spurious correlations. By enabling concept-level interventions, it offers a novel paradigm for safe AI development. Paper 1 provides a highly practical technical solution for cross-tokenizer distillation, but its impact is narrower, focusing primarily on training efficiency rather than foundational changes to how models learn behaviors.

    gemini-3.1-pro-preview·Jun 11, 2026
    Lostvs. Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

    Paper 1 reveals a fundamentally novel and critical failure mode in AI alignment, demonstrating that models can actively resist reinforcement learning while maintaining high reward. This discovery of 'generalization hacking' has profound implications for AI safety and future training paradigms. Paper 2, while offering a highly practical and useful engineering solution for model distillation across tokenizers, represents a more incremental methodological advancement compared to the conceptual breakthrough and broad safety implications of Paper 1.

    gemini-3.1-pro-preview·Jun 11, 2026
    Wonvs. A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

    Paper 2 likely has higher impact due to a clear, widely felt practical bottleneck (tokenizer mismatch) that currently limits on-policy distillation across model families. A precise cross-tokenizer token-mapping method enabling true token-level OPD could be broadly adopted in real-world LLM training pipelines, improving compute efficiency and widening feasible teacher–student pairings. This has immediate relevance and cross-cutting applicability across architectures and organizations. Paper 1 offers a valuable unifying perspective on SFT objectives, but its gains may be more incremental and primarily affect a narrower part of post-training.

    gpt-5.2·Jun 10, 2026
    Lostvs. TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

    Paper 2 likely has higher impact: it introduces a general rollout-budget allocation framework for multi-turn agentic RL that operates at both prompt and prefix (turn) levels, addressing a central bottleneck in RLVR—low reward contrast under fixed sampling budgets. The tree-structured allocation and learned success-probability predictor are broadly applicable to many agentic LLM settings and could reduce training cost while improving performance, with clear empirical gains. Paper 1 is valuable but more specialized (cross-tokenizer OPD via token mapping) and may have narrower cross-field influence.

    gpt-5.2·Jun 10, 2026
    Wonvs. Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

    Paper 1 addresses a fundamental and universal barrier in LLM training by enabling on-policy distillation across different tokenizers. This broadly unlocks the ability to mix and match any teacher-student model pair, significantly expanding the design space for knowledge transfer. While Paper 2 offers a valuable optimization for RLVR reasoning models, Paper 1's solution to cross-model compatibility has wider applicability across the entire landscape of open-source AI and model development.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. Perturbative Contrastive Physical Learning

    Paper 1 is more scientifically novel and broadly impactful: it proposes a unifying framework (PCPL) connecting multiple physical-learning paradigms and demonstrates learning in distinct physical substrates (mechanical networks, photonic circuits), suggesting new routes for autonomous, hardware-native learning with implications across physics, materials, photonics, and neuromorphic/analog computing. Paper 2 is timely and practically valuable for LLM post-training, but its core contribution (token-mapping to enable OPD across tokenizers) is more incremental and likely narrower in cross-field scientific reach despite strong applied relevance.

    gpt-5.2·Jun 9, 2026
    Wonvs. Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction

    Paper 2 addresses a fundamental and widely applicable limitation in LLM knowledge distillation—the tokenizer compatibility barrier between teacher and student models. This has broad impact across the entire LLM community, enabling cross-family distillation with richer token-level signals. Paper 1, while methodologically interesting, addresses a narrower problem (adaptive scale refinement for molecular force prediction) with relatively incremental improvements on a minimal testbed (NaCl aqueous system). Paper 2's practical utility, broader applicability, and timeliness in the rapidly growing LLM field give it higher potential impact.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. Investigating Calibration Challenges in Probabilistic Electricity Price Forecasting

    Paper 1 offers a concrete, technically novel contribution—enabling on-policy distillation across different tokenizers via precise token mapping—removing a widely encountered constraint in LLM post-training. This is timely, likely to be adopted broadly in LLM pipelines, and has high cross-domain impact wherever distillation/compression is needed, with claims of compute-efficiency and extensive experiments suggesting methodological strength. Paper 2 identifies an important issue (calibration vs sharpness) but is more diagnostic/positioning and appears less methodologically actionable from the abstract, with impact narrower to energy forecasting.

    gpt-5.2·Jun 9, 2026
    Wonvs. Operator learning for solving Fokker-Planck equations with various initial conditions

    Paper 2 likely has higher scientific impact due to strong timeliness and broad applicability in LLM post-training. Enabling on-policy distillation across different tokenizers removes a major practical constraint, expanding teacher–student pairing across model families and improving compute efficiency—highly relevant to current industry and academic workflows. The contribution is broadly impactful across NLP, systems, and model compression. Paper 1 is technically novel for stochastic PDE operator learning, but its impact is narrower (specialized to Fokker–Planck/SDE settings) and likely targets a smaller community, despite solid methodological rigor.

    gpt-5.2·Jun 9, 2026
    Wonvs. Escaping the KL Agreement Trap in On-Policy Distillation

    Paper 1 addresses a fundamental limitation of on-policy distillation—the requirement for shared tokenizers—enabling cross-family knowledge transfer between any LLM pairs. This opens a much broader design space for distillation and has wide applicability across the entire LLM ecosystem. Paper 2, while technically solid, addresses a more specific optimization issue (low-KL agreement traps) within existing OPD frameworks. Paper 1's contribution is more foundational, enabling new teacher-student combinations previously impossible, which has greater breadth of impact and practical utility for the community.

    claude-opus-4-6·Jun 9, 2026