Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio

Fangbo Tu, Junhua Zhao, Chi Liu, Xin Chen, Haifeng Wu, Jian Wan, Srinivasan Manoharan

Jun 4, 2026

arXiv:2606.05682v1 PDF

cs.AI(primary)cs.LG

#1474of 3355·Artificial Intelligence

#1474 of 3355 · Artificial Intelligence

Tournament Score

1418±48

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor5

Novelty4.5

Clarity7

Tournament Score

1418±48

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Demand for low-precision inference, including NVFP4-based approaches, has grown as large language models are increasingly deployed in latency and cost constrained production environments. Quantization-aware distillation (QAD) helps recover accuracy lost under low bit quantization by training a quantized student to match the output distribution of a frozen higher precision teacher via a KL-divergence loss. In this work, we first provide a representation level diagnosis of QAD: output matching alone can mask internal degradation, because many intermediate activation geometries can yield similar teacher-aligned logits. Using CKA, we show that KL-only QAD can reduce layerwise representational similarity relative to the BF16 teacher, with especially severe drift in RL-post-trained models. This drift correlates with downstream bottlenecks on reasoning and coding tasks, suggesting that low bit recovery requires preserving internal geometry rather than matching outputs alone. Motivated by this finding, we propose \textbf{CKA-QAD}, a CKA-guided representational alignment method for NVFP4 QAD and low bit LLM accuracy recovery. The method adds a lightweight regularizer that preserves internal representational geometry during distillation by aligning layerwise Gram matrices through CKA. Across Nemotron 3 Nano and Qwen3-4B-Thinking-2507, CKA-QAD substantially improves representational alignment and improves downstream reasoning and coding accuracy with modest training overhead. Our findings position CKA-guided representational alignment as a practical complement to output matching for quantized LLM recovery.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillation"

1. Core Contribution

The paper identifies and addresses a specific failure mode in quantization-aware distillation (QAD) for NVFP4 LLMs: standard KL-divergence-based distillation can successfully match output distributions while simultaneously degrading internal representational geometry. The authors propose CKA-QAD, which augments the KL distillation objective with a Centered Kernel Alignment (CKA) regularizer that aligns layerwise Gram matrices between teacher and student models. The key diagnostic insight—that KL-only QAD can *worsen* internal CKA scores relative to PTQ alone—is the paper's most compelling finding, particularly the observation that RL-post-trained models exhibit the most severe drift.

2. Methodological Rigor

The methodology is generally sound but has notable gaps:

Strengths in experimental design:

The paper evaluates on two architecturally distinct models (hybrid Mamba-Transformer MoE and pure Transformer), strengthening generalizability claims.

Multiple challenging benchmarks (AIME25, GPQA-D, LiveCodeBench-v5) are used with appropriate multi-sample evaluation protocols.

Training overhead is carefully profiled with concrete numbers.

Concerns:

The paper evaluates only two models, both relatively small (3B active parameters and 4B parameters). The claim that this method generalizes broadly is insufficiently supported.

The causal claim that CKA drift *causes* downstream accuracy degradation remains correlational. The paper acknowledges this ("empirical correlation") but the framing sometimes implies causation.

Ablation studies are essentially absent. There is no systematic study of: (a) how many/which layers to regularize, (b) sensitivity to the dynamic balancing mechanism, (c) comparison against simpler feature-matching alternatives (e.g., MSE with proper normalization), or (d) the effect of CKA regularization strength.

The dynamic loss balancing (Eq. 5) is presented as an advantage but never compared against fixed-λ alternatives, leaving its contribution unclear.

The Qwen3-4B results in Table 2 are missing CKA-QAD entries for the LiveCodeBench difficulty breakdown (Table 3), reducing completeness.

3. Potential Impact

Practical value: The method adds modest overhead (~0.5% wall-clock, ~7% VRAM) while providing meaningful accuracy improvements on reasoning benchmarks (e.g., +3.8% on AIME25 for Qwen3-4B). For production LLM deployment teams already using QAD pipelines, CKA-QAD appears to be a practical, drop-in enhancement.

Breadth of influence: The finding that output-matching distillation can mask internal degradation has broader implications beyond NVFP4 specifically. This diagnostic insight could influence how the community thinks about distillation objectives in general, including for pruning, architecture search, and other compression techniques.

Limitations on impact: The paper is narrowly scoped to NVFP4, and while Section 5.4 mentions future extension to INT4/MXFP4/sub-4-bit, no evidence is provided. The industrial focus (PayPal authorship, NVIDIA-specific format) may limit academic adoption. Additionally, CKA as a distillation signal is not novel—the paper itself cites prior work using CKA for BERT distillation [12] and visual knowledge distillation [13].

4. Timeliness & Relevance

The paper addresses a timely need: NVFP4 is a recently introduced format with growing adoption, and LLM quantization for production deployment is an active area. The focus on reasoning models (which are increasingly important) and the observation about RL-post-trained models being particularly vulnerable to representational drift is timely and practically relevant. However, the NVFP4 format is NVIDIA-proprietary, which somewhat limits the paper's reach compared to format-agnostic contributions.

5. Strengths & Limitations

Key Strengths:

The diagnostic finding (KL-only QAD worsening internal CKA relative to PTQ) is genuinely interesting and well-visualized in Figure 1.

The method is lightweight and practical, with clear engineering considerations (top-k logit distillation, feature-space CKA computation, dynamic balancing).

The observation that RL-post-trained models are most susceptible to representational drift during QAD is novel and actionable for practitioners.

The paper is clearly written with a logical flow from diagnosis to solution.

Notable Weaknesses:

Limited novelty: CKA as a distillation loss is established; the contribution is primarily applying it to NVFP4 QAD with engineering adaptations. The dynamic balancing mechanism (Eq. 5) is the most novel technical element but is not thoroughly analyzed.

Narrow evaluation scope: Only two models, both under 30B parameters. No evaluation on larger models where quantization effects may differ substantially.

Incomplete baselines: No comparison against other feature-matching approaches (MSE, attention transfer, SP distillation) adapted for NVFP4, making it impossible to assess whether CKA specifically is necessary versus any intermediate supervision.

Statistical reporting: Error bars or confidence intervals are not reported for benchmark results, despite the stochastic nature of multi-sample evaluation.

Theoretical depth: The paper provides intuitive explanations for why CKA helps but no formal analysis. The connection to the superposition hypothesis [4] is mentioned but not developed.

Incomplete results: The Qwen3-4B CKA-QAD LiveCodeBench difficulty breakdown is missing from Table 3, and the paper notes "completed evaluations" suggesting some results may be preliminary.

6. Additional Observations

The paper's arXiv date (June 2026) and reference to a 2026 paper [1] suggest this is very recent work. The focus on a specific hardware vendor's format (NVIDIA) positions this as applied/systems research rather than foundational ML research. The promised code release would significantly enhance reproducibility and impact. The improvement magnitudes, while meaningful (especially +3.8% on AIME25), are not transformative—they represent incremental gains in an already well-performing pipeline.

Rating:5.5/ 10

Significance 5.5Rigor 5Novelty 4.5Clarity 7

Generated Jun 5, 2026

Comparison History (16)

vs. Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

gpt-5.26/6/2026

Paper 2 has higher likely impact: it targets a broadly relevant and timely problem—accurate low-bit (NVFP4) LLM deployment—affecting many domains and production systems. It contributes a clear diagnostic (KL-only QAD hides internal representational drift), links this drift to downstream reasoning/coding performance, and proposes a general, lightweight remedy (CKA-based regularization) demonstrated on multiple model families. Paper 1 is innovative but more domain-specific (biomedical tooling/MCP ecosystem) and depends on platform adoption, narrowing breadth of impact compared to quantization methods applicable across LLM deployments.

vs. Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

gpt-5.26/6/2026

Paper 2 likely has higher scientific impact due to strong real-world applicability (production NVFP4/low-bit inference), a clear methodological contribution (representation-level diagnosis + CKA-regularized distillation), and timeliness as quantization is central to deploying LLMs. It links internal geometry drift to downstream reasoning/coding degradation and proposes a lightweight, generally applicable fix validated on multiple models. Paper 1 is novel and useful as an evaluation benchmark, but its impact is more indirect (measurement-focused) and narrower to LLM evaluation rather than immediately improving deployed systems.

vs. PieArena: Ranking and Profiling Language Agents in Realistic Negotiation Scenarios

claude-opus-4.66/6/2026

PieArena introduces a novel, large-scale benchmark for evaluating LLM negotiation capabilities—a relatively underexplored area combining strategic reasoning, theory of mind, and economic reasoning. It offers broader interdisciplinary impact (AI, economics, behavioral science, business), includes human baselines from MBA students, and provides a reusable evaluation framework. Paper 1, while technically solid, addresses a more incremental improvement (CKA regularization for quantization-aware distillation) within a narrower scope. PieArena's finding that GPT-5 matches human negotiators is likely to attract significant attention across multiple communities.

vs. Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

claude-opus-4.66/6/2026

Paper 2 demonstrates higher potential scientific impact due to: (1) remarkable practical results—a 1B model outperforming a 78B model with 42x less memory and surpassing GPT-5.1 on planning; (2) broader real-world applicability in safety-critical autonomous driving; (3) novel multi-teacher distillation with asymmetric gradient projection addressing a fundamental challenge; (4) cross-model generalization validation. While Paper 1 offers a useful diagnostic insight (CKA-guided alignment for NVFP4 quantization), it represents a more incremental contribution—adding a regularizer to existing QAD pipelines—with narrower scope limited to low-bit quantization recovery.

vs. Benchmark Everything Everywhere All at Once

gemini-3.16/5/2026

Paper 1 addresses a fundamental bottleneck in AI: the rapid saturation and high cost of creating LLM benchmarks. An autonomous benchmark generation system offers massive scalability, high novelty, and broad applicability across all AI domains. Paper 2 presents a rigorous methodological improvement for model quantization, but its impact is more specialized to efficiency and deployment, whereas Paper 1 could fundamentally shift the overarching evaluation paradigm of the entire field.

vs. Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

gemini-3.16/5/2026

Paper 1 addresses a critical bottleneck in LLM deployment: low-precision inference (NVFP4). By identifying that traditional output matching degrades internal representational geometry and proposing a CKA-guided solution, it offers a fundamental algorithmic improvement for model compression. Given the immense cost and latency pressures in deploying LLMs, innovations that successfully recover reasoning and coding capabilities at 4-bit precision will have widespread, immediate industry and academic impact. While Paper 2 provides a valuable methodological reality check for multi-agent systems, Paper 1's foundational contribution to efficient, hardware-aware LLM scaling has greater potential for broad, lasting impact.

vs. Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization

gemini-3.16/5/2026

While Paper 1 offers a novel multi-agent approach to molecular optimization with strong domain applications, Paper 2 addresses a highly critical and timely bottleneck in modern AI: efficient deployment of Large Language Models. By improving low-precision (NVFP4) quantization through internal geometry preservation, Paper 2 has immediate, wide-ranging implications for reducing computational costs and latency across almost all LLM applications, giving it a broader and more immediate scientific and industrial impact.

vs. No Need to Train Your RDB Foundation Model

gemini-3.16/5/2026

Paper 2 addresses a fundamental challenge in enterprise AI: applying foundation models to multi-table relational databases (RDBs) without retraining. By providing a training-free encoder and scalable SQL primitives, it bridges the gap between LLMs and structured enterprise data. While Paper 1 offers a valuable optimization for LLM quantization, Paper 2 has a significantly broader scope of real-world applications across multiple domains, offering a novel paradigm for zero-shot predictive modeling on heterogeneous tabular data.

vs. SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization

gemini-3.16/5/2026

Paper 2 tackles low-bit quantization for LLMs, a critical bottleneck for deploying large models efficiently. By addressing internal representation degradation during distillation, it offers broad, immediate impacts across all fields utilizing LLMs. Paper 1 is valuable for scientific visualization workflows, but its scope and potential applications are more niche compared to fundamental improvements in foundational AI model efficiency.

vs. FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

gemini-3.16/5/2026

Paper 2 addresses a critical and universal bottleneck in LLM deployment—low-precision quantization—offering a novel method to preserve internal geometry during distillation. Its findings and proposed CKA-QAD method have immediate, widespread applicability across all LLM domains, significantly impacting efficient model scaling and deployment. Paper 1, while providing valuable insights into MLLM limitations, is restricted to a niche domain (Feynman diagrams) and serves primarily as a benchmark rather than a broadly applicable algorithmic advancement.

vs. Closing the Loop on Latent Reasoning via Test-Time Reconstruction

gpt-5.26/5/2026

Paper 2 is more novel and broadly impactful: it introduces a general test-time self-supervised “closed-loop” constraint for latent reasoning (reconstruction as a fidelity check), applicable beyond quantization and potentially across many latent-coordination or cache-based reasoning systems. The reported gains (e.g., large AIME improvement) indicate strong practical relevance and timeliness given current interest in latent reasoning and test-time training. Paper 1 is methodologically solid and valuable for deployment (NVFP4 distillation), but its scope is narrower (quantized LLM recovery) and the core idea (representation alignment via CKA regularization) is a more incremental extension of existing distillation diagnostics/regularizers.

vs. Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains

gemini-3.16/5/2026

Paper 2 addresses a critical and highly timely bottleneck in AI: the efficient deployment of Large Language Models (LLMs) via ultra-low-precision quantization (NVFP4). By diagnosing internal degradation in standard distillation and proposing a novel CKA-guided alignment method, it offers a fundamental improvement to LLM compression. This has immense potential for broad, cross-disciplinary impact by making advanced reasoning models cheaper and faster to deploy. Paper 1 offers a solid application of DRL to supply chain management, but its scope is narrower and lacks the transformative cross-field potential of optimizing frontier AI models.

vs. CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model

gemini-3.16/5/2026

Paper 1 addresses a critical, emerging frontier in AI safety—covert psychological manipulation in dynamic interactions. By introducing a novel multi-turn benchmark and revealing implicit model behaviors, it has broad interdisciplinary impact across AI alignment, psychology, and HCI. While Paper 2 offers significant practical value for model deployment, Paper 1 tackles a fundamental, widely relevant safety challenge with higher potential to shape future regulatory and alignment research.

vs. Bidirectional Search for Longest Paths: Case for Front-to-Front Heuristics

gpt-5.26/5/2026

Paper 1 is likely to have higher impact: it targets a timely, high-demand problem (accurate low-bit LLM deployment), introduces a broadly applicable and lightweight representational-alignment regularizer (CKA-guided) beyond standard KL distillation, and provides diagnostic evidence linking internal drift to downstream reasoning/coding degradation. Its applications span many LLMs and production inference contexts, with potential influence across quantization, distillation, and interpretability. Paper 2 is methodologically solid but more niche (longest-path variants), with narrower cross-field uptake despite useful algorithmic advances.

vs. A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

gpt-5.26/5/2026

Paper 2 has higher likely impact due to strong real-world applicability (production NVFP4 inference), broad relevance (quantization, distillation, representation learning), and an actionable method (CKA-guided regularization) that can be adopted widely across models and stacks. It offers a clear diagnosis plus a practical fix with measurable downstream gains, which tends to translate into rapid uptake. Paper 1 is methodologically rigorous and novel for causal decomposition in RLVR, but its impact is narrower (alignment/RLVR auditing) and more contingent on the community adopting its specific estimand and simulator-based framework.

vs. Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

claude-opus-4.66/5/2026

Paper 2 (SAGE-PTQ) demonstrates higher potential scientific impact due to its more fundamental contribution: achieving near-1-bit quantization with dramatically better perplexity (6.74 vs 55.8 for BiLLM on LLaMA-3-8B) while reducing memory by 50% and improving inference speed 1.5x. The graph-based approach to optimizing group structures is more novel and broadly applicable. Paper 1's CKA-QAD offers incremental improvements to an existing distillation pipeline for a specific format (NVFP4), while Paper 2 addresses the harder ultra-low-bit regime with a complete framework showing order-of-magnitude improvements over baselines.