Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio
Fangbo Tu, Junhua Zhao, Chi Liu, Xin Chen, Haifeng Wu, Jian Wan, Srinivasan Manoharan
Abstract
Demand for low-precision inference, including NVFP4-based approaches, has grown as large language models are increasingly deployed in latency and cost constrained production environments. Quantization-aware distillation (QAD) helps recover accuracy lost under low bit quantization by training a quantized student to match the output distribution of a frozen higher precision teacher via a KL-divergence loss. In this work, we first provide a representation level diagnosis of QAD: output matching alone can mask internal degradation, because many intermediate activation geometries can yield similar teacher-aligned logits. Using CKA, we show that KL-only QAD can reduce layerwise representational similarity relative to the BF16 teacher, with especially severe drift in RL-post-trained models. This drift correlates with downstream bottlenecks on reasoning and coding tasks, suggesting that low bit recovery requires preserving internal geometry rather than matching outputs alone. Motivated by this finding, we propose \textbf{CKA-QAD}, a CKA-guided representational alignment method for NVFP4 QAD and low bit LLM accuracy recovery. The method adds a lightweight regularizer that preserves internal representational geometry during distillation by aligning layerwise Gram matrices through CKA. Across Nemotron 3 Nano and Qwen3-4B-Thinking-2507, CKA-QAD substantially improves representational alignment and improves downstream reasoning and coding accuracy with modest training overhead. Our findings position CKA-guided representational alignment as a practical complement to output matching for quantized LLM recovery.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillation"
1. Core Contribution
The paper identifies and addresses a specific failure mode in quantization-aware distillation (QAD) for NVFP4 LLMs: standard KL-divergence-based distillation can successfully match output distributions while simultaneously degrading internal representational geometry. The authors propose CKA-QAD, which augments the KL distillation objective with a Centered Kernel Alignment (CKA) regularizer that aligns layerwise Gram matrices between teacher and student models. The key diagnostic insight—that KL-only QAD can *worsen* internal CKA scores relative to PTQ alone—is the paper's most compelling finding, particularly the observation that RL-post-trained models exhibit the most severe drift.
2. Methodological Rigor
The methodology is generally sound but has notable gaps:
Strengths in experimental design:
Concerns:
3. Potential Impact
Practical value: The method adds modest overhead (~0.5% wall-clock, ~7% VRAM) while providing meaningful accuracy improvements on reasoning benchmarks (e.g., +3.8% on AIME25 for Qwen3-4B). For production LLM deployment teams already using QAD pipelines, CKA-QAD appears to be a practical, drop-in enhancement.
Breadth of influence: The finding that output-matching distillation can mask internal degradation has broader implications beyond NVFP4 specifically. This diagnostic insight could influence how the community thinks about distillation objectives in general, including for pruning, architecture search, and other compression techniques.
Limitations on impact: The paper is narrowly scoped to NVFP4, and while Section 5.4 mentions future extension to INT4/MXFP4/sub-4-bit, no evidence is provided. The industrial focus (PayPal authorship, NVIDIA-specific format) may limit academic adoption. Additionally, CKA as a distillation signal is not novel—the paper itself cites prior work using CKA for BERT distillation [12] and visual knowledge distillation [13].
4. Timeliness & Relevance
The paper addresses a timely need: NVFP4 is a recently introduced format with growing adoption, and LLM quantization for production deployment is an active area. The focus on reasoning models (which are increasingly important) and the observation about RL-post-trained models being particularly vulnerable to representational drift is timely and practically relevant. However, the NVFP4 format is NVIDIA-proprietary, which somewhat limits the paper's reach compared to format-agnostic contributions.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
6. Additional Observations
The paper's arXiv date (June 2026) and reference to a 2026 paper [1] suggest this is very recent work. The focus on a specific hardware vendor's format (NVIDIA) positions this as applied/systems research rather than foundational ML research. The promised code release would significantly enhance reproducibility and impact. The improvement magnitudes, while meaningful (especially +3.8% on AIME25), are not transformative—they represent incremental gains in an already well-performing pipeline.
Generated Jun 5, 2026
Comparison History (16)
Paper 2 has higher likely impact: it targets a broadly relevant and timely problem—accurate low-bit (NVFP4) LLM deployment—affecting many domains and production systems. It contributes a clear diagnostic (KL-only QAD hides internal representational drift), links this drift to downstream reasoning/coding performance, and proposes a general, lightweight remedy (CKA-based regularization) demonstrated on multiple model families. Paper 1 is innovative but more domain-specific (biomedical tooling/MCP ecosystem) and depends on platform adoption, narrowing breadth of impact compared to quantization methods applicable across LLM deployments.
Paper 2 likely has higher scientific impact due to strong real-world applicability (production NVFP4/low-bit inference), a clear methodological contribution (representation-level diagnosis + CKA-regularized distillation), and timeliness as quantization is central to deploying LLMs. It links internal geometry drift to downstream reasoning/coding degradation and proposes a lightweight, generally applicable fix validated on multiple models. Paper 1 is novel and useful as an evaluation benchmark, but its impact is more indirect (measurement-focused) and narrower to LLM evaluation rather than immediately improving deployed systems.
PieArena introduces a novel, large-scale benchmark for evaluating LLM negotiation capabilities—a relatively underexplored area combining strategic reasoning, theory of mind, and economic reasoning. It offers broader interdisciplinary impact (AI, economics, behavioral science, business), includes human baselines from MBA students, and provides a reusable evaluation framework. Paper 1, while technically solid, addresses a more incremental improvement (CKA regularization for quantization-aware distillation) within a narrower scope. PieArena's finding that GPT-5 matches human negotiators is likely to attract significant attention across multiple communities.
Paper 2 demonstrates higher potential scientific impact due to: (1) remarkable practical results—a 1B model outperforming a 78B model with 42x less memory and surpassing GPT-5.1 on planning; (2) broader real-world applicability in safety-critical autonomous driving; (3) novel multi-teacher distillation with asymmetric gradient projection addressing a fundamental challenge; (4) cross-model generalization validation. While Paper 1 offers a useful diagnostic insight (CKA-guided alignment for NVFP4 quantization), it represents a more incremental contribution—adding a regularizer to existing QAD pipelines—with narrower scope limited to low-bit quantization recovery.
Paper 1 addresses a fundamental bottleneck in AI: the rapid saturation and high cost of creating LLM benchmarks. An autonomous benchmark generation system offers massive scalability, high novelty, and broad applicability across all AI domains. Paper 2 presents a rigorous methodological improvement for model quantization, but its impact is more specialized to efficiency and deployment, whereas Paper 1 could fundamentally shift the overarching evaluation paradigm of the entire field.
Paper 1 addresses a critical bottleneck in LLM deployment: low-precision inference (NVFP4). By identifying that traditional output matching degrades internal representational geometry and proposing a CKA-guided solution, it offers a fundamental algorithmic improvement for model compression. Given the immense cost and latency pressures in deploying LLMs, innovations that successfully recover reasoning and coding capabilities at 4-bit precision will have widespread, immediate industry and academic impact. While Paper 2 provides a valuable methodological reality check for multi-agent systems, Paper 1's foundational contribution to efficient, hardware-aware LLM scaling has greater potential for broad, lasting impact.
While Paper 1 offers a novel multi-agent approach to molecular optimization with strong domain applications, Paper 2 addresses a highly critical and timely bottleneck in modern AI: efficient deployment of Large Language Models. By improving low-precision (NVFP4) quantization through internal geometry preservation, Paper 2 has immediate, wide-ranging implications for reducing computational costs and latency across almost all LLM applications, giving it a broader and more immediate scientific and industrial impact.
Paper 2 addresses a fundamental challenge in enterprise AI: applying foundation models to multi-table relational databases (RDBs) without retraining. By providing a training-free encoder and scalable SQL primitives, it bridges the gap between LLMs and structured enterprise data. While Paper 1 offers a valuable optimization for LLM quantization, Paper 2 has a significantly broader scope of real-world applications across multiple domains, offering a novel paradigm for zero-shot predictive modeling on heterogeneous tabular data.
Paper 2 tackles low-bit quantization for LLMs, a critical bottleneck for deploying large models efficiently. By addressing internal representation degradation during distillation, it offers broad, immediate impacts across all fields utilizing LLMs. Paper 1 is valuable for scientific visualization workflows, but its scope and potential applications are more niche compared to fundamental improvements in foundational AI model efficiency.
Paper 2 addresses a critical and universal bottleneck in LLM deployment—low-precision quantization—offering a novel method to preserve internal geometry during distillation. Its findings and proposed CKA-QAD method have immediate, widespread applicability across all LLM domains, significantly impacting efficient model scaling and deployment. Paper 1, while providing valuable insights into MLLM limitations, is restricted to a niche domain (Feynman diagrams) and serves primarily as a benchmark rather than a broadly applicable algorithmic advancement.
Paper 2 is more novel and broadly impactful: it introduces a general test-time self-supervised “closed-loop” constraint for latent reasoning (reconstruction as a fidelity check), applicable beyond quantization and potentially across many latent-coordination or cache-based reasoning systems. The reported gains (e.g., large AIME improvement) indicate strong practical relevance and timeliness given current interest in latent reasoning and test-time training. Paper 1 is methodologically solid and valuable for deployment (NVFP4 distillation), but its scope is narrower (quantized LLM recovery) and the core idea (representation alignment via CKA regularization) is a more incremental extension of existing distillation diagnostics/regularizers.
Paper 2 addresses a critical and highly timely bottleneck in AI: the efficient deployment of Large Language Models (LLMs) via ultra-low-precision quantization (NVFP4). By diagnosing internal degradation in standard distillation and proposing a novel CKA-guided alignment method, it offers a fundamental improvement to LLM compression. This has immense potential for broad, cross-disciplinary impact by making advanced reasoning models cheaper and faster to deploy. Paper 1 offers a solid application of DRL to supply chain management, but its scope is narrower and lacks the transformative cross-field potential of optimizing frontier AI models.
Paper 1 addresses a critical, emerging frontier in AI safety—covert psychological manipulation in dynamic interactions. By introducing a novel multi-turn benchmark and revealing implicit model behaviors, it has broad interdisciplinary impact across AI alignment, psychology, and HCI. While Paper 2 offers significant practical value for model deployment, Paper 1 tackles a fundamental, widely relevant safety challenge with higher potential to shape future regulatory and alignment research.
Paper 1 is likely to have higher impact: it targets a timely, high-demand problem (accurate low-bit LLM deployment), introduces a broadly applicable and lightweight representational-alignment regularizer (CKA-guided) beyond standard KL distillation, and provides diagnostic evidence linking internal drift to downstream reasoning/coding degradation. Its applications span many LLMs and production inference contexts, with potential influence across quantization, distillation, and interpretability. Paper 2 is methodologically solid but more niche (longest-path variants), with narrower cross-field uptake despite useful algorithmic advances.
Paper 2 has higher likely impact due to strong real-world applicability (production NVFP4 inference), broad relevance (quantization, distillation, representation learning), and an actionable method (CKA-guided regularization) that can be adopted widely across models and stacks. It offers a clear diagnosis plus a practical fix with measurable downstream gains, which tends to translate into rapid uptake. Paper 1 is methodologically rigorous and novel for causal decomposition in RLVR, but its impact is narrower (alignment/RLVR auditing) and more contingent on the community adopting its specific estimand and simulator-based framework.
Paper 2 (SAGE-PTQ) demonstrates higher potential scientific impact due to its more fundamental contribution: achieving near-1-bit quantization with dramatically better perplexity (6.74 vs 55.8 for BiLLM on LLaMA-3-8B) while reducing memory by 50% and improving inference speed 1.5x. The graph-based approach to optimizing group structures is more novel and broadly applicable. Paper 1's CKA-QAD offers incremental improvements to an existing distillation pipeline for a specific format (NVFP4), while Paper 2 addresses the harder ultra-low-bit regime with a complete framework showing order-of-magnitude improvements over baselines.