Qian Zhao, Kunlong Chen, Changxin Tian, Zhonghui Jiang, Haitao Zhang, Chaofan Yu, Peijie Jiang, Mingliang Gong
FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.
The paper identifies Shrinkage Bias, a systematic negative rounding error inherent to non-uniform FP4 formats like E2M1 when using Round-to-Nearest-Even (RTNE). The key insight is geometric: in E2M1's non-uniform grid, rounding bins at spacing-transition points (e.g., between the {1.5, 2, 3} levels) have asymmetric widths (ℓ_i ≠ r_i), producing a consistent toward-zero bias. The authors show this bias: (a) accumulates multiplicatively across layers in deep networks, and (b) is paradoxically *amplified* by Random Hadamard Transforms (RHT), which push tensor mass into the most asymmetric bins. This provides a unified mechanistic explanation for training instabilities observed in E2M1-based FP4 recipes like NVFP4.
The proposed solution, UFP4, is elegantly simple: switch to a uniform grid (E1M2/INT4), which eliminates shrinkage bias by construction. This format change enables RHT across all three training GEMMs (FPROP, DGRAD, WGRAD) — something harmful under E2M1 — while restricting stochastic rounding to dY alone.
Theoretical analysis is clean and well-grounded. The derivation of per-bin expected error under locally uniform density (Equation 2) is straightforward but provides genuine insight. The multiplicative accumulation argument (Equations 3-5) via orthogonal decomposition of quantized operands is mathematically sound, though the assumption that residual terms are incoherent with the signal deserves more scrutiny — this is likely violated for structured weight matrices.
Empirical validation is comprehensive across multiple levels:
Controlled baselines: The authors invest effort in tuning the E2M1 baseline through controlled one-factor ablations (Appendix B), avoiding the straw-man comparison trap. This significantly strengthens the claims.
One limitation: the locally-uniform density assumption within bins (Equation 2) is a simplification. Post-RHT distributions are approximately Gaussian, and the actual bias depends on the density profile within each bin. However, the empirical results strongly corroborate the theoretical predictions.
Hardware design implications: The paper's most consequential claim is that future accelerators should support E1M2/INT4-style uniform grids as first-class training primitives. Given that NVIDIA Blackwell/Rubin and AMD MI350 are built around E2M1, this is a direct challenge to current industry direction. If validated at larger scales, this could influence next-generation accelerator ISA design — a high-stakes outcome.
Practical training recipes: UFP4 is immediately applicable on hardware supporting uniform 4-bit computation. The HiFloat4/Ascend 960 connection noted by the authors provides a near-term platform. The fused RHT+quantization kernel demonstrates minimal overhead (1.06-1.07×), making the approach practical.
Broader numerical format research: The shrinkage bias framework provides a reusable analytical tool for evaluating any non-uniform quantization format, applicable beyond FP4 to FP8 and other emerging formats.
This paper arrives at a critical juncture. FP4 training is transitioning from research curiosity to hardware-supported reality with Blackwell. The industry is making multi-year hardware commitments based on E2M1. Identifying a fundamental limitation *before* these choices become irrevocable makes this work highly timely. The June 2026 date suggests it targets the design window for post-Rubin architectures.
The paper also addresses the practical bottleneck of FP4 training instability, which has been documented but not satisfactorily explained. Providing a root-cause analysis rather than symptomatic fixes is valuable.
The paper's strength lies in combining theoretical elegance with industrial-scale empirical validation. The concept of "regime shift" from dynamic-range-limited to local-resolution-limited post-RHT is an important conceptual contribution that extends beyond FP4. The effective bucket ratio metric provides a useful diagnostic tool.
The work is well-positioned as complementary to existing quantization improvements (adaptive rounding, tensor decomposition, etc.), which could be combined with uniform grids for further gains.
Generated Jun 19, 2026
Paper 1 establishes a foundational impossibility theorem for Eliciting Latent Knowledge, a critical AI safety problem. While Paper 2 offers highly timely advancements for LLM hardware efficiency, its impact is tied to current quantization paradigms. In contrast, Paper 1 provides a mathematically rigorous, timeless theoretical bound on AI alignment, proving behavioral feedback alone cannot guarantee AI honesty. This fundamentally shifts how researchers must approach AGI safety, mechanistic interpretability, and model evaluation, offering broader and more enduring long-term scientific impact than a transient systems optimization.
Paper 1 has higher impact due to its profound implications for LLM pretraining and future AI hardware design. By identifying 'Shrinkage Bias' in standard E2M1 formats and proving the superiority of uniform grids at massive scales (up to 124B parameters), it directly challenges current paradigms for next-gen accelerators (NVIDIA Blackwell/AMD MI350). Enabling stable 4-bit training offers immense compute and memory savings globally. In contrast, Paper 2 presents a useful but narrower context-compression technique for agents, operating at a smaller scale (4B parameters) and building heavily on existing soft-prompting methods.
Paper 2 demonstrates higher potential scientific impact due to its industry-wide implications for LLM pretraining and hardware design. By identifying 'Shrinkage Bias' in E2M1 formats and proposing the UFP4 recipe, it challenges current AI hardware assumptions (NVIDIA Blackwell) and offers a scalable solution for 4-bit training. While Paper 1 presents a valuable robotics contribution, Paper 2 addresses a critical computational bottleneck for frontier AI, scaling successfully to 124B parameters. This promises massive efficiency gains and broader fundamental impact across the deep learning systems community.
Paper 2 addresses a critical bottleneck in LLM pretraining by identifying a mathematical flaw (Shrinkage Bias) in current FP4 hardware paths. Its proposed UFP4 recipe, validated on massive 124B parameter models, offers immense, immediate real-world utility and methodological rigor that will directly influence future AI hardware and foundation model training. While Paper 1 is philosophically profound and conceptually novel regarding autonomous AI, it is highly theoretical and lacks the immediate, quantifiable empirical and industrial impact of Paper 2.
Paper 2 identifies a fundamental mathematical limitation (Shrinkage Bias) in current FP4 hardware designs used by major GPU manufacturers (NVIDIA, AMD), proposes a concrete solution (UFP4), and validates it at scale up to 124B parameters. Its impact spans hardware architecture design, numerical methods, and LLM training efficiency—directly influencing next-generation accelerator design decisions. Paper 1 provides useful characterization of jailbreaking mechanisms but is more incremental, building on known demonstration-based jailbreaking with primarily empirical observations across four models, with narrower implications for the field.
Paper 2 likely has higher scientific impact due to strong real-world applicability and timeliness: FP4 pretraining is an immediate bottleneck for frontier-scale LLMs, and a recipe (UFP4) plus hardware implications (uniform grids as primitives) can influence both training practice and accelerator design. It offers a clear mechanistic explanation (geometric shrinkage bias, RHT interaction), validated via long-run pretraining across multiple model scales with ablations and scaling-law analysis. Paper 1 is novel and useful for agent design, but its claims hinge on architectural-theorem framing that may be harder to generalize and translate into broad, near-term adoption.
Paper 2 addresses a fundamental limitation in FP4 training for LLMs—a highly timely topic given the rapid scaling of LLM pretraining and the introduction of next-generation GPU architectures (Blackwell/Rubin, MI350). It identifies a novel theoretical insight (Shrinkage Bias from geometric asymmetry), proposes a practical recipe (UFP4) validated at scale (up to 124B parameters), and provides actionable hardware design recommendations. Its breadth of impact spans ML systems, hardware architecture, and numerical methods. Paper 1 addresses a narrower niche (fault diagnosis with limited data) with more incremental contributions in transfer learning and data augmentation.
Paper 1 presents a novel, rigorous analysis of a specific technical problem (shrinkage bias in FP4 training) with concrete mathematical foundations, empirical validation across multiple model scales, and actionable hardware design implications. It addresses a timely problem in LLM efficiency with quantitative results and scaling-law analysis. Paper 2 proposes a conceptual framework for integrating generative models into systems but lacks empirical validation, formal rigor, and specific technical depth—reading more as an opinion/position piece with abstract design principles rather than a scientifically grounded contribution.
Paper 2 likely has higher scientific impact due to a broadly applicable, timely contribution to LLM pretraining efficiency on emerging FP4 hardware. It identifies a fundamental, format-geometric source of training instability (shrinkage bias), offers a unified explanation tied to RHT, and proposes a concrete, validated recipe (UFP4) demonstrated at large scales (up to MoE 124B) with ablations and scaling-law analysis. The findings can influence both software training recipes and future accelerator design, affecting many models and domains. Paper 1 is innovative but more domain-specific and harder to translate into widespread practice.
Paper 2 has higher likely impact: it addresses a widely felt bottleneck (million-token attention) with an end-to-end solution spanning algorithm + GPU kernel, demonstrated at 109B scale and released code/model, making near-term adoption and downstream applications (agents, codebases, long-term memory, multimodal) highly plausible. Its relevance is immediate for deployment economics and could influence many long-context systems. Paper 1 is novel and important for FP4 training stability and hardware design, but its applicability depends more on specific quantization formats/hardware support and is narrower than long-context inference.