Back to Rankings

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Qian Zhao, Kunlong Chen, Changxin Tian, Zhonghui Jiang, Haitao Zhang, Chaofan Yu, Peijie Jiang, Mingliang Gong

Jun 18, 2026arXiv:2606.20381v1
cs.AI
Share
#46 of 3753 · Artificial Intelligence
Tournament Score
1570±48
10501800
89%
Win Rate
17
Wins
2
Losses
19
Matches
Rating
7.3/ 10
Significance7.5
Rigor7.5
Novelty7
Clarity8

Abstract

FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Rethinking Shrinkage Bias in LLM FP4 Pretraining

1. Core Contribution

The paper identifies Shrinkage Bias, a systematic negative rounding error inherent to non-uniform FP4 formats like E2M1 when using Round-to-Nearest-Even (RTNE). The key insight is geometric: in E2M1's non-uniform grid, rounding bins at spacing-transition points (e.g., between the {1.5, 2, 3} levels) have asymmetric widths (ℓ_i ≠ r_i), producing a consistent toward-zero bias. The authors show this bias: (a) accumulates multiplicatively across layers in deep networks, and (b) is paradoxically *amplified* by Random Hadamard Transforms (RHT), which push tensor mass into the most asymmetric bins. This provides a unified mechanistic explanation for training instabilities observed in E2M1-based FP4 recipes like NVFP4.

The proposed solution, UFP4, is elegantly simple: switch to a uniform grid (E1M2/INT4), which eliminates shrinkage bias by construction. This format change enables RHT across all three training GEMMs (FPROP, DGRAD, WGRAD) — something harmful under E2M1 — while restricting stochastic rounding to dY alone.

2. Methodological Rigor

Theoretical analysis is clean and well-grounded. The derivation of per-bin expected error under locally uniform density (Equation 2) is straightforward but provides genuine insight. The multiplicative accumulation argument (Equations 3-5) via orthogonal decomposition of quantized operands is mathematically sound, though the assumption that residual terms are incoherent with the signal deserves more scrutiny — this is likely violated for structured weight matrices.

Empirical validation is comprehensive across multiple levels:

  • Tensor-level diagnostics: SQNR and effective bucket ratio measurements on real training tensors demonstrate the format-dependent RHT effect convincingly (Figures 2, 4-7).
  • GEMM-level analysis: Shows the pattern survives matrix multiplication, not just element-wise quantization.
  • End-to-end training: Dense 1.5B, MoE 7.9B, and MoE 124B models show consistent improvements. The 124B MoE experiment is particularly compelling for industrial relevance.
  • Scaling law analysis: Following established protocols, E1M2 curves remain below E2M1 across model sizes.
  • Ablation studies: Systematic ablation of RHT scope and SR scope (Table 2) cleanly isolates contributions.
  • Controlled baselines: The authors invest effort in tuning the E2M1 baseline through controlled one-factor ablations (Appendix B), avoiding the straw-man comparison trap. This significantly strengthens the claims.

    One limitation: the locally-uniform density assumption within bins (Equation 2) is a simplification. Post-RHT distributions are approximately Gaussian, and the actual bias depends on the density profile within each bin. However, the empirical results strongly corroborate the theoretical predictions.

    3. Potential Impact

    Hardware design implications: The paper's most consequential claim is that future accelerators should support E1M2/INT4-style uniform grids as first-class training primitives. Given that NVIDIA Blackwell/Rubin and AMD MI350 are built around E2M1, this is a direct challenge to current industry direction. If validated at larger scales, this could influence next-generation accelerator ISA design — a high-stakes outcome.

    Practical training recipes: UFP4 is immediately applicable on hardware supporting uniform 4-bit computation. The HiFloat4/Ascend 960 connection noted by the authors provides a near-term platform. The fused RHT+quantization kernel demonstrates minimal overhead (1.06-1.07×), making the approach practical.

    Broader numerical format research: The shrinkage bias framework provides a reusable analytical tool for evaluating any non-uniform quantization format, applicable beyond FP4 to FP8 and other emerging formats.

    4. Timeliness & Relevance

    This paper arrives at a critical juncture. FP4 training is transitioning from research curiosity to hardware-supported reality with Blackwell. The industry is making multi-year hardware commitments based on E2M1. Identifying a fundamental limitation *before* these choices become irrevocable makes this work highly timely. The June 2026 date suggests it targets the design window for post-Rubin architectures.

    The paper also addresses the practical bottleneck of FP4 training instability, which has been documented but not satisfactorily explained. Providing a root-cause analysis rather than symptomatic fixes is valuable.

    5. Strengths & Limitations

    Key Strengths:

  • Root-cause identification: Rather than proposing yet another stabilization trick, the paper identifies a *format-level* source of error, offering an explanation with predictive power.
  • Multi-scale validation: From single-bin theory → tensor diagnostics → GEMM outputs → end-to-end training at 124B scale. This progression is convincing.
  • Clean experimental design: Matching all auxiliary configurations (block size, scale hierarchy, SR scope) between recipes isolates the grid-format variable.
  • Actionable hardware recommendation: The conclusion is concrete and falsifiable.
  • Notable Limitations:

  • Residual BF16 gap: Even UFP4 incurs measurable degradation versus BF16 (0.97-1.85% relative loss error). The improvement over E2M1 is meaningful but modest in absolute terms.
  • Hardware availability: E1M2 support is not yet widely available on mainstream accelerators, limiting immediate adoption. The paper acknowledges this but it constrains near-term impact.
  • Range-restricted E2M1 failure: The negative result on emulating uniform grids via range restriction (Figure 10) is important but tested with relatively simple approaches. More sophisticated mapping schemes might bridge the gap on existing hardware.
  • Limited downstream evaluation: Only LM loss is reported; downstream task performance (reasoning, coding, etc.) is not evaluated.
  • Scale of improvement: The relative improvements (e.g., 1.26%→0.97% on Dense 1.5B) are consistent but not transformative, raising questions about whether the effect grows or shrinks at even larger scales.
  • Theoretical analysis: The locally-uniform density assumption is convenient but not rigorously justified for real tensor distributions.
  • Additional Observations

    The paper's strength lies in combining theoretical elegance with industrial-scale empirical validation. The concept of "regime shift" from dynamic-range-limited to local-resolution-limited post-RHT is an important conceptual contribution that extends beyond FP4. The effective bucket ratio metric provides a useful diagnostic tool.

    The work is well-positioned as complementary to existing quantization improvements (adaptive rounding, tensor decomposition, etc.), which could be combined with uniform grids for further gains.

    Rating:7.3/ 10
    Significance 7.5Rigor 7.5Novelty 7Clarity 8

    Generated Jun 19, 2026

    Comparison History (19)

    Lostvs. The Impossibility of Eliciting Latent Knowledge

    Paper 1 establishes a foundational impossibility theorem for Eliciting Latent Knowledge, a critical AI safety problem. While Paper 2 offers highly timely advancements for LLM hardware efficiency, its impact is tied to current quantization paradigms. In contrast, Paper 1 provides a mathematically rigorous, timeless theoretical bound on AI alignment, proving behavioral feedback alone cannot guarantee AI honesty. This fundamentally shifts how researchers must approach AGI safety, mechanistic interpretability, and model evaluation, offering broader and more enduring long-term scientific impact than a transient systems optimization.

    gemini-3.1-pro-preview·Jun 19, 2026
    Wonvs. SoftSkill: Behavioral Compression for Contextual Adaptation

    Paper 1 has higher impact due to its profound implications for LLM pretraining and future AI hardware design. By identifying 'Shrinkage Bias' in standard E2M1 formats and proving the superiority of uniform grids at massive scales (up to 124B parameters), it directly challenges current paradigms for next-gen accelerators (NVIDIA Blackwell/AMD MI350). Enabling stable 4-bit training offers immense compute and memory savings globally. In contrast, Paper 2 presents a useful but narrower context-compression technique for agents, operating at a smaller scale (4B parameters) and building heavily on existing soft-prompting methods.

    gemini-3.1-pro-preview·Jun 19, 2026
    Wonvs. PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation

    Paper 2 demonstrates higher potential scientific impact due to its industry-wide implications for LLM pretraining and hardware design. By identifying 'Shrinkage Bias' in E2M1 formats and proposing the UFP4 recipe, it challenges current AI hardware assumptions (NVIDIA Blackwell) and offers a scalable solution for 4-bit training. While Paper 1 presents a valuable robotics contribution, Paper 2 addresses a critical computational bottleneck for frontier AI, scaling successfully to 124B parameters. This promises massive efficiency gains and broader fundamental impact across the deep learning systems community.

    gemini-3.1-pro-preview·Jun 19, 2026
    Wonvs. The Tao of Agency: Autotelic AI, Embedded Agency and Dissolution of the Self

    Paper 2 addresses a critical bottleneck in LLM pretraining by identifying a mathematical flaw (Shrinkage Bias) in current FP4 hardware paths. Its proposed UFP4 recipe, validated on massive 124B parameter models, offers immense, immediate real-world utility and methodological rigor that will directly influence future AI hardware and foundation model training. While Paper 1 is philosophically profound and conceptually novel regarding autonomous AI, it is highly theoretical and lacks the immediate, quantifiable empirical and industrial impact of Paper 2.

    gemini-3.1-pro-preview·Jun 19, 2026
    Wonvs. What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

    Paper 2 identifies a fundamental mathematical limitation (Shrinkage Bias) in current FP4 hardware designs used by major GPU manufacturers (NVIDIA, AMD), proposes a concrete solution (UFP4), and validates it at scale up to 124B parameters. Its impact spans hardware architecture design, numerical methods, and LLM training efficiency—directly influencing next-generation accelerator design decisions. Paper 1 provides useful characterization of jailbreaking mechanisms but is more incremental, building on known demonstration-based jailbreaking with primarily empirical observations across four models, with narrower implications for the field.

    claude-opus-4-6·Jun 19, 2026
    Wonvs. The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

    Paper 2 likely has higher scientific impact due to strong real-world applicability and timeliness: FP4 pretraining is an immediate bottleneck for frontier-scale LLMs, and a recipe (UFP4) plus hardware implications (uniform grids as primitives) can influence both training practice and accelerator design. It offers a clear mechanistic explanation (geometric shrinkage bias, RHT interaction), validated via long-run pretraining across multiple model scales with ablations and scaling-law analysis. Paper 1 is novel and useful for agent design, but its claims hinge on architectural-theorem framing that may be harder to generalize and translate into broad, near-term adoption.

    gpt-5.2·Jun 19, 2026
    Wonvs. Leveraging systems' non-linearity to tackle the scarcity of data in the design of Intelligent Fault Diagnosis Systems

    Paper 2 addresses a fundamental limitation in FP4 training for LLMs—a highly timely topic given the rapid scaling of LLM pretraining and the introduction of next-generation GPU architectures (Blackwell/Rubin, MI350). It identifies a novel theoretical insight (Shrinkage Bias from geometric asymmetry), proposes a practical recipe (UFP4) validated at scale (up to 124B parameters), and provides actionable hardware design recommendations. Its breadth of impact spans ML systems, hardware architecture, and numerical methods. Paper 1 addresses a narrower niche (fault diagnosis with limited data) with more incremental contributions in transfer learning and data augmentation.

    claude-opus-4-6·Jun 19, 2026
    Wonvs. Grounded Inference: Principles for Deterministically Encapsulated Generative Models

    Paper 1 presents a novel, rigorous analysis of a specific technical problem (shrinkage bias in FP4 training) with concrete mathematical foundations, empirical validation across multiple model scales, and actionable hardware design implications. It addresses a timely problem in LLM efficiency with quantitative results and scaling-law analysis. Paper 2 proposes a conceptual framework for integrating generative models into systems but lacks empirical validation, formal rigor, and specific technical depth—reading more as an opinion/position piece with abstract design principles rather than a scientifically grounded contribution.

    claude-opus-4-6·Jun 19, 2026
    Wonvs. Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End Driving

    Paper 2 likely has higher scientific impact due to a broadly applicable, timely contribution to LLM pretraining efficiency on emerging FP4 hardware. It identifies a fundamental, format-geometric source of training instability (shrinkage bias), offers a unified explanation tied to RHT, and proposes a concrete, validated recipe (UFP4) demonstrated at large scales (up to MoE 124B) with ablations and scaling-law analysis. The findings can influence both software training recipes and future accelerator design, affecting many models and domains. Paper 1 is innovative but more domain-specific and harder to translate into widespread practice.

    gpt-5.2·Jun 19, 2026
    Lostvs. MiniMax Sparse Attention

    Paper 2 has higher likely impact: it addresses a widely felt bottleneck (million-token attention) with an end-to-end solution spanning algorithm + GPU kernel, demonstrated at 109B scale and released code/model, making near-term adoption and downstream applications (agents, codebases, long-term memory, multimodal) highly plausible. Its relevance is immediate for deployment economics and could influence many long-context systems. Paper 1 is novel and important for FP4 training stability and hardware design, but its applicability depends more on specific quantization formats/hardware support and is narrower than long-context inference.

    gpt-5.2·Jun 19, 2026