Back to Rankings

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim, Hanyul Ryu, Soojung Ryu, Jungwook Choi

cs.LGcs.AI
Share
#2212 of 5669 · cs.LG
Tournament Score
1427±44
10501750
63%
Win Rate
12
Wins
7
Losses
19
Matches
Rating
6.5/ 10
Significance6.5
Rigor7
Novelty6
Clarity7.5

Abstract

Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-MM NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to  ⁣\sim\!2 points over the NVFP4 baseline. Our CUDA-core small-MM kernel further improves latency-critical decoding, delivering up to 2.5 ⁣×2.5\!\times kernel-level speedup over NVFP4 vLLM and approximately 2 ⁣×2\!\times end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ReSET

1. Core Contribution

ReSET addresses two concrete obstacles in deploying large reasoning models (LRMs) with NVFP4 quantization: (a) accuracy degradation during quantized reasoning, and (b) latency inefficiency of existing NVFP4 kernels at small batch sizes typical of production decode settings.

The accuracy contribution is a step-aware temperature scaling method that adaptively adjusts decoding temperature based on reasoning-step-level entropy rather than token-level entropy alone. The key insight is that token-level entropy is dominated by the uncertainty of the surrounding reasoning step — a symbolic token (e.g., a digit) might appear high-entropy simply because it sits within an uncertain reasoning step, causing naive token-level thresholds to miss critical sharpening opportunities. ReSET uses a hybrid online estimator combining sliding-window initialization with causal within-step averaging to track step entropy during autoregressive generation.

The latency contribution is a CUDA-core NVFP4 GEMV kernel for small-M (batch size 1–8) decoding. The paper correctly identifies that Blackwell Tensor Cores require M=128 tile sizes, leading to <1% utilization at production batch sizes. The custom kernel uses multi-token CTA fusion, multi-accumulator threading, and register-only dequantization to avoid the tile under-occupancy problem entirely.

2. Methodological Rigor

Analysis quality. The paper's diagnostic analysis is well-structured. The progression from observing that quantization increases incorrect sampling at low-entropy symbolic tokens (Sec 3.1), to showing a naive fix partially works (Sec 3.2), to demonstrating why it's incomplete due to step-level entropy domination (Sec 3.3), is logically compelling. The 1.5M token analysis from R1-Qwen-14B across 90 AIME problems provides adequate statistical grounding.

Experimental evaluation. The evaluation spans five models across two families (R1-Distill-Qwen and Qwen3), three benchmarks (AIME-120, GPQA-Diamond, LiveCodeBench), and 8 random seeds per configuration. This is reasonably thorough. Comparisons against four PTQ baselines (RTN, BRQ, 4/6, MR-GPTQ) provide a fair competitive landscape. The ablation studies (Tables 5, 6, 9, 10, 13) systematically validate design choices.

Concerns. The accuracy improvements, while consistent, are modest — up to ~2.6 points on AIME-120 but often smaller on GPQA-Diamond and LiveCodeBench. On LiveCodeBench, the NVFP4+ReSET result (40.2 avg) still falls below the RTN baseline for some individual models (e.g., R1-Qwen-7B: 28.4 vs 29.5 RTN). The claim that ReSET "dominates the NVFP4 PTQ frontier" is slightly overstated for individual model-task pairs, though it holds on average. The paper acknowledges that combining ReSET with GPTQ provides minimal additional benefit (Table 12), which somewhat limits the extensibility argument.

3. Potential Impact

Practical deployment. This work directly addresses a real production bottleneck. The observation that NVFP4's headline 4× throughput advantage collapses at SLO-feasible batch sizes is important for practitioners deploying reasoning models. The ~2× end-to-end speedup over BF16 on Qwen3-32B at batch size 1 is a significant practical result.

CUDA-core NVFP4 kernel. The authors claim this is the first public CUDA-core NVFP4 GEMV implementation, which could serve as a reference for other low-precision kernel developers. The 1.57–2.49× kernel-level speedup over vLLM-CUTLASS is substantial.

Step-level entropy as a control signal. The conceptual contribution that step-level entropy is the appropriate granularity for reasoning-time interventions could influence future work on adaptive decoding, not just for quantized models but potentially for full-precision reasoning as well.

Scope limitations. Impact is bounded to NVIDIA Blackwell hardware and NVFP4 specifically. The method's effectiveness on other quantization formats (e.g., INT4, MXFP4) or hardware platforms is unknown. The approach is also specific to reasoning models with structured chain-of-thought traces.

4. Timeliness & Relevance

This paper is highly timely. LRMs (DeepSeek-R1, Qwen3, OpenAI o-series) are rapidly proliferating, and inference-time scaling makes their serving costs a first-order concern. NVIDIA's Blackwell architecture with native NVFP4 support is the current generation hardware. The gap between NVFP4's theoretical benefits and practical realization at production batch sizes is an immediate pain point that this work addresses directly.

The paper is positioned at the intersection of two active research threads: efficient LLM inference and reasoning model deployment, making it relevant to both the systems and ML communities.

5. Strengths & Limitations

Key Strengths:

  • Problem identification is sharp: The Tensor-Core utilization collapse at small M (Fig. 1) and the step-level entropy domination of token entropy (Fig. 3) are clearly demonstrated insights.
  • Minimal overhead: ReSET adds ~1.5% per-decode-step overhead (~100μs on Qwen3-32B), making it deployment-ready.
  • Complementary contributions: The accuracy (ReSET) and latency (CUDA kernel) contributions address orthogonal bottlenecks and compose naturally.
  • Code availability: Public release at GitHub enables reproducibility.
  • Hyperparameter robustness: Window size w, temperatures T_low/T_high, and threshold τ_0 show stable behavior across reasonable ranges.
  • Notable Weaknesses:

  • Modest accuracy gains on some benchmarks: GPQA-Diamond improvement is ~1.3 points average; LiveCodeBench shows only ~1.0 point average gain and inconsistent per-model improvements.
  • Step boundary definition is simplistic: Using "\n\n" delimiters as step boundaries is heuristic and may not generalize to models with different formatting conventions.
  • Limited to reasoning models: The entropy dynamics analyzed here are specific to CoT reasoning; applicability to general LLM tasks is unexplored.
  • Single hardware platform: All results are on B200; generalization to other GPUs or quantization formats is unstated.
  • The two contributions are somewhat disjoint: While presented as a unified system, the temperature scaling and kernel design are independent contributions that could have been separate papers. Their joint evaluation (Fig. 6) shows they compose but doesn't demonstrate synergy.
  • No comparison to other adaptive decoding methods beyond basic top-p/min-p sweeps; methods like SEAL [36] are cited but not compared against experimentally.
  • Overall Assessment

    ReSET makes a solid engineering and analytical contribution to NVFP4 reasoning model deployment. The step-level entropy insight is well-motivated and the CUDA-core kernel fills a real gap. However, the accuracy improvements are modest on some benchmarks, and the work's applicability is narrowly scoped to NVFP4 on Blackwell hardware. It represents a meaningful incremental advance for the LLM inference systems community rather than a paradigm-shifting contribution.

    Rating:6.5/ 10
    Significance 6.5Rigor 7Novelty 6Clarity 7.5

    Generated Jun 12, 2026

    Comparison History (19)

    Lostvs. VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

    VideoMDM addresses a fundamental challenge in 3D human motion generation—eliminating the need for 3D ground truth supervision—which could broadly impact motion synthesis, animation, and embodied AI. The theoretical contribution (showing depth-weighted 2D reprojection loss equivalence to 3D supervision) is novel and generalizable. Paper 2, while practically useful, addresses a narrower optimization problem (NVFP4 quantization for reasoning models) that is more incremental and tied to specific hardware. VideoMDM's ability to learn from abundant monocular video data opens significantly wider research directions and real-world applications.

    claude-opus-4-6·Jun 12, 2026
    Wonvs. Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

    Paper 2 addresses the critical bottleneck of high inference costs in large reasoning models. By enabling accurate NVFP4 quantization and significantly improving decoding latency, it offers substantial systemic improvements for deploying advanced AI at scale. While Paper 1 provides a highly practical UX improvement for coding agents, Paper 2's fundamental infrastructure optimization has a broader and more immediate impact across all applications relying on resource-intensive reasoning models.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

    Paper 2 targets the highly critical and timely bottleneck of Large Reasoning Model (LRM) inference costs. By enabling accurate, latency-critical NVFP4 quantization and providing a custom CUDA kernel, it directly impacts the scalability and deployment of cutting-edge AI models across the massive LLM ecosystem. While Paper 1 presents an elegant solution for safety-critical control systems, Paper 2's focus on foundational model efficiency addresses a much broader and immediate industrial and research need, giving it higher potential for widespread scientific and practical impact.

    gemini-3.1-pro-preview·Jun 12, 2026
    Lostvs. The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

    Paper 2 offers a more novel, broadly applicable theoretical framework: a geometric explanation of abrupt qualitative changes in diffusion/flow dynamics via projection caustics, plus a general diagnostic (CBD) that can guide interventions and control. This can impact multiple areas (generative modeling theory, diffusion training/sampling, controllability, robustness) and is timely given widespread diffusion use. Paper 1 is strong and practical for efficient LLM inference, but its contributions (temperature scaling heuristics + custom NVFP4 kernel) are more incremental and narrower in scope, with impact tied to specific hardware/precision stacks.

    gpt-5.2·Jun 12, 2026
    Wonvs. Reinforcement Learning for Flow-Matching Policies with Density Transport

    Paper 2 addresses a critical bottleneck in the current AI landscape: the high inference cost and latency of Large Reasoning Models. By proposing hardware-aware optimizations for the emerging NVFP4 standard, it offers immediate, widespread practical applications for deploying large language models. While Paper 1 introduces a mathematically elegant approach to continuous control, Paper 2's focus on LLM efficiency, hardware-software co-design, and latency-critical decoding ensures broader and more immediate impact across both academia and industry.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

    Paper 2 addresses a critical bottleneck in deploying Large Reasoning Models by optimizing low-precision NVFP4 inference. Its combination of algorithmic innovation (entropy-based temperature scaling) and systems engineering (custom CUDA kernel) delivers tangible improvements in both accuracy and speed. This provides immediate, high-impact practical utility for real-world AI deployment, whereas Paper 1 offers more theoretical and specialized analytical insights into the mechanics of model distillation.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

    Paper 2 has higher likely impact: it targets a broad, timely bottleneck (efficient inference for large reasoning models) with immediate applicability across many LLM deployments. It contributes both an algorithmic fix (step-aware entropy temperature scaling to recover accuracy under NVFP4) and a systems advance (new small-batch CUDA kernel) with strong reported latency gains, increasing adoption potential. Paper 1 is novel and rigorous but is more domain-specific (power systems forecasting benchmarks/metrics), likely narrowing breadth and immediate cross-field uptake.

    gpt-5.2·Jun 12, 2026
    Lostvs. Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

    Paper 1 advances AI for Science by enabling the symbolic recovery of governing equations from noisy, high-dimensional data. Its theoretical guarantees and applicability across disciplines, such as physics and neuroscience, provide a fundamental contribution to scientific discovery. While Paper 2 offers significant practical advancements for large language model inference efficiency, Paper 1 demonstrates broader, longer-term potential for cross-disciplinary scientific impact and methodological innovation.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. Adjusted Cup-Product Neural Layer

    Paper 1 addresses a highly critical and timely bottleneck in modern AI: the inference cost and latency of Large Reasoning Models. By providing tangible accuracy and speed improvements for NVFP4 execution, it has immediate, massive real-world applications across the AI industry. Paper 2 presents an elegant theoretical integration of gauge theory into neural layers, but its impact is likely confined to a narrower niche of physics-informed machine learning compared to the broad, urgent applicability of Paper 1.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

    Paper 1 addresses a critical bottleneck in deploying Large Reasoning Models: inference cost and latency. By combining algorithmic innovation (step-aware temperature scaling) with a system-level optimization (custom CUDA NVFP4 kernel), it provides substantial hardware acceleration for next-generation GPUs. While Paper 2 offers a novel RL exploration strategy, Paper 1's full-stack approach to making massive reasoning models computationally feasible gives it broader and more immediate real-world impact.

    gemini-3.1-pro-preview·Jun 12, 2026