ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim, Hanyul Ryu, Soojung Ryu, Jungwook Choi

Jun 11, 2026arXiv:2606.13233v1

cs.LGcs.AI

#2212of 5669·cs.LG

#2212 of 5669 · cs.LG

Tournament Score

1427±44

10501750

63%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7

Novelty6

Clarity7.5

Abstract

Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small- $M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$ 2 points over the NVFP4 baseline. Our CUDA-core small- $M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ReSET

1. Core Contribution

ReSET addresses two concrete obstacles in deploying large reasoning models (LRMs) with NVFP4 quantization: (a) accuracy degradation during quantized reasoning, and (b) latency inefficiency of existing NVFP4 kernels at small batch sizes typical of production decode settings.

The accuracy contribution is a step-aware temperature scaling method that adaptively adjusts decoding temperature based on reasoning-step-level entropy rather than token-level entropy alone. The key insight is that token-level entropy is dominated by the uncertainty of the surrounding reasoning step — a symbolic token (e.g., a digit) might appear high-entropy simply because it sits within an uncertain reasoning step, causing naive token-level thresholds to miss critical sharpening opportunities. ReSET uses a hybrid online estimator combining sliding-window initialization with causal within-step averaging to track step entropy during autoregressive generation.

The latency contribution is a CUDA-core NVFP4 GEMV kernel for small-M (batch size 1–8) decoding. The paper correctly identifies that Blackwell Tensor Cores require M=128 tile sizes, leading to <1% utilization at production batch sizes. The custom kernel uses multi-token CTA fusion, multi-accumulator threading, and register-only dequantization to avoid the tile under-occupancy problem entirely.

2. Methodological Rigor

Analysis quality. The paper's diagnostic analysis is well-structured. The progression from observing that quantization increases incorrect sampling at low-entropy symbolic tokens (Sec 3.1), to showing a naive fix partially works (Sec 3.2), to demonstrating why it's incomplete due to step-level entropy domination (Sec 3.3), is logically compelling. The 1.5M token analysis from R1-Qwen-14B across 90 AIME problems provides adequate statistical grounding.

Experimental evaluation. The evaluation spans five models across two families (R1-Distill-Qwen and Qwen3), three benchmarks (AIME-120, GPQA-Diamond, LiveCodeBench), and 8 random seeds per configuration. This is reasonably thorough. Comparisons against four PTQ baselines (RTN, BRQ, 4/6, MR-GPTQ) provide a fair competitive landscape. The ablation studies (Tables 5, 6, 9, 10, 13) systematically validate design choices.

Concerns. The accuracy improvements, while consistent, are modest — up to ~2.6 points on AIME-120 but often smaller on GPQA-Diamond and LiveCodeBench. On LiveCodeBench, the NVFP4+ReSET result (40.2 avg) still falls below the RTN baseline for some individual models (e.g., R1-Qwen-7B: 28.4 vs 29.5 RTN). The claim that ReSET "dominates the NVFP4 PTQ frontier" is slightly overstated for individual model-task pairs, though it holds on average. The paper acknowledges that combining ReSET with GPTQ provides minimal additional benefit (Table 12), which somewhat limits the extensibility argument.

3. Potential Impact

Practical deployment. This work directly addresses a real production bottleneck. The observation that NVFP4's headline 4× throughput advantage collapses at SLO-feasible batch sizes is important for practitioners deploying reasoning models. The ~2× end-to-end speedup over BF16 on Qwen3-32B at batch size 1 is a significant practical result.

CUDA-core NVFP4 kernel. The authors claim this is the first public CUDA-core NVFP4 GEMV implementation, which could serve as a reference for other low-precision kernel developers. The 1.57–2.49× kernel-level speedup over vLLM-CUTLASS is substantial.

Step-level entropy as a control signal. The conceptual contribution that step-level entropy is the appropriate granularity for reasoning-time interventions could influence future work on adaptive decoding, not just for quantized models but potentially for full-precision reasoning as well.

Scope limitations. Impact is bounded to NVIDIA Blackwell hardware and NVFP4 specifically. The method's effectiveness on other quantization formats (e.g., INT4, MXFP4) or hardware platforms is unknown. The approach is also specific to reasoning models with structured chain-of-thought traces.

4. Timeliness & Relevance

This paper is highly timely. LRMs (DeepSeek-R1, Qwen3, OpenAI o-series) are rapidly proliferating, and inference-time scaling makes their serving costs a first-order concern. NVIDIA's Blackwell architecture with native NVFP4 support is the current generation hardware. The gap between NVFP4's theoretical benefits and practical realization at production batch sizes is an immediate pain point that this work addresses directly.

The paper is positioned at the intersection of two active research threads: efficient LLM inference and reasoning model deployment, making it relevant to both the systems and ML communities.

5. Strengths & Limitations

Key Strengths:

Problem identification is sharp: The Tensor-Core utilization collapse at small M (Fig. 1) and the step-level entropy domination of token entropy (Fig. 3) are clearly demonstrated insights.

Minimal overhead: ReSET adds ~1.5% per-decode-step overhead (~100μs on Qwen3-32B), making it deployment-ready.

Complementary contributions: The accuracy (ReSET) and latency (CUDA kernel) contributions address orthogonal bottlenecks and compose naturally.

Code availability: Public release at GitHub enables reproducibility.

Hyperparameter robustness: Window size w, temperatures T_low/T_high, and threshold τ_0 show stable behavior across reasonable ranges.

Notable Weaknesses:

Modest accuracy gains on some benchmarks: GPQA-Diamond improvement is ~1.3 points average; LiveCodeBench shows only ~1.0 point average gain and inconsistent per-model improvements.

Step boundary definition is simplistic: Using "\n\n" delimiters as step boundaries is heuristic and may not generalize to models with different formatting conventions.

Limited to reasoning models: The entropy dynamics analyzed here are specific to CoT reasoning; applicability to general LLM tasks is unexplored.

Single hardware platform: All results are on B200; generalization to other GPUs or quantization formats is unstated.

The two contributions are somewhat disjoint: While presented as a unified system, the temperature scaling and kernel design are independent contributions that could have been separate papers. Their joint evaluation (Fig. 6) shows they compose but doesn't demonstrate synergy.

No comparison to other adaptive decoding methods beyond basic top-p/min-p sweeps; methods like SEAL [36] are cited but not compared against experimentally.

Overall Assessment

ReSET makes a solid engineering and analytical contribution to NVFP4 reasoning model deployment. The step-level entropy insight is well-motivated and the CUDA-core kernel fills a real gap. However, the accuracy improvements are modest on some benchmarks, and the work's applicability is narrowly scoped to NVFP4 on Blackwell hardware. It represents a meaningful incremental advance for the LLM inference systems community rather than a paradigm-shifting contribution.

Rating:6.5/ 10

Significance 6.5Rigor 7Novelty 6Clarity 7.5

Generated Jun 12, 2026

Comparison History (19)

Lostvs. VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

VideoMDM addresses a fundamental challenge in 3D human motion generation—eliminating the need for 3D ground truth supervision—which could broadly impact motion synthesis, animation, and embodied AI. The theoretical contribution (showing depth-weighted 2D reprojection loss equivalence to 3D supervision) is novel and generalizable. Paper 2, while practically useful, addresses a narrower optimization problem (NVFP4 quantization for reasoning models) that is more incremental and tied to specific hardware. VideoMDM's ability to learn from abundant monocular video data opens significantly wider research directions and real-world applications.

claude-opus-4-6·Jun 12, 2026

Wonvs. Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Paper 2 addresses the critical bottleneck of high inference costs in large reasoning models. By enabling accurate NVFP4 quantization and significantly improving decoding latency, it offers substantial systemic improvements for deploying advanced AI at scale. While Paper 1 provides a highly practical UX improvement for coding agents, Paper 2's fundamental infrastructure optimization has a broader and more immediate impact across all applications relying on resource-intensive reasoning models.