Back to Rankings

Optimal Post-Training Quantization Scales and Where to Find Them

Juan Amboage, Pablo Monteagudo-Lago, Ian Colbert, Giuseppe Franco, Nicholas Fraser

cs.LGcs.AI
Share
#1205 of 5669 · cs.LG
Tournament Score
1464±44
10501750
67%
Win Rate
12
Wins
6
Losses
18
Matches
Rating
7.2/ 10
Significance7
Rigor8.5
Novelty7.5
Clarity8.5

Abstract

Post-training quantization (PTQ) compresses large language models by mapping weights to low-bit representations. The scaling factor that defines the quantization grid is typically chosen using simple, data-free heuristics. In this work, we present PiSO (Piecewise Scale Optimization), an algorithm that leverages calibration data to compute the optimal channel-wise weight scales exactly and efficiently under round-to-nearest quantization. PiSO partitions the scale search space into finitely many intervals on which the objective admits a closed-form minimizer. We extend PiSO to group-wise quantization via principled heuristics and propose effective strategies for interleaving scale optimization with error correction. Experiments on Llama and Qwen models across multiple model sizes and target weight bit-widths demonstrate consistent improvements in perplexity and downstream zero-shot accuracy, both standalone and combined with error correction. In particular, we observe increased benefits as the target bit-width narrows and quantization becomes more challenging.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PiSO — Optimal Post-Training Quantization Scales

1. Core Contribution

PiSO addresses a surprisingly under-explored component of PTQ pipelines: the scaling factor that defines the quantization grid. While significant effort has gone into error correction (GPTQ, Qronos) and rounding decisions, the scale is typically set via simple heuristics (absmax or grid search), both of which are data-free and thus ignore how activations amplify weight errors.

The key insight is elegant: the round-to-nearest (RTN) grid assignment q(w;s)q(w; s) is piecewise constant in the scale ss, creating finitely many intervals where the objective has a closed-form quadratic minimizer. By sweeping through these intervals with incremental updates, PiSO computes the provably globally optimal channel-wise scale in O(D2G)O(D^2|G|) time per channel. This transforms what appears to be a non-convex optimization into an exact, efficient algorithm — a genuinely novel algorithmic contribution.

2. Methodological Rigor

The theoretical foundation is thorough and well-constructed. The paper provides:

  • Complete proofs (Appendix C) establishing channel-wise separability (Lemma C.4), closed-form solutions for fixed assignments (Lemma C.5), piecewise-constant structure (Corollaries C.8-C.9), and global optimality of the sweep (Proposition C.10).
  • Complexity analysis showing O(D2G)O(D^2|G|) per channel with dense Hessian, dropping to O(DG)O(D|G|) when HH is diagonal.
  • Formal bounds on the approximation error of the independent group-wise variant (Proposition C.12).
  • The extension to group-wise quantization is handled honestly — the authors acknowledge it requires approximations and characterize two variants (independent and sequential) with clear trade-offs. The sequential variant's superiority, especially at 2-bit (order-of-magnitude improvement over independent), is well-demonstrated.

    The three integration strategies with error correction (decoupled, layer-wise interleaved, group-wise interleaved) are well-motivated, with the interleaved variants addressing the mismatch between RTN assumptions and error-corrected assignments.

    One concern: the optimality guarantee strictly holds only for channel-wise RTN. When combined with GPTQ/Qronos, the RTN assumption is violated, yet the paper demonstrates empirical benefits without formal guarantees for this regime. The authors are transparent about this limitation.

    3. Experimental Comprehensiveness

    The experimental evaluation is extensive:

  • Models: Llama-3 (1B, 3B, 8B) and Qwen-2.5 (1.5B, 3B, 7B)
  • Bit-widths: 2, 3, and 4-bit
  • Granularities: Channel-wise and group-wise (G16, G32)
  • Metrics: WikiText-2 perplexity and zero-shot accuracy (ARC-Easy, ARC-Challenge, HellaSwag)
  • Baselines: absmax, data-free, Beacon, combined with RTN/GPTQ/Qronos
  • Key findings are convincing:

    1. At 2-bit channel-wise, PiSO reduces absmax perplexity by orders of magnitude (e.g., Llama-3.2-1B: 2e5 → 3e3 for RTN).

    2. Qronos⊙PiSO achieves the best 2-bit results consistently (e.g., 27.9 vs. 197 for Qronos+absmax on Llama-1B).

    3. At 3-bit, RTN+PiSO is competitive with or surpasses GPTQ+absmax — a striking result suggesting better scales can substitute for error correction.

    4. The calibration efficiency finding (Figure 2) is particularly interesting: a single 2048-token sample suffices for RTN+PiSO, while GPTQ+absmax needs ≥64 samples.

    Runtime overhead is minimal (0.99-1.13× for interleaved variant), making PiSO practical.

    4. Timeliness & Relevance

    This work is highly timely. LLM deployment on edge devices demands aggressive quantization (2-4 bit), where scale selection matters most. The paper correctly identifies that as bit-widths decrease, the gap between naive and optimal scales widens dramatically. With the proliferation of low-bit quantization formats (MX, NVFP4) and the increasing need for efficient PTQ, a principled scale optimization method fills a clear gap.

    The compatibility with existing PTQ pipelines (transform + round) is a significant practical advantage — PiSO can be dropped into any pipeline using RTN or GPTQ/Qronos with minimal modification.

    5. Strengths & Limitations

    Strengths:

  • Exact optimality for channel-wise RTN — rare in quantization literature where most methods are approximate
  • The piecewise-constant insight is elegant and leads to an efficient algorithm
  • Negligible runtime overhead makes it immediately practical
  • Comprehensive evaluation with clear ablations across objectives (X-X, X̃-X̃, X-X̃)
  • The calibration efficiency finding has practical implications for deployment scenarios with limited data
  • Strong theoretical presentation with complete proofs
  • Limitations:

  • Optimality guarantees don't extend to group-wise or error-corrected settings
  • Limited to symmetric quantization; asymmetric quantization and hardware-constrained formats (MX, NVFP4 with low-bit scales) are left for future work
  • Evaluation limited to weight-only quantization; interaction with activation quantization unexplored
  • No comparison with CDQuant or COMQ (only Beacon), though the authors justify this choice
  • The improvements at 4-bit are modest, suggesting diminishing returns at higher bit-widths
  • The paper focuses on perplexity and simple zero-shot benchmarks; evaluation on more challenging generation tasks or instruction-following would strengthen the claims
  • Notable observations:

  • The finding that optimal scales can exceed absmax (Figure 4) challenges a widespread assumption in PTQ pipelines
  • The data-free variant of PiSO serves as a stronger baseline than grid search, yet still dramatically underperforms the data-aware version, highlighting the importance of activation awareness
  • Beacon's inconsistent performance (sometimes catastrophic, e.g., Llama-3B 3-bit: 33.7 perplexity) suggests that joint scale-assignment optimization without optimality guarantees can be unreliable
  • Overall Assessment

    PiSO makes a clean, well-executed contribution to an important but neglected aspect of PTQ. The theoretical insight is elegant, the algorithm is practical, and the experimental gains are consistent and meaningful, especially at low bit-widths. While the scope is focused (channel-wise RTN optimality with heuristic extensions), the work establishes a solid foundation that could influence how future PTQ pipelines handle scale selection.

    Rating:7.2/ 10
    Significance 7Rigor 8.5Novelty 7.5Clarity 8.5

    Generated Jun 10, 2026

    Comparison History (18)

    Wonvs. A Riemannian Approach to Low-Rank Optimal Transport

    Paper 2 likely has higher impact: it targets an immediate, high-demand problem (post-training quantization of LLMs) with clear real-world deployment implications and broad industry relevance. Its core contribution—exact, efficient optimization of quantization scales under round-to-nearest—offers a principled alternative to common heuristics and is readily adoptable in existing PTQ pipelines, potentially affecting many models and systems. Paper 1 is methodologically sophisticated and broadly extensible within optimal transport, but its audience and near-term adoption are narrower than LLM quantization.

    gpt-5.2·Jun 11, 2026
    Lostvs. ATLAS: Active Theory Learning for Automated Science

    Paper 1 presents a foundational framework for automating scientific discovery through active theory learning, offering broad, long-term conceptual impact across cognitive science and other empirical fields. In contrast, Paper 2 addresses a highly specific, albeit timely, engineering problem in LLM compression. The fundamental novelty and cross-disciplinary potential of automating mechanistic modeling give Paper 1 a significantly higher potential for transformative scientific impact.

    gemini-3.1-pro-preview·Jun 11, 2026
    Lostvs. Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

    Paper 2 has higher potential impact due to its novelty and broad relevance: it introduces and empirically demonstrates “generalization hacking,” a failure mode where RL appears successful by reward metrics while behavior fails to generalize—directly affecting alignment, safety evaluation, and RLHF methodology across many AI systems. Its implications span ML, AI safety, and deployment governance, and it is timely as training-aware frontier models emerge. Paper 1 is rigorous and practically useful for PTQ, but its impact is narrower (model compression/efficiency) and more incremental relative to existing calibration-based quantization advances.

    gpt-5.2·Jun 11, 2026
    Wonvs. Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

    Paper 2 addresses the highly timely and practically impactful problem of LLM compression through quantization, which is critical for deploying large language models efficiently. PiSO provides an exact, efficient solution for optimal quantization scales—a fundamental component used across the rapidly growing LLM deployment ecosystem. Its broad applicability across model families and bit-widths, combined with the massive interest in efficient LLM inference, gives it wider potential impact. Paper 1, while theoretically rigorous with optimal regret bounds for two-sided assortment learning, addresses a more niche operations research problem with narrower immediate applicability.

    claude-opus-4-6·Jun 10, 2026
    Lostvs. Can we trust our models? Epistemic calibration in second-order classification

    While Paper 1 offers a timely and practical method for LLM compression, Paper 2 introduces a foundational concept (epistemic calibration) that addresses a critical gap in ML safety and uncertainty quantification. Its theoretical rigor, impossibility theorem, and broad applicability to high-stakes ML across multiple domains give it a higher potential for widespread and lasting scientific impact.

    gemini-3.1-pro-preview·Jun 10, 2026
    Lostvs. From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

    Paper 2 is more novel and broadly impactful: it challenges a common inferential leap in interpretability (associational metrics → interventional claims) with a concrete, multi-model causal audit, yielding a general negative result and an evidential standard that can reshape evaluation practices beyond MoE pruning. Its methodological rigor is high (token-level interventions, multiple-comparison correction, power control), and the implications span interpretability, causal evaluation, pruning, and safety auditing. Paper 1 is strong and timely for PTQ efficiency, but its impact is likely more incremental and narrower to quantization workflows.

    gpt-5.2·Jun 10, 2026
    Wonvs. A Systematic Approach for Selecting Trajectories for Data Augmentation

    Paper 2 has higher impact potential due to a clear methodological innovation (exact, efficient optimal scale computation for PTQ under round-to-nearest), strong timeliness (LLM compression is highly active), and broad applicability across models, bit-widths, and deployment settings. Its results directly improve perplexity and zero-shot accuracy and become more valuable at lower bit-widths, aligning with real-world constraints. Paper 1 is careful and insightful for trajectory ML, but its impact is narrower, the benefits are conditional, and it reads more as an evaluation/framework thesis than a generally transferable new algorithmic advance.

    gpt-5.2·Jun 10, 2026
    Wonvs. RL4RLA: Teaching ML to Discover Randomized Linear Algebra Algorithms Through Curriculum Design and Graph-Based Search

    Paper 2 has higher likely impact due to timeliness and direct applicability: improving post-training quantization for LLMs targets an urgent bottleneck in deploying foundation models, with immediate benefits across industry and research. Its core contribution (exact, efficient optimization of quantization scales under round-to-nearest) is a clear methodological advance that can be adopted widely and combined with existing PTQ/error-correction pipelines. Paper 1 is novel and broader scientifically, but its impact may be slower and more niche (automated discovery of RLA algorithms) with higher barriers to adoption.

    gpt-5.2·Jun 10, 2026
    Wonvs. Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

    Paper 1 addresses a ubiquitous bottleneck in large language model deployment by providing a mathematically rigorous algorithm for optimal post-training quantization scaling. Given the widespread demand for efficient LLM inference across nearly all AI applications, its potential for broad, immediate real-world impact and methodological improvement over standard heuristics outweighs the more specialized, though innovative, advancements in flow matching alignment presented in Paper 2.

    gemini-3.1-pro-preview·Jun 10, 2026
    Wonvs. Assessing Sample Quality in Conditional Generation under Compositional Shift

    Paper 2 is likely to have higher scientific impact due to strong timeliness and broad applicability: improved post-training quantization directly affects deployment cost and accessibility of frontier LLMs across many domains. Its core contribution (PiSO) offers an exact, efficient optimization method for quantization scales with clear methodological rigor and measurable gains on widely used models/benchmarks, and should be easy to adopt in existing PTQ pipelines. Paper 1 addresses an important evaluation gap for compositional shift and has strong relevance in scientific imaging, but its impact may be narrower and more dependent on assumptions about attribute coverage and trust-score validity.

    gpt-5.2·Jun 10, 2026