Juan Amboage, Pablo Monteagudo-Lago, Ian Colbert, Giuseppe Franco, Nicholas Fraser
Post-training quantization (PTQ) compresses large language models by mapping weights to low-bit representations. The scaling factor that defines the quantization grid is typically chosen using simple, data-free heuristics. In this work, we present PiSO (Piecewise Scale Optimization), an algorithm that leverages calibration data to compute the optimal channel-wise weight scales exactly and efficiently under round-to-nearest quantization. PiSO partitions the scale search space into finitely many intervals on which the objective admits a closed-form minimizer. We extend PiSO to group-wise quantization via principled heuristics and propose effective strategies for interleaving scale optimization with error correction. Experiments on Llama and Qwen models across multiple model sizes and target weight bit-widths demonstrate consistent improvements in perplexity and downstream zero-shot accuracy, both standalone and combined with error correction. In particular, we observe increased benefits as the target bit-width narrows and quantization becomes more challenging.
PiSO addresses a surprisingly under-explored component of PTQ pipelines: the scaling factor that defines the quantization grid. While significant effort has gone into error correction (GPTQ, Qronos) and rounding decisions, the scale is typically set via simple heuristics (absmax or grid search), both of which are data-free and thus ignore how activations amplify weight errors.
The key insight is elegant: the round-to-nearest (RTN) grid assignment is piecewise constant in the scale , creating finitely many intervals where the objective has a closed-form quadratic minimizer. By sweeping through these intervals with incremental updates, PiSO computes the provably globally optimal channel-wise scale in time per channel. This transforms what appears to be a non-convex optimization into an exact, efficient algorithm — a genuinely novel algorithmic contribution.
The theoretical foundation is thorough and well-constructed. The paper provides:
The extension to group-wise quantization is handled honestly — the authors acknowledge it requires approximations and characterize two variants (independent and sequential) with clear trade-offs. The sequential variant's superiority, especially at 2-bit (order-of-magnitude improvement over independent), is well-demonstrated.
The three integration strategies with error correction (decoupled, layer-wise interleaved, group-wise interleaved) are well-motivated, with the interleaved variants addressing the mismatch between RTN assumptions and error-corrected assignments.
One concern: the optimality guarantee strictly holds only for channel-wise RTN. When combined with GPTQ/Qronos, the RTN assumption is violated, yet the paper demonstrates empirical benefits without formal guarantees for this regime. The authors are transparent about this limitation.
The experimental evaluation is extensive:
Key findings are convincing:
1. At 2-bit channel-wise, PiSO reduces absmax perplexity by orders of magnitude (e.g., Llama-3.2-1B: 2e5 → 3e3 for RTN).
2. Qronos⊙PiSO achieves the best 2-bit results consistently (e.g., 27.9 vs. 197 for Qronos+absmax on Llama-1B).
3. At 3-bit, RTN+PiSO is competitive with or surpasses GPTQ+absmax — a striking result suggesting better scales can substitute for error correction.
4. The calibration efficiency finding (Figure 2) is particularly interesting: a single 2048-token sample suffices for RTN+PiSO, while GPTQ+absmax needs ≥64 samples.
Runtime overhead is minimal (0.99-1.13× for interleaved variant), making PiSO practical.
This work is highly timely. LLM deployment on edge devices demands aggressive quantization (2-4 bit), where scale selection matters most. The paper correctly identifies that as bit-widths decrease, the gap between naive and optimal scales widens dramatically. With the proliferation of low-bit quantization formats (MX, NVFP4) and the increasing need for efficient PTQ, a principled scale optimization method fills a clear gap.
The compatibility with existing PTQ pipelines (transform + round) is a significant practical advantage — PiSO can be dropped into any pipeline using RTN or GPTQ/Qronos with minimal modification.
PiSO makes a clean, well-executed contribution to an important but neglected aspect of PTQ. The theoretical insight is elegant, the algorithm is practical, and the experimental gains are consistent and meaningful, especially at low bit-widths. While the scope is focused (channel-wise RTN optimality with heuristic extensions), the work establishes a solid foundation that could influence how future PTQ pipelines handle scale selection.
Generated Jun 10, 2026
Paper 2 likely has higher impact: it targets an immediate, high-demand problem (post-training quantization of LLMs) with clear real-world deployment implications and broad industry relevance. Its core contribution—exact, efficient optimization of quantization scales under round-to-nearest—offers a principled alternative to common heuristics and is readily adoptable in existing PTQ pipelines, potentially affecting many models and systems. Paper 1 is methodologically sophisticated and broadly extensible within optimal transport, but its audience and near-term adoption are narrower than LLM quantization.
Paper 1 presents a foundational framework for automating scientific discovery through active theory learning, offering broad, long-term conceptual impact across cognitive science and other empirical fields. In contrast, Paper 2 addresses a highly specific, albeit timely, engineering problem in LLM compression. The fundamental novelty and cross-disciplinary potential of automating mechanistic modeling give Paper 1 a significantly higher potential for transformative scientific impact.
Paper 2 has higher potential impact due to its novelty and broad relevance: it introduces and empirically demonstrates “generalization hacking,” a failure mode where RL appears successful by reward metrics while behavior fails to generalize—directly affecting alignment, safety evaluation, and RLHF methodology across many AI systems. Its implications span ML, AI safety, and deployment governance, and it is timely as training-aware frontier models emerge. Paper 1 is rigorous and practically useful for PTQ, but its impact is narrower (model compression/efficiency) and more incremental relative to existing calibration-based quantization advances.
Paper 2 addresses the highly timely and practically impactful problem of LLM compression through quantization, which is critical for deploying large language models efficiently. PiSO provides an exact, efficient solution for optimal quantization scales—a fundamental component used across the rapidly growing LLM deployment ecosystem. Its broad applicability across model families and bit-widths, combined with the massive interest in efficient LLM inference, gives it wider potential impact. Paper 1, while theoretically rigorous with optimal regret bounds for two-sided assortment learning, addresses a more niche operations research problem with narrower immediate applicability.
While Paper 1 offers a timely and practical method for LLM compression, Paper 2 introduces a foundational concept (epistemic calibration) that addresses a critical gap in ML safety and uncertainty quantification. Its theoretical rigor, impossibility theorem, and broad applicability to high-stakes ML across multiple domains give it a higher potential for widespread and lasting scientific impact.
Paper 2 is more novel and broadly impactful: it challenges a common inferential leap in interpretability (associational metrics → interventional claims) with a concrete, multi-model causal audit, yielding a general negative result and an evidential standard that can reshape evaluation practices beyond MoE pruning. Its methodological rigor is high (token-level interventions, multiple-comparison correction, power control), and the implications span interpretability, causal evaluation, pruning, and safety auditing. Paper 1 is strong and timely for PTQ efficiency, but its impact is likely more incremental and narrower to quantization workflows.
Paper 2 has higher impact potential due to a clear methodological innovation (exact, efficient optimal scale computation for PTQ under round-to-nearest), strong timeliness (LLM compression is highly active), and broad applicability across models, bit-widths, and deployment settings. Its results directly improve perplexity and zero-shot accuracy and become more valuable at lower bit-widths, aligning with real-world constraints. Paper 1 is careful and insightful for trajectory ML, but its impact is narrower, the benefits are conditional, and it reads more as an evaluation/framework thesis than a generally transferable new algorithmic advance.
Paper 2 has higher likely impact due to timeliness and direct applicability: improving post-training quantization for LLMs targets an urgent bottleneck in deploying foundation models, with immediate benefits across industry and research. Its core contribution (exact, efficient optimization of quantization scales under round-to-nearest) is a clear methodological advance that can be adopted widely and combined with existing PTQ/error-correction pipelines. Paper 1 is novel and broader scientifically, but its impact may be slower and more niche (automated discovery of RLA algorithms) with higher barriers to adoption.
Paper 1 addresses a ubiquitous bottleneck in large language model deployment by providing a mathematically rigorous algorithm for optimal post-training quantization scaling. Given the widespread demand for efficient LLM inference across nearly all AI applications, its potential for broad, immediate real-world impact and methodological improvement over standard heuristics outweighs the more specialized, though innovative, advancements in flow matching alignment presented in Paper 2.
Paper 2 is likely to have higher scientific impact due to strong timeliness and broad applicability: improved post-training quantization directly affects deployment cost and accessibility of frontier LLMs across many domains. Its core contribution (PiSO) offers an exact, efficient optimization method for quantization scales with clear methodological rigor and measurable gains on widely used models/benchmarks, and should be easy to adopt in existing PTQ pipelines. Paper 1 addresses an important evaluation gap for compositional shift and has strong relevance in scientific imaging, but its impact may be narrower and more dependent on assumptions about attribute coverage and trust-score validity.