Back to Rankings

Loss-Shift Transfer via Bayes Quotients

Vasileios Sevetlidis

cs.LG
Share
#2893 of 5669 · cs.LG
Tournament Score
1399±50
10501750
56%
Win Rate
10
Wins
8
Losses
18
Matches
Rating
5.8/ 10
Significance5.5
Rigor7.5
Novelty5
Clarity8.5

Abstract

Transfer learning is usually studied as a consequence of distribution shift. This paper identifies an orthogonal failure mode in which the data distribution is fixed and the loss changes. This setting is called \emph{loss shift}. A loss determines which information in XX is Bayes-relevant, and two losses may therefore require different representations even under the same joint law P(X,Y)P(X,Y). The idea is formalized using Bayes quotients, which allow losses to be ordered by refinement. In the Bayes-quotient formulation, strict refinement gives an immediate qualitative obstruction. A source-minimal representation for a coarser loss is insufficient for a strictly finer target loss. For finite-output log loss, this obstruction becomes an exact quantitative identity. The excess risk is the conditional information about YY discarded by the representation. Experiments in controlled, learned, synthetic-image, and real-image settings show the predicted effect, i.e., classification-equivalent representations can have different optimal log-loss performance under a fixed data distribution.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Loss-Shift Transfer via Bayes Quotients"

1. Core Contribution

This paper identifies and formalizes loss shift — a transfer learning failure mode in which the data distribution P(X,Y) remains fixed but the loss function changes between source and target tasks. The key insight is that different losses require different "Bayes-relevant" information from representations. A representation that is minimal and sufficient for a coarse loss (e.g., 0-1 classification) may be provably insufficient for a finer loss (e.g., log loss) under the *same* distribution.

The formalization uses Bayes quotients: for a fixed (P, ℓ), the Bayes quotient partitions the input space by equivalence classes that share the same Bayes-optimal action. Two losses are compared via a refinement preorder on their induced sigma-algebras. The main quantitative result (Theorem 3.3) shows that for finite-output log loss, the frozen-transfer excess risk equals exactly I(Y; X | H) — the conditional mutual information discarded by the representation.

2. Methodological Rigor

The theoretical framework is clean and mathematically well-structured. The paper carefully works within the standard Borel setting, defines Bayes sufficiency and minimality precisely, and proves a preorder on losses through sigma-algebra containment. The proofs (Appendix C) are straightforward but correct — they rely on standard information-theoretic identities (cross-entropy decomposition, tower property, conditional mutual information).

However, the theoretical novelty should be assessed carefully. Theorem 3.3 is essentially the well-known identity that the excess log-loss risk of any predictor relative to the Bayes predictor equals the expected conditional KL divergence, which is a standard result in information theory and Bayesian learning theory. The paper acknowledges this connection to Cover & Thomas (2006) and Xu & Raginsky (2022). The contribution is therefore more in the *framing* and *application* of known information-theoretic identities to the transfer learning context rather than in deriving fundamentally new mathematical results.

The experiments are well-designed for isolating the mechanism:

  • Controlled binary model: Exact population quantities are computable, and the empirical gap matches the theoretical prediction to high precision (correlation 0.999995). This is convincing.
  • Learned bottleneck: Shows the effect persists with optimization-derived representations, with fine-tuning recovering performance (confirming the information is in X, just not in H).
  • dSprites: Image-based with known latent structure; diagnostic probes confirm differential information retention.
  • CIFAR-10H: Real-image setting using human soft labels, providing ecological validity.
  • The experiments are thorough and well-controlled, with proper confidence intervals across replications.

    3. Potential Impact

    The concept of loss shift provides a useful diagnostic lens for practitioners. When transferring pretrained representations to tasks with different loss functions (e.g., from classification to calibrated probability estimation), the framework explains why frozen features may be fundamentally insufficient. This is practically relevant because:

  • Modern ML commonly reuses classification-pretrained features for downstream probabilistic tasks.
  • The calibration literature has noted that accuracy-optimized networks can have poor probability estimates; this paper provides a representation-theoretic explanation complementing existing calibration work.
  • The framework gives a clean information-theoretic measure (conditional MI) of the transfer gap.
  • However, the practical implications may be somewhat limited. The primary example (accuracy → log loss) is well-understood informally: practitioners already know that classification features may not preserve calibration. The paper's contribution is making this precise rather than revealing a surprising new phenomenon. The framework doesn't immediately suggest new algorithms beyond "train with the target loss" or "don't freeze."

    4. Timeliness & Relevance

    The paper is timely given the prevalence of foundation models and frozen-feature transfer. The emphasis on what information frozen representations preserve or discard is directly relevant to how pretrained models are deployed. The connection to calibration is also relevant given growing interest in uncertainty quantification.

    The paper fills a conceptual gap: transfer learning theory has focused almost exclusively on distribution shift, and this work correctly points out that loss mismatch under fixed distributions is an independent axis of difficulty.

    5. Strengths & Limitations

    Strengths:

  • Clean formalization of an intuitive but under-theorized phenomenon
  • The Bayes quotient preorder provides an elegant language for comparing loss functions' representational demands
  • Experimental design is exemplary in isolating the mechanism across four settings of increasing realism
  • The exact quantitative identity (Theorem 3.3) is satisfying even if the underlying math is known
  • Paper is well-written with clear progression from theory to experiments
  • Limitations:

  • The core theoretical results are relatively straightforward applications of known information-theoretic identities. The sigma-algebra containment argument is clean but not deep.
  • The quantitative result is restricted to finite-output log loss as the target. Extension to other proper scoring rules, Bregman losses, or structured losses is left open.
  • The Bayes-quotient regime requires almost-sure uniqueness of Bayes actions, excluding important cases.
  • Practical guidance is limited — the framework is diagnostic rather than prescriptive.
  • The paper builds on the author's own concurrent work (Sevetlidis 2026a, 2026b), both from June 2026. The dependency on very recent, potentially unreviewed prior work creates a fragile foundation.
  • The experiments, while well-controlled, use relatively simple settings. The CIFAR-10H experiment is the most realistic but uses a small dataset (10K images) with high variance across seeds.
  • The "loss shift" framing, while novel as terminology, captures something practitioners implicitly understand. The gap between the formalization's elegance and its actionable novelty is notable.
  • Overall Assessment

    This is a well-executed paper that provides a clean theoretical framework for an underappreciated phenomenon. The formalization via Bayes quotients is elegant, and the experiments convincingly demonstrate the predicted effects. The main limitation is that the core insight — different losses need different information — is relatively intuitive, and the mathematical machinery, while polished, primarily repackages known information-theoretic results. The paper's impact will likely be conceptual rather than algorithmic, providing useful vocabulary and formal tools for thinking about representation adequacy across different objectives.

    Rating:5.8/ 10
    Significance 5.5Rigor 7.5Novelty 5Clarity 8.5

    Generated Jun 12, 2026

    Comparison History (18)

    Lostvs. Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

    Paper 2 addresses the highly timely and impactful area of reinforcement learning for LLM reasoning. By providing mechanistic insights and practical interventions for scaling reasoning capabilities, it has immediate, broad applicability in current AI research. While Paper 1 offers a novel theoretical framework for transfer learning, Paper 2's direct relevance to the rapid development of advanced reasoning models gives it higher potential for widespread scientific and real-world impact.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. Distributional Loss for Robust Classification

    Paper 2 introduces a foundational theoretical framework addressing a novel, under-explored problem (loss shift as orthogonal to distribution shift) using a new mathematical formalism (Bayes quotients). While Paper 1 offers a practical algorithmic improvement for robust classification, Paper 2 challenges core assumptions in transfer and representation learning. Its rigorous theoretical formalization, combined with empirical validation, gives it a higher potential to fundamentally shape future research directions, influence theoretical machine learning, and impact how researchers conceptualize representation sufficiency.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

    Paper 2 introduces a fundamental theoretical framework ('loss shift' and 'Bayes quotients') to transfer learning, an area that underpins modern machine learning. While Paper 1 provides an impressive domain-specific benchmark and method for power systems, Paper 2's insights into representation learning and loss functions have the potential for broader impact across all fields applying machine learning, influencing both theoretical understanding and practical model design.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. Clustering Node Attributed Networks with Graph Neural Networks and Self Learning

    Paper 2 introduces a broadly applicable and conceptually novel transfer-learning setting (loss shift) orthogonal to distribution shift, with a formal framework (Bayes quotients) that yields qualitative impossibility results and an exact quantitative identity for log loss linking excess risk to discarded conditional information. This combination of new problem framing, theoretical rigor, and cross-domain relevance (representation learning, information theory, transfer, evaluation metrics) suggests wider and longer-lasting impact than Paper 1’s incremental advance on GNN-based graph clustering, which is more application-specific and shows mixed real-data gains.

    gpt-5.2·Jun 12, 2026
    Wonvs. To GAN or Not To GAN: Segmentation Analysis on Mars DEM

    Paper 2 has higher potential impact: it introduces a novel, general transfer-learning failure mode (loss shift) orthogonal to distribution shift, formalized via Bayes quotients with qualitative and quantitative results (including an exact identity for log loss). The framework is broadly applicable across ML tasks where objectives change (classification vs calibration, different decision costs), is timely given representation learning/transfer, and is supported by theory plus experiments. Paper 1 is more application-specific, and its main empirical finding (GAN augmentation not improving segmentation) is narrower and less broadly generalizable.

    gpt-5.2·Jun 12, 2026
    Lostvs. Novel Aspects of IEEE SA P3109 Arithmetic Formats for Machine Learning

    Paper 1 describes the IEEE SA P3109 standard for machine learning arithmetic formats, which has high practical impact as an industry standard affecting hardware and software implementations across the entire ML ecosystem. Standards like this shape how billions of computations are performed. Paper 2 introduces the novel concept of 'loss shift' and Bayes quotients, which is theoretically interesting but addresses a more niche aspect of transfer learning. While Paper 2 is conceptually elegant, the breadth of impact of an IEEE standard on ML numerical formats—affecting chip designers, framework developers, and practitioners—gives Paper 1 greater estimated scientific impact.

    claude-opus-4-6·Jun 12, 2026
    Lostvs. Soft Sequence Policy Optimization

    Paper 2 demonstrates higher potential scientific impact due to its timeliness and direct real-world applicability in Large Language Model alignment. While Paper 1 introduces a highly novel theoretical framework for transfer learning, Paper 2 addresses critical bottlenecks in modern RLHF/GRPO pipelines. By improving training stability and performance in mathematical reasoning and coding tasks—highly sought-after capabilities in contemporary AI—Paper 2 is positioned for rapid adoption and widespread citation across the highly active LLM research and applied AI communities.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. Accelerating Speculative Diffusions via Block Verification

    Paper 1 introduces a fundamental theoretical framework ('loss shift' and Bayes quotients) that addresses a novel failure mode in transfer learning independent of distribution shift. This foundational insight has broad implications across representation learning and generalization. In contrast, Paper 2 offers a valuable but highly specific algorithmic speedup for diffusion models. Theoretical advances like those in Paper 1 typically yield broader, longer-lasting scientific impact across multiple subfields of machine learning.

    gemini-3.1-pro-preview·Jun 12, 2026
    Lostvs. MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

    Paper 1 represents a major breakthrough in AI reasoning, achieving gold-medal performance on high-profile benchmarks like the IMO and USAMO. The introduction of population-level test-time scaling addresses a critical bottleneck in LLM reasoning capabilities. While Paper 2 offers a solid theoretical contribution to transfer learning, Paper 1 solves a highly visible grand challenge in artificial intelligence, virtually guaranteeing broader immediate attention, extensive follow-up research, and significant real-world impact in automated theorem proving and advanced reasoning systems.

    gemini-3.1-pro-preview·Jun 12, 2026
    Lostvs. Learning with Simulators: No Regret in a Computationally Bounded World

    Paper 1 is more novel and broadly impactful: it introduces a new learning-theoretic framework (simulatable processes) that extends PAC-style VC-dimension guarantees to arbitrarily dependent, computationally bounded data-generating processes, and connects regret to time-bounded Kolmogorov complexity. This is a significant conceptual generalization with potential cross-field influence (learning theory, online learning, complexity, causal/simulation-based modeling). Paper 2 offers a clean and timely reframing of transfer under loss shift with useful identities for log loss, but its scope is narrower and more tied to representation/transfer phenomena than foundational guarantees.

    gpt-5.2·Jun 12, 2026