Arthur Hoarau
Uncertainty estimation is critical for deploying machine learning models in high-stakes settings. However, classical calibration only assesses the reliability of predicted probabilities and does not evaluate whether epistemic uncertainty estimates are themselves trustworthy. This limitation is particularly relevant for second-order classification models. We introduce epistemic calibration, a principled criterion that measures whether reported epistemic uncertainty faithfully reflects the dispersion of model predictions around the ground truth. We show that epistemic calibration is a strictly stronger notion than classical calibration and captures failure modes invisible to standard metrics. We relate this work to the existing literature through an impossibility theorem that holds under the epistemic calibration hypothesis. To operationalize this concept, we propose the Expected Epistemic Calibration Error (EECE), which we prove to be a consistent estimator of a True Epistemic Calibration Error (TECE). Experiments across a broad range of uncertainty quantification methods show that epistemic calibration is a coherent and meaningful criterion and reveal substantial differences across methods, despite similar predictive performance.
This paper introduces epistemic calibration, a new criterion for evaluating whether second-order classification models (those that produce distributions over predicted probabilities, not just point estimates) faithfully represent their epistemic uncertainty. The key insight is that classical calibration only checks whether the *mean* prediction matches empirical frequencies, but says nothing about whether the *variance* (epistemic uncertainty) of the second-order distribution is meaningful. The paper formalizes this via Definition 5: a second-order predictor Π is epistemically calibrated if the ground-truth probability of the positive class, conditioned on the full second-order distribution Ψ, equals the mean of Ψ. The practical implication (Proposition 1) is that the reported epistemic uncertainty should equal the expected squared error of sampled models around the truth.
The paper also proposes the Expected Epistemic Calibration Error (EECE) as a computable metric and proves it consistently estimates the True Epistemic Calibration Error (TECE). An impossibility theorem (Theorem 2) connects this to recent critiques of epistemic uncertainty estimation, showing that under epistemic calibration, the spread of the posterior carries no additional information about Y beyond the mean.
The theoretical development is generally sound. The proofs for Theorem 2 (impossibility), Proposition 1 (reformulation), and Theorem 3 (consistency of EECE) are provided in full and follow logically. The consistency proof for EECE mirrors the classical ECE consistency argument, extended to the joint prediction space, which is a natural and well-motivated generalization.
However, several concerns merit attention:
The concept of epistemic calibration addresses a genuine gap. As uncertainty quantification becomes increasingly important in safety-critical applications, having principled tools to evaluate *whether epistemic uncertainty estimates are trustworthy* is valuable. Currently, practitioners lack standardized metrics for this purpose.
Practical applications include:
Limitations on impact: The binary-only restriction significantly narrows applicability. Most practical UQ challenges arise in multi-class or regression settings. The computational cost of the EECE (requiring binning in high-dimensional joint spaces) may also limit adoption at scale.
The paper is well-timed. There is growing recognition that epistemic uncertainty estimation methods may be "fundamentally incomplete" (Jiménez et al., 2026, cited as [16]), and several recent position papers have questioned the disentanglement of aleatoric and epistemic uncertainty. This paper provides a constructive response: rather than declaring the problem unsolvable, it offers a local, testable condition under which second-order models can be trusted. The connection to the impossibility results in [16] through Theorem 2 is intellectually satisfying — epistemic calibration implies that the bias term in the generalized bias-variance decomposition vanishes, making the remaining variance term tractable.
The writing is generally clear, though some notation choices could be improved (overloading of Ψ as both a distribution and an index). The paper would benefit from discussing how epistemic calibration relates to proper scoring rules for second-order predictions, and whether EECE could be extended to assess calibration of credal sets or other imprecise probability representations that the introduction mentions.
Generated Jun 10, 2026
Paper 2 likely has higher impact: it proposes a new, strictly stronger calibration notion (epistemic calibration), introduces principled metrics (EECE/TECE) with consistency guarantees, and connects to theory via an impossibility theorem. This targets a timely, widely relevant problem (trustworthy uncertainty in high-stakes ML) with broad applicability across domains using probabilistic/uncertainty estimates. Paper 1 is innovative for automated experiment design in cognitive science, but its current validation is primarily in silico on bandit-agent recovery, making near-term real-world uptake and cross-field breadth less certain than Paper 2’s general framework.
Paper 2 likely has higher impact: it introduces a stronger, principled notion of calibration (epistemic calibration) addressing a key gap in uncertainty quantification, backed by theory (strictness vs classical calibration, impossibility theorem) and a consistent estimator (EECE for TECE) plus broad empirical comparison. This is broadly relevant across ML, statistics, and safety-critical deployment, with clear real-world implications for trust and decision-making. Paper 1 is a useful unifying reframing and yields empirical gains for SFT, but is more scoped to LLM fine-tuning practice and may have narrower cross-field reach.
Paper 1 introduces a fundamentally new theoretical concept (epistemic calibration) that addresses a critical gap in uncertainty quantification for machine learning. It provides formal definitions, impossibility theorems, consistent estimators, and broad experimental validation. This foundational contribution has potential to reshape how epistemic uncertainty is evaluated across many domains, especially high-stakes applications. Paper 2, while practically useful, is more incremental—proposing a smaller, efficient time series model with covariate support, representing engineering optimization rather than conceptual innovation. Paper 1's theoretical depth and broad applicability give it higher long-term scientific impact.
Paper 2 addresses a foundational issue in machine learning uncertainty quantification. By introducing a novel, strictly stronger calibration metric and theoretical proofs, its impact spans across all high-stakes ML applications (e.g., medicine, autonomous systems). Paper 1 is valuable for physical modeling, but has a narrower methodological and application scope.
While Paper 1 offers a timely and practical method for LLM compression, Paper 2 introduces a foundational concept (epistemic calibration) that addresses a critical gap in ML safety and uncertainty quantification. Its theoretical rigor, impossibility theorem, and broad applicability to high-stakes ML across multiple domains give it a higher potential for widespread and lasting scientific impact.
Paper 2 addresses a fundamental and highly timely issue in machine learning: the reliability of epistemic uncertainty estimates. By introducing a novel, strictly stronger calibration metric and proving its consistency, it offers broad implications for deploying ML in high-stakes, real-world applications across multiple disciplines. Paper 1 is methodologically rigorous and optimal for two-sided platforms, but its impact is narrower, primarily confined to operations research and specialized e-commerce applications.
Paper 2 introduces a fundamental new concept (epistemic calibration) with broad applicability across all of machine learning, supported by theoretical results (impossibility theorem, consistency proofs) and empirical validation. It addresses a universal gap in uncertainty quantification that affects any high-stakes deployment. Paper 1, while achieving strong results on neural population modeling benchmarks and addressing practical BCI recalibration, is more incremental and domain-specific—combining existing ideas (Transformers, adapters, gain modulation) for a narrower neuroscience/BCI application. Paper 2's theoretical contributions have potential to influence calibration research broadly.
Paper 2 is likely to have higher scientific impact due to strong timeliness and broad applicability: efficiently enabling speech-to-LLM capabilities via lightweight LoRA/distillation addresses a rapidly growing real-world need (voice assistants, accessibility, edge inference) and can be adopted widely across models and products. Its method is concrete, scalable, and positioned to influence multimodal LLM system design. Paper 1 is conceptually novel and rigorous (new calibration notion + estimator), but its impact is more specialized to uncertainty evaluation and second-order classification, with slower translation to mainstream deployments.
Paper 2 introduces a foundational concept (epistemic calibration) with theoretical guarantees and broad applicability across all high-stakes ML domains. In contrast, Paper 1 focuses on a domain-specific application (IMU-based HAR). Paper 2's theoretical depth, methodological rigor, and broader scope give it significantly higher potential for widespread scientific impact.
Paper 1 introduces a novel theoretical framework, including an impossibility theorem and a consistent estimator for epistemic uncertainty, which addresses a fundamental and broad problem in high-stakes machine learning. Paper 2, while empirically rigorous, focuses on a narrower niche (trajectory data augmentation) and relies on evaluating existing heuristics. The foundational nature and broad applicability of Paper 1's contributions across all ML domains give it significantly higher potential for widespread scientific impact.