Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

Hankyeol Kim, Pilsung Kang

May 26, 2026

arXiv:2605.27752v1 PDF

cs.AI(primary)

#1413of 2682·Artificial Intelligence

#1413 of 2682 · Artificial Intelligence

Tournament Score

1404±47

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor8

Novelty5.5

Clarity8

Tournament Score

1404±47

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks. The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence. In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and systematically characterizes a largely overlooked source of variability in LLM confidence calibration studies: the measurement protocol itself. While prior work (Xiong et al., 2024; Yang et al., 2024; Dai, 2026) has shown that verbalized confidence is sensitive to *how it is elicited* (prompt template, scale, format), this paper holds the elicitation fixed and instead varies the *comparison infrastructure*—which answer string receives the token-probability score, how that score is read from answer tokens, and under which conditioning context it is measured. The central finding is that these seemingly mundane measurement choices can flip the sign of the ECE gap between verbalized and token-probability confidence, making conclusions about which signal is "better calibrated" protocol-dependent rather than model-dependent.

The paper also contributes a supplied-answer analysis showing that verbalized confidence barely discriminates between plausible wrong answers and correct answers (paired difference of only +0.021), while clearly rejecting off-type wrong answers (mean confidence 0.240 vs. 0.832 for gold). This suggests verbalized confidence tracks surface plausibility and answer provenance rather than correctness per se.

Methodological Rigor

The experimental design is carefully controlled and well-documented. The paper uses a factorial structure across protocol axes (Table 1) with each comparison changing one variable at a time. Three model families × two variants (base/Instruct) × four datasets = 24 settings provide reasonable coverage. Bootstrap confidence intervals (B=1000) are reported throughout, and the authors are transparent about limitations—base model parse rates, sample sizes, and the descriptive (not causal) nature of their comparisons.

The robustness checking is thorough: bin-count sweeps (Appendix D), an integer 0-100 scale check (Appendix I), four non-ECE metrics (Appendix L), a strict scorer check (Appendix M), and same-family scaling checks at 14B, 32B, and 72B. The conditioning-context effect survives all of these. The paper is also careful to distinguish between protocol effects and causal claims—a methodological discipline sometimes lacking in this literature.

One limitation in rigor is that the study uses only three model families at the primary scale, all open-weight 7-8B models. The larger-model checks use only one family (Qwen2.5). Additionally, the supplied-answer analysis mixes cross-model plausible wrong answers (60.7%) with off-type fallbacks (39.3%), which somewhat muddies the interpretation of "plausible wrong" results, though the per-cell breakdowns are provided.

Potential Impact

The practical implications are significant for the growing community of researchers and practitioners using LLM confidence for selective prediction, abstention, and safety-critical deployment. The paper demonstrates that published claims about alignment improving self-reported uncertainty, or verbalized confidence being superior to token probabilities, may be artifacts of implicit protocol choices rather than genuine model properties.

The reporting checklist (C1-C4) is a concrete, actionable contribution that could improve reproducibility and comparability across the field. If adopted, it would make confidence calibration papers more interpretable and prevent misleading conclusions.

The supplied-answer finding has implications for retrieval-augmented generation and tool-use systems where models evaluate externally provided information—verbalized confidence may not reliably distinguish correct from plausible-but-wrong retrieved answers.

Timeliness & Relevance

This paper is highly timely. As LLMs are deployed in high-stakes domains (medical QA, legal analysis, autonomous agents), confidence calibration is increasingly important for safety. Multiple recent papers (Tian et al., 2023; Xiong et al., 2024; Yang et al., 2024) have made claims about verbalized confidence calibration that this paper shows are potentially protocol-dependent. The finding that Instruct models show near-parity rather than large calibration gains under the default protocol challenges a narrative that has influenced deployment decisions.

Strengths

1. Identifies a blind spot: The paper addresses measurement infrastructure choices that are typically implicit, moving the conversation from "which signal is better?" to "better under what protocol?"—a more scientifically rigorous framing.

2. Quantitative context for protocol effects: Table 4 is particularly valuable, showing that conditioning context (|Δg| = 0.28) is comparable to model family (0.37) and larger than dataset identity (0.17), while estimator choice is negligible (0.01). This gives the field concrete guidance on which choices matter most.

3. Extensive robustness: The paper is unusually thorough in checking whether its conclusions are artifacts of specific methodological decisions—different scales, metrics, bin counts, model sizes, and scoring rules.

4. Conservative, well-scoped claims: The authors consistently frame results as sensitivity analyses rather than causal decompositions, and clearly distinguish primary evidence (Instruct models) from stress tests (base models).

Limitations

1. No closed-model coverage: The study cannot address GPT-4, Claude, or other API-only models where token probabilities may be unavailable or computed differently. This limits generalizability to the models most commonly used in practice.

2. QA-only evaluation: All benchmarks are short-answer QA. The paper's findings may not transfer to long-form generation, multi-turn dialogue, or code generation, where "correctness" and "answer" are less discrete.

3. Primarily diagnostic, not prescriptive: The paper convincingly shows that protocol choices matter but does not recommend a preferred protocol or provide guidance on which configuration best approximates "true" model uncertainty (if such a thing exists). The checklist standardizes reporting but not methodology.

4. Limited novelty in individual findings: That conditioning context matters for language models is well-known; the paper's novelty lies in systematically applying this insight to the specific case of confidence calibration comparison, which is more of an empirical contribution than a conceptual breakthrough.

5. Single prompt template: While intentional (to isolate token-side variation), using one verbalized-confidence template means the findings describe protocol sensitivity conditional on that template.

Overall Assessment

This is a careful, well-executed measurement study that makes an important methodological contribution to LLM confidence calibration research. It is unlikely to generate breakthrough follow-up work, but it should influence how confidence calibration is reported and interpreted. The impact is most likely to be felt through improved experimental standards rather than new capabilities.

Rating:6.5/ 10

Significance 7Rigor 8Novelty 5.5Clarity 8

Generated May 28, 2026

Comparison History (13)

vs. BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

gpt-5.25/28/2026

Paper 2 likely has higher impact: it introduces a new privacy-constrained, multi-task benchmark with linked longitudinal learning data, standardized clinical assessments, and randomized-treatment endpoints—enabling broad methodological work (knowledge tracing, recommendation, prediction, causal inference) and direct real-world relevance in pediatric personalization. The inclusion of an RCT cohort and a synthetic companion dataset improves rigor and reproducibility. Paper 1 is timely and methodologically careful, but is mainly a measurement/protocol critique within LLM evaluation and is less likely to translate into cross-domain applications beyond AI calibration research.

vs. Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

gpt-5.25/28/2026

Paper 2 is likely to have higher scientific impact due to a more concrete, experimentally grounded contribution: it isolates and quantifies protocol sensitivity in LLM confidence calibration across models and benchmarks, provides robustness checks, and proposes a practical reporting checklist. This is timely for safety, evaluation, and deployment of LLMs, and can influence both research methodology and applied auditing. Paper 1 raises important, high-utility concerns for low-resource evaluation, but appears more conceptual/framework-oriented with less demonstrated empirical methodology in the abstract, potentially limiting immediate uptake and citation impact.

vs. Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

gemini-3.15/28/2026

Paper 2 addresses a foundational methodological issue in how LLM confidence and calibration are measured. While Paper 1 presents a valuable neuro-symbolic framework for medical AI, Paper 2's insights into evaluation protocol sensitivity will broadly impact foundational LLM research, uncertainty quantification, and AI safety across all domains, likely driving widespread adoption of its reporting checklist.

vs. $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

gpt-5.25/28/2026

Paper 2 is likely higher impact because it identifies a broad, protocol-dependent measurement confound in LLM confidence calibration that affects many benchmarks, model families, and downstream uses (evaluation, uncertainty estimation, decision-making). Its findings generalize across AR models and provide actionable guidance (a reporting checklist) that can reshape community standards and improve reproducibility. Paper 1 is novel and timely for diffusion LLM safety, but its impact is narrower (limited to D-LLMs and a specific monitoring architecture) and depends on the adoption trajectory of diffusion LLMs.

vs. Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

claude-opus-4.65/28/2026

Paper 2 introduces a novel benchmark (MentalMap) addressing a fundamental question about LLM world models with a well-designed multilingual hierarchy. The discovery of a universal 'L3 reasoning cliff' is a striking empirical finding with broad implications for understanding LLM capabilities and limitations. The human comparison strengthens claims about fundamental text-based reasoning constraints. Paper 1, while methodologically rigorous, is more narrowly focused on measurement protocol sensitivity in confidence calibration—important but incremental. Paper 2's breadth (13 models, 8 languages, human baselines) and its implications for multimodal AI give it wider impact potential.

vs. Meta-Agent: From Task Descriptions to Verified Multi-Agent Systems

gemini-3.15/28/2026

Paper 2 addresses a critical and highly timely bottleneck in AI: the brittleness of complex multi-agent workflows. By proposing an automated, self-verifying framework to construct and execute multi-agent systems, it offers significant practical utility across various domains like coding and reasoning. While Paper 1 provides crucial methodological insights for LLM calibration, Paper 2's generative approach to building robust AI systems presents broader potential applications and aligns directly with the rapidly growing interest in autonomous agent deployment.

vs. Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts

claude-opus-4.65/28/2026

Paper 1 addresses a fundamental methodological issue in LLM confidence calibration—showing that measurement protocol choices (conditioning context, token readout, answer string selection) can change the sign of calibration comparisons. This has broad impact across the entire LLM evaluation community, as confidence calibration is central to trustworthy AI deployment. The actionable reporting checklist and systematic experimental design across multiple models/benchmarks provide immediate practical value. Paper 2 offers interesting insights on process vs. output alignment in specific domains, but its narrower scope (two specific decision contexts) and more niche audience limit its breadth of impact.

vs. A Policy-Driven Runtime Layer for Agentic LLM Serving

claude-opus-4.65/28/2026

Paper 1 addresses a critical architectural gap in multi-agent LLM serving systems—an increasingly dominant production workload. It proposes a novel runtime layer with concrete primitives and validates it with substantial empirical gains (13-37pp cache hit improvement, 12-29% latency reduction). Its systems-level contribution has broad practical impact across the rapidly growing LLM deployment ecosystem. Paper 2 makes a valuable methodological contribution about calibration measurement sensitivity, but its impact is narrower—primarily improving evaluation practices rather than enabling new capabilities. Paper 1's timeliness and real-world applicability give it higher potential impact.

vs. StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

gemini-3.15/28/2026

Paper 1 addresses a fundamental methodological flaw in how LLM confidence calibration is measured, exposing high sensitivity to protocol choices. By providing a reporting checklist, it has the potential to broadly influence evaluation standards and improve reproducibility across the entire LLM research community. Paper 2 presents a strong, though more specialized, algorithmic improvement for agent RL, making its overall scientific impact likely narrower.

vs. CoRe-Code: Collaborative Reinforcement Learning for Code Generation

claude-opus-4.65/28/2026

Paper 2 addresses a fundamental methodological issue in LLM confidence calibration that affects a wide range of downstream research. By demonstrating that calibration comparisons are highly sensitive to measurement protocol choices—often unreported—it provides a critical methodological contribution with broad implications across all fields using LLM uncertainty estimates. The reporting checklist offers lasting practical value. Paper 1, while solid engineering work combining multi-agent systems with RL for code generation, is more incremental, building on existing GRPO and planner-coder paradigms in a rapidly evolving area where methods are quickly superseded.

vs. The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?

gemini-3.15/28/2026

Paper 1 has higher potential impact because it addresses a fundamental methodological flaw in measuring LLM confidence and calibration. As AI reliability becomes critical across all domains, establishing rigorous evaluation protocols for model uncertainty is universally essential. While Paper 2 offers valuable insights for knowledge graph compression in scientific discovery, its scope is somewhat narrower. Paper 1's findings on protocol sensitivity challenge widespread assumptions in AI evaluation and propose a reporting checklist that could become a standard for future LLM research.

vs. EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

gpt-5.25/28/2026

Paper 1 has higher likely impact due to broader relevance and timeliness: it exposes protocol-dependent artifacts in a widely used evaluation paradigm (LLM confidence calibration), offers systematic ablations across models/benchmarks, and provides actionable reporting guidance that can influence many subsequent studies and applications (safety, uncertainty, human-AI interaction). Paper 2 is a useful RLVR training tweak for open-ended QA, but its demonstrated scope is narrower (two medical QA datasets) and may generalize less broadly; its methodological contribution is more incremental compared to reframing and standard-setting in Paper 1.

vs. MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

gpt-5.25/28/2026

Paper 2 likely has higher impact due to strong real-world applicability (privacy-preserving, low-latency on-device mobile agents), timely relevance, and broader cross-field influence (systems/OS, HCI, mobile computing, VLM agents). It proposes a concrete framework (online exploration + structured memory + rollback) and reports device-level evaluations with latency and success-rate gains, suggesting methodological rigor and deployability. Paper 1 is valuable and conceptually important for LLM evaluation rigor, but its primary contribution is diagnostic/protocol-clarifying rather than enabling new capabilities, so downstream practical impact may be narrower.