Back to Rankings

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

Bora Kargi, David Salinas

cs.LG
Share
#3371 of 5669 · cs.LG
Tournament Score
1379±48
10501750
59%
Win Rate
10
Wins
7
Losses
17
Matches
Rating
7/ 10
Significance7
Rigor8
Novelty6
Clarity8.5

Abstract

Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations.To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation"

1. Core Contribution

The paper addresses a practical and increasingly important problem: how to reliably estimate human-derived Elo ratings for LLMs using automated LLM-as-a-judge evaluations, while providing honest uncertainty quantification. The core innovation is a two-level uncertainty framework:

Local level (Soft-Elo): Instead of collapsing LLM judge score differences into ternary {0, 0.5, 1} labels, the authors fit a temperature parameter β* via MLE on human non-tie battles to map score differences to calibrated win probabilities σ(β*s(x)). These soft targets replace hard labels in the standard Bradley-Terry likelihood — the BT model itself is unchanged. This simple modification reduces held-out Elo MAE from 45.9 to 17.9 across 8 judges and 55 models on LMArena.

Global level: Split conformal prediction is applied to the residual gap between LLM-derived and human-derived Elo ratings, producing prediction intervals with distribution-free marginal coverage guarantees. The normalized nonconformity score divides absolute residuals by bootstrap standard errors, yielding locally adaptive intervals.

The elegance lies in the minimalism: changing only the *targets* fed into an existing BT pipeline, rather than modifying the ranking model itself, yields dramatic improvements.

2. Methodological Rigor

The experimental design is thorough and well-controlled:

  • Leave-one-model-out evaluation prevents data leakage: β* is refit excluding the held-out model's battles, and the variation across folds is shown to be negligible (std ≤ 0.005).
  • Eight diverse judges spanning multiple model families (Qwen, Gemma, GPT-OSS, Llama, DeepSeek) provide breadth.
  • Three corpora (LMArena 100K, 140K, ComparIA) test generalization. The ComparIA French-language stress test is particularly valuable, revealing failure modes (Qwen judges lose rank correlation when β* collapses).
  • Careful diagnostic separation of battle-level agreement (κ), rank fidelity (ρ), and scale fidelity (MAE) exposes the specific failure mode: judges recover rankings but distort the Elo scale.
  • The paper honestly documents failure modes: Soft-Elo can over-compress when the score-difference signal is uninformative (low β*), and the conformal guarantee requires exchangeability that may not hold under distribution shift.
  • One notable strength is the label-smoothing baseline comparison (Appendix H), which tests whether the gain comes specifically from score-difference information or merely from moving targets off {0,1}. While label smoothing with optimally tuned c achieves competitive MAE (26.9 vs 17.9), it requires retrospective tuning without a principled diagnostic, whereas β* is fit via interpretable MLE.

    The conformal construction is standard but appropriately applied. The normalized nonconformity scores and the 5-split evaluation protocol are sound. Coverage is maintained (92-96% at 90% nominal) while intervals shrink 39-70%.

    3. Potential Impact

    Immediate practical value: The method provides a concrete, deployable pipeline for LLM developers who want Elo estimates with uncertainty bounds without running expensive human annotation campaigns. The 17.9 Elo MAE is operationally meaningful — roughly the difference between adjacent models on crowded leaderboards.

    Leaderboard methodology: The paper makes a convincing case against fixed-baseline win-rate reporting (citing the Gemini-2.5 example where rankings shift dramatically with judge choice). The multi-opponent Elo estimation paradigm with conformal intervals is a better framework for model comparison.

    Broader methodological insight: The finding that score differences carry calibrated uncertainty information is transferable beyond LLM evaluation. Any domain using pairwise comparisons with scalar scores (recommendation systems, sports analytics, peer review) could benefit from soft BT targets.

    Limitations on impact: The method still requires some human-labeled battles for calibration (β* fitting and conformal calibration set), though not for the model under test. The conformal guarantee is marginal, not conditional on model strength — the residual analysis shows remaining strength-correlated structure even under Soft-Elo. The reliance on exchangeability is a real constraint for frontier model evaluation, precisely the regime of greatest interest.

    4. Timeliness & Relevance

    This paper arrives at a critical juncture. LLM-as-a-judge is becoming the dominant evaluation paradigm as human annotation costs scale poorly. Simultaneously, the community is recognizing the fragility of existing benchmarks (Arena-Hard score sensitivity to judge choice, leaderboard gaming, benchmark saturation). The paper directly addresses the reliability gap that makes automated evaluation untrustworthy for deployment decisions.

    The use of open-weight judges exclusively (except DeepSeek via API) aligns with the reproducibility concern the paper raises about closed-weight judge deprecation.

    5. Strengths & Limitations

    Key Strengths:

  • Minimal, principled intervention (changing targets, not models) with large empirical gains
  • Comprehensive diagnostics that separate ranking from scaling failures
  • Cross-corpus and cross-lingual validation with honest failure-mode documentation
  • The β* diagnostic serves as a pre-deployment check: low β* warns when Soft-Elo shouldn't be trusted
  • Code release and detailed appendices support reproducibility
  • Notable Limitations:

  • The 55-model calibration pool is modest for conformal prediction; exchangeability across model families/vintages is assumed but may not hold for truly novel architectures
  • No epistemic uncertainty modeling (hallucinated scores, prompt ambiguity)
  • The conformal intervals, while narrower, are still quite wide (74-143 Elo median for Soft-Elo) — enough to span several tiers on a leaderboard
  • The method inherits all BT limitations (transitivity assumption, scalar projection of preferences)
  • β* calibration requires human non-tie labels from the same prompt distribution
  • Overall Assessment

    This is a well-executed, practically motivated paper that makes a clear contribution to LLM evaluation methodology. The insight that hard labels discard useful uncertainty information from judge scores is not entirely novel (the related work on ordinal feedback and soft preferences is acknowledged), but the specific application to Elo estimation with the conformal prediction wrapper is new and well-validated. The paper's main contribution is showing that a minimal, interpretable modification to existing pipelines yields substantial practical gains. It is more engineering-oriented than theoretically novel, but the engineering is rigorous and the problem is important.

    Rating:7/ 10
    Significance 7Rigor 8Novelty 6Clarity 8.5

    Generated Jun 12, 2026

    Comparison History (17)

    Wonvs. Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

    Paper 1 targets a highly timely bottleneck—reliable, low-cost evaluation of LLMs—where even incremental improvements can propagate broadly across model development, benchmarking, and deployment. It combines a practical innovation (soft win-probability propagation in Bradley–Terry/Elo) with distribution-free conformal intervals, directly addressing systematic judge–human mismatch with calibrated uncertainty, and is validated on a major real-world platform (LMArena) with released code. Paper 2 is methodologically interesting for ensemble pruning/calibration, but bagging compression is a more mature area and likely has narrower cross-field urgency and impact today.

    gpt-5.2·Jun 12, 2026
    Wonvs. WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

    Paper 1 addresses a high-impact, timely problem in LLM evaluation—a rapidly growing field with broad relevance. It introduces a novel combination of soft Elo estimation with conformal prediction to provide calibrated uncertainty bounds for LLM rankings without costly human annotation, demonstrating strong empirical results (17.9 Elo MAE). This has immediate practical applications for the entire LLM development community. Paper 2 provides a valuable benchmarking contribution for wearable HAR but serves a narrower community, and its main finding—performance plateau—limits its forward-looking impact. Paper 1's methodological novelty and broader relevance give it higher estimated impact.

    claude-opus-4-6·Jun 12, 2026
    Wonvs. Clustering Node Attributed Networks with Graph Neural Networks and Self Learning

    Paper 1 targets a timely, high-leverage bottleneck: scalable, trustworthy LLM evaluation. Its combination of soft-label Bradley–Terry/Elo with calibrated win probabilities plus split conformal prediction for distribution-free uncertainty intervals is methodologically novel and rigorous, and directly applicable to real-world model development/benchmarking. It also broadens impact across ML evaluation, statistics, and AI governance. Paper 2 is a reasonable incremental extension of existing GNN-based unsupervised clustering with self-training; impact is narrower, results appear more conditional (e.g., balanced clusters), and novelty is less distinct in a crowded area.

    gpt-5.2·Jun 12, 2026
    Lostvs. SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

    SlimSearcher addresses a critical and timely problem—computational efficiency of AI agents—with a principled multi-stage framework combining Pareto-efficient filtration and adaptive reward gating. It demonstrates substantial practical impact (17-58% reduction in tool calls) across multiple benchmarks while maintaining accuracy. The efficiency-accuracy tradeoff is fundamental to scaling AI agents in real-world deployment. Paper 2, while methodologically sound in combining conformal prediction with Elo estimation for LLM evaluation, addresses a narrower problem with more incremental contributions. SlimSearcher's broader applicability to the rapidly growing agent ecosystem gives it higher potential impact.

    claude-opus-4-6·Jun 12, 2026
    Lostvs. The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

    Paper 2 is more likely to have higher scientific impact: it proposes a broadly applicable geometric theory (projection caustics) for abrupt transitions in continuous-time generative dynamics and introduces a diagnostic (CBD) with demonstrated use across toy, diffusion, flow-matching, and latent text-to-image models. This combines novelty with cross-domain relevance to a rapidly evolving core area of ML. Paper 1 is practical and timely for LLM evaluation, but is more incremental (calibration + conformal intervals atop established Bradley–Terry/Elo) and its impact is narrower to benchmarking workflows.

    gpt-5.2·Jun 12, 2026
    Wonvs. Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

    Paper 2 addresses the highly timely and practically important problem of LLM evaluation, proposing a principled statistical framework (conformal prediction + calibrated Bradley-Terry) that reduces reliance on expensive human annotations. Its broad applicability to the rapidly growing LLM ecosystem, concrete quantitative improvements (17.9 Elo MAE), distribution-free coverage guarantees, and released code give it wider potential impact. Paper 1 makes solid but incremental improvements to multimodal VAEs, a more niche area with less immediate broad impact compared to the urgent need for reliable LLM benchmarking.

    claude-opus-4-6·Jun 12, 2026
    Wonvs. To GAN or Not To GAN: Segmentation Analysis on Mars DEM

    Paper 2 has higher potential impact: it introduces a novel, methodologically rigorous calibration framework (probabilistic Bradley–Terry/Elo plus split conformal prediction) with distribution-free uncertainty guarantees, directly addressing a timely bottleneck in LLM evaluation and benchmarking. Its applications are broad and immediate (model development, leaderboards, safety/regression testing) across AI/ML and related fields, and it reports quantitative improvements on a large real-world dataset with released code. Paper 1 is applied and domain-specific; its negative result on GAN augmentation and narrower scope likely limit broader scientific influence.

    gpt-5.2·Jun 12, 2026
    Lostvs. Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

    Paper 2 likely has higher scientific impact: it targets a central, timely question—how RL post-training produces reasoning gains—and proposes mechanistic explanations (strategy selection vs. improvement) with actionable levers (SFT diversity, RL difficulty schedules) that could influence many future training pipelines across reasoning/coding models. Its breadth spans mechanistic interpretability, RLHF/RLAIF methodology, and capability scaling. Paper 1 is innovative and practically useful for low-cost evaluation, but its impact is more niche (LLM ranking calibration) and depends on access to human ground truth for conformal calibration, limiting generality.

    gpt-5.2·Jun 12, 2026
    Lostvs. Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

    Paper 1 provides fundamental theoretical contributions to asynchronous distributed optimization—proving that gradient clipping removes dependence on maximum delay and establishing high-probability convergence bounds under heavy-tailed noise. These results have broad impact across distributed ML, federated learning, and large-scale training. Paper 2 addresses a timely but narrower problem (LLM evaluation calibration) with incremental methodological contributions combining existing techniques (conformal prediction, Bradley-Terry). While practically useful, Paper 1's theoretical insights are more foundational and applicable across a wider range of settings.

    claude-opus-4-6·Jun 12, 2026
    Wonvs. How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators

    Paper 1 addresses a highly timely and broadly impactful problem—reliable and cost-effective LLM evaluation—which affects the entire AI community. It combines calibrated soft Elo estimation with conformal prediction to provide distribution-free uncertainty guarantees, offering strong methodological novelty and immediate practical utility for LLM developers. Paper 2 proposes a useful but incremental improvement (adaptive memory gating) for neural operators on specific PDEs, with narrower scope and audience. Paper 1's relevance to the rapidly growing LLM ecosystem gives it significantly broader potential impact.

    claude-opus-4-6·Jun 12, 2026