Bora Kargi, David Salinas
Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations.To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .
The paper addresses a practical and increasingly important problem: how to reliably estimate human-derived Elo ratings for LLMs using automated LLM-as-a-judge evaluations, while providing honest uncertainty quantification. The core innovation is a two-level uncertainty framework:
Local level (Soft-Elo): Instead of collapsing LLM judge score differences into ternary {0, 0.5, 1} labels, the authors fit a temperature parameter β* via MLE on human non-tie battles to map score differences to calibrated win probabilities σ(β*s(x)). These soft targets replace hard labels in the standard Bradley-Terry likelihood — the BT model itself is unchanged. This simple modification reduces held-out Elo MAE from 45.9 to 17.9 across 8 judges and 55 models on LMArena.
Global level: Split conformal prediction is applied to the residual gap between LLM-derived and human-derived Elo ratings, producing prediction intervals with distribution-free marginal coverage guarantees. The normalized nonconformity score divides absolute residuals by bootstrap standard errors, yielding locally adaptive intervals.
The elegance lies in the minimalism: changing only the *targets* fed into an existing BT pipeline, rather than modifying the ranking model itself, yields dramatic improvements.
The experimental design is thorough and well-controlled:
One notable strength is the label-smoothing baseline comparison (Appendix H), which tests whether the gain comes specifically from score-difference information or merely from moving targets off {0,1}. While label smoothing with optimally tuned c achieves competitive MAE (26.9 vs 17.9), it requires retrospective tuning without a principled diagnostic, whereas β* is fit via interpretable MLE.
The conformal construction is standard but appropriately applied. The normalized nonconformity scores and the 5-split evaluation protocol are sound. Coverage is maintained (92-96% at 90% nominal) while intervals shrink 39-70%.
Immediate practical value: The method provides a concrete, deployable pipeline for LLM developers who want Elo estimates with uncertainty bounds without running expensive human annotation campaigns. The 17.9 Elo MAE is operationally meaningful — roughly the difference between adjacent models on crowded leaderboards.
Leaderboard methodology: The paper makes a convincing case against fixed-baseline win-rate reporting (citing the Gemini-2.5 example where rankings shift dramatically with judge choice). The multi-opponent Elo estimation paradigm with conformal intervals is a better framework for model comparison.
Broader methodological insight: The finding that score differences carry calibrated uncertainty information is transferable beyond LLM evaluation. Any domain using pairwise comparisons with scalar scores (recommendation systems, sports analytics, peer review) could benefit from soft BT targets.
Limitations on impact: The method still requires some human-labeled battles for calibration (β* fitting and conformal calibration set), though not for the model under test. The conformal guarantee is marginal, not conditional on model strength — the residual analysis shows remaining strength-correlated structure even under Soft-Elo. The reliance on exchangeability is a real constraint for frontier model evaluation, precisely the regime of greatest interest.
This paper arrives at a critical juncture. LLM-as-a-judge is becoming the dominant evaluation paradigm as human annotation costs scale poorly. Simultaneously, the community is recognizing the fragility of existing benchmarks (Arena-Hard score sensitivity to judge choice, leaderboard gaming, benchmark saturation). The paper directly addresses the reliability gap that makes automated evaluation untrustworthy for deployment decisions.
The use of open-weight judges exclusively (except DeepSeek via API) aligns with the reproducibility concern the paper raises about closed-weight judge deprecation.
This is a well-executed, practically motivated paper that makes a clear contribution to LLM evaluation methodology. The insight that hard labels discard useful uncertainty information from judge scores is not entirely novel (the related work on ordinal feedback and soft preferences is acknowledged), but the specific application to Elo estimation with the conformal prediction wrapper is new and well-validated. The paper's main contribution is showing that a minimal, interpretable modification to existing pipelines yields substantial practical gains. It is more engineering-oriented than theoretically novel, but the engineering is rigorous and the problem is important.
Generated Jun 12, 2026
Paper 1 targets a highly timely bottleneck—reliable, low-cost evaluation of LLMs—where even incremental improvements can propagate broadly across model development, benchmarking, and deployment. It combines a practical innovation (soft win-probability propagation in Bradley–Terry/Elo) with distribution-free conformal intervals, directly addressing systematic judge–human mismatch with calibrated uncertainty, and is validated on a major real-world platform (LMArena) with released code. Paper 2 is methodologically interesting for ensemble pruning/calibration, but bagging compression is a more mature area and likely has narrower cross-field urgency and impact today.
Paper 1 addresses a high-impact, timely problem in LLM evaluation—a rapidly growing field with broad relevance. It introduces a novel combination of soft Elo estimation with conformal prediction to provide calibrated uncertainty bounds for LLM rankings without costly human annotation, demonstrating strong empirical results (17.9 Elo MAE). This has immediate practical applications for the entire LLM development community. Paper 2 provides a valuable benchmarking contribution for wearable HAR but serves a narrower community, and its main finding—performance plateau—limits its forward-looking impact. Paper 1's methodological novelty and broader relevance give it higher estimated impact.
Paper 1 targets a timely, high-leverage bottleneck: scalable, trustworthy LLM evaluation. Its combination of soft-label Bradley–Terry/Elo with calibrated win probabilities plus split conformal prediction for distribution-free uncertainty intervals is methodologically novel and rigorous, and directly applicable to real-world model development/benchmarking. It also broadens impact across ML evaluation, statistics, and AI governance. Paper 2 is a reasonable incremental extension of existing GNN-based unsupervised clustering with self-training; impact is narrower, results appear more conditional (e.g., balanced clusters), and novelty is less distinct in a crowded area.
SlimSearcher addresses a critical and timely problem—computational efficiency of AI agents—with a principled multi-stage framework combining Pareto-efficient filtration and adaptive reward gating. It demonstrates substantial practical impact (17-58% reduction in tool calls) across multiple benchmarks while maintaining accuracy. The efficiency-accuracy tradeoff is fundamental to scaling AI agents in real-world deployment. Paper 2, while methodologically sound in combining conformal prediction with Elo estimation for LLM evaluation, addresses a narrower problem with more incremental contributions. SlimSearcher's broader applicability to the rapidly growing agent ecosystem gives it higher potential impact.
Paper 2 is more likely to have higher scientific impact: it proposes a broadly applicable geometric theory (projection caustics) for abrupt transitions in continuous-time generative dynamics and introduces a diagnostic (CBD) with demonstrated use across toy, diffusion, flow-matching, and latent text-to-image models. This combines novelty with cross-domain relevance to a rapidly evolving core area of ML. Paper 1 is practical and timely for LLM evaluation, but is more incremental (calibration + conformal intervals atop established Bradley–Terry/Elo) and its impact is narrower to benchmarking workflows.
Paper 2 addresses the highly timely and practically important problem of LLM evaluation, proposing a principled statistical framework (conformal prediction + calibrated Bradley-Terry) that reduces reliance on expensive human annotations. Its broad applicability to the rapidly growing LLM ecosystem, concrete quantitative improvements (17.9 Elo MAE), distribution-free coverage guarantees, and released code give it wider potential impact. Paper 1 makes solid but incremental improvements to multimodal VAEs, a more niche area with less immediate broad impact compared to the urgent need for reliable LLM benchmarking.
Paper 2 has higher potential impact: it introduces a novel, methodologically rigorous calibration framework (probabilistic Bradley–Terry/Elo plus split conformal prediction) with distribution-free uncertainty guarantees, directly addressing a timely bottleneck in LLM evaluation and benchmarking. Its applications are broad and immediate (model development, leaderboards, safety/regression testing) across AI/ML and related fields, and it reports quantitative improvements on a large real-world dataset with released code. Paper 1 is applied and domain-specific; its negative result on GAN augmentation and narrower scope likely limit broader scientific influence.
Paper 2 likely has higher scientific impact: it targets a central, timely question—how RL post-training produces reasoning gains—and proposes mechanistic explanations (strategy selection vs. improvement) with actionable levers (SFT diversity, RL difficulty schedules) that could influence many future training pipelines across reasoning/coding models. Its breadth spans mechanistic interpretability, RLHF/RLAIF methodology, and capability scaling. Paper 1 is innovative and practically useful for low-cost evaluation, but its impact is more niche (LLM ranking calibration) and depends on access to human ground truth for conformal calibration, limiting generality.
Paper 1 provides fundamental theoretical contributions to asynchronous distributed optimization—proving that gradient clipping removes dependence on maximum delay and establishing high-probability convergence bounds under heavy-tailed noise. These results have broad impact across distributed ML, federated learning, and large-scale training. Paper 2 addresses a timely but narrower problem (LLM evaluation calibration) with incremental methodological contributions combining existing techniques (conformal prediction, Bradley-Terry). While practically useful, Paper 1's theoretical insights are more foundational and applicable across a wider range of settings.
Paper 1 addresses a highly timely and broadly impactful problem—reliable and cost-effective LLM evaluation—which affects the entire AI community. It combines calibrated soft Elo estimation with conformal prediction to provide distribution-free uncertainty guarantees, offering strong methodological novelty and immediate practical utility for LLM developers. Paper 2 proposes a useful but incremental improvement (adaptive memory gating) for neural operators on specific PDEs, with narrower scope and audience. Paper 1's relevance to the rapidly growing LLM ecosystem gives it significantly broader potential impact.