Mina Remeli, Moritz Hardt
Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.
This paper addresses a fundamental validity question about LLM evaluation: do pairwise comparison rankings (aggregated via Bradley-Terry/Elo) actually reflect accuracy-based rankings when ground truth is available? The authors convert five established benchmarks (MMLU Pro, GPQA Diamond, SimpleQA, GSM8K, BBH) into free-form generative evaluations, collect pairwise comparisons using LLM judges, and demonstrate Spearman correlations above 0.9 with accuracy-based rankings. The key insight is that pairwise comparisons recover accuracy orderings even without access to ground truth, and critically, outperform direct judge evaluation when the judge is weak.
A secondary but notable contribution is the identification of "echo" (post-answer repetition) as a causal driver of judge preference on non-discriminative pairs, providing mechanistic insight into how judges make decisions when both answers are correct or both incorrect.
The experimental design is generally sound. The authors test across five diverse benchmarks spanning multiple-choice (converted to free-form), math, factual QA, and multi-task settings, providing reasonable breadth. They evaluate with multiple judge models (gpt-oss-20b, gpt-oss-120b, o3, plus phi-4 and gemma-3-27b-it for generalization), which strengthens the findings.
However, several methodological choices warrant scrutiny:
1. Model filtering: The filtering criteria (>90% parse rate, above-baseline accuracy, 1% accuracy gap between pairs) removes 22% of models in the final step alone. While the authors provide an ablation showing correlation drops only modestly (0.85→0.82 on GSM8K), this filtering could still inflate apparent alignment by removing the hardest-to-distinguish cases.
2. Number of models ranked: Rankings involve only 10-27 models, which limits the statistical power of rank correlation metrics. With 10 models, a single swap changes Kendall's tau by ~0.044.
3. Echo causal analysis: The intervention study (appending question-answer sequences three times) is well-designed with appropriate controls, though the effect sizes on discriminative pairs (n=108) have wide confidence intervals, making the null result less conclusive.
4. Benchmark selection: All five benchmarks have verifiable answers. The paper acknowledges this limitation but the title and abstract could more clearly signal this scope restriction.
The practical implications are significant for the LLM evaluation community:
The impact is somewhat bounded by the restriction to discriminative tasks. The most compelling use case for pairwise comparisons—open-ended generation—remains unaddressed.
This paper is highly timely. Pairwise comparison-based evaluation (Chatbot Arena, AlpacaEval, Arena-Hard) has become the de facto standard for LLM ranking, yet skepticism about what these rankings measure has grown. The paper directly addresses this tension with controlled experiments. The weak-judge setting is particularly relevant as frontier models increasingly surpass available evaluators.
1. Clean experimental question with practical significance: The paper asks a simple but important question and provides a clear answer for the studied settings.
2. Multi-benchmark, multi-judge evaluation: Testing across 5 benchmarks and 5 judges provides convincing evidence of generality within the studied scope.
3. Mechanistic insight via echo: Moving beyond correlation to identify a causal mechanism is a genuine contribution that elevates the paper beyond a purely empirical study.
4. Honest treatment of limitations: The paper clearly states scope restrictions and discusses cases where findings don't hold (gemma judge, SimpleQA).
5. Reproducibility: Code is publicly available, and the experimental pipeline is clearly described.
1. Generalization gap: The paper's title ("Correct Looks Better") and framing suggest broad applicability, but results are limited to tasks with verifiable answers. The leap to open-ended settings—where pairwise comparisons are most needed—remains unjustified.
2. Limited model diversity in some benchmarks: With as few as 10 models ranked, correlation metrics can be sensitive to individual model placement.
3. Bias correction analysis is shallow: Controlling for 4 style features and self-preference via linear coefficients may not capture complex bias interactions. The finding that bias correction barely moves rankings could reflect inadequate bias modeling rather than absence of bias effects.
4. No adversarial robustness analysis: If pairwise rankings become a training signal, models could be optimized to exploit judge preferences (acknowledged but not explored).
5. The "almost paradoxical" claim is overstated: It's not paradoxical that a judge can rank models by relative quality without knowing absolute accuracy—this is well-understood in measurement theory (comparative judgment has been known to be easier than absolute judgment since Thurstone, 1927).
6. Missing comparison with other cheap evaluation methods: Self-consistency, confidence calibration, or ensemble-based approaches could serve as additional baselines.
This is a well-executed empirical study that provides useful evidence for the LLM evaluation community. The main finding—high alignment between pairwise rankings and accuracy rankings—is reassuring and practically relevant, especially in the weak-judge regime. The echo finding adds mechanistic depth. However, the impact is constrained by the restriction to verifiable tasks, and the paper would benefit from more explicit framing as a controlled validation study rather than a general endorsement of pairwise evaluation. The work is solid and timely, contributing meaningfully to our understanding of evaluation methodology, though it leaves the most important questions (open-ended tasks, adversarial robustness) for future work.
Generated Jun 9, 2026
While Paper 1 presents a highly practical and deployed framework for enterprise decision-making, Paper 2 addresses a fundamental and heavily debated methodological issue in generative AI: evaluation. By demonstrating that pairwise Elo comparisons strongly correlate with ground-truth accuracy and analyzing the impact of stylistic biases, Paper 2 provides critical insights that will influence how the broader AI community evaluates and ranks foundational models, leading to a wider scientific impact across the field.
Paper 2 is likely higher impact: it addresses a timely, widely used evaluation paradigm (pairwise preferences/Elo) with broad applicability across AI benchmarks and model development, and offers empirical evidence of strong alignment with ground-truth accuracy plus actionable findings about judge behavior (e.g., echo effects). Paper 1 is novel and intriguing but is narrower (mathematical conjecture generation focused on a specific conjecture and a neural analogue) with less immediate, generalizable real-world uptake and harder-to-validate long-term impact.
Paper 2 addresses a critical, widely debated issue in modern AI: whether pairwise evaluation methods (like Chatbot Arena's Elo) reflect actual accuracy or merely stylistic biases. By demonstrating a >0.9 correlation with ground-truth accuracy, its findings validate the primary evaluation paradigm used across the entire LLM community. Paper 1, while methodologically rigorous and pre-registered, focuses on a much narrower subfield (spatial memory and occlusion in embodied agents), limiting its breadth of impact compared to the fundamental evaluation insights provided by Paper 2.
Paper 1 addresses a fundamental question about the validity of pairwise comparison evaluation methods (like Elo) used across the entire generative AI field. Demonstrating >0.9 Spearman correlation with ground-truth accuracy rankings provides strong theoretical grounding for a widely-adopted evaluation paradigm, with broad implications for all model evaluation. Paper 2, while technically solid and practically useful, is more narrowly focused on a specific training framework for visual code generation. Paper 1's findings impact how the entire community thinks about evaluation methodology, giving it broader and more lasting scientific impact.
Paper 1 is likely to have higher scientific impact due to broader relevance and timeliness: it addresses a central, cross-domain problem in modern AI—how to reliably evaluate generative models via pairwise comparisons. Its findings (high correlation with accuracy, robustness to judge/style bias, and identification of a causal preference artifact) can influence benchmarking methodology across many tasks and communities. Paper 2 is innovative and high-value for mining/industrial scheduling, but its impact is more domain-specific and may be harder to generalize beyond constrained simulator-guided LLM control.
Paper 1 is more novel: it proposes a deterministically constrained, process-semantics workflow for LLM-based autoformalization grounded in a principled notion of proof rigor, and demonstrates an end-to-end Lean formalization of a recent research-level result—high potential to shift how formalization is done and to impact mathematics, PL, and AI verification. Paper 2 is timely and useful for evaluation practice, but its core contribution is an empirical validation of pairwise/Elo rankings and an identified bias factor; impact is likely narrower and more incremental.
Paper 1 introduces a novel paradigm for both evaluating and training LLMs using expert rubrics, demonstrating substantial performance improvements across multiple model scales. Its dual utility in evaluation and reinforcement learning offers broader practical applications and greater potential to advance alignment methodologies compared to Paper 2's empirical validation of existing pairwise comparison methods.
Paper 1 addresses a fundamental and highly debated issue in AI: the validity of pairwise comparisons and LLM-as-a-judge. By demonstrating a strong correlation between Elo rankings and ground-truth accuracy, it validates the field's dominant evaluation paradigm. Paper 2 presents a valuable but more niche benchmark focused on table representation. Consequently, Paper 1 has a broader scope, higher timeliness, and greater potential to influence how generative models are evaluated across the entire community.
Paper 2 addresses a widely applicable methodological concern in AI evaluation (pairwise comparisons for generative models), providing empirical evidence that Elo rankings correlate strongly with ground-truth accuracy. This has broad, immediate impact across the entire generative AI community, affecting how models are benchmarked and evaluated. Paper 1, while intellectually rigorous and insightful about RAG limitations in law, is more domain-specific and primarily theoretical/architectural in nature, limiting its breadth of impact. Paper 2's findings are more actionable and relevant to a larger research community.
Paper 2 addresses a fundamental issue in the evaluation of generative models, a critical challenge across the entire AI community. Validating that pairwise comparisons strongly correlate with ground-truth accuracy bolsters confidence in widely used systems like Chatbot Arena. While Paper 1 presents an impressive web-agent engineering effort, Paper 2's findings have much broader implications, affecting how researchers across multiple subfields evaluate and rank foundation models.