Validation

Does AI agree with human peer reviewers? Three methods, multiple datasets — pairwise comparison, single-item rating, and full tournament ranking.

Work in progress — results are preliminary and actively being refined

Tournament — ICLR Code Generation

Code generation papers from ICLR 2024-2025

Papers
62
62 full text
Human Experts
262
62 papers reviewed
AI Matches
5398
Abstract + Summary (Opus 4.5) non-tie pairs
Extraction
0
16582 abstract-only
Avg/Paper
534.9
468–573
Tournament
Complete
Expert-Expert
79.5%
5196/6532 non-tie pairs
CI: [78.6%, 80.5%]
Expert vs Majority
89.7%
4774/5324 non-tie pairs
CI: [88.8%, 90.5%]
AI vs Expert
76.1%
4108/5398 non-tie pairs
CI: [74.9%, 77.2%]
AI vs Majority
78.9%
1271/1611 non-tie pairs
CI: [76.8%, 80.8%]

Accuracy by Expert Score Gap

How accuracy changes when the quality difference between papers is obvious vs subtle
medium (1-2)
68.2%
1750/2565 pairs
large (>2)
83.2%
2358/2833 pairs
Agreement rates are computed on non-tie paper pairs only (pairs where reviewers gave different scores). Rates are based on different match sets per content mode and are not directly comparable across formats. For a fair comparison on the same pairs, see the Pairwise section.

Pairwise Ranking(Abstract + Summary (Opus 4.5))

Spearman ρ
+0.704
Kendall τ
+0.530
Pearson r
+0.759
62 papers · 5909 human pairs · 6749 AI matches
PaperH RankAIΔ
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions17 +6
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows21 -1
Batched Low-Rank Adaptation of Foundation Models313 +10
$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis427 +23
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation510 +5
LILO: Learning Interpretable Libraries by Compressing and Documenting Code66 0
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation721 +14
ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis818 +10
Learning Performance-Improving Code Edits93 -6
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis104 -6

IRT Score(Abstract + Summary (Opus 4.5))

Spearman ρ
+0.680
Kendall τ
+0.506
Pearson r
+0.755
29→57 distinct scores·Δρ = -0.006
PaperIRTAIΔ
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions17 +6
Batched Low-Rank Adaptation of Foundation Models213 +11
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows31 -2
$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis427 +23
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis54 -1
Learning Performance-Improving Code Edits63 -3
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation710 +3
LILO: Learning Interpretable Libraries by Compressing and Documenting Code86 -2
ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis918 +9
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation1021 +11

AI Pairwise Ranking vs Single-Item Score(Opus 4.6 Thinking)

Spearman ρ
+0.902
Kendall τ
+0.777
Pearson r
+0.917
62 papers with both pairwise matches and single-item scores
Correlation between the ranking from pairwise tournament matches (round-robin judges reading Opus 4.6 Thinking summaries) and the ranking from Opus 4.6 Thinking single-item scores (direct paper scoring without comparison).

Acceptance Tier Ranking vs AI — Abstract + Summary (Opus 4.5)

Spearman ρ
+0.639
Kendall τ
+0.463
53 papers with tiersPairwise accuracy (non-tie pairs): 79.5%oral: 5spotlight: 2poster: 14reject: 32
PaperTierScoreTier #AI #Δ
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructionsoral917 +6
Batched Low-Rank Adaptation of Foundation Modelsoral8213 +11
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflowsoral831 -2
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesisoral7.2544 0
ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesisoral7518 +13
$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesisspotlight7.5627 +21
Learning Performance-Improving Code Editsspotlight7.2573 -4
L2MAC: Large Language Model Automatic Computer for Extensive Code Generationposter7.2810 +2
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generationposter7921 +12
LILO: Learning Interpretable Libraries by Compressing and Documenting Codeposter7106 -4
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modulesposter6.51117 +6
Combining Induction and Transduction for Abstract Reasoningposter6.25125 -7
Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChainposter6.251319 +6
SFS: Smarter Code Space Search improves LLM Inference Scalingposter6.21422 +8
Generating CAD Code with Vision-Language Models for 3D Designsposter61548 +33