Validation

Does AI agree with human peer reviewers? Three methods, multiple datasets — pairwise comparison, single-item rating, and full tournament ranking.

Work in progress — results are preliminary and actively being refined

Tournament — ICLR Code Generation

Code generation papers from ICLR 2024-2025

Papers

62 full text

Human Experts

262

62 papers reviewed

AI Matches

5398

Abstract + Summary (Opus 4.5) non-tie pairs

Extraction

16582 abstract-only

Avg/Paper

534.9

468–573

Tournament

Complete

Expert-Expert

79.5%

5196/6532 non-tie pairs

CI: [78.6%, 80.5%]

Expert vs Majority

89.7%

4774/5324 non-tie pairs

CI: [88.8%, 90.5%]

AI vs Expert

76.1%

4108/5398 non-tie pairs

CI: [74.9%, 77.2%]

AI vs Majority

78.9%

1271/1611 non-tie pairs

CI: [76.8%, 80.8%]

Accuracy by Expert Score Gap

How accuracy changes when the quality difference between papers is obvious vs subtle

medium (1-2)

68.2%

1750/2565 pairs

large (>2)

83.2%

2358/2833 pairs

Agreement rates are computed on non-tie paper pairs only (pairs where reviewers gave different scores). Rates are based on different match sets per content mode and are not directly comparable across formats. For a fair comparison on the same pairs, see the Pairwise section.

Pairwise Ranking(Abstract + Summary (Opus 4.5))

Spearman ρ

+0.704

Kendall τ

+0.530

Pearson r

+0.759

62 papers · 5909 human pairs · 6749 AI matches

Paper	H Rank	AI	Δ
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions	1	7	+6
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows	2	1	-1
Batched Low-Rank Adaptation of Foundation Models	3	13	+10
$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis	4	27	+23
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation	5	10	+5
LILO: Learning Interpretable Libraries by Compressing and Documenting Code	6	6	0
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation	7	21	+14
ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis	8	18	+10
Learning Performance-Improving Code Edits	9	3	-6
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis	10	4	-6

IRT Score(Abstract + Summary (Opus 4.5))

Spearman ρ

+0.680

Kendall τ

+0.506

Pearson r

+0.755

29→57 distinct scores·Δρ = -0.006

Paper	IRT	AI	Δ
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions	1	7	+6
Batched Low-Rank Adaptation of Foundation Models	2	13	+11
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows	3	1	-2
$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis	4	27	+23
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis	5	4	-1
Learning Performance-Improving Code Edits	6	3	-3
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation	7	10	+3
LILO: Learning Interpretable Libraries by Compressing and Documenting Code	8	6	-2
ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis	9	18	+9
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation	10	21	+11

AI Pairwise Ranking vs Single-Item Score(Opus 4.6 Thinking)

Spearman ρ

+0.902

Kendall τ

+0.777

Pearson r

+0.917

62 papers with both pairwise matches and single-item scores

Correlation between the ranking from pairwise tournament matches (round-robin judges reading Opus 4.6 Thinking summaries) and the ranking from Opus 4.6 Thinking single-item scores (direct paper scoring without comparison).

Acceptance Tier Ranking vs AI — Abstract + Summary (Opus 4.5)

Spearman ρ

+0.639

Kendall τ

+0.463

53 papers with tiersPairwise accuracy (non-tie pairs): 79.5%oral: 5spotlight: 2poster: 14reject: 32

Paper	Tier	Score	Tier #	AI #	Δ
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions	oral	9	1	7	+6
Batched Low-Rank Adaptation of Foundation Models	oral	8	2	13	+11
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows	oral	8	3	1	-2
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis	oral	7.25	4	4	0
ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis	oral	7	5	18	+13
$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis	spotlight	7.5	6	27	+21
Learning Performance-Improving Code Edits	spotlight	7.25	7	3	-4
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation	poster	7.2	8	10	+2
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation	poster	7	9	21	+12
LILO: Learning Interpretable Libraries by Compressing and Documenting Code	poster	7	10	6	-4
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules	poster	6.5	11	17	+6
Combining Induction and Transduction for Abstract Reasoning	poster	6.25	12	5	-7
Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain	poster	6.25	13	19	+6
SFS: Smarter Code Space Search improves LLM Inference Scaling	poster	6.2	14	22	+8
Generating CAD Code with Vision-Language Models for 3D Designs	poster	6	15	48	+33