Does AI agree with human peer reviewers? Three methods, multiple datasets — pairwise comparison, single-item rating, and full tournament ranking.
Code generation papers from ICLR 2024-2025
| Paper | H Rank | AI | Δ |
|---|---|---|---|
| BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions | 1 | 7 | +6 |
| Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows | 2 | 1 | -1 |
| Batched Low-Rank Adaptation of Foundation Models | 3 | 13 | +10 |
| $\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis | 4 | 27 | +23 |
| L2MAC: Large Language Model Automatic Computer for Extensive Code Generation | 5 | 10 | +5 |
| LILO: Learning Interpretable Libraries by Compressing and Documenting Code | 6 | 6 | 0 |
| ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation | 7 | 21 | +14 |
| ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis | 8 | 18 | +10 |
| Learning Performance-Improving Code Edits | 9 | 3 | -6 |
| A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis | 10 | 4 | -6 |
| Paper | IRT | AI | Δ |
|---|---|---|---|
| BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions | 1 | 7 | +6 |
| Batched Low-Rank Adaptation of Foundation Models | 2 | 13 | +11 |
| Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows | 3 | 1 | -2 |
| $\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis | 4 | 27 | +23 |
| A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis | 5 | 4 | -1 |
| Learning Performance-Improving Code Edits | 6 | 3 | -3 |
| L2MAC: Large Language Model Automatic Computer for Extensive Code Generation | 7 | 10 | +3 |
| LILO: Learning Interpretable Libraries by Compressing and Documenting Code | 8 | 6 | -2 |
| ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis | 9 | 18 | +9 |
| ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation | 10 | 21 | +11 |
| Paper | Tier | Score | Tier # | AI # | Δ |
|---|---|---|---|---|---|
| BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions | oral | 9 | 1 | 7 | +6 |
| Batched Low-Rank Adaptation of Foundation Models | oral | 8 | 2 | 13 | +11 |
| Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows | oral | 8 | 3 | 1 | -2 |
| A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis | oral | 7.25 | 4 | 4 | 0 |
| ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis | oral | 7 | 5 | 18 | +13 |
| $\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis | spotlight | 7.5 | 6 | 27 | +21 |
| Learning Performance-Improving Code Edits | spotlight | 7.25 | 7 | 3 | -4 |
| L2MAC: Large Language Model Automatic Computer for Extensive Code Generation | poster | 7.2 | 8 | 10 | +2 |
| ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation | poster | 7 | 9 | 21 | +12 |
| LILO: Learning Interpretable Libraries by Compressing and Documenting Code | poster | 7 | 10 | 6 | -4 |
| CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules | poster | 6.5 | 11 | 17 | +6 |
| Combining Induction and Transduction for Abstract Reasoning | poster | 6.25 | 12 | 5 | -7 |
| Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain | poster | 6.25 | 13 | 19 | +6 |
| SFS: Smarter Code Space Search improves LLM Inference Scaling | poster | 6.2 | 14 | 22 | +8 |
| Generating CAD Code with Vision-Language Models for 3D Designs | poster | 6 | 15 | 48 | +33 |