How Kurate.org ranks arXiv preprints across 46 categories using AI-powered pairwise comparison.
The system fetches the latest preprints from the arXiv API across Artificial Intelligence, Machine Learning, Quantum Physics, Applied Physics, Cosmology, Probability, Cryptography and Security, Distributed Computing, Game Theory, Information Theory, Robotics, Quantum Gases, Materials Science, Mathematical Physics, Computational Physics, General Physics, Optics, High Energy Astrophysics, Instrumentation and Methods, Inorganic Chemistry, Information Retrieval, Programming Languages, Social and Information Networks, Disordered Systems and Neural Networks, Mesoscale and Nanoscale Physics, Strongly Correlated Electrons, Superconductivity, Cryptographic Protocols, Public-key Cryptography, Economics, Optimization and Control, Chaotic Dynamics, General Relativity, High Energy Physics - Lattice, High Energy Physics - Phenomenology, High Energy Physics - Theory, Pattern Formation and Solitons, Chemical Physics, Plasma Physics, Space Physics, Biomolecules, Genomics, Neurons and Cognition, Populations and Evolution, Quantitative Methods, Machine Learning (Statistics) and downloads the full PDF for each. When a new category is added, the system fetches all papers published in the last 30 days; after that, it fetches incrementally from the most recent paper onward.
Each paper receives a Claude Opus 4.6 Impact Assessment (with extended thinking mode) generated from the full PDF text, analyzing novelty, methodology, potential impact, and limitations. The assessment serves as input to the pairwise tournament. View the assessment prompt →
Separately, each model assigns a direct 1–10 Single-Item (SI) rating across five dimensions — significance, rigor, novelty, clarity, and overall score. These ratings provide a complementary signal to the pairwise tournament and power the Score–Pairwise Coherence analysis.
Papers are compared head-to-head within the same category using their abstract + AI Impact Assessment as input. Each comparison is judged by one of three models via round-robin rotation:
Each model evaluates which paper has higher potential scientific impact across five dimensions: novelty, real-world applications, methodological rigor, breadth of impact, and timeliness. View the evaluation prompt →
The presentation order of each pair is randomly flipped with 50% probability before sending to the LLM, eliminating the known tendency for models to prefer the paper presented first.
Opponent selection uses TrueSkill match quality — a function that maximizes the information gained from each match by accounting for both skill difference and rating uncertainty. New papers with high uncertainty are matched against a wider range of opponents; established papers face similarly-rated peers for fine-grained ranking.
Convergence uses two tiers of TrueSkill sigma targets: general papers (±50 Elo pts) and top-K papers (±40 pts). Papers with extreme win rates (100% or 0%) continue receiving matches until they face a decisive result or reach the match floor (50 comparisons). A calibration ratio ensures new papers are compared against established ones for transitive score calibration.
Global rankings use TrueSkill (Bayesian skill estimation) as the primary metric, updated incrementally after each match. Scores use a conservative estimate (mu − 3σ) mapped to an Elo-style scale centered at 1200. The 95% confidence interval is derived from TrueSkill sigma (±2σ in Elo points), reflecting both match count and opponent quality — not just win rate.
Round-robin rotation ensures each model contributes equally to every paper's ranking. The Model Analysis page shows inter-model agreement rates and per-model ranking correlations.
Papers with multiple arXiv tags can be viewed across primary categories using AND/OR tag filtering.
Rankings display TrueSkill score, win rate, confidence interval, AI rating (averaged SI score), gap score, match count, and publication date. Default sort is by TrueSkill. All data is pre-computed for instant loading and updates automatically as new papers arrive and matches complete.
The system runs autonomously — fetching new papers on a configurable schedule, running pairwise comparisons until convergence targets are met, then idling. Administrators can trigger additional rounds or adjust parameters at any time.
Limitations
AI-based evaluation is an approximation of scientific impact, not a replacement for human peer review. Rankings reflect the consensus of three large language models. Papers with very few matches may have wide confidence intervals regardless of their win rate.