ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang

Jun 9, 2026arXiv:2606.10479v1

cs.AI

#2707of 3489·Artificial Intelligence

#2707 of 3489 · Artificial Intelligence

Tournament Score

1321±44

10501800

47%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty7

Clarity7.5

Abstract

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ComBench

1. Core Contribution

ComBench introduces a 100-problem Olympiad-level combinatorics benchmark that explicitly separates two capabilities: Rigorous Proof Reasoning (producing valid mathematical arguments) and Constructive Realization (producing concrete, machine-checkable discrete witnesses such as colorings, tilings, graphs, or permutations). The key innovation is the verifier-gated scoring protocol: construction-centric problems receive both a rubric-guided proof score (0/1/6/7) from an LLM judge and a binary pass/fail from a deterministic Python verifier. When constructions fail verification, high proof scores are demoted (7→6, 6→1), operationalizing the insight that a proof claiming existence without a valid witness may contain hidden gaps.

This addresses a genuine evaluation blind spot. Prior benchmarks either use answer-only matching (which reveals nothing about proof quality) or rubric-guided proof grading (which cannot verify whether claimed constructions are actually valid). The formal verification line (e.g., CombiBench in Lean 4) addresses verification but requires formalization, limiting scope. ComBench occupies a middle ground: natural-language proofs with executable witness checking.

2. Methodological Rigor

The annotation pipeline is carefully structured in three stages: specification/rubric construction, verifier generation with semantic audit, and assembly with executable reference checking. The quality control targets three risks (target fidelity, verifier adequacy, metadata consistency) and includes LLM-assisted semantic auditing plus human review.

Strengths in evaluation design:

The 0/1/6/7 rubric aligns with actual IMO grading conventions, capturing meaningful partial progress

The verifier-gated demotion rule is well-motivated by case studies (Appendix J) showing high proof scores coexisting with invalid constructions

A manual audit of 50 sampled proof-judge decisions shows 90% strict acceptance, 98% accepted-or-minor

Methodological concerns:

The benchmark contains only 100 problems, which limits statistical power for fine-grained comparisons, especially at the category level (e.g., Counting has only 7 problems)

The proof judge is an LLM (Gemini-3.1-Pro), introducing potential systematic biases. The 50-sample manual audit is relatively small

The overlap analysis with IMO-Bench (14%) is documented but overlapping problems are retained, which could inflate scores for models trained on these public problems

The verifier-gated demotion rule, while well-motivated, involves somewhat arbitrary choices (why 7→6 and 6→1 specifically?)

3. Potential Impact

Diagnostic value: The paper's most impactful finding is the empirical demonstration that proof reasoning and constructive realization are genuinely distinct capabilities. Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4—a result that would be invisible under either answer-only or proof-only evaluation. This finding should influence how the community evaluates and trains mathematical reasoning models.

Error taxonomy: The proof error analysis reveals that 41.2% of failures stem from "Missing Core Mechanism" and 20% from "Wrong Mathematical Target"—substantive mathematical failures rather than formatting issues. This provides actionable diagnostic information for model developers.

Benchmark design patterns: The verifier-gated scoring approach—combining natural-language proof evaluation with deterministic verification of specific output components—is a reusable design pattern applicable beyond combinatorics to any domain requiring both argumentation and concrete artifacts.

4. Timeliness & Relevance

The paper is highly timely. It directly addresses the observation that frontier models (including Gemini Deep Think and DeepSeek-R1) failed IMO 2025 P6, a combinatorics problem. As LLMs approach human-competitive performance on Olympiad mathematics, the need for more nuanced evaluation beyond pass/fail on final answers becomes critical. The paper fills the gap between answer-matching benchmarks (which are being saturated) and formal theorem-proving benchmarks (which have high formalization overhead).

The benchmark is also well-calibrated for current frontier models: the best model achieves 65.4% Avg. and 75.3% Best@4, leaving substantial headroom without being frustratingly difficult.

5. Strengths & Limitations

Key Strengths:

The conceptual separation of proof reasoning from constructive realization is the paper's strongest intellectual contribution, supported by compelling empirical evidence

The evaluation protocol is reproducible: deterministic verifiers provide ground-truth construction evaluation, and the rubric-guided judge uses standardized guidelines

Extensive appendices (case studies, error taxonomy, verifier examples, prompt templates) support transparency and reproducibility

The category-level analysis reveals that Existence/Construction problems are consistently hardest, providing targeted diagnostic information

Notable Limitations:

100 problems is small for a benchmark aspiring to fine-grained diagnosis; category-level conclusions rest on as few as 7 problems

The deterministic verifier can only check prescribed witness formats, potentially penalizing valid but differently-formatted constructions

The proof judge reliability check is limited (50 samples, 1 rejection), and systematic biases across problem types or models are not investigated

The paper does not explore whether construction ability can be improved through specific training interventions, limiting actionability

Some construction-centric records may test format compliance as much as mathematical insight, though the case studies suggest this is not the dominant failure mode

Other observations: The paper evaluates 10 models spanning a wide capability range, but the rapid pace of model development (all models from 2026) means these rankings will quickly become dated—the benchmark's lasting value lies in its evaluation methodology and diagnostic framework rather than specific model rankings. The dataset curation from 15 competition sources across 25 years provides good coverage, though the construction-centric subset is heavily concentrated in extremal problems (28/50).

Summary

ComBench makes a meaningful contribution by formalizing the distinction between proof reasoning and constructive realization in Olympiad combinatorics, backed by a carefully designed evaluation protocol and informative empirical results. The benchmark size is modest and the evaluation has inherent limitations (LLM judge, prescribed formats), but the conceptual framework and diagnostic insights are valuable for the mathematical reasoning community. The work is well-executed within its scope and addresses a timely gap.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 7Clarity 7.5

Generated Jun 10, 2026

Comparison History (19)

Wonvs. Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

ComBench addresses a critical gap in evaluating frontier LLMs on Olympiad-level combinatorics, a domain central to mathematical AI research. Its benchmark design separating analysis and construction reasoning provides novel diagnostic insights about distinct model capabilities. The benchmark serves the rapidly growing field of mathematical reasoning in LLMs and will likely be widely adopted. Paper 2, while methodologically interesting in its self-gated clarification approach for hierarchical agents, targets a narrower application domain (tariff classification) and has more limited generalizability despite its sound experimental design.

claude-opus-4-6·Jun 11, 2026

Lostvs. Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

Paper 2 offers broader real-world applications and interdisciplinary impact by applying structured LLMs to conflict resolution. Its rigorous human-subject experiments comparing AI against professional human mediators demonstrate immediate practical utility in economics, law, and psychology. In contrast, while Paper 1 provides a valuable benchmark for mathematical reasoning, it contributes primarily to a crowded subfield of AI evaluation and may have a shorter lifespan of relevance as models rapidly evolve.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

While Paper 1 presents a valuable human-in-the-loop framework for engineering simulations, Paper 2 addresses a fundamental and highly active area in artificial intelligence: advanced mathematical reasoning in LLMs. By introducing a rigorous benchmark for Olympiad-level combinatorics, Paper 2 has a significantly broader potential impact across the vast AI research community, as evaluating and improving frontier models' reasoning capabilities is currently a central bottleneck in global AI development.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Paper 1 addresses the practically important and timely problem of LLM-based office automation with a large-scale, well-structured benchmark (200 tasks, 7,118 criteria) that directly evaluates real-world productivity software use. This has broader impact across industry and research, as office automation affects millions of users. Paper 2, while rigorous, targets a narrower niche (Olympiad combinatorics reasoning) with only 100 problems. Paper 1's findings on the significant gap between LLM capabilities and reliable document automation have more immediate implications for the rapidly growing LLM agent deployment ecosystem.

claude-opus-4-6·Jun 10, 2026

Wonvs. Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

Paper 2 likely has higher scientific impact: it introduces a new, broadly useful benchmark (ComBench) targeting a timely, unsolved weakness in LLMs—rigorous proof reasoning and constructive combinatorics—with a careful evaluation protocol (rubric-guided grading + deterministic verification). Benchmarks often catalyze progress across many labs and subfields (LLM evaluation, reasoning, formal methods, education), and the results indicate it is not saturated. Paper 1 is a solid domain adaptation study but is more incremental (standard LoRA/NEFTune) with limited dataset size and narrower applicability (financial NER).

gpt-5.2·Jun 10, 2026

Lostvs. ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

ActiveMem proposes a novel architectural framework addressing a fundamental limitation in LLM agents (context overload) by decoupling reasoning and memory. This innovation has broad applicability across numerous long-horizon tasks and agentic workflows, offering a scalable solution to improve efficiency and capability. While ComBench provides a rigorous evaluation for a specific niche (combinatorics), ActiveMem's contribution directly enhances model capabilities, promising wider real-world applications and higher overall scientific impact across AI research.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness

ComBench offers a concrete, reproducible benchmark with empirical results on frontier LLMs, addressing a well-defined gap in evaluating combinatorial reasoning. It provides actionable diagnostics and a clear evaluation protocol. Paper 2 proposes a speculative theoretical framework ('Soul Computing') with vague claims about AI consciousness, lacking empirical validation, rigorous methodology, or falsifiable hypotheses. Paper 1's methodological rigor, practical utility for the active LLM evaluation community, and timeliness give it substantially higher scientific impact potential.

claude-opus-4-6·Jun 10, 2026

Wonvs. A Reliable Fault Diagnosis Method Based on Belief Rule Base Consider Robustness Analysis

ComBench addresses a timely and high-impact topic—benchmarking LLM reasoning capabilities in mathematical problem-solving. It introduces a novel benchmark with careful evaluation methodology distinguishing proof reasoning from constructive realization, relevant to the rapidly growing AI/LLM research community. Its breadth of impact across AI, mathematics education, and reasoning research is substantial. Paper 1, while competent, addresses a more incremental improvement in fault diagnosis using belief rule bases, a narrower domain with less transformative potential and smaller research community engagement.

claude-opus-4-6·Jun 10, 2026

Lostvs. Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

Paper 1 addresses a fundamental and practical challenge in cross-modal knowledge distillation without paired data, providing both theoretical foundations and a principled algorithmic framework with guarantees. It has broader real-world applicability across multimodal AI systems where paired data is scarce. Paper 2, while valuable as a benchmark for evaluating LLM combinatorial reasoning, has narrower impact primarily within the LLM evaluation community and will likely become outdated as models improve. Paper 1's theoretical contributions on distributional alignment are more enduring and broadly applicable across machine learning.

claude-opus-4-6·Jun 10, 2026

Lostvs. Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Paper 1 proposes a novel, domain-agnostic framework for self-bootstrapping LLM agents, offering broad applicability across various autonomous tasks. Its algorithmic innovations (WIA and AIW) directly address core challenges in agent generalization and learning efficiency. In contrast, Paper 2 introduces a highly rigorous but domain-specific benchmark for combinatorics. While valuable for evaluation, Paper 1's general methodological advancements in agent architecture give it a significantly higher potential for widespread adoption and real-world impact across diverse fields.

gemini-3.1-pro-preview·Jun 10, 2026

#2707of 3489·Artificial Intelligence

#2707 of 3489 · Artificial Intelligence

Tournament Score

1321±44

10501800

47%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty7

Clarity7.5