Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang
Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.
ComBench introduces a 100-problem Olympiad-level combinatorics benchmark that explicitly separates two capabilities: Rigorous Proof Reasoning (producing valid mathematical arguments) and Constructive Realization (producing concrete, machine-checkable discrete witnesses such as colorings, tilings, graphs, or permutations). The key innovation is the verifier-gated scoring protocol: construction-centric problems receive both a rubric-guided proof score (0/1/6/7) from an LLM judge and a binary pass/fail from a deterministic Python verifier. When constructions fail verification, high proof scores are demoted (7→6, 6→1), operationalizing the insight that a proof claiming existence without a valid witness may contain hidden gaps.
This addresses a genuine evaluation blind spot. Prior benchmarks either use answer-only matching (which reveals nothing about proof quality) or rubric-guided proof grading (which cannot verify whether claimed constructions are actually valid). The formal verification line (e.g., CombiBench in Lean 4) addresses verification but requires formalization, limiting scope. ComBench occupies a middle ground: natural-language proofs with executable witness checking.
The annotation pipeline is carefully structured in three stages: specification/rubric construction, verifier generation with semantic audit, and assembly with executable reference checking. The quality control targets three risks (target fidelity, verifier adequacy, metadata consistency) and includes LLM-assisted semantic auditing plus human review.
Diagnostic value: The paper's most impactful finding is the empirical demonstration that proof reasoning and constructive realization are genuinely distinct capabilities. Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4—a result that would be invisible under either answer-only or proof-only evaluation. This finding should influence how the community evaluates and trains mathematical reasoning models.
Error taxonomy: The proof error analysis reveals that 41.2% of failures stem from "Missing Core Mechanism" and 20% from "Wrong Mathematical Target"—substantive mathematical failures rather than formatting issues. This provides actionable diagnostic information for model developers.
Benchmark design patterns: The verifier-gated scoring approach—combining natural-language proof evaluation with deterministic verification of specific output components—is a reusable design pattern applicable beyond combinatorics to any domain requiring both argumentation and concrete artifacts.
The paper is highly timely. It directly addresses the observation that frontier models (including Gemini Deep Think and DeepSeek-R1) failed IMO 2025 P6, a combinatorics problem. As LLMs approach human-competitive performance on Olympiad mathematics, the need for more nuanced evaluation beyond pass/fail on final answers becomes critical. The paper fills the gap between answer-matching benchmarks (which are being saturated) and formal theorem-proving benchmarks (which have high formalization overhead).
The benchmark is also well-calibrated for current frontier models: the best model achieves 65.4% Avg. and 75.3% Best@4, leaving substantial headroom without being frustratingly difficult.
Other observations: The paper evaluates 10 models spanning a wide capability range, but the rapid pace of model development (all models from 2026) means these rankings will quickly become dated—the benchmark's lasting value lies in its evaluation methodology and diagnostic framework rather than specific model rankings. The dataset curation from 15 competition sources across 25 years provides good coverage, though the construction-centric subset is heavily concentrated in extremal problems (28/50).
ComBench makes a meaningful contribution by formalizing the distinction between proof reasoning and constructive realization in Olympiad combinatorics, backed by a carefully designed evaluation protocol and informative empirical results. The benchmark size is modest and the evaluation has inherent limitations (LLM judge, prescribed formats), but the conceptual framework and diagnostic insights are valuable for the mathematical reasoning community. The work is well-executed within its scope and addresses a timely gap.
Generated Jun 10, 2026
ComBench addresses a critical gap in evaluating frontier LLMs on Olympiad-level combinatorics, a domain central to mathematical AI research. Its benchmark design separating analysis and construction reasoning provides novel diagnostic insights about distinct model capabilities. The benchmark serves the rapidly growing field of mathematical reasoning in LLMs and will likely be widely adopted. Paper 2, while methodologically interesting in its self-gated clarification approach for hierarchical agents, targets a narrower application domain (tariff classification) and has more limited generalizability despite its sound experimental design.
Paper 2 offers broader real-world applications and interdisciplinary impact by applying structured LLMs to conflict resolution. Its rigorous human-subject experiments comparing AI against professional human mediators demonstrate immediate practical utility in economics, law, and psychology. In contrast, while Paper 1 provides a valuable benchmark for mathematical reasoning, it contributes primarily to a crowded subfield of AI evaluation and may have a shorter lifespan of relevance as models rapidly evolve.
While Paper 1 presents a valuable human-in-the-loop framework for engineering simulations, Paper 2 addresses a fundamental and highly active area in artificial intelligence: advanced mathematical reasoning in LLMs. By introducing a rigorous benchmark for Olympiad-level combinatorics, Paper 2 has a significantly broader potential impact across the vast AI research community, as evaluating and improving frontier models' reasoning capabilities is currently a central bottleneck in global AI development.
Paper 1 addresses the practically important and timely problem of LLM-based office automation with a large-scale, well-structured benchmark (200 tasks, 7,118 criteria) that directly evaluates real-world productivity software use. This has broader impact across industry and research, as office automation affects millions of users. Paper 2, while rigorous, targets a narrower niche (Olympiad combinatorics reasoning) with only 100 problems. Paper 1's findings on the significant gap between LLM capabilities and reliable document automation have more immediate implications for the rapidly growing LLM agent deployment ecosystem.
Paper 2 likely has higher scientific impact: it introduces a new, broadly useful benchmark (ComBench) targeting a timely, unsolved weakness in LLMs—rigorous proof reasoning and constructive combinatorics—with a careful evaluation protocol (rubric-guided grading + deterministic verification). Benchmarks often catalyze progress across many labs and subfields (LLM evaluation, reasoning, formal methods, education), and the results indicate it is not saturated. Paper 1 is a solid domain adaptation study but is more incremental (standard LoRA/NEFTune) with limited dataset size and narrower applicability (financial NER).
ActiveMem proposes a novel architectural framework addressing a fundamental limitation in LLM agents (context overload) by decoupling reasoning and memory. This innovation has broad applicability across numerous long-horizon tasks and agentic workflows, offering a scalable solution to improve efficiency and capability. While ComBench provides a rigorous evaluation for a specific niche (combinatorics), ActiveMem's contribution directly enhances model capabilities, promising wider real-world applications and higher overall scientific impact across AI research.
ComBench offers a concrete, reproducible benchmark with empirical results on frontier LLMs, addressing a well-defined gap in evaluating combinatorial reasoning. It provides actionable diagnostics and a clear evaluation protocol. Paper 2 proposes a speculative theoretical framework ('Soul Computing') with vague claims about AI consciousness, lacking empirical validation, rigorous methodology, or falsifiable hypotheses. Paper 1's methodological rigor, practical utility for the active LLM evaluation community, and timeliness give it substantially higher scientific impact potential.
ComBench addresses a timely and high-impact topic—benchmarking LLM reasoning capabilities in mathematical problem-solving. It introduces a novel benchmark with careful evaluation methodology distinguishing proof reasoning from constructive realization, relevant to the rapidly growing AI/LLM research community. Its breadth of impact across AI, mathematics education, and reasoning research is substantial. Paper 1, while competent, addresses a more incremental improvement in fault diagnosis using belief rule bases, a narrower domain with less transformative potential and smaller research community engagement.
Paper 1 addresses a fundamental and practical challenge in cross-modal knowledge distillation without paired data, providing both theoretical foundations and a principled algorithmic framework with guarantees. It has broader real-world applicability across multimodal AI systems where paired data is scarce. Paper 2, while valuable as a benchmark for evaluating LLM combinatorial reasoning, has narrower impact primarily within the LLM evaluation community and will likely become outdated as models improve. Paper 1's theoretical contributions on distributional alignment are more enduring and broadly applicable across machine learning.
Paper 1 proposes a novel, domain-agnostic framework for self-bootstrapping LLM agents, offering broad applicability across various autonomous tasks. Its algorithmic innovations (WIA and AIW) directly address core challenges in agent generalization and learning efficiency. In contrast, Paper 2 introduces a highly rigorous but domain-specific benchmark for combinatorics. While valuable for evaluation, Paper 1's general methodological advancements in agent architecture give it a significantly higher potential for widespread adoption and real-world impact across diverse fields.