OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields

Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang, Jue Wang, Ran Sun, Zhuo Yang, Wanli Ouyang

May 28, 2026

arXiv:2605.29833v1 PDF

cs.AI(primary)

#1278of 2821·Artificial Intelligence

#1278 of 2821 · Artificial Intelligence

Tournament Score

1419±48

10501800

54%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty6.5

Clarity7

Tournament Score

1419±48

10501800

54%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

As multimodal language models play an increasingly important role in scientific research, materials science offers a critical testbed due to its interdisciplinary, multimodal, and application-driven nature. However, existing materials benchmarks mainly focus on property prediction, knowledge QA, or characterization understanding, leaving the broader reasoning process from materials knowledge to application underexplored. To fill this gap, we present OmniMatBench, a human-calibrated multimodal reasoning benchmark for materials science. OmniMatBench contains 3,171 expert-curated QA and calculation problems across 19 materials-science subfields, spanning fundamental materials knowledge, structural and engineering materials, materials processing and manufacturing, and functional and applied materials. We evaluate 13 open-source and closed-source MLLMs and find that the best model achieves only a 0.372 overall score, revealing a substantial gap in current materials-science reasoning. Further analysis shows strong variation across subfields, fixed reasoning heuristics, uneven materials knowledge, and limited high-level knowledge application under formula-, retrieval-, and code-assisted settings. OmniMatBench provides crucial insights into the capabilities and limitations of current MLLMs and establishes a foundation for reliable AI assistants in materials-science research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: OmniMatBench

1. Core Contribution

OmniMatBench introduces a human-calibrated multimodal reasoning benchmark spanning 3,171 expert-curated QA and calculation problems across 19 materials science subfields. Its central novelty lies in two aspects: (a) breadth of coverage organized around a Knowledge–Structure–Processing–Application (KSPA) taxonomy that goes beyond prior benchmarks focused narrowly on property prediction, knowledge recall, or characterization understanding; and (b) a fine-grained evaluation protocol that separates rubric-based open-ended QA scoring (with expert key points) from strict slot-based calculation evaluation (with formula traces, unit requirements, and numerical precision constraints). This dual evaluation design enables diagnosis of *where* models fail—concept identification, formula selection, multimodal grounding, unit-aware computation, or answer formatting—rather than merely *whether* they fail.

The benchmark fills a genuine gap: existing materials science benchmarks (MatBench, MaCBench, MatVQA, MSQA, MatQnA, MatCha, MATRIX) either focus on text-only tasks, multiple-choice formats, or narrower subfield coverage. OmniMatBench is the first to systematically include engineering-critical but underrepresented subfields such as powder materials, gem materials and gemology, welding technology, and nanomaterials, while also requiring free-form numerical computation.

2. Methodological Rigor

The data construction pipeline is multi-stage and expert-involved: source material selection, expert extraction and adaptation, LLM-based preliminary verification, and human expert validation. The annotation team is substantial (8 PhD researchers, 60 trained annotators, 5 professors), though the paper provides limited detail on inter-annotator agreement or disagreement resolution beyond stating that ambiguous cases were resolved by professors.

The evaluation protocol is carefully designed. QA evaluation uses precision, recall, and F1 against expert-curated key points, validated against human judgments with Spearman correlations of 0.91–0.94 and F1 MAE of 0.122. This is reasonable but not perfect—an MAE of 0.122 on a 0–1 scale suggests non-trivial noise, which the authors acknowledge. CAL evaluation uses strict slot exact accuracy with normalization, which is appropriately stringent for computational tasks.

The evaluation of 13 MLLMs (8 closed-source, 5 open-source) is comprehensive and includes contemporary frontier models (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, o3, etc.). The inclusion of ablation studies—formula assistance (oracle vs. distractor retrieval), code execution augmentation, and difficulty-stratified analysis—adds analytical depth beyond simple leaderboard reporting. These probe experiments are the most intellectually interesting part of the paper, revealing that formula access alone is insufficient (best score only 35.62 with oracle context) and that code execution can actually *hurt* performance (GPT-5.5 drops from 27.65 to 19.06), pointing to a knowledge-to-execution gap rather than a pure arithmetic bottleneck.

One limitation in rigor: the human baseline is computed from only two PhD students, which is statistically thin for calibration purposes. The paper appropriately caveats this, but a more robust human reference would strengthen claims about the difficulty level.

3. Potential Impact

Within materials science AI: OmniMatBench provides a much-needed comprehensive evaluation infrastructure. The KSPA taxonomy and 19-subfield coverage could become a standard diagnostic tool for materials-focused AI development, particularly as the field moves beyond property prediction toward reasoning and decision support.

For MLLM development: The finding that even the best model achieves only 0.372 overall (and that code augmentation hurts performance) provides actionable feedback for model developers. The category-level analysis reveals specific failure modes—fixed reasoning heuristics, inability to distinguish applicable formulas from plausible distractors, visual parameter misinterpretation—that could guide targeted improvements.

Broader scientific AI: The evaluation methodology—combining rubric-based conceptual assessment with structured computational evaluation, plus systematic probing of assistance modalities—could serve as a template for benchmarks in other applied science domains (chemistry, civil engineering, biomedical engineering).

The qualitative error case studies (Appendix H) are particularly valuable, providing concrete failure taxonomies that are transferable across scientific reasoning contexts.

4. Timeliness & Relevance

The paper is highly timely. The rapid deployment of MLLMs as scientific assistants creates urgent need for rigorous domain-specific evaluation. The materials science community specifically lacks comprehensive reasoning benchmarks that test the full pipeline from knowledge understanding to engineering application. The inclusion of very recent models (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro—all 2026 releases) ensures immediate relevance.

5. Strengths & Limitations

Key Strengths:

Unprecedented breadth across 19 materials subfields with KSPA taxonomy

Dual QA+CAL evaluation with fine-grained failure diagnosis

Insightful probe experiments (formula assistance, code execution, difficulty analysis) that go beyond leaderboard reporting

Expert-curated key points enabling process-level evaluation, not just final-answer matching

Strong practical utility: the benchmark identifies specific bottlenecks for improving AI-assisted materials research

Notable Limitations:

The human baseline is statistically weak (n=2), limiting calibration claims

Some subfield imbalance in question counts (e.g., 98 for nanomaterials vs. 231 for electronic information materials)

The automated QA evaluator, while validated, has non-trivial MAE (0.122), introducing evaluation noise

Limited discussion of potential data contamination—given that source materials are from "classical and used materials knowledge resources," some content may appear in training data of evaluated models

The paper does not release full details of the LLM verifier stage, reducing reproducibility of the construction pipeline

No analysis of whether multimodal inputs are actually necessary for solving each problem (i.e., how many questions could be answered without the images)

Additional Observations

The paper's comparison table (Table 1) effectively positions OmniMatBench relative to prior work. The "Difficult" designation (best model <50%) is meaningful and suggests the benchmark will remain useful for some time. The extensive appendix with qualitative error cases adds substantial analytical value, though the main paper could better synthesize these into actionable recommendations.

The benchmark's design choice to include both common and rare materials subfields is commendable from a coverage perspective but may introduce evaluation noise due to varying question quality across subfields with different availability of expert-curated source materials.

Rating:7/ 10

Significance 7.5Rigor 6.5Novelty 6.5Clarity 7

Generated May 29, 2026

Comparison History (13)

vs. Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

gemini-3.15/29/2026

Paper 1 introduces a comprehensive, expert-curated multimodal benchmark for materials science, addressing a critical bottleneck in 'AI for Science.' While Paper 2 offers a valuable methodological improvement for LLM distillation, Paper 1 has a higher potential for broad interdisciplinary impact by establishing a foundational testbed that could accelerate real-world materials discovery and engineering.

vs. BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

gemini-3.15/29/2026

Paper 2 demonstrates higher potential scientific impact due to its broad applicability across the rapidly advancing field of autonomous AI agents. While Paper 1 provides a valuable domain-specific benchmark for materials science, Paper 2 addresses a fundamental bottleneck in general LLM capabilities: self-reflection and controlled evolution. By introducing a novel metric (Failure Avoidance Rate) and a controlled simulation framework, BenchTrace offers a methodological leap for evaluating agentic improvement that transcends specific domains, ensuring wider adoption and relevance across artificial intelligence research.

vs. When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

gemini-3.15/29/2026

Paper 2 presents a rigorous, human-calibrated benchmark for multimodal AI in materials science, a critical domain for real-world physical discoveries and engineering. While Paper 1 offers useful insights into LLM prompting behavior, Paper 2's potential to accelerate AI-driven scientific discovery in a specialized, high-impact field gives it greater long-term scientific value and interdisciplinary impact.

vs. Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

gemini-3.15/29/2026

Paper 2 introduces a highly novel backward-learning methodology for GPU kernel optimization, addressing a critical bottleneck in AI and high-performance computing. While Paper 1 provides a valuable domain-specific benchmark for materials science, Paper 2's automated optimization framework has broader potential impact across all computational fields by significantly improving the efficiency of underlying hardware execution and advancing AI-driven code optimization.

vs. Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

claude-opus-4.65/29/2026

OmniMatBench addresses a significant gap in evaluating MLLMs for materials science, a field with enormous real-world applications. Its breadth (19 subfields, 3,171 problems, 13 models evaluated) and human-calibrated design make it a foundational benchmark resource that can drive progress across materials science and AI. Paper 1, while technically sound, offers an incremental improvement to RLVR training via first-token diversification—a narrower contribution within the reinforcement learning community. Benchmarks tend to have broader, longer-lasting impact by shaping research directions across multiple communities.

vs. CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact because it introduces a large, expert-curated, human-calibrated benchmark spanning 19 materials-science subfields, enabling standardized evaluation of multimodal reasoning in a high-impact scientific domain. Its breadth and timeliness (MLLMs for scientific discovery), plus clear evidence of current capability gaps, can influence both AI and materials research communities and guide model development. Paper 1 is a solid, methodologically interesting retrieval-training contribution, but its impact is narrower (tool/API retrieval) and evaluated on a relatively small subset, with more incremental gains.

vs. Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

gpt-5.25/29/2026

Paper 2 has higher potential impact due to a broadly applicable, novel framework (cognitive scheduling of visual evidence acquisition) that addresses structural limitations in multimodal reasoning and can generalize across many tasks and domains beyond a single field. Its architectural idea (LM-controlled, on-demand perception) is timely and could influence future multimodal system design, offering clear real-world applicability. Paper 1 is valuable and rigorous but primarily contributes a domain-specific benchmark; its impact is likely more concentrated within materials science and evaluation work rather than reshaping core multimodal reasoning methods.

vs. Quantifying and Optimizing Simplicity via Polynomial Representations

gpt-5.25/29/2026

Paper 1 is more likely to have higher broad scientific impact: it proposes a new, general-purpose, quantitative simplicity metric (effective polynomial degree) that predicts generalization and yields a differentiable regularizer improving performance across diverse settings (vision, text, VLM fine-tuning, RL). This is methodologically innovative and potentially influences theory and practice across ML. Paper 2 is timely and valuable for materials-science AI evaluation, but as a benchmark its impact is narrower (materials + MLLM assessment) and depends on community adoption; it is less likely to shift core methodology across fields than Paper 1.

vs. You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

gemini-3.15/29/2026

Paper 1 introduces a comprehensive, expert-curated benchmark bridging AI and materials science. High-quality, domain-specific benchmarks often have long-lasting foundational impact by driving model development and evaluating progress in interdisciplinary scientific discovery. While Paper 2 offers an interesting algorithmic advancement for general LLM agents, Paper 1 provides a highly rigorous, large-scale resource that directly facilitates the application of AI to real-world physical sciences, likely yielding broader and more sustained interdisciplinary impact.

vs. The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

gemini-3.15/29/2026

While Paper 1 provides a rigorous statistical correction to a specific LLM evaluation debate, Paper 2 introduces a foundational, expert-curated benchmark bridging AI and materials science. OmniMatBench has higher potential impact due to its interdisciplinary breadth and direct relevance to accelerating real-world physical science discoveries. Establishing a comprehensive evaluation framework across 19 subfields will likely spur widespread development of domain-specific multimodal models, driving innovations in materials design and manufacturing that extend far beyond the AI community itself.

vs. EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics

claude-opus-4.65/29/2026

OmniMatBench has broader impact potential: it establishes a comprehensive benchmark across 19 materials science subfields with 3,171 expert-curated problems, evaluates 13 models, and identifies systematic gaps in MLLM reasoning. Benchmarks historically drive community-wide progress and attract citations. Its breadth across materials science subfields and relevance to the rapidly growing MLLM evaluation space give it wider applicability. EvoMD-LLM, while novel in framing reactive MD as language modeling, addresses a narrower problem with moderate accuracy (66.14%) and more limited immediate practical applications.

vs. TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

gemini-3.15/29/2026

Paper 2 offers a novel, broadly applicable framework for optimizing multi-agent LLM systems by co-evolving prompts and communication topologies. Its ability to maintain high accuracy on major benchmarks while significantly reducing token costs addresses a critical bottleneck in modern AI. While Paper 1 provides a valuable domain-specific benchmark, Paper 2's fundamental methodological advancement in multi-agent architectures has a wider potential for cross-disciplinary impact and widespread adoption in AI research and applications.

vs. Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale

gemini-3.15/29/2026

Paper 2 presents a massive, longitudinal empirical study (10,195 students across 120 schools over two years) on LLM-teacher collaboration. Its exceptional methodological rigor and scale provide definitive insights into real-world educational impacts, giving it broader societal relevance and stronger empirical validation compared to the domain-specific benchmark presented in Paper 1.