TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
Zhiqiang Liu, Wenhui Dong, Yilang Tan, Yuwen Qu, Haochen Yin, Chenyang Si
Abstract
Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.
AI Impact Assessments
(1 models)Scientific Impact Assessment: TOBench
1. Core Contribution
TOBench addresses a genuine gap in the evaluation landscape for tool-using AI agents: the lack of benchmarks that simultaneously test multimodal perception, tool execution, artifact inspection, and iterative self-correction within realistic professional workflows. The benchmark contributes 100 executable tasks across 20 subcategory slices, supported by 27 MCP (Model Context Protocol) servers with 324 tools, organized into two macro families—Customer Service and Intelligent Creation.
The central design principle of closed-loop multimodal verification is the paper's most distinctive conceptual contribution. Unlike benchmarks that evaluate tool selection or final-answer correctness in isolation, TOBench requires agents to execute tools, inspect rendered artifacts (documents, images, presentations), and revise their outputs when they fail task-specific requirements. This perceive–act–inspect–revise loop better mirrors real-world professional workflows.
2. Methodological Rigor
Strengths in design: The task formalism (Equation 1) is well-specified, decomposing each instance into instruction packages, executable environments, latent state, action/observation spaces, transition dynamics, evaluation criteria, and grounded verifiers. The three-tier evaluation criterion taxonomy (format constraints, judge-based multimodal constraints, tool/result constraints) is sensible and well-motivated.
Construction pipeline: The semi-automated pipeline—scenario discovery, task instantiation, evaluator synthesis, and human audit—is a reasonable approach, though it raises reproducibility questions. The paper acknowledges significant manual curation (two PhD students, approximately one month), with roughly two-thirds of initially collected cases discarded. This manual intervention is both a strength (quality control) and a limitation (scalability).
Evaluation concerns: The per-task evaluator synthesis approach is creative but introduces potential evaluator variance. Using VLM-based judges for multimodal constraints inherits known biases from LLM-as-judge paradigms. The paper acknowledges this but doesn't quantify inter-evaluator agreement or evaluator reliability metrics. The strict all-or-nothing task success metric (Equation 10) is principled but may obscure partial progress—a supplementary partial-credit analysis would strengthen the findings.
Experimental breadth: Testing 15 models is commendable, though the benchmark size of 100 tasks is relatively small. Statistical confidence intervals are absent, making it difficult to distinguish meaningful performance differences between closely-ranked models (e.g., Claude Opus 4.6 at 32.0% vs. Gemini-3-Pro at 32.0%).
3. Potential Impact
Practical relevance: The benchmark targets a real and growing need—evaluating agents in professional workflows involving heterogeneous tools and multimodal artifacts. The MCP-based infrastructure aligns with an emerging industry standard, making the benchmark directly relevant to deployed systems.
Diagnostic value: The detailed error taxonomy (Table 4) with five top-level categories and 16 subcategories, combined with per-model error heatmaps, provides actionable diagnostic information. The finding that tool call and parameter errors remain the dominant bottleneck—not higher-level reasoning—is a useful insight for the community.
Gap identification: The dramatic human-model performance gap (94% vs. 41% best model) clearly demonstrates that current systems are far from solving these tasks, establishing TOBench as a meaningful challenge benchmark for the near-to-medium term.
Limitations on broader impact: The 100-task scale, while carefully curated, is small compared to benchmarks like ToolBench (16,000+ APIs) or OSWorld (369 tasks). The narrow focus on two macro families limits generalizability claims. The MCP ecosystem dependency means benchmark longevity depends on protocol stability.
4. Timeliness & Relevance
The paper is highly timely. MCP adoption is accelerating, multimodal agents are becoming production-relevant, and the community lacks integrated benchmarks that test the full agent loop. The emphasis on closed-loop verification addresses a specific blind spot: many current benchmarks reward open-loop execution without penalizing agents that fail to self-verify. The finding that "missing visual verification" is a distinct and common failure mode is novel and practically important.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
TOBench makes a solid contribution to the agent evaluation literature by combining multimodal tool use, artifact inspection, and self-correction into a unified benchmark. The closed-loop verification concept is well-motivated and the error analysis is informative. However, the modest scale, lack of evaluator validation, and absence of ablation studies limit the strength of the empirical contribution. The work is more of an engineering and design contribution than a methodological breakthrough, but it fills a genuine gap and provides a useful testbed for the rapidly growing tool-using agent community.
Generated May 19, 2026
Comparison History (20)
Paper 1 introduces a comprehensive, end-to-end benchmark for omni-modal, tool-using agents grounded in real-world workflows. High-quality benchmarks frequently drive broad foundation model development and typically achieve massive citation counts and widespread adoption. While Paper 2 offers a novel and rigorous methodological improvement for multi-agent aggregation using signed graphs, its scope is narrower. Paper 1's focus on closed-loop multimodal verification and integration with emerging standards (MCP) guarantees more immediate, widespread applicability and a higher potential to shape the trajectory of autonomous agent research.
Paper 2 likely has higher scientific impact due to its strong real-world applicability and timeliness: an executable, tool-integrated, omni-modal benchmark with closed-loop verification directly targets a central bottleneck in evaluating and improving agentic systems. Such benchmarks often become shared infrastructure, enabling broad, cross-field impact (LLMs/agents, HCI, software engineering, multimodal learning, evaluation). Paper 1 is methodologically rigorous and novel in regret characterization for MNL MDPs, but is more specialized with narrower immediate adoption outside RL theory.
AnchorDiff introduces a genuinely novel approach—the first masked-diffusion framework for radiology report generation—combining knowledge-graph-derived clinical anchors with diffusion language modeling. This represents a significant methodological innovation that bridges diffusion models and medical NLP, with clear clinical applications and potential to influence both the diffusion modeling and medical AI communities. Paper 1 (TOBench) is a solid benchmark contribution but is incremental in nature, combining existing evaluation paradigms. Benchmarks have shorter-lived impact and face rapid obsolescence, whereas Paper 2's architectural innovations are more likely to inspire follow-up research across multiple domains.
Paper 2 likely has higher impact due to broader, timely applicability: an executable, closed-loop benchmark for real-world omni-modal tool-using agents with grounded evaluators and scalable infrastructure. Benchmarks often become community standards, shaping model development across multimodal reasoning, agentic tool use, evaluation, and safety. Its methodological contribution (verifiable execution + artifact inspection + self-correction loops) and strong baseline results indicate relevance and adoption potential. Paper 1 is novel and useful for cost-efficient LLM evaluation on generative tasks, but its impact is narrower (sampling/evaluation efficiency) and more incremental relative to the benchmark/platform effect.
Paper 2 likely has higher scientific impact because it introduces a broad, timely benchmark/harness for omni-modal, closed-loop tool-using agents—a central emerging paradigm with wide applicability across AI research and industry. Its executable tasks, grounded evaluators, and verification loop can become a standard for measuring progress, enabling reproducibility and catalyzing model/tooling advances across many subfields (agentic LMs, multimodal reasoning, evaluation, HCI). Paper 1 is a solid, novel RL credit-assignment improvement but is narrower in scope (generative recommendation with SIDs) and may impact a more specialized community.
Paper 1 (TOBench/MM-ToolBench) addresses a broader, more fundamental challenge in AI—omni-modal tool use with closed-loop verification—spanning multiple domains with 324 tools and MCP-based infrastructure. Its contributions (benchmark design, evaluation harness, multimodal verification paradigm) are generalizable across the agent research community. Paper 2 (TeleCom-Bench) is valuable but narrowly focused on telecommunications, limiting its cross-field impact. While both reveal significant performance gaps, Paper 1's domain-agnostic framework and methodological contributions (semi-automated construction pipeline, grounded evaluators) will likely influence broader agent and benchmark research.
Paper 2 likely has higher scientific impact: it introduces a large, executable, omni-modal, closed-loop benchmark with grounded evaluators and an end-to-end harness, directly enabling broad, standardized evaluation of real-world tool-using agents. This has immediate real-world relevance, strong timeliness, and wide applicability across agent systems, multimodal reasoning, HCI, and software/tool integration. Paper 1 is novel and methodologically interesting, but is narrower (test-time aggregation for reasoning traces) and its impact depends on adoption in specific inference pipelines, whereas a widely used benchmark can shape an entire research area and accelerate progress.
Paper 2 has higher likely impact due to its timeliness and broad relevance to rapidly evolving tool-using, multimodal AI agents. A high-quality benchmark with executable tasks, grounded evaluators, and closed-loop verification can become community infrastructure, enabling reproducible comparisons and accelerating progress across NLP, CV, HCI, and agent systems. Its methodological contribution (scalable construction + verifiable evaluation harness) and clear empirical gap to humans suggest strong downstream influence. Paper 1 is a useful optimization/clustering variant with niche applications, but incremental relative to extensive clustering/metaheuristic literature and narrower cross-field reach.
Paper 2 introduces a comprehensive benchmark for evaluating omni-modal tool-using AI agents, a critical and rapidly growing area in artificial intelligence. Benchmarks typically exert massive influence by standardizing evaluation and driving future model development across multiple domains. In contrast, while Paper 1 presents an innovative approach to hypothesis generation, its immediate impact is largely confined to the specific domain of nanomedicine.
Paper 2 introduces a much-needed benchmark for omni-modal, tool-using AI agents, directly addressing a critical bottleneck in the rapidly expanding field of agentic AI. Benchmarks like this typically drive immediate and widespread technical progress, adoption, and citations across the AI community. While Paper 1 offers a valuable resource for AI governance, Paper 2 has a significantly broader scope of impact, higher technical timeliness, and more direct applicability to real-world AI development workflows.
Paper 1 provides deeper scientific insights through systematic investigation of mobile world models across four modalities, yielding three actionable findings about representation effectiveness, training transferability, and agent guidance strategies. Its contributions span both model training and evaluation methodology with novel empirical conclusions. Paper 2, while addressing a real gap with a comprehensive benchmark (TOBench/MM-ToolBench), is primarily an evaluation resource. Paper 1's findings about when and how world models help GUI agents offer more generalizable scientific contributions that can influence future agent architectures and training paradigms.
Paper 1 addresses a fundamental bottleneck in AI for Science by enabling LLMs to understand complex chemical reaction diagrams. Its novel Visual Anchor mechanism and impressive 20-point performance gain directly accelerate chemical reasoning and drug discovery. While Paper 2 offers a valuable benchmark for general tool-use agents, it operates in a highly saturated area of AI evaluation. Paper 1's targeted methodology bridges a critical modality gap in molecular science, offering higher potential for transformative real-world scientific breakthroughs.
Paper 2 demonstrates broader and more timely scientific impact. While Paper 1 offers a valuable domain-specific contribution to cultural heritage knowledge graphs, Paper 2 addresses a critical bottleneck in the rapidly advancing field of autonomous AI agents: evaluating omni-modal tool use in realistic workflows. By providing a comprehensive, executable benchmark with closed-loop multimodal verification and testing 15 contemporary models, Paper 2 establishes a foundational evaluation standard that will likely be widely adopted by researchers developing next-generation multimodal AI agents across various domains.
Paper 2 is likely higher impact: it introduces a substantial, executable omni-modal benchmark with closed-loop multimodal verification, grounded evaluators, and an end-to-end harness—assets that can standardize evaluation and drive progress across agentic LLMs, multimodal reasoning, HCI, and tool-use research. The methodology appears more rigorous (measurable tasks, human baseline, multi-model experiments) and broadly applicable to real-world workflows. Paper 1 is useful engineering work (reducing vendor lock-in) but is less novel scientifically and may have narrower, more incremental impact.
Benchmark papers in highly active areas like omni-modal tool-using agents typically generate widespread, immediate scientific impact by establishing standardized evaluation metrics. While Paper 1 offers a valuable theoretical paradigm, Paper 2 provides a concrete, scalable evaluation framework that addresses a critical gap in current AI research, directly facilitating and measuring the development of next-generation agentic models.
Paper 1 introduces a novel concept—temporal memory contamination—that identifies a previously understudied longitudinal safety risk in memory-equipped LLM agents. It proposes a rigorous evaluation protocol (trigger-probe with NullMemory baseline), demonstrates the phenomenon across multiple architectures, and offers a practical detection mechanism. This addresses a fundamental and increasingly critical safety concern as persistent-memory agents become widespread. Paper 2, while valuable as a benchmark contribution, is more incremental—adding another multi-modal tool-use benchmark to an already crowded space. Paper 1's conceptual novelty and safety implications give it broader and more lasting impact.
Paper 1 provides a concrete, executable benchmark for omni-modal tool-using agents, addressing a critical and immediate need for evaluating complex, real-world AI agent workflows. Benchmarks typically generate high scientific impact through widespread adoption by researchers testing new models. While Paper 2 offers a valuable conceptual framework for environment scaling, Paper 1 delivers a tangible resource with strong methodological rigor (closed-loop multimodal verification) that will directly drive and measure progress in the rapidly growing field of AI agents.
MEMOIR introduces a novel memory-guided tree-search framework with concrete architectural innovations (two-level memory hierarchy, cross-branch knowledge transfer) that demonstrably improves LLM-based solver synthesis across multiple combinatorial optimization problems. It shows strong empirical gains in both validity and quality with reduced variance. Paper 2 (TOBench/MM-ToolBench) is a benchmark contribution, which, while useful, has narrower methodological novelty—benchmarks typically have shorter-lived impact unless widely adopted. MEMOIR's approach addresses fundamental limitations in LLM-guided search and has broader applicability beyond its specific domain.
Paper 2 addresses a highly timely and rapidly growing field: multimodal, tool-using AI agents. By providing a comprehensive, real-world benchmark (TOBench/MM-ToolBench) with closed-loop verification, it is likely to be widely adopted and cited by researchers developing next-generation LLM agents. In contrast, Paper 1 focuses on theoretical runtime bounds for a specific subfield of evolutionary algorithms, which, while methodologically rigorous, has a much narrower scope and limited immediate real-world impact.
TOBench introduces a comprehensive benchmark addressing a significant gap in evaluating multimodal tool-using agents—a rapidly growing area in AI. Its 100 executable tasks, 324 tools, and evaluation of 15 contemporary models provide substantial infrastructure for the community. The finding that top models achieve only 32% vs. 94% human performance highlights a clear research challenge. Paper 1's TIDE framework, while useful for argumentative essay understanding, addresses a narrower problem domain with incremental improvements to prompt optimization, limiting its broader impact across fields.