Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao
In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.
IMUG-Bench addresses a genuine gap in the evaluation landscape for unified multimodal models (UMMs) — the absence of benchmarks that jointly assess understanding and generation in dynamic, multi-turn interleaved image-text dialogues. The benchmark introduces three key innovations: (1) a three-class taxonomy (Static Spatial, Temporal Causal, Hybrid) spanning 19 domains and 97 tasks with 3,113 samples and 12,034 interaction turns; (2) Dynamic-MCQ questions whose ground-truth answers depend on model-generated images from prior turns, enabling evaluation of self-generated content comprehension; and (3) systematic identification and mitigation of exposure bias in multi-turn generation through test-time scaling strategies.
The problem is well-motivated: as UMMs are increasingly deployed in real-world interactive settings, evaluating them only on single-turn or static tasks fails to capture error accumulation, context drift, and the interplay between understanding and generation across turns. The Dynamic-MCQ concept is particularly valuable, as it creates a closed-loop evaluation that mirrors genuine interactive use.
Benchmark construction follows a structured pipeline: manual template design → LLM/VLM-based filling → two-person human verification. This semi-automated approach balances scale with quality control. The use of multiple image sources (public platforms, AI-generated, real-world capture) helps reduce source-specific biases.
Evaluation design is thoughtful. The regex-based scoring for MCQs with format weighting (wfmt) appropriately penalizes formatting failures while still crediting correct content. The VLM-as-a-judge paradigm for image generation, decomposed into multiple evaluation-points, is well-justified and validated against human ratings (Pearson correlation >0.72 across all model-class combinations, with an overall mean of 0.804).
Experimental coverage is reasonable, spanning 8 models (6 open-source, 2 closed-source) of varying sizes. However, there are notable limitations. The benchmark relies heavily on LLM/VLM-generated content, which could introduce systematic biases. The evaluation of exposure bias, while convincing in trend, lacks formal statistical testing (e.g., confidence intervals, significance tests for performance decay across turns). The human validation covers only 5 of 8 models and uses task-level sampling rather than comprehensive annotation.
The test-time scaling experiments (CoT, Self-Verification, Best-of-N) provide useful ablations, though the analysis is limited to a single open-source model (BAGEL), weakening generalizability claims.
Direct impact: IMUG-Bench fills an underserved niche and could become a standard evaluation suite for UMMs, particularly as the field rapidly evolves. The exposure bias finding — that image generation scores consistently decay across turns while text understanding remains stable — is an important empirical insight that could influence model architecture and training strategy design.
Practical applications: The test-time scaling strategies (especially CoT yielding the largest gains) offer immediately actionable techniques for practitioners deploying UMMs in production. The Self-Verification mechanism is conceptually elegant, leveraging the understanding-generation capability gap.
Broader influence: The Dynamic-MCQ paradigm could influence benchmark design beyond this specific domain, as the principle of adapting ground truth to model outputs is applicable to any multi-step evaluation. The systematic comparison between open-source and closed-source models provides valuable reference data for the community.
However, impact may be limited by the benchmark's specificity to current UMM architectures. As the field evolves rapidly, the benchmark's shelf life could be short without ongoing updates.
The timing is excellent. UMMs are at an inflection point: multiple models (BAGEL, BLIP3-o, OmniGen2, etc.) were released in 2025, and commercial models (GPT-5, Gemini 2.5 Flash) now support interleaved generation. The lack of adequate multi-turn evaluation has been a recognized blind spot. IMUG-Bench arrives when the community needs it most.
The exposure bias finding is particularly timely, as it highlights a fundamental challenge that must be addressed before UMMs can be reliably deployed in extended interactive sessions.
The paper's contribution is primarily empirical rather than theoretical. The benchmark design choices are sensible but not deeply innovative in methodology — the novelty lies in their application to an underexplored evaluation setting. The writing is clear and well-organized, though the paper could benefit from more rigorous statistical analysis of the exposure bias phenomenon (regression analysis, effect sizes, etc.).
The comparison table (Table 1) effectively positions IMUG-Bench against existing benchmarks, clearly showing it is the only one covering all desired features (multi-turn interaction, dynamic understanding, exposure bias analysis, and test-time scaling).
Generated Jun 9, 2026
IMUG-Bench addresses a timely and broadly impactful gap in evaluating unified multimodal models, which are a rapidly growing area of AI research. Benchmarks tend to have outsized impact by shaping research directions for entire communities. The paper covers both open-source and closed-source models, reveals exposure bias in multi-turn generation, and explores mitigation strategies. While Paper 1 presents a rigorous and novel constrained optimization framework for memory retention, its scope is narrower (memory management for language agents) and its impact is more incremental. Paper 2's benchmark utility gives it broader adoption potential.
Paper 2 proposes a highly novel methodological framework combining visual feedback with policy optimization for code generation, addressing the challenging non-differentiable rendering problem. Its application across multiple domains (charts, web/UI, slides) and integration of reinforcement learning (GRPO) with multimodal LLMs suggest broader methodological impact and real-world applicability compared to Paper 1, which primarily introduces a new benchmark.
Paper 2 likely has higher impact: it introduces a new benchmark targeting a timely, under-evaluated capability (multi-turn interleaved multimodal dialogue) with broad applicability across multimodal modeling, evaluation, and deployment. Benchmarks often become community standards, enabling reproducible comparisons and catalyzing follow-on work. It also surfaces exposure bias and tests mitigation strategies, increasing practical relevance. Paper 1 offers a useful insight into feedback design for self-distillation, but the contribution is narrower and more incremental, with less cross-field breadth than a widely adoptable benchmark.
IMUG-Bench addresses a more fundamental and broadly impactful gap: evaluating unified multimodal models on interleaved understanding and generation in multi-turn settings. Benchmarks tend to have outsized impact by shaping research directions for entire communities. It covers a wider scope (3,113 samples, 12,034 turns, multiple model families), reveals novel findings like exposure bias in generation, and explores mitigation strategies. Paper 1, while solid, addresses a narrower problem (LLM agent memory) with incremental architectural contributions and evaluation on a single benchmark.
ReasonAlloc addresses a critical and timely bottleneck in LLM reasoning inference—KV cache growth during long chain-of-thought decoding. It introduces a novel hierarchical budget allocation framework with the 'Reasoning Wave' concept, offering a practical, training-free, plug-and-play solution with demonstrated gains. This directly impacts the rapidly growing field of reasoning models (e.g., DeepSeek-R1). While IMUG-Bench provides a useful evaluation benchmark for unified multimodal models, benchmarks generally have narrower methodological contribution. ReasonAlloc's practical applicability and relevance to efficient inference give it higher potential impact.
Paper 2 likely has higher impact due to a more novel, generalizable method (self-evolving, utility-governed skill memory) with clear real-world relevance to interactive clinical decision support. It proposes an end-to-end post-deployment framework (read–write–assess–govern) addressing robustness, governance, and continual improvement without weight updates—broadly applicable to agentic systems beyond medicine. Paper 1 is timely and useful but primarily contributes a benchmark and test-time strategies; its impact is narrower and more incremental relative to fast-evolving multimodal evaluation suites.
Paper 1 addresses a critical bottleneck in the deployment of real-world AI agents by benchmarking multi-turn interleaved image-text dialogues. Its focus on unified multimodal models and exposure bias offers broader immediate applications and cross-disciplinary impact. While Paper 2 provides rigorous methodological advancements in formal theorem proving, its impact is concentrated within a more specialized subfield of mathematical reasoning.
Paper 2 addresses a highly critical bottleneck in current AI: long-horizon agentic workflows and finite context windows. By introducing a methodology to synthesize training data for 'delegation intelligence' in multi-agent systems, it directly impacts the rapidly growing field of AI agents and 'deep research.' While Paper 1 provides a valuable benchmark for multimodal models, Paper 2's focus on internalizing multi-agent delegation into model weights represents a more profound methodological innovation with broader applications across all complex, real-world LLM tasks.
Paper 2 targets unified multimodal models, a fast-moving and broadly influential area with immediate relevance to real-world assistants. Its focus on multi-turn interleaved image-text interaction and exposure bias addresses a widely recognized evaluation gap and can impact model design, safety, and deployment across vision-language, dialog, and agentic systems. Paper 1 is methodologically strong and valuable for tabular representation learning, but tabular encoder evaluation is a narrower community and likely yields more specialized impact. Overall, Paper 2 has greater cross-field breadth and timeliness.
IMUG-Bench addresses a broader and more active research area (unified multimodal models) with wider applicability across AI/ML. It evaluates both understanding and generation in multi-turn settings, identifies exposure bias as a key failure mode, and explores mitigation strategies—contributions relevant to the rapidly growing UMM community. While PSEBench is rigorous and valuable for healthcare AI, its scope is narrower (Minnesota-specific patient safety regulations), limiting its broader scientific influence. IMUG-Bench's findings on exposure bias and test-time scaling strategies have wider methodological implications across multimodal AI research.