IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao

Jun 8, 2026arXiv:2606.09169v1

cs.AIcs.CVcs.MM

#1518of 3489·Artificial Intelligence

#1518 of 3489 · Artificial Intelligence

Tournament Score

1415±42

10501800

52%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6

Novelty6.5

Clarity7.5

Abstract

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: IMUG-Bench

1. Core Contribution

IMUG-Bench addresses a genuine gap in the evaluation landscape for unified multimodal models (UMMs) — the absence of benchmarks that jointly assess understanding and generation in dynamic, multi-turn interleaved image-text dialogues. The benchmark introduces three key innovations: (1) a three-class taxonomy (Static Spatial, Temporal Causal, Hybrid) spanning 19 domains and 97 tasks with 3,113 samples and 12,034 interaction turns; (2) Dynamic-MCQ questions whose ground-truth answers depend on model-generated images from prior turns, enabling evaluation of self-generated content comprehension; and (3) systematic identification and mitigation of exposure bias in multi-turn generation through test-time scaling strategies.

The problem is well-motivated: as UMMs are increasingly deployed in real-world interactive settings, evaluating them only on single-turn or static tasks fails to capture error accumulation, context drift, and the interplay between understanding and generation across turns. The Dynamic-MCQ concept is particularly valuable, as it creates a closed-loop evaluation that mirrors genuine interactive use.

2. Methodological Rigor

Benchmark construction follows a structured pipeline: manual template design → LLM/VLM-based filling → two-person human verification. This semi-automated approach balances scale with quality control. The use of multiple image sources (public platforms, AI-generated, real-world capture) helps reduce source-specific biases.

Evaluation design is thoughtful. The regex-based scoring for MCQs with format weighting (wfmt) appropriately penalizes formatting failures while still crediting correct content. The VLM-as-a-judge paradigm for image generation, decomposed into multiple evaluation-points, is well-justified and validated against human ratings (Pearson correlation >0.72 across all model-class combinations, with an overall mean of 0.804).

Experimental coverage is reasonable, spanning 8 models (6 open-source, 2 closed-source) of varying sizes. However, there are notable limitations. The benchmark relies heavily on LLM/VLM-generated content, which could introduce systematic biases. The evaluation of exposure bias, while convincing in trend, lacks formal statistical testing (e.g., confidence intervals, significance tests for performance decay across turns). The human validation covers only 5 of 8 models and uses task-level sampling rather than comprehensive annotation.

The test-time scaling experiments (CoT, Self-Verification, Best-of-N) provide useful ablations, though the analysis is limited to a single open-source model (BAGEL), weakening generalizability claims.

3. Potential Impact

Direct impact: IMUG-Bench fills an underserved niche and could become a standard evaluation suite for UMMs, particularly as the field rapidly evolves. The exposure bias finding — that image generation scores consistently decay across turns while text understanding remains stable — is an important empirical insight that could influence model architecture and training strategy design.

Practical applications: The test-time scaling strategies (especially CoT yielding the largest gains) offer immediately actionable techniques for practitioners deploying UMMs in production. The Self-Verification mechanism is conceptually elegant, leveraging the understanding-generation capability gap.

Broader influence: The Dynamic-MCQ paradigm could influence benchmark design beyond this specific domain, as the principle of adapting ground truth to model outputs is applicable to any multi-step evaluation. The systematic comparison between open-source and closed-source models provides valuable reference data for the community.

However, impact may be limited by the benchmark's specificity to current UMM architectures. As the field evolves rapidly, the benchmark's shelf life could be short without ongoing updates.

4. Timeliness & Relevance

The timing is excellent. UMMs are at an inflection point: multiple models (BAGEL, BLIP3-o, OmniGen2, etc.) were released in 2025, and commercial models (GPT-5, Gemini 2.5 Flash) now support interleaved generation. The lack of adequate multi-turn evaluation has been a recognized blind spot. IMUG-Bench arrives when the community needs it most.

The exposure bias finding is particularly timely, as it highlights a fundamental challenge that must be addressed before UMMs can be reliably deployed in extended interactive sessions.

5. Strengths & Limitations

Strengths:

Novel evaluation paradigm: Dynamic-MCQ is a genuinely new contribution that enables closed-loop evaluation of self-generated content understanding

Comprehensive taxonomy: The three-class structure with 97 tasks across 19 domains provides fine-grained diagnostic capability

Actionable findings: The exposure bias characterization and test-time scaling mitigation strategies provide both diagnostic value and practical solutions

Scale: 12,034 turns provide sufficient statistical power for turn-level analysis

Human validation: Pearson correlation analysis with two-person human ratings strengthens credibility

Limitations:

Test-time scaling limited to one model: Only BAGEL is used for CoT/Self-Verification/Best-of-N experiments, limiting generalizability

No training-time solutions explored: The paper identifies exposure bias but only explores inference-time mitigations; scheduled sampling or curriculum learning during training would be valuable

VLM judge circularity: Using VLMs to judge VLM outputs raises potential bias concerns, though the human correlation analysis partially addresses this

Template-based construction: Despite LLM diversification, template-based generation may produce less naturalistic interactions than real user dialogues

Missing baselines: No comparison with pipeline approaches (separate understanding + generation models) to contextualize unified model performance

Limited analysis of understanding-generation coupling: While the paper notes imbalance, it doesn't deeply analyze how generation errors propagate to understanding (beyond Dynamic-MCQ scores)

Reproducibility concerns: Closed-source model evaluations via APIs may not be reproducible as models are updated

Additional Observations

The paper's contribution is primarily empirical rather than theoretical. The benchmark design choices are sensible but not deeply innovative in methodology — the novelty lies in their application to an underexplored evaluation setting. The writing is clear and well-organized, though the paper could benefit from more rigorous statistical analysis of the exposure bias phenomenon (regression analysis, effect sizes, etc.).

The comparison table (Table 1) effectively positions IMUG-Bench against existing benchmarks, clearly showing it is the only one covering all desired features (multi-turn interaction, dynamic understanding, exposure bias analysis, and test-time scaling).

Rating:6.5/ 10

Significance 7Rigor 6Novelty 6.5Clarity 7.5

Generated Jun 9, 2026

Comparison History (21)

Wonvs. Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

IMUG-Bench addresses a timely and broadly impactful gap in evaluating unified multimodal models, which are a rapidly growing area of AI research. Benchmarks tend to have outsized impact by shaping research directions for entire communities. The paper covers both open-source and closed-source models, reveals exposure bias in multi-turn generation, and explores mitigation strategies. While Paper 1 presents a rigorous and novel constrained optimization framework for memory retention, its scope is narrower (memory management for language agents) and its impact is more incremental. Paper 2's benchmark utility gives it broader adoption potential.

claude-opus-4-6·Jun 10, 2026

Lostvs. Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

Paper 2 proposes a highly novel methodological framework combining visual feedback with policy optimization for code generation, addressing the challenging non-differentiable rendering problem. Its application across multiple domains (charts, web/UI, slides) and integration of reinforcement learning (GRPO) with multimodal LLMs suggest broader methodological impact and real-world applicability compared to Paper 1, which primarily introduces a new benchmark.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. The Role of Feedback Alignment in Self-Distillation

Paper 2 likely has higher impact: it introduces a new benchmark targeting a timely, under-evaluated capability (multi-turn interleaved multimodal dialogue) with broad applicability across multimodal modeling, evaluation, and deployment. Benchmarks often become community standards, enabling reproducible comparisons and catalyzing follow-on work. It also surfaces exposure bias and tests mitigation strategies, increasing practical relevance. Paper 1 offers a useful insight into feedback design for self-distillation, but the contribution is narrower and more incremental, with less cross-field breadth than a widely adoptable benchmark.

gpt-5.2·Jun 10, 2026

Wonvs. Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

IMUG-Bench addresses a more fundamental and broadly impactful gap: evaluating unified multimodal models on interleaved understanding and generation in multi-turn settings. Benchmarks tend to have outsized impact by shaping research directions for entire communities. It covers a wider scope (3,113 samples, 12,034 turns, multiple model families), reveals novel findings like exposure bias in generation, and explores mitigation strategies. Paper 1, while solid, addresses a narrower problem (LLM agent memory) with incremental architectural contributions and evaluation on a single benchmark.

claude-opus-4-6·Jun 10, 2026

Lostvs. ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

ReasonAlloc addresses a critical and timely bottleneck in LLM reasoning inference—KV cache growth during long chain-of-thought decoding. It introduces a novel hierarchical budget allocation framework with the 'Reasoning Wave' concept, offering a practical, training-free, plug-and-play solution with demonstrated gains. This directly impacts the rapidly growing field of reasoning models (e.g., DeepSeek-R1). While IMUG-Bench provides a useful evaluation benchmark for unified multimodal models, benchmarks generally have narrower methodological contribution. ReasonAlloc's practical applicability and relevance to efficient inference give it higher potential impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Paper 2 likely has higher impact due to a more novel, generalizable method (self-evolving, utility-governed skill memory) with clear real-world relevance to interactive clinical decision support. It proposes an end-to-end post-deployment framework (read–write–assess–govern) addressing robustness, governance, and continual improvement without weight updates—broadly applicable to agentic systems beyond medicine. Paper 1 is timely and useful but primarily contributes a benchmark and test-time strategies; its impact is narrower and more incremental relative to fast-evolving multimodal evaluation suites.

gpt-5.2·Jun 9, 2026

Wonvs. TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

Paper 1 addresses a critical bottleneck in the deployment of real-world AI agents by benchmarking multi-turn interleaved image-text dialogues. Its focus on unified multimodal models and exposure bias offers broader immediate applications and cross-disciplinary impact. While Paper 2 provides rigorous methodological advancements in formal theorem proving, its impact is concentrated within a more specialized subfield of mathematical reasoning.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Paper 2 addresses a highly critical bottleneck in current AI: long-horizon agentic workflows and finite context windows. By introducing a methodology to synthesize training data for 'delegation intelligence' in multi-agent systems, it directly impacts the rapidly growing field of AI agents and 'deep research.' While Paper 1 provides a valuable benchmark for multimodal models, Paper 2's focus on internalizing multi-agent delegation into model weights represents a more profound methodological innovation with broader applications across all complex, real-world LLM tasks.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Paper 2 targets unified multimodal models, a fast-moving and broadly influential area with immediate relevance to real-world assistants. Its focus on multi-turn interleaved image-text interaction and exposure bias addresses a widely recognized evaluation gap and can impact model design, safety, and deployment across vision-language, dialog, and agentic systems. Paper 1 is methodologically strong and valuable for tabular representation learning, but tabular encoder evaluation is a narrower community and likely yields more specialized impact. Overall, Paper 2 has greater cross-field breadth and timeliness.

gpt-5.2·Jun 9, 2026

Wonvs. PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

IMUG-Bench addresses a broader and more active research area (unified multimodal models) with wider applicability across AI/ML. It evaluates both understanding and generation in multi-turn settings, identifies exposure bias as a key failure mode, and explores mitigation strategies—contributions relevant to the rapidly growing UMM community. While PSEBench is rigorous and valuable for healthcare AI, its scope is narrower (Minnesota-specific patient safety regulations), limiting its broader scientific influence. IMUG-Bench's findings on exposure bias and test-time scaling strategies have wider methodological implications across multimodal AI research.

claude-opus-4-6·Jun 9, 2026

#1518of 3489·Artificial Intelligence

#1518 of 3489 · Artificial Intelligence

Tournament Score

1415±42

10501800

52%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6

Novelty6.5

Clarity7.5