Advancing Creative Physical Intelligence in Large Multimodal Models

Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim, Emre Can Acikgoz, Bingxuan Li, Kunlun Zhu, Jiateng Liu

May 25, 2026

arXiv:2605.26396v1 PDF

cs.AI(primary)cs.CLcs.LG

#584of 2682·Artificial Intelligence

#584 of 2682 · Artificial Intelligence

Tournament Score

1471±44

10501800

73%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity6.5

Tournament Score

1471±44

10501800

73%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Advancing Creative Physical Intelligence in Large Multimodal Models"

1. Core Contribution

This paper introduces MM-CreativityBench, a benchmark for evaluating visually grounded creative tool repurposing in multimodal environments, alongside an affordance-grounded alignment training method. The core insight is that creative problem-solving in LMMs requires not just generating plausible solutions, but sustaining an evidence-driven exploration process that connects visual perception with physical affordance reasoning at the part level. The benchmark operationalizes this through an interactive protocol where models inspect scenes, entities, and zoomed-in parts before committing to answers.

The paper also proposes a two-stage training approach: supervised fine-tuning (SFT) on knowledge-guided positive trajectories, followed by Direct Preference Optimization (DPO) using hard-negative trajectories that capture realistic failure modes like hallucinated affordances and premature commitment. This combination more than doubles gold-correct performance on 4B and 8B models.

2. Methodological Rigor

Strengths in design: The reverse task construction methodology is well-conceived—building tasks from verified entity-part-affordance triples in a knowledge base ensures that each instance has a known, grounded solution path. The three-level image hierarchy (environment → entity → part) and the exploration stack formalism provide clean experimental control.

Evaluation depth is notable. The paper goes well beyond reporting accuracy: it analyzes exploration efficiency (repetition rates, similarity density, exploration progress), affordance similarity effects, typicality effects, error categorization (with 92% human-model agreement), ablation across image conditions, and prompting format variations. The case studies (Section 5.7) are genuinely informative, showing specific failure modes and how training repairs them.

Concerns about rigor:

Synthetic images are a significant limitation. While the authors justify this as controlled evaluation, it introduces a fundamental gap: generated images may encode visual cues differently from real photographs, and models may behave differently on synthetic vs. real imagery. The paper acknowledges but does not empirically quantify this gap.

Single gold answer evaluation is conservative. The paper admits multiple solutions may exist, but the strict metric penalizes valid alternative solutions. This could systematically underestimate model capability.

GPT-5.4 as trajectory teacher and evaluator introduces circularity concerns. The training trajectories, error categorization, and some aspects of task construction rely on the same model family, which could create subtle biases.

The knowledge base originates from the same research group's prior work, and the benchmark construction is tightly coupled to its structure, raising questions about generalizability beyond the annotated entity set.

3. Potential Impact

Benchmark contribution: MM-CreativityBench fills a genuine gap. Table 1 convincingly shows that no prior benchmark simultaneously addresses creative tool use, affordance grounding, attribute grounding, part-level reasoning, distractor inclusion, visual grounding, and interactive evaluation. This could become a useful diagnostic tool for the community.

Training methodology: The affordance-grounded alignment approach demonstrates that structured affordance knowledge can serve as a training signal for improving multimodal reasoning. The finding that SFT teaches exploration structure while hard-negative DPO teaches evidence discrimination is an actionable insight for the alignment community.

Broader implications: The work connects to embodied AI, robotics, and cognitive science. The framing around Sternberg's Triarchic Theory and Gibson's affordances provides theoretical grounding. The distinction between creativity and hallucination (Section 6) is particularly valuable—the paper argues that creative tool use requires *conditional and verifiable* imagination, not unconstrained generation.

Limitations on impact: The reliance on synthetic images and a closed knowledge base may limit adoption. The benchmark requires substantial infrastructure (image generation, multi-turn evaluation, knowledge base integration), which could hinder reproducibility despite code availability.

4. Timeliness & Relevance

This paper addresses a timely gap. As LMMs are increasingly deployed in agentic and embodied settings, understanding whether they can perform physically grounded creative reasoning—rather than pattern-matching from training data—is essential. The finding that GPT-5.4 can underperform open-source models like Qwen on this task (Table 3) challenges the assumption that scaling alone solves grounded reasoning, which is a relevant and provocative result.

The interactive evaluation paradigm aligns with the growing interest in agent-based evaluation (VisualAgentBench, VisEscape). The preference optimization approach connects to the active research area of RLHF/DPO for multimodal models.

5. Strengths & Limitations

Key Strengths:

Comprehensive evaluation framework with interaction-level metrics beyond final accuracy

Strong ablation study separating SFT and DPO contributions, with clear mechanistic explanations

Diagnostic value: the benchmark reveals *why* models fail (part-level grounding, not entity-level recognition), which is more informative than just showing they fail

Well-articulated distinction between creative reasoning and hallucination

The exploration stack formalism is elegant and could generalize to other interactive reasoning tasks

Notable Weaknesses:

Synthetic image dependency limits ecological validity

Scale of benchmark (333 test instances) is modest; statistical significance of differences is not reported

No real-world or robotics validation—the paper claims relevance to embodied AI but provides no evidence of transfer

The training improvements, while substantial in relative terms, still yield absolute accuracy below 42%, suggesting the problem remains largely unsolved

Comparison fairness: different models have different context window handling, multi-image processing, and instruction-following capabilities, making cross-family comparisons noisy

The paper is extremely long (50+ pages with appendix), which somewhat dilutes the core message

Overall Assessment

This is a substantive contribution that identifies an important capability gap in LMMs and provides both evaluation tools and initial training solutions. The benchmark design is thoughtful, the analysis is thorough, and the findings are actionable. The main limitations are the synthetic image setting and the modest benchmark scale. The paper would benefit from stronger evidence that findings transfer to real-world visual settings.

Rating:6.8/ 10

Significance 7.5Rigor 6.5Novelty 7Clarity 6.5

Generated May 27, 2026

Comparison History (22)

vs. MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

claude-opus-4.65/28/2026

MemCog introduces a fundamental paradigm shift from Memory-as-Tool to Memory-as-Cognition in conversational agents, addressing core architectural limitations with a comprehensive framework (navigable memory stores, cross-dimensional navigation, proactive reasoning). It achieves SOTA on multiple established benchmarks and introduces a novel benchmark (ProactiveMemBench). The concept of integrating memory as cognition rather than a tool has broad implications for agent architectures, LLM-based systems, and cognitive AI. Paper 2, while valuable, is more narrowly focused on creative physical reasoning benchmarks and alignment techniques for LMMs, with comparatively less paradigmatic novelty.

vs. AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

gpt-5.25/28/2026

Paper 1 has higher potential scientific impact due to its more novel framing and benchmark contribution: it operationalizes “creative physical intelligence” in LMMs with a new, fine-grained, visually grounded tool-use benchmark and proposes an alignment method targeting a clear failure mode (grounded exploration vs. hallucination). This is timely for multimodal reasoning/embodied cognition and could influence evaluation and training across multimodal AI, robotics, HCI, and cognitive science. Paper 2 is highly application-oriented and strong on performance, but is closer to established retrieval/knowledge-augmented agent patterns with narrower methodological novelty.

vs. A Query Engine for the Agents

claude-opus-4.65/28/2026

Paper 2 introduces a novel benchmark (MM-CreativityBench) and a new alignment method (affordance-grounded alignment) addressing a fundamental gap in LMM evaluation—creative physical reasoning. This targets a core AI challenge with broad implications across robotics, embodied AI, and cognitive science. Paper 1, while practically useful, is more of an engineering contribution (optimized JS query libraries for AI traces) with narrower impact. Paper 2's methodological contribution to preference learning and its identification of systematic failure modes in frontier models are likely to influence a wider research community.

vs. Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

claude-opus-4.65/28/2026

Paper 1 introduces a novel benchmark and alignment methodology for creative physical reasoning in LMMs—a largely untested but fundamental aspect of intelligence. It addresses a deeper capability gap (affordance-grounded creative problem-solving) with broad implications for embodied AI, robotics, and cognitive science. Paper 2 presents a clever engineering contribution to prompt optimization with solid empirical gains, but operates in a more incremental, narrower space. Paper 1's contribution to understanding and improving creative physical intelligence in multimodal models has broader cross-disciplinary impact and higher long-term significance.

vs. CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

gemini-3.15/28/2026

Paper 2 demonstrates higher potential scientific impact because it addresses a fundamental limitation of current Large Multimodal Models: creative physical intelligence and affordance grounding. While Paper 1 offers a valuable, practical application in e-commerce dispute resolution, Paper 2 tackles a broader challenge essential for embodied AI, robotics, and advanced physical reasoning. By introducing a novel benchmark and an affordance-grounded alignment method using Direct Preference Optimization, Paper 2 pushes the boundaries of how models perceive and interact with physical environments, offering wider applicability across multiple AI domains.

vs. AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

gemini-3.15/27/2026

Paper 1 tackles a fundamental cognitive capability—creative physical intelligence and affordance reasoning—which has broad implications across multimodal AI, robotics, and cognitive science. Introducing a novel benchmark and alignment strategy for open-ended problem-solving pushes the boundaries of LMM capabilities. While Paper 2 offers a valuable and rigorous systems-level optimization for LLM agent efficiency, Paper 1's focus on advancing core reasoning and grounding capabilities suggests a wider and more transformative scientific impact across multiple disciplines.

vs. Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

claude-opus-4.65/27/2026

Paper 2 introduces a novel benchmark (MM-CreativityBench) and a new capability dimension (creative physical intelligence) for large multimodal models, addressing a fundamental gap in AI evaluation. It proposes affordance-grounded alignment via DPO, which is methodologically novel and broadly applicable. The work spans multiple high-impact areas (LMMs, embodied AI, creative reasoning) and is highly timely given the rapid advancement of multimodal models. Paper 1, while solid, makes a more incremental contribution to hierarchical RL with skill reuse, a well-studied area with narrower immediate impact.

vs. ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis

gpt-5.25/27/2026

Paper 1 is more scientifically novel and broadly impactful: it introduces a new benchmark for physically grounded creative tool use in multimodal models and proposes an alignment method (preference learning plus affordance-guided supervision) targeting a clear, currently limiting failure mode (grounded exploration vs hallucination). This advances evaluation methodology and model training for embodied/grounded reasoning, with relevance to robotics, HCI, vision-language, and safety. Paper 2 is highly applicable as an engineering system for causal/RCA workflows, but appears less methodologically novel and more domain-tooling oriented, with impact depending on deployment and validation depth.

vs. Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

gpt-5.25/27/2026

Paper 1 likely has higher impact due to its novel identification and empirical demonstration of a structural vulnerability in the dominant RLHF paradigm, with broad implications for AI safety, alignment, evaluation methodology, and deployment governance. The “alignment tampering” mechanism is widely relevant across LLMs trained via preference data, and its real-world stakes (bias amplification, propaganda, goal-seeking) are immediate and timely. While Paper 2 introduces a useful benchmark and training approach for multimodal grounded creativity, its impact is more domain-specific and incremental relative to ongoing benchmark/grounding work.

vs. MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

gemini-3.15/27/2026

Paper 1 addresses a fundamental cognitive capability in AI—creative physical problem-solving and affordance reasoning—which has broad implications for foundation models, embodied AI, and robotics. It introduces both a novel benchmark and a new alignment methodology. Paper 2, while highly practical and valuable for on-device deployment and privacy, represents a narrower engineering optimization for mobile GUI agents rather than a fundamental leap in AI reasoning capabilities.

vs. Position: AI Safety Requires Effective Controllability

gemini-3.15/27/2026

Paper 2 offers a critical paradigm shift in AI safety, moving beyond static alignment to active runtime controllability. As autonomous AI agents become ubiquitous, ensuring they can be reliably interrupted and redirected is an urgent challenge with massive real-world and policy implications. While Paper 1 makes strong methodological contributions to embodied AI and LMMs, Paper 2's foundational framework and benchmark for agentic controllability address a more universally pressing bottleneck across the broader artificial intelligence landscape, giving it a wider and potentially more transformative scientific impact.

vs. Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling

gemini-3.15/27/2026

Paper 1 targets a significant frontier in AI—creative physical reasoning and tool use in Large Multimodal Models (LMMs). By introducing a novel benchmark (MM-CreativityBench) and an alignment method using Direct Preference Optimization to ground affordances, it addresses a fundamental limitation in current LMMs. In contrast, Paper 2 offers a valuable but more incremental technical optimization (adaptive negative sampling) for Knowledge Graph Foundation Models. Paper 1's focus on bridging multimodal perception with physical, open-ended problem-solving gives it higher potential for broad, cross-disciplinary impact in both general AI and embodied robotics.

vs. OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

gpt-5.25/27/2026

Paper 2 has higher estimated impact due to stronger novelty and broader applicability: it introduces a visually grounded benchmark for creative, physically constrained tool use (a key gap for multimodal agents) and pairs it with a concrete training recipe (affordance-grounded alignment via DPO + KB supervision) showing measurable improvements and reduced hallucination. This combination of evaluation + intervention is timely for robotics/embodied AI and multimodal agents, with potential real-world translation. Paper 1 is rigorous and valuable for social reasoning diagnostics, but is more niche (ToM evaluation) and primarily benchmark-focused without an associated capability-improving method.

vs. Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

gpt-5.25/27/2026

Paper 2 identifies a broadly consequential failure mode in deployed retrieval-augmented LLMs: multi-turn evidence monitoring that does not translate into safe action selection. It offers large-scale evaluation (50k+ turn-level), cross-model analysis, human validation, and mechanistic probes, making the claim robust and actionable for safety-critical RAG applications (health, law, ops). The monitoring-control gap is timely and likely to influence evaluation standards and system design across NLP/AI safety. Paper 1 is novel but more benchmark/technique-specific and narrower in immediate real-world risk relevance.

vs. Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models

gemini-3.15/27/2026

Paper 1 pioneers a novel frontier in Large Multimodal Models by addressing creative physical intelligence and affordance grounding, which are critical for advancing embodied AI and robotics. By introducing a new benchmark and an alignment method to solve fundamental reasoning gaps, it offers broader long-term scientific impact across multiple disciplines compared to Paper 2's narrower, though highly practical, focus on LLM fine-tuning security.

vs. Proper Scoring Rules for Agentic Uncertainty Quantification

claude-opus-4.65/27/2026

Paper 2 introduces a principled, theoretically grounded scoring framework (TPS) for evaluating uncertainty quantification in agentic AI systems—a rapidly growing area. Its contribution is foundational: it provides strictly proper scoring rules with formal proofs, addresses censored trajectories, and demonstrates that existing metrics are theoretically deficient. This has broad applicability across all agentic LLM systems. Paper 1, while addressing an interesting niche (creative physical reasoning in LMMs), is more application-specific with a narrower benchmark contribution and incremental alignment technique. Paper 2's methodological rigor and generalizability give it higher potential impact.

vs. LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

claude-opus-4.65/27/2026

Paper 2 introduces a novel benchmark (MM-CreativityBench) and a new training paradigm (affordance-grounded alignment) addressing a fundamental gap in LMM evaluation—creative physical reasoning. It tackles a deeper scientific question about grounded intelligence beyond pattern recognition, with broader implications for AI safety, embodied AI, and cognitive science. Paper 1 presents an engineering contribution (a Python library for entity linking using existing LLM techniques), which, while practical, offers incremental novelty and narrower scientific impact compared to Paper 2's new benchmark and methodology.

vs. Natural Language Query to Configuration for Retrieval Agents

gemini-3.15/27/2026

Paper 2 addresses a fundamental frontier in AI—creative physical intelligence and affordance-grounded reasoning in LMMs. By introducing a novel benchmark and an alignment method for embodied problem-solving, it bridges vision, language, and robotics, offering broader theoretical and multi-disciplinary impact. Paper 1 presents a highly practical systems optimization for retrieval agents, but its contribution is more incremental and applied compared to the foundational cognitive capabilities explored in Paper 2.

vs. Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

gemini-3.15/27/2026

Paper 2 addresses a critical bottleneck in LLM trustworthiness—Chain-of-Thought faithfulness—by bridging mechanistic interpretability with external outputs. Its computationally efficient circuit-tracing approach using Fused Gromov-Wasserstein distance offers high methodological rigor. While Paper 1 introduces a valuable multimodal benchmark, Paper 2's foundational contribution to AI safety and alignment has broader theoretical implications and higher potential impact across the widespread deployment of reasoning models.

vs. FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

gpt-5.25/27/2026

Paper 1 likely has higher impact due to stronger novelty and broader relevance: it introduces a new benchmark targeting an under-evaluated capability (affordance-grounded creative tool use) and proposes an alignment framework addressing a core failure mode (grounded exploration vs hallucination) applicable across multimodal agents and embodied reasoning. Its applications extend to robotics, interactive assistants, and safety/grounding. Paper 2 is valuable and practical for improving long-text alignment in CLIP with an efficient method and dataset, but the contribution is more incremental within established VLM fine-tuning and alignment paradigms.