Advancing Creative Physical Intelligence in Large Multimodal Models
Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim, Emre Can Acikgoz, Bingxuan Li, Kunlun Zhu, Jiateng Liu
Abstract
Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Advancing Creative Physical Intelligence in Large Multimodal Models"
1. Core Contribution
This paper introduces MM-CreativityBench, a benchmark for evaluating visually grounded creative tool repurposing in multimodal environments, alongside an affordance-grounded alignment training method. The core insight is that creative problem-solving in LMMs requires not just generating plausible solutions, but sustaining an evidence-driven exploration process that connects visual perception with physical affordance reasoning at the part level. The benchmark operationalizes this through an interactive protocol where models inspect scenes, entities, and zoomed-in parts before committing to answers.
The paper also proposes a two-stage training approach: supervised fine-tuning (SFT) on knowledge-guided positive trajectories, followed by Direct Preference Optimization (DPO) using hard-negative trajectories that capture realistic failure modes like hallucinated affordances and premature commitment. This combination more than doubles gold-correct performance on 4B and 8B models.
2. Methodological Rigor
Strengths in design: The reverse task construction methodology is well-conceived—building tasks from verified entity-part-affordance triples in a knowledge base ensures that each instance has a known, grounded solution path. The three-level image hierarchy (environment → entity → part) and the exploration stack formalism provide clean experimental control.
Evaluation depth is notable. The paper goes well beyond reporting accuracy: it analyzes exploration efficiency (repetition rates, similarity density, exploration progress), affordance similarity effects, typicality effects, error categorization (with 92% human-model agreement), ablation across image conditions, and prompting format variations. The case studies (Section 5.7) are genuinely informative, showing specific failure modes and how training repairs them.
Concerns about rigor:
3. Potential Impact
Benchmark contribution: MM-CreativityBench fills a genuine gap. Table 1 convincingly shows that no prior benchmark simultaneously addresses creative tool use, affordance grounding, attribute grounding, part-level reasoning, distractor inclusion, visual grounding, and interactive evaluation. This could become a useful diagnostic tool for the community.
Training methodology: The affordance-grounded alignment approach demonstrates that structured affordance knowledge can serve as a training signal for improving multimodal reasoning. The finding that SFT teaches exploration structure while hard-negative DPO teaches evidence discrimination is an actionable insight for the alignment community.
Broader implications: The work connects to embodied AI, robotics, and cognitive science. The framing around Sternberg's Triarchic Theory and Gibson's affordances provides theoretical grounding. The distinction between creativity and hallucination (Section 6) is particularly valuable—the paper argues that creative tool use requires *conditional and verifiable* imagination, not unconstrained generation.
Limitations on impact: The reliance on synthetic images and a closed knowledge base may limit adoption. The benchmark requires substantial infrastructure (image generation, multi-turn evaluation, knowledge base integration), which could hinder reproducibility despite code availability.
4. Timeliness & Relevance
This paper addresses a timely gap. As LMMs are increasingly deployed in agentic and embodied settings, understanding whether they can perform physically grounded creative reasoning—rather than pattern-matching from training data—is essential. The finding that GPT-5.4 can underperform open-source models like Qwen on this task (Table 3) challenges the assumption that scaling alone solves grounded reasoning, which is a relevant and provocative result.
The interactive evaluation paradigm aligns with the growing interest in agent-based evaluation (VisualAgentBench, VisEscape). The preference optimization approach connects to the active research area of RLHF/DPO for multimodal models.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
This is a substantive contribution that identifies an important capability gap in LMMs and provides both evaluation tools and initial training solutions. The benchmark design is thoughtful, the analysis is thorough, and the findings are actionable. The main limitations are the synthetic image setting and the modest benchmark scale. The paper would benefit from stronger evidence that findings transfer to real-world visual settings.
Generated May 27, 2026
Comparison History (22)
MemCog introduces a fundamental paradigm shift from Memory-as-Tool to Memory-as-Cognition in conversational agents, addressing core architectural limitations with a comprehensive framework (navigable memory stores, cross-dimensional navigation, proactive reasoning). It achieves SOTA on multiple established benchmarks and introduces a novel benchmark (ProactiveMemBench). The concept of integrating memory as cognition rather than a tool has broad implications for agent architectures, LLM-based systems, and cognitive AI. Paper 2, while valuable, is more narrowly focused on creative physical reasoning benchmarks and alignment techniques for LMMs, with comparatively less paradigmatic novelty.
Paper 1 has higher potential scientific impact due to its more novel framing and benchmark contribution: it operationalizes “creative physical intelligence” in LMMs with a new, fine-grained, visually grounded tool-use benchmark and proposes an alignment method targeting a clear failure mode (grounded exploration vs. hallucination). This is timely for multimodal reasoning/embodied cognition and could influence evaluation and training across multimodal AI, robotics, HCI, and cognitive science. Paper 2 is highly application-oriented and strong on performance, but is closer to established retrieval/knowledge-augmented agent patterns with narrower methodological novelty.
Paper 2 introduces a novel benchmark (MM-CreativityBench) and a new alignment method (affordance-grounded alignment) addressing a fundamental gap in LMM evaluation—creative physical reasoning. This targets a core AI challenge with broad implications across robotics, embodied AI, and cognitive science. Paper 1, while practically useful, is more of an engineering contribution (optimized JS query libraries for AI traces) with narrower impact. Paper 2's methodological contribution to preference learning and its identification of systematic failure modes in frontier models are likely to influence a wider research community.
Paper 1 introduces a novel benchmark and alignment methodology for creative physical reasoning in LMMs—a largely untested but fundamental aspect of intelligence. It addresses a deeper capability gap (affordance-grounded creative problem-solving) with broad implications for embodied AI, robotics, and cognitive science. Paper 2 presents a clever engineering contribution to prompt optimization with solid empirical gains, but operates in a more incremental, narrower space. Paper 1's contribution to understanding and improving creative physical intelligence in multimodal models has broader cross-disciplinary impact and higher long-term significance.
Paper 2 demonstrates higher potential scientific impact because it addresses a fundamental limitation of current Large Multimodal Models: creative physical intelligence and affordance grounding. While Paper 1 offers a valuable, practical application in e-commerce dispute resolution, Paper 2 tackles a broader challenge essential for embodied AI, robotics, and advanced physical reasoning. By introducing a novel benchmark and an affordance-grounded alignment method using Direct Preference Optimization, Paper 2 pushes the boundaries of how models perceive and interact with physical environments, offering wider applicability across multiple AI domains.
Paper 1 tackles a fundamental cognitive capability—creative physical intelligence and affordance reasoning—which has broad implications across multimodal AI, robotics, and cognitive science. Introducing a novel benchmark and alignment strategy for open-ended problem-solving pushes the boundaries of LMM capabilities. While Paper 2 offers a valuable and rigorous systems-level optimization for LLM agent efficiency, Paper 1's focus on advancing core reasoning and grounding capabilities suggests a wider and more transformative scientific impact across multiple disciplines.
Paper 2 introduces a novel benchmark (MM-CreativityBench) and a new capability dimension (creative physical intelligence) for large multimodal models, addressing a fundamental gap in AI evaluation. It proposes affordance-grounded alignment via DPO, which is methodologically novel and broadly applicable. The work spans multiple high-impact areas (LMMs, embodied AI, creative reasoning) and is highly timely given the rapid advancement of multimodal models. Paper 1, while solid, makes a more incremental contribution to hierarchical RL with skill reuse, a well-studied area with narrower immediate impact.
Paper 1 is more scientifically novel and broadly impactful: it introduces a new benchmark for physically grounded creative tool use in multimodal models and proposes an alignment method (preference learning plus affordance-guided supervision) targeting a clear, currently limiting failure mode (grounded exploration vs hallucination). This advances evaluation methodology and model training for embodied/grounded reasoning, with relevance to robotics, HCI, vision-language, and safety. Paper 2 is highly applicable as an engineering system for causal/RCA workflows, but appears less methodologically novel and more domain-tooling oriented, with impact depending on deployment and validation depth.
Paper 1 likely has higher impact due to its novel identification and empirical demonstration of a structural vulnerability in the dominant RLHF paradigm, with broad implications for AI safety, alignment, evaluation methodology, and deployment governance. The “alignment tampering” mechanism is widely relevant across LLMs trained via preference data, and its real-world stakes (bias amplification, propaganda, goal-seeking) are immediate and timely. While Paper 2 introduces a useful benchmark and training approach for multimodal grounded creativity, its impact is more domain-specific and incremental relative to ongoing benchmark/grounding work.
Paper 1 addresses a fundamental cognitive capability in AI—creative physical problem-solving and affordance reasoning—which has broad implications for foundation models, embodied AI, and robotics. It introduces both a novel benchmark and a new alignment methodology. Paper 2, while highly practical and valuable for on-device deployment and privacy, represents a narrower engineering optimization for mobile GUI agents rather than a fundamental leap in AI reasoning capabilities.
Paper 2 offers a critical paradigm shift in AI safety, moving beyond static alignment to active runtime controllability. As autonomous AI agents become ubiquitous, ensuring they can be reliably interrupted and redirected is an urgent challenge with massive real-world and policy implications. While Paper 1 makes strong methodological contributions to embodied AI and LMMs, Paper 2's foundational framework and benchmark for agentic controllability address a more universally pressing bottleneck across the broader artificial intelligence landscape, giving it a wider and potentially more transformative scientific impact.
Paper 1 targets a significant frontier in AI—creative physical reasoning and tool use in Large Multimodal Models (LMMs). By introducing a novel benchmark (MM-CreativityBench) and an alignment method using Direct Preference Optimization to ground affordances, it addresses a fundamental limitation in current LMMs. In contrast, Paper 2 offers a valuable but more incremental technical optimization (adaptive negative sampling) for Knowledge Graph Foundation Models. Paper 1's focus on bridging multimodal perception with physical, open-ended problem-solving gives it higher potential for broad, cross-disciplinary impact in both general AI and embodied robotics.
Paper 2 has higher estimated impact due to stronger novelty and broader applicability: it introduces a visually grounded benchmark for creative, physically constrained tool use (a key gap for multimodal agents) and pairs it with a concrete training recipe (affordance-grounded alignment via DPO + KB supervision) showing measurable improvements and reduced hallucination. This combination of evaluation + intervention is timely for robotics/embodied AI and multimodal agents, with potential real-world translation. Paper 1 is rigorous and valuable for social reasoning diagnostics, but is more niche (ToM evaluation) and primarily benchmark-focused without an associated capability-improving method.
Paper 2 identifies a broadly consequential failure mode in deployed retrieval-augmented LLMs: multi-turn evidence monitoring that does not translate into safe action selection. It offers large-scale evaluation (50k+ turn-level), cross-model analysis, human validation, and mechanistic probes, making the claim robust and actionable for safety-critical RAG applications (health, law, ops). The monitoring-control gap is timely and likely to influence evaluation standards and system design across NLP/AI safety. Paper 1 is novel but more benchmark/technique-specific and narrower in immediate real-world risk relevance.
Paper 1 pioneers a novel frontier in Large Multimodal Models by addressing creative physical intelligence and affordance grounding, which are critical for advancing embodied AI and robotics. By introducing a new benchmark and an alignment method to solve fundamental reasoning gaps, it offers broader long-term scientific impact across multiple disciplines compared to Paper 2's narrower, though highly practical, focus on LLM fine-tuning security.
Paper 2 introduces a principled, theoretically grounded scoring framework (TPS) for evaluating uncertainty quantification in agentic AI systems—a rapidly growing area. Its contribution is foundational: it provides strictly proper scoring rules with formal proofs, addresses censored trajectories, and demonstrates that existing metrics are theoretically deficient. This has broad applicability across all agentic LLM systems. Paper 1, while addressing an interesting niche (creative physical reasoning in LMMs), is more application-specific with a narrower benchmark contribution and incremental alignment technique. Paper 2's methodological rigor and generalizability give it higher potential impact.
Paper 2 introduces a novel benchmark (MM-CreativityBench) and a new training paradigm (affordance-grounded alignment) addressing a fundamental gap in LMM evaluation—creative physical reasoning. It tackles a deeper scientific question about grounded intelligence beyond pattern recognition, with broader implications for AI safety, embodied AI, and cognitive science. Paper 1 presents an engineering contribution (a Python library for entity linking using existing LLM techniques), which, while practical, offers incremental novelty and narrower scientific impact compared to Paper 2's new benchmark and methodology.
Paper 2 addresses a fundamental frontier in AI—creative physical intelligence and affordance-grounded reasoning in LMMs. By introducing a novel benchmark and an alignment method for embodied problem-solving, it bridges vision, language, and robotics, offering broader theoretical and multi-disciplinary impact. Paper 1 presents a highly practical systems optimization for retrieval agents, but its contribution is more incremental and applied compared to the foundational cognitive capabilities explored in Paper 2.
Paper 2 addresses a critical bottleneck in LLM trustworthiness—Chain-of-Thought faithfulness—by bridging mechanistic interpretability with external outputs. Its computationally efficient circuit-tracing approach using Fused Gromov-Wasserstein distance offers high methodological rigor. While Paper 1 introduces a valuable multimodal benchmark, Paper 2's foundational contribution to AI safety and alignment has broader theoretical implications and higher potential impact across the widespread deployment of reasoning models.
Paper 1 likely has higher impact due to stronger novelty and broader relevance: it introduces a new benchmark targeting an under-evaluated capability (affordance-grounded creative tool use) and proposes an alignment framework addressing a core failure mode (grounded exploration vs hallucination) applicable across multimodal agents and embodied reasoning. Its applications extend to robotics, interactive assistants, and safety/grounding. Paper 2 is valuable and practical for improving long-text alignment in CLIP with an efficient method and dataset, but the contribution is more incremental within established VLM fine-tuning and alignment paradigms.