Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

Jiateng Liu, Bingxuan Li, Zhenhailong Wang, Rushi Wang, Kaiwen Hong, Cheng Qian, Jiayu Liu, Denghui Zhang

Jun 3, 2026

arXiv:2606.05445v1 PDF

cs.AI(primary)

#2041of 3355·Artificial Intelligence

#2041 of 3355 · Artificial Intelligence

Tournament Score

1378±47

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor5.5

Novelty6

Clarity7

Tournament Score

1378±47

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks. Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation. To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Brick-Composer

1. Core Contribution

This paper addresses the problem of whether multimodal large language models (MLLMs) can perform brick assembly — specifically, selecting the correct brick from candidates and estimating its 6DoF pose for placement. The contributions are threefold: (1) BC-Bench, the first benchmark for evaluating MLLMs on diverse brick assembly with both selection and full SE(3) pose estimation; (2) a systematic evaluation of state-of-the-art MLLMs (GPT-5.4, Qwen, InternVL, Gemma) revealing their limitations; and (3) Brick-Composer, a learning framework combining human design demonstrations, simulator-based world feedback, and synthetic experience generation to improve assembly performance.

The formulation of brick assembly as a sequential decision-making problem with two coupled subtasks (selection and pose estimation) is clean and well-motivated. The proxy task of LEGO-style assembly is a reasonable testbed for compositional spatial reasoning, bridging perception and physical action planning.

2. Methodological Rigor

Benchmark Design: BC-Bench is carefully constructed with 102 objects (82 train, 20 test), split at the object level to prevent data leakage. The multi-view rendering pipeline (six orthogonal + one isometric view) provides rich spatial information. The symmetry-aware rotation error metric is a thoughtful design choice that avoids penalizing physically equivalent orientations.

Learning Framework: The three-signal approach is logical but somewhat incremental:

*Human Design Sparks*: Standard supervised fine-tuning on ~3K human-designed assembly steps, serialized as text tokens.

*World Feedback*: A feedback loop where the simulator renders the model's prediction, and the model attempts correction — either at inference time or via training on correction trajectories.

*Synthetic Experience*: Procedural generation of ~40K training steps from ~700 synthetic objects using feasible attachment filtering and density-based compactness rewards.

The ablation study (Table 3) is informative: world feedback alone can hurt performance, but combining it with designer supervision helps. Synthetic experience provides the largest marginal gain. However, there are methodological concerns:

The evaluation set is relatively small (1,013 steps from 20 objects), which may lead to high variance in reported metrics, particularly the "best-object" numbers.

The "best-object performance" metric is inherently cherry-picked and could be misleading. Reporting per-object statistics (e.g., median, quartiles) would be more informative.

The strict step-wise success rate going from ~0% to ~15% (average) is notable but still low. The 42% best-object figure, while impressive relative to the baseline, applies to a single favorable object.

The paper does not discuss error propagation in detail: in sequential assembly, early errors cascade, yet evaluation appears to use ground-truth assembly states for each step rather than the model's own accumulated predictions.

3. Potential Impact

Spatial Reasoning Benchmark: BC-Bench fills a genuine gap — no prior benchmark combines diverse brick vocabularies, manual-based instruction following, and full SE(3) pose estimation for MLLMs. This could become a useful testbed for the spatial reasoning community.

Practical Assembly Applications: The paper frames this as a step toward robotic assembly, manufacturing, and design assistance. However, the gap between simulation-only evaluation and real-world deployment is substantial (acknowledged in limitations). The sim-to-real transfer challenges — perception noise, grasping, contact dynamics — are non-trivial.

Learning Framework Generalizability: The three-signal learning paradigm (human demos + world feedback + synthetic scaling) is conceptually applicable beyond brick assembly to other spatial reasoning and assembly tasks. The synthetic data generation approach for creating physically plausible training configurations is a practical contribution.

Adjacent Fields: This work connects to robotics (6DoF pose estimation, assembly planning), spatial AI (3D reasoning in MLLMs), and instruction following. It provides evidence that MLLMs can learn spatial skills through targeted training, which has implications for embodied AI more broadly.

4. Timeliness & Relevance

The paper is timely given the rapid development of MLLMs and growing interest in their spatial and physical reasoning capabilities. Multiple recent works (SpatialVLMs, 3D-LLM, SpatialBot) have highlighted spatial reasoning as a key frontier. The assembly domain is particularly relevant as it requires compositional reasoning — understanding individual parts, their affordances, and spatial relationships — which tests deeper capabilities than standard VQA tasks.

The comparison with very recent models (GPT-5.4, Qwen-3.5-VL-27B) ensures relevance, and the benchmark evaluation reveals that even frontier models struggle severely with precise pose estimation, validating the need for this research direction.

5. Strengths & Limitations

Key Strengths:

Well-defined task formulation that decomposes assembly into measurable subtasks

First comprehensive benchmark for MLLM-based diverse brick assembly with SE(3) pose estimation

Thoughtful multi-view rendering pipeline with coordinate-augmented views

Ablation study clearly showing complementary value of each learning signal

Honest reporting of limitations — 15% average step-wise success is far from solved

Notable Weaknesses:

Scale limitations: 102 total objects (20 for evaluation) is small. The ~3K human-design training steps are acknowledged as limited, but the 40K synthetic steps may introduce a domain gap (no object-level semantics).

Evaluation protocol: Step-wise evaluation with ground-truth states at each step doesn't capture the cascading error problem of real sequential assembly. Object-level completion rates would be more meaningful.

Pose estimation as text generation: Predicting SE(3) poses as serialized text tokens is conceptually interesting but likely inherently limited compared to regression-based approaches. The paper doesn't compare against specialized pose estimation baselines.

No real-world validation: Everything remains in simulation. The paper's framing around "real-world objects from reusable building blocks" is aspirational rather than demonstrated.

Reproducibility concerns: The paper relies on copyrighted BrickLink designs and custom simulation, and while a GitHub page is referenced, the actual availability of the benchmark and code is unclear.

Limited model coverage for fine-tuning: Only Gemma-3-12B and Qwen-3-VL-8B are fine-tuned, leaving open how the approach scales with model size.

Overall Assessment

Brick-Composer makes a solid foundational contribution by introducing a well-structured benchmark and demonstrating that MLLMs can learn non-trivial assembly skills through physically grounded training. The task formulation is clean and the results are honest about current limitations. However, the small evaluation scale, simulation-only setting, lack of comparison with specialized baselines, and the still-low absolute performance (15% strict success) temper the immediate impact. This is a reasonable first step that opens a research direction rather than solving the problem.

Rating:5.5/ 10

Significance 5.5Rigor 5.5Novelty 6Clarity 7

Generated Jun 5, 2026

Comparison History (22)

vs. Agents' Last Exam

claude-opus-4.66/6/2026

Agents' Last Exam (ALE) has higher potential scientific impact due to its breadth and timeliness. It introduces a comprehensive, living benchmark spanning 55 subfields and 13 industry clusters with 1K+ tasks, developed with 250+ industry experts, addressing a fundamental gap between AI benchmark performance and real-world economic deployment. This addresses a critical, widely-recognized problem in AI evaluation and could influence the entire field's research direction. Brick-Composer, while innovative in its specific domain of physical assembly with MLLMs, addresses a narrower problem with more limited cross-field applicability.

vs. What Type of Inference is Active Inference?

gpt-5.26/6/2026

Paper 1 has higher likely scientific impact due to its creation of a new benchmark (BC-Bench), a concrete training framework (Brick-Composer) with multiple grounded supervision signals, and sizable empirical gains on a challenging, application-relevant task (robotic/physical assembly). It is timely given rapid advances in MLLMs and embodied AI, and its artifacts can catalyze follow-up work across vision-language, robotics, and simulation-to-real learning. Paper 2 offers valuable theoretical clarification and improved planning formulations, but is more incremental within an established active-inference literature and demonstrated on limited grid-world settings.

vs. Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

gpt-5.26/6/2026

Paper 1 is likely higher impact due to its concrete, timely benchmark (BC-Bench) and a physically grounded training framework that yields large empirical gains on an important real-world capability: embodied/assembly reasoning with reusable parts. It offers immediate utility to the vision-language/robotics community via standardized evaluation and a scalable recipe (human demos + world feedback + synthetic data). Paper 2 has strong conceptual framing and theory, but relies on assumptions-heavy regret decompositions and limited empirical validation, making near-term adoption and broader downstream influence less certain.

vs. AdaMEM: Test-Time Adaptive Memory for Language Agents

claude-opus-4.66/6/2026

AdaMEM addresses a fundamental challenge in language agent systems—test-time adaptation through dynamic memory—with broad applicability across multiple agent tasks (ALFWorld, WebShop, HotpotQA). It introduces a novel hybrid memory architecture and scaling dimension for agentic memory that could influence the design of future agent systems broadly. Paper 2 (Brick-Composer) tackles an interesting but narrower problem of brick assembly with MLLMs, achieving modest results (15% step success). While creative, its impact is more domain-specific, whereas AdaMEM's contributions to adaptive agent architectures have wider implications for the rapidly growing field of language agents.

vs. Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

gemini-3.16/6/2026

Paper 2 introduces a novel benchmark and application (physical brick assembly) that bridges multimodal LLMs with spatial reasoning and embodied AI. This cross-disciplinary approach offers higher novelty and broader potential real-world applications in robotics. In contrast, Paper 1 presents a more incremental multi-agent feedback mechanism in the highly saturated domain of text-based mathematical reasoning.

vs. Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems

claude-opus-4.66/6/2026

Paper 1 introduces a novel benchmark (BC-Bench) and framework (Brick-Composer) at the intersection of MLLMs and physical assembly—a timely, underexplored area with broad implications for robotics, embodied AI, and construction automation. Its novelty in formulating brick assembly as sequential decision-making for MLLMs, combined with a multi-signal learning framework, addresses a fundamental capability gap. Paper 2 makes a solid but incremental contribution to anomaly detection in manufacturing by proposing product-aware autoencoders, which is a relatively straightforward extension of existing methods to a known problem domain with narrower impact scope.

vs. LLM Self-Recognition: Steering and Retrieving Activation Signatures

gemini-3.16/5/2026

Paper 1 addresses the urgent and highly impactful problem of AI-generated content attribution. Its novel approach of using internal activation steering to create an undetectable fingerprint offers immediate real-world utility for AI safety, academic integrity, and misinformation mitigation. While Paper 2 presents an interesting step forward for spatial reasoning and robotics using MLLMs, its application is currently more niche and exploratory, with relatively low success rates. Paper 1's broad applicability across NLP and AI policy gives it a higher potential for immediate scientific impact.

vs. Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study

claude-opus-4.66/5/2026

Brick-Composer introduces a novel benchmark (BC-Bench) and learning framework for evaluating and improving MLLMs on spatial reasoning and physical assembly tasks—a relatively unexplored area with broad implications for robotics, AI planning, and embodied intelligence. Its multi-signal training approach (human demonstrations, world feedback, synthetic experience) is methodologically innovative and applicable beyond brick assembly. Paper 2 addresses a narrower problem (traffic sign defect detection via image difference classification) with incremental contributions. While useful for infrastructure inspection, its scope and novelty are more limited compared to Paper 1's breadth and timeliness in the rapidly growing MLLM and embodied AI fields.

vs. Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

gpt-5.26/5/2026

Paper 2 likely has higher impact: it targets a broadly relevant, timely problem (multi-turn image editing) with immediate real-world applications in mainstream creative tools. Its core contribution—a context-aware RL post-training framework jointly optimizing discrete reasoning and continuous image generation, plus trajectory filtering—generalizes across multimodal generation tasks and aligns with current trends in RL for foundation models. The accompanying large-scale benchmark (MICE-Bench) with automated metrics further boosts adoption. Paper 1 is novel and valuable for embodied assembly, but its applicability and near-term deployment are narrower and results still relatively low (e.g., ~15% strict step success).

vs. Bidirectional Search for Longest Paths: Case for Front-to-Front Heuristics

claude-opus-4.66/5/2026

Brick-Composer addresses the emerging and high-impact intersection of multimodal large language models and physical reasoning/robotics assembly. It introduces a novel benchmark (BC-Bench), proposes a new learning framework with three complementary training signals, and demonstrates significant improvements. This work has broader impact potential across AI, robotics, and manufacturing, and is highly timely given the rapid advancement of MLLMs. Paper 1, while technically sound, addresses a more niche algorithmic problem in combinatorial search with incremental improvements, limiting its breadth of impact.

vs. Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

gemini-3.16/5/2026

Paper 2 addresses a highly urgent and globally relevant issue: the environmental impact and energy consumption of hyperscale data centers driven by the AI boom. Its findings have broad implications for climate policy, energy grid planning, and the tech industry. While Paper 1 presents an interesting AI/robotics framework for spatial reasoning, Paper 2's potential to influence real-world sustainability efforts, policy decisions, and cross-disciplinary research gives it a higher scientific and societal impact.

vs. Step-adaptive multimodal fusion network with multi-scale cloud feature learning for ultra-short-term solar irradiance forecasting

gemini-3.16/5/2026

Paper 1 addresses a fundamental and highly timely challenge in artificial intelligence—equipping Multimodal LLMs with spatial reasoning and physical assembly skills. By introducing a novel benchmark and a grounded learning framework, it offers broad, transformative implications for embodied AI, robotics, and autonomous construction. In contrast, Paper 2 presents a highly specialized, though practically valuable, incremental improvement in solar irradiance forecasting. Paper 1's broader applicability and foundational novelty give it a higher potential for widespread scientific impact across multiple disciplines.

vs. QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

claude-opus-4.66/5/2026

Brick-Composer introduces a novel problem formulation (brick assembly as sequential decision-making for MLLMs), a new benchmark (BC-Bench), and a learning framework combining multiple training signals. It opens a new research direction at the intersection of embodied AI, spatial reasoning, and construction, with broad potential applications in robotics and manufacturing. While QCFuse is a solid engineering contribution optimizing RAG serving efficiency, it addresses an incremental improvement in cache fusion for existing systems. Brick-Composer's novelty, benchmark contribution, and cross-disciplinary impact give it higher potential.

vs. Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development

gpt-5.26/5/2026

Paper 1 has higher scientific impact potential due to clearer technical novelty and broader research relevance: it introduces a new benchmark (BC-Bench) for a hard, emerging capability (MLLM grounded assembly), plus a concrete training framework with multiple supervision signals and measurable performance gains. It advances embodied/spatial reasoning and could influence robotics, vision-language learning, and evaluation methodologies. Paper 2 targets an important applied problem in enterprise workflows and reports a deployment study, but its contribution is more systems/knowledge-management architecture and standardization, with less generalizable scientific methodology and fewer transferable algorithmic insights.

vs. Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

claude-opus-4.66/5/2026

Paper 2 reveals a fundamental paradox in LLM safety alignment—that improving safety awareness inherently increases vulnerability to a novel attack vector. This has broader, more immediate impact across the entire AI safety community, affecting all aligned LLMs including frontier models like GPT-5 and Claude. The finding challenges core assumptions in current alignment paradigms and has urgent implications for AI policy and deployment. Paper 1, while interesting, addresses a narrower robotics/assembly domain with incremental progress (15% success rate), limiting its broader scientific influence.

vs. Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

claude-opus-4.66/5/2026

Paper 1 addresses a foundational systems-level challenge for the rapidly growing LLM agent ecosystem. Its systematic taxonomy, profiling methodology, and actionable design recommendations for agent memory have broad applicability across all agent-based AI systems, influencing infrastructure decisions at scale. Paper 2, while creative in applying MLLMs to brick assembly, targets a narrower robotics/embodied AI niche with modest results (~15% step success). Paper 1's timeliness, breadth of impact across the entire agent systems community, and practical utility for deployment give it significantly higher potential impact.

vs. Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents

gpt-5.26/5/2026

Paper 2 has higher likely impact: it proposes a broadly applicable, model-agnostic memory/state management paradigm for long-horizon LLM agents, addressing a timely bottleneck (state coherence, error isolation, and context cost) across many domains (tool use, workflows, coding, planning). The hierarchical state-tree with explicit operations is a clear methodological contribution with strong efficiency gains and sizable success-rate improvements on a relevant benchmark. Paper 1 is novel and valuable for embodied assembly with new benchmarking, but its impact is narrower (brick-like assembly) and current gains still leave low end-to-end success.

vs. GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

claude-opus-4.66/5/2026

Brick-Composer introduces a novel problem formulation (brick assembly as sequential decision-making for MLLMs), a new benchmark (BC-Bench), and a learning framework with physically grounded training signals. It opens a new research direction connecting MLLMs with robotic assembly and spatial reasoning, with broad implications for robotics, manufacturing, and embodied AI. GuardNet, while practically useful, presents an incremental contribution using established techniques (BiLSTM ensembles) for prompt injection detection, and acknowledges that larger LLMs still outperform it on key metrics.

vs. Optimal Transport-based Permutation-Invariant Bayesian Optimization of Offshore Wind Farm Layouts

gpt-5.26/5/2026

Paper 2 likely has higher scientific impact due to stronger timeliness and broader cross-field relevance: it targets MLLMs, embodied/robotic assembly, and introduces BC-Bench, a reusable benchmark that can catalyze follow-on work. Its framework (human demos + world feedback + synthetic experience) is broadly applicable to grounded action learning beyond bricks. Paper 1 is novel and rigorous for a high-value domain (wind farm layout) but is more specialized; its methodological contribution (permutation-invariant BO via optimal transport) may see narrower adoption compared to benchmarked, generalizable advances in multimodal agent assembly.

vs. Agentic Molecular Recovery via Molecule-Aware Exploration

gemini-3.16/5/2026

Paper 1 tackles a foundational challenge in embodied AI and spatial reasoning, bridging multimodal LLMs with physical assembly. Its introduction of a novel benchmark and learning framework offers broad, transformative applications in robotics and manufacturing. While Paper 2 addresses a valuable problem in computational chemistry, Paper 1's focus on equipping AI with generalizable, physically grounded construction skills represents a broader leap in capability toward general-purpose AI agents.