MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

Xiaoyu Dong, Zhi Li, Xiao-Ming Wu

May 27, 2026

arXiv:2605.28579v1 PDF

cs.AI(primary)

#777of 2682·Artificial Intelligence

#777 of 2682 · Artificial Intelligence

Tournament Score

1454±49

10501800

69%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.8

Novelty6.5

Clarity7

Tournament Score

1454±49

10501800

69%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MUSE Benchmark

1. Core Contribution

MUSE addresses a genuine gap in the Text-to-CAD evaluation landscape by shifting the focus from geometric shape matching to engineering design quality. The benchmark introduces three key innovations: (1) a structured Design Specification formalism that decomposes design intent into assembly graphs, valid parameter spaces, and manufacturing plans; (2) a three-stage evaluation protocol (code check → geometric check → design-intent alignment) that reveals where LLMs fail in the generation pipeline; and (3) a rubric-based VLM judge for scalable assessment of functionality, manufacturability, and assemblability.

The core problem is well-motivated: existing Text-to-CAD benchmarks (DeepCAD, Text2CAD, CADFusion) focus on single-part models evaluated by Chamfer Distance, which is insufficient for real engineering design. MUSE explicitly targets multi-component assemblies with manufacturing constraints — a substantially harder and more practical problem.

2. Methodological Rigor

Strengths in methodology:

The three-stage funnel evaluation is well-designed and reveals interpretable failure modes. The strict sequential gating (failing early stages zeros downstream scores) is appropriate for engineering contexts.

The geometric validity checks (watertightness, manifold, self-intersection, overlap) are standard B-Rep quality metrics, properly implemented.

The VLM judge validation against human annotation is reasonably thorough: 20 design instances, 4 annotators across 2 rounds, with correlations reported at three granularities and bootstrap confidence intervals.

Weaknesses in methodology:

The benchmark contains only 106 design instances, which is quite small. While the authors argue each instance is costly to create (requiring expert designers), this limits statistical power and coverage.

The dataset construction relies heavily on LLM augmentation (Claude Opus for script augmentation, GPT-5.5 for Design Specifications), creating potential circularity when evaluating these same model families.

The VLM judge validation uses only 20 design instances (≈19% of the benchmark), which is a modest sample for establishing reliability. The Pearson r of 0.713 at the sub-criteria level is reasonable but not exceptional.

Engineering drawings rendered as SVGs, while better than perspective renders, still cannot capture all relevant manufacturing features (e.g., internal tolerances, material grain direction).

The authors acknowledge that no physical manufacturing validation was performed — all validation is by professional designers reviewing digital models.

3. Potential Impact

Direct impact on Text-to-CAD research: MUSE establishes a new evaluation paradigm that could redirect the field from shape-generation metrics toward engineering-grounded assessment. The "failure cascade" finding — that even the best models (GPT-5.5 at ~52% final score with Gemini judge, ~67% with GPT-5.5 judge) struggle with design intent after passing code and geometry checks — is a valuable diagnostic.

Impact on LLM evaluation: The rubric-based VLM judge methodology could transfer to other domains requiring structured, domain-specific evaluation beyond text quality.

Industrial relevance: The benchmark's emphasis on CNC milling, 3D printing, laser cutting, and real material constraints (timber, PLA, acrylic) makes it more relevant to actual product design than existing academic benchmarks. However, the complexity level (chairs, tables, card holders) remains relatively simple compared to industrial assemblies.

Limitations on impact: The benchmark is CadQuery-specific, which limits applicability to other CAD scripting environments (OpenSCAD, FreeCAD API, SolidWorks API). The 106-instance scale may be insufficient for fine-grained model development or training.

4. Timeliness & Relevance

This work is highly timely. The rapid adoption of LLMs for code generation has naturally extended to CAD scripting, and multiple concurrent works (ArtiCAD, CADSmith, EvoCAD, CADCodeVerify) address related problems. The community clearly needs better evaluation frameworks. MUSE fills a specific niche by focusing on assembly-level design with engineering constraints, which no existing benchmark adequately covers.

The paper is also well-positioned relative to the broader "LLM-as-judge" trend, applying structured rubric-based evaluation rather than holistic scoring.

5. Strengths & Limitations

Key Strengths:

Well-defined problem formulation: The Design Specification formalism (D, G, Ω, M) provides a clean mathematical framework for what constitutes a valid design.

Comprehensive evaluation: The six sub-criteria across three pillars (functionality, manufacturability, assemblability) are well-justified and cover distinct engineering concerns.

Transparent findings: The failure cascade analysis is genuinely informative — showing that Overlap Free is the hardest geometric check and that code execution ability doesn't predict geometric quality (RQ4) are actionable insights.

Reproducibility: Public release of dataset, code, and leaderboard supports community adoption.

Notable Weaknesses:

Small scale: 106 instances limits benchmark depth and model differentiation.

No physics-based validation: The Robust criterion uses qualitative LLM assessment rather than FEA or physics simulation, which undermines claims about structural evaluation.

Potential judge bias: Using GPT-5.5 as both a judge and an evaluated model raises concerns. Table 9 shows GPT-5.5 achieves 67.14% final score when judged by itself versus 52.36% when judged by Gemini — a ~15-point inflation suggesting self-preference bias.

Limited assembly complexity: Most instances appear to be furniture-like objects with <10 components. Industrial assemblies routinely involve hundreds of parts.

Missing baselines: No existing Text-to-CAD methods could be evaluated, making this purely an LLM benchmark rather than a method comparison. ArtiCAD's unavailability is noted but weakens the experimental contribution.

Overall Assessment

MUSE makes a solid conceptual contribution by reframing Text-to-CAD evaluation around engineering design quality. The three-stage protocol and rubric-based evaluation are well-conceived. However, the benchmark's small scale, reliance on LLM-generated components in its own construction pipeline, and lack of physics-based validation temper its immediate practical impact. It serves primarily as a proof-of-concept for engineering-grounded CAD benchmarking rather than a definitive evaluation resource. The findings about LLM failure modes in CAD generation are valuable but not deeply surprising.

Rating:6.2/ 10

Significance 6.5Rigor 5.8Novelty 6.5Clarity 7

Generated May 28, 2026

Comparison History (16)

vs. How Much is Brain Data Worth for Machine Learning?

claude-opus-4.65/28/2026

Paper 1 addresses a fundamental theoretical question in NeuroAI—quantifying the value of brain data for machine learning—by deriving analytical scaling laws and exchange rates. This provides a broadly applicable theoretical framework spanning neuroscience and ML, with potential to guide resource allocation in data collection across many domains. Paper 2, while practically useful, is a benchmark paper for a relatively narrow application (Text-to-CAD). Benchmarks can be impactful, but Paper 1's novel theoretical contributions and cross-disciplinary breadth give it higher potential for lasting scientific influence.

vs. Generative AI and the Productivity Divide: Human-AI Complementarities in Education

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact due to its novel, field-defining benchmark for Text-to-CAD that targets real industrial constraints (manufacturability, functionality, assemblability) and introduces a scalable evaluation protocol with validated VLM-based judging. Benchmarks and leaderboards often become widely adopted infrastructure, catalyzing progress across ML, CAD/CAE, robotics, and manufacturing. Paper 1 is timely and methodologically strong (RCT) with clear managerial implications, but its impact is narrower to education/knowledge-work productivity and may be less reusable as a community standard than a benchmark/dataset.

vs. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

claude-opus-4.65/28/2026

Paper 2 (ARR) introduces a more broadly applicable methodological framework that addresses a fundamental bottleneck in RLHF for multimodal generative models—transforming implicit preferences into explicit, interpretable rubrics for reward modeling. This has wide-reaching implications across text-to-image, image editing, and potentially other generative domains. The proposed Rubric Policy Optimization (RPO) offers a novel training paradigm. Paper 1 (MUSE), while valuable as a benchmark for Text-to-CAD, is more niche in scope, primarily serving the CAD/engineering design community. Paper 2's broader methodological contribution and cross-domain applicability give it higher potential impact.

vs. Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

gpt-5.25/28/2026

Paper 2 (MUSE) has higher likely impact due to broader applicability and timeliness: text-to-CAD for manufacturable, functional assemblies targets a large industrial/design ecosystem and multiple research communities (LLMs, CAD/graphics, manufacturing, HCI). Its benchmark, multi-stage evaluation protocol, public leaderboard, and scalable VLM-judge validation can become shared infrastructure, accelerating progress and enabling standardized comparison. Paper 1 offers a strong, novel trustworthiness framing for legal AI with solver-grounding, but its domain specificity (law + statutes formalization) and deployment constraints likely limit breadth relative to a widely reusable CAD benchmark.

vs. Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

gemini-3.15/28/2026

Paper 2 addresses a critical safety and reliability flaw in LLM reasoning, with broad applicability across high-stakes domains like healthcare and law. Its Judge-Then-Solve framework improves both safety and inference efficiency for general AI systems. While Paper 1 introduces a highly rigorous and valuable benchmark for industrial Text-to-CAD, Paper 2's focus on fundamental AI reasoning control offers significantly wider multidisciplinary impact and timely relevance to the broader AI community.

vs. Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

gpt-5.25/28/2026

Paper 1 is likely to have higher impact due to a clear, widely usable benchmark that targets an important industrial gap (manufacturable, functional, assemblable CAD assemblies) with structured specs and a multi-stage evaluation protocol beyond geometry. Benchmarks and leaderboards often catalyze broad progress across models and communities (CAD/CAE, graphics, robotics, manufacturing, LLM evaluation). It also offers direct real-world applicability in engineering design workflows. Paper 2 is novel mechanistic analysis for agents, but its impact is narrower and more dependent on adoption by the interpretability community.

vs. EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

claude-opus-4.65/28/2026

EnactToM addresses a fundamental gap in AI research—functional Theory of Mind in embodied agents—which has broad implications across multi-agent systems, human-AI collaboration, and cognitive science. Its finding that frontier models score 0% on functional ToM despite 45% on literal belief probes reveals a critical capability gap with wide relevance. The evolving benchmark design and detailed failure taxonomy provide actionable research directions. MUSE, while valuable for CAD/engineering applications, targets a narrower domain. EnactToM's contributions span multiple research communities (NLP, robotics, cognitive science, multi-agent systems), giving it broader potential impact.

vs. Dr-CiK: A Testbed for Foresight-Driven Agents

gpt-5.25/28/2026

Paper 1 (MUSE) is likely higher impact: it targets a major, under-benchmarked bottleneck for industrial-grade Text-to-CAD—assemblies with manufacturability/assemblability and design intent—introducing richer evaluation beyond geometric similarity and validating scalable rubric-based judging. This is timely for LLM-driven engineering workflows and could influence both CAD generation methods and evaluation standards in CAD/CAE, robotics, and manufacturing. Paper 2 (Dr-CiK) is a solid benchmark for context-retrieval forecasting agents, but its scope is narrower and closer to existing retrieval+forecasting paradigms, with likely more incremental methodological novelty.

vs. DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

gemini-3.15/28/2026

Paper 1 (MUSE) addresses a critical bottleneck in generative AI for physical design: bridging the gap between geometric similarity and real-world manufacturability, functionality, and assemblability. By providing a scalable, VLM-judged benchmark for complex CAD assemblies, it has significant potential to accelerate the adoption of text-to-CAD systems in industrial applications. While Paper 2 offers valuable insights into LLM limitations in operations research, Paper 1's focus on engineering-ready generative AI offers broader, more transformative cross-disciplinary impact across AI, mechanical engineering, and manufacturing.

vs. Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

claude-opus-4.65/28/2026

MUSE introduces a novel, concrete benchmark with dataset, code, and leaderboard for an important and timely problem (Text-to-CAD), addressing clear gaps in evaluating manufacturability, functionality, and assemblability. It provides empirical results across multiple LLMs and a validated evaluation framework. Paper 2, while raising important points about AI evaluation in low-resource contexts, is primarily a position/framework paper without new datasets, experiments, or concrete artifacts, limiting its immediate measurable scientific impact despite its relevance.

vs. MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

gpt-5.25/28/2026

Paper 2 (MUSE) is likely to have higher scientific impact because it introduces a broadly useful benchmark and evaluation protocol for Text-to-CAD assemblies, targeting manufacturability/functionality/assemblability—key unmet needs for industrial relevance. Benchmarks often become community infrastructure, influencing many subsequent methods across CAD, robotics, manufacturing, and multimodal LLM evaluation. Its methodology (multi-stage checks, rubric-based VLM judge validated by humans, public leaderboard) supports rigor and adoption. Paper 1 is strong and application-relevant, but is more domain-specific (molecular design) and its gains may depend on engineering choices around agents/representations.

vs. Automatic Layer Selection for Hallucination Detection

gpt-5.25/28/2026

Paper 1 (MUSE) is likely higher impact due to its strong novelty and real-world relevance: it introduces a new benchmark and evaluation protocol that targets industrially critical properties (manufacturability, functionality, assemblability) for Text-to-CAD assemblies, moving the field beyond geometric similarity. Benchmarks often catalyze broad progress across academia and industry, and its rubric-based scalable evaluation plus reliability validation add rigor. Paper 2 offers a solid, practical method for hallucination detection (automatic layer selection via intrinsic dimension), but it is narrower in scope and less likely to reshape a domain compared to a widely adopted CAD benchmark.

vs. Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

claude-opus-4.65/28/2026

Paper 1 introduces a novel and concerning security threat ('Sleeper Attack') for LLM agents that persists across interactions—a fundamentally new attack paradigm with broad implications for AI safety. It formalizes the threat model, provides a comprehensive benchmark, and demonstrates vulnerability across seven major LLMs. Given the rapid deployment of LLM agents in real-world systems, this work has urgent, cross-cutting impact on AI security. Paper 2, while valuable for CAD/engineering evaluation, addresses a narrower application domain with less transformative implications for the broader AI research community.

vs. Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

claude-opus-4.65/28/2026

MUSE addresses a more impactful and broadly relevant problem—bridging LLM-driven generation with industrial CAD design—spanning AI, manufacturing, and engineering. It introduces a comprehensive benchmark with practical evaluation criteria (functionality, manufacturability, assemblability) that goes beyond geometric similarity, with clear real-world applications in product design automation. Paper 1, while introducing a useful diagnostic concept (citation laundering) and benchmark for RAG evaluation, addresses a narrower methodological concern within NLP evaluation. MUSE's interdisciplinary reach, practical utility, and connection to industrial applications give it higher potential impact.

vs. Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

gpt-5.25/28/2026

Paper 2 (MUSE) likely has higher impact: it introduces a timely, broadly useful benchmark and evaluation protocol for Text-to-CAD that targets real industrial constraints (manufacturability, functionality, assemblability) and provides dataset, rubric-based evaluation, and a leaderboard—assets that can catalyze community progress across ML, CAD, robotics, and manufacturing. Paper 1 is practical and elegant, but is an incremental, narrow codec contribution in a mature area (embedding compression/ANN), with more limited cross-field reach.

vs. Utility-Aware Multimodal Contrastive Learning for Product Image Generation

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact: it introduces a broadly useful benchmark (MUSE) and evaluation protocol for Text-to-CAD that targets real industrial criteria (manufacturability, functionality, assemblability) and provides dataset/code/leaderboard, enabling community-wide progress and reproducibility. Its rubric-based VLM judge plus human validation adds methodological rigor and scalability. The work is timely given rapid LLM advances and could influence CAD, graphics, HCI, and manufacturing research. Paper 1 is innovative and commercially relevant, but is more domain-specific (marketplace demand optimization) and may generalize less across fields.