Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

Fedor Rodionov, Aleksandar Cvejic, Michael Birsak, John Femiani, Peter Wonka

Jun 9, 2026arXiv:2606.10953v1

cs.AIcs.CV

#2968of 3489·Artificial Intelligence

#2968 of 3489 · Artificial Intelligence

Tournament Score

1290±44

10501800

30%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5

Rigor6

Novelty5

Clarity7.5

Abstract

Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. However, progress in automatic furniture arrangement has been limited by the lack of real, professionally designed floor-plan datasets with object-level furniture annotations. To address this gap, we introduce AntPlan-270, a curated dataset of 270 architectural floor plans with per-room furniture bounding box annotations across ten residential room categories. Building on this dataset, we present Architect-Ant, an editable automatic furnishing framework powered by a fine-tuned vision-language model. Furniture layouts are represented using a compact, coordinate-based domain-specific language (DSL) that encodes object categories and placements relative to the room geometry. To improve spatial reasoning, we generate procedural reasoning traces that capture architectural constraints such as wall alignment, door and window clearance, circulation, fixture compatibility, and room-specific furniture inventories, and use them to supervise fine-tuning of the model. We then apply preference optimization over candidate object placements to further refine layout quality. The generated DSL can be rasterized into semantic masks and used to condition a Flux-based LoRA renderer, producing realistic blueprint-style furnished floor-plan images while preserving the editable symbolic layout. Experiments on layout furnishing show that Architect-Ant produces geometrically valid and functionally plausible layouts, and suggest a scalable path for furnishing larger structure-only floor-plan datasets.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Architect-Ant

1. Core Contribution

Architect-Ant addresses the problem of automatic furniture placement in 2D architectural floor plans—a practically important but under-explored niche compared to 3D indoor scene synthesis. The paper makes three interlinked contributions: (1) AntPlan-270, a dataset of 270 real architectural floor plans with per-room furniture bounding-box pseudo-labels across ten residential room categories; (2) a structured generation framework that represents furniture layouts as a coordinate-based DSL and uses a fine-tuned vision-language model (Qwen3.5-9B) with LoRA adapters to generate placements; and (3) a training pipeline combining supervised fine-tuning with procedural reasoning traces (including fail-and-fix recovery augmentation) and direct preference optimization (DPO) driven by a deterministic rule-based scorer encoding architectural constraints.

The key conceptual insight is moving constraint enforcement *into* the generative model's training distribution via preference optimization, rather than relying on post-hoc solvers or repair steps—a meaningful distinction from prior constraint-based and LLM-driven layout methods.

2. Methodological Rigor

The methodology is well-structured with clear ablations isolating each pipeline component (Table 5): zero-shot → SFT without recovery → SFT with recovery → SFT with image input → synthetic-pair DPO → model-pair DPO. The progression demonstrates incremental value at each stage. The distinction between synthetic-pair and model-pair DPO is particularly insightful—the authors honestly report that model-pair DPO achieves higher rule scores but produces qualitatively worse layouts (reward hacking), leading them to adopt the more conservative synthetic-pair approach.

However, several methodological concerns emerge:

Dataset scale: 270 floor plans yielding ~1,351 room samples (before augmentation) is quite small. The pseudo-label pipeline (detector bootstrapped from hand-labeled subset, then manually reviewed) introduces noise that the authors acknowledge but cannot fully quantify.

Evaluation limitations: The rule-based scorer serves double duty as both the DPO training signal and a primary evaluation metric, creating circularity. The VLM-as-judge (Gemini 3 Flash) provides independence but has acknowledged calibration issues, especially on kitchens. No human evaluation is conducted.

Axis-aligned bounding boxes only: The DSL lacks rotation, which is a significant simplification for furniture placement (e.g., angled sofas, rotated desks).

Out-of-distribution evaluation: While testing on CubiCasa5K rooms is appropriate, the comparison with frontier models (Kimi K2.5, GLM-5V-Turbo) uses different candidate budgets (K=10 vs K=2), making direct comparison difficult despite the authors' acknowledgment.

3. Potential Impact

Practical applications in real estate visualization, interior design, and architectural CAD workflows are clear and commercially relevant. The editable DSL representation is a strength—unlike pixel-based approaches, it supports downstream modifications, a genuine requirement for professional workflows.

Research impact is more moderate. The paper occupies a narrow intersection: 2D floor-plan furnishing with bounding boxes. Most active research in indoor scene synthesis operates in 3D (3D-FRONT, ScanNet ecosystems). The contribution may have limited uptake because:

The dataset is small and based on pseudo-labels rather than clean annotations.

The method is specific to axis-aligned 2D boxes in residential rooms.

The rendering pipeline (FLUX LoRA) is a visualization convenience rather than a technical contribution.

The broader principle—using rule-derived preferences to train structured generators when clean demonstrations are scarce—has wider applicability to constrained layout problems (chip design, UI layout, warehouse planning), though the paper doesn't empirically demonstrate transfer.

4. Timeliness & Relevance

The paper is timely in two respects. First, applying VLMs/LLMs to structured spatial reasoning is an active frontier, and prior work (FloorplanQA, LayoutGPT) has exposed brittleness that motivates task-specific adaptation. Second, preference optimization (DPO) from programmatic verifiers is a growing paradigm (code, math), and extending it to geometric layout is natural but not yet well-explored.

The gap the paper identifies—lack of real 2D furnished floor-plan data—is genuine. However, whether 270 plans constitute a sufficient solution is debatable.

5. Strengths & Limitations

Strengths:

Clean problem formulation: separating structure generation from furnishing, and keeping layouts as editable symbolic objects rather than pixels.

Honest reporting of failure modes: reward hacking with model-pair DPO, kitchen difficulties, VLM judge calibration issues.

The synthetic-pair DPO construction (single bounding-box perturbation) is clever and well-motivated, preventing the model from exploiting surface-form shortcuts.

Comprehensive rule scorer with interpretable, decomposable penalties.

Thorough ablation study.

Limitations:

Small dataset with pseudo-labels limits generalization claims.

No orientation/rotation modeling—axis-aligned boxes are a significant constraint.

No human evaluation study, which is critical for a design-oriented task.

The four-room-type scope (bedroom, bathroom, kitchen, living room) is narrow; commercial and mixed-use spaces are absent.

Scalability is claimed but not demonstrated beyond 270 plans.

The FLUX LoRA rendering, while visually appealing, is not evaluated rigorously and serves primarily as visualization.

Comparison fairness issues with frontier baselines due to different candidate budgets.

Limited novelty in individual components—the contribution is more in their combination for this specific domain.

Overall Assessment

Architect-Ant is a competent systems paper that assembles existing techniques (VLM fine-tuning, DPO, rule-based scoring) into a practical pipeline for a well-defined problem. The work is methodologically careful with good ablations and honest limitations. However, the dataset is small, the representation is simplified (no rotation), evaluation relies heavily on the training signal itself, and the impact may be limited by the narrow 2D bounding-box setting when the field is moving toward richer 3D representations. The paper contributes a useful proof-of-concept for rule-guided preference optimization in spatial layout but falls short of a transformative advance.

Rating:5.5/ 10

Significance 5Rigor 6Novelty 5Clarity 7.5

Generated Jun 10, 2026

Comparison History (20)

Lostvs. Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

Paper 1 addresses a fundamental bottleneck in AI agents—recognizing uncertainty and seeking clarification—by integrating help-seeking directly into the action space. This approach has broad, cross-disciplinary implications for improving the reliability and safety of autonomous LLM agents. In contrast, Paper 2 presents a valuable but highly domain-specific dataset and pipeline for architectural floor plans, which has practical industry applications but narrower foundational scientific impact.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

Paper 2 has higher likely scientific impact due to a clearer technical contribution (new annotated dataset + editable furnishing framework), stronger methodological components (DSL representation, constraint-based reasoning traces, preference optimization, and rendering pipeline), and direct real-world applicability in architecture/real estate/interior design. Its dataset addresses a key bottleneck and can enable follow-on work, broadening impact across vision-language modeling, structured generation, and design automation. Paper 1 is conceptually novel and timely for HCI, but its smaller-scale gamified study (74 participants) suggests more limited generalizability and downstream reuse compared with a reusable dataset + model framework.

gpt-5.2·Jun 11, 2026

Wonvs. Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

Paper 1 offers a highly practical and scalable solution to a major bottleneck in architecture and real estate design. By combining a novel dataset, a domain-specific language, and vision-language models for procedural reasoning, it presents a comprehensive neuro-symbolic framework. While Paper 2 introduces an innovative multi-agent world model approach for sports analytics, Paper 1 has broader immediate real-world applications and commercial potential across multiple large-scale industries.