ControlRoom3D 🤖
Room Generation using
Semantic Proxy Rooms

published at CVPR 2024

1Meta GenAI, 2RWTH Aachen University, 3Technical University of Munich
*Work performed during internship at Meta GenAI.

ControlRoom3D creates diverse and plausible 3D room meshes aligning well with user-defined room layouts and textual descriptions of the room style.


Manually creating 3D environments for AR/VR applications is a complex process requiring expert knowledge in 3D modeling software. Pioneering works facilitate this process by generating room meshes conditioned on textual style descriptions. Yet, many of these automatically generated 3D meshes do not adhere to typical room layouts, compromising their plausibility, e.g., by placing several beds in one bedroom. To address these challenges, we present ControlRoom3D, a novel method to generate high-quality room meshes. Central to our approach is a user-defined 3D semantic proxy room that outlines a rough room layout based on semantic bounding boxes and a textual description of the overall room style. Our key insight is that when rendered to 2D, this 3D representation provides valuable geometric and semantic information to control powerful 2D models to generate 3D consistent textures and geometry that aligns well with the proxy room. Backed up by an extensive study including quantitative metrics and qualitative user evaluations, our method generates diverse and globally plausible 3D room meshes, thus empowering users to design 3D rooms effortlessly without specialized knowledge.




Geometry Alignment

Scale ambiguity leads to significant inaccuracies in state-of-the-art metric depth estimators such as ZoeDepth. In contrast, our proposed depth alignment module iteratively optimizes the alignment loss to achieve strong alignment with the proxy room.

Interpolate start reference image.

No Optimization

Interpolation end reference image.

After Depth Alignment

SAM Masks

We leverage SAM to obtain pixel-precise instance masks for each object. For pixels located within the rendered bounding box but outside the SAM mask, we assign the near depth value to the far depth. Including SAM masks leads to sharper 3D object boundaries, resulting in a more seamless integration into the 3D room mesh.

(Hover over image to see the effect.)

Normal Loss

Although the depth alignment loss effectively aligns the frame with the 3D proxy room, it may occasionally distort the surface of objects to fit them within their bounding boxes. To counter this, we introduce the normal preservation loss, retaining the original shape of the objects.

(Hover over image to see the effect.)


  author    = {Schult, Jonas and Tsai, Sam and H\"ollein, Lukas and Wu, Bichen and Wang, Jialiang and Ma, Chih-Yao and Li, Kunpeng and Wang, Xiaofang and Wimbauer, Felix and He, Zijian and Zhang, Peizhao and Leibe, Bastian and Vajda, Peter and Hou, Ji},
  title     = {ControlRoom3D: Room Generation using Semantic Proxy Rooms},
  booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2024},