Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Text2Room is a method for generating 3D meshes from text prompts.
  • Pre-trained 2D text-to-image models are used to create a sequence of images from different poses.
  • Monocular depth estimation and a text-conditioned inpainting model are used to lift the images into a 3D scene representation.
  • A tailored viewpoint selection is used to fuse the content of each image into a seamless, textured 3D mesh.
  • Text2Room is the first method to generate room-scale 3D geometry with compelling textures from only text as input.

Paper Content


  • Generating 3D meshes from text
  • 2D text-to-image models used to create 3D models
  • 3D datasets are smaller than 2D datasets
  • Iterative optimization problem in image domain to generate 3D objects
  • Generating large scenes with dense and coherent content
  • Merging generated content with existing mesh to create smooth transitions
  • Text-based Generation has seen advances due to datasets and model architectures
  • Diffusion models have achieved impressive results on image synthesis
  • Text-to-image models can generate diverse, high-fidelity, and controllable outputs
  • Text-based generation has been extended to other modalities including audio, video, and 4D fields
  • Text-to-3D methods use 3D data for supervised training or optimization in the image domain
  • Recent methods combine text-to-image diffusion models and neural radiance fields to generate 3D objects
  • Several methods have been proposed for novel-view synthesis from a single image
  • Other methods optimize a neural 3D representation of an object
  • Perpetual view generation synthesizes videos from a single RGB image
  • Fridman et al. create 3D scenes from text, focusing on zoom-out video generation
  • We generate complete, textured 3D room geometry from arbitrary trajectories


  • PureClipNeRF creates key objects but not complete 3D structure
  • Outpainting creates high-detail textures but has occlusion issues
  • Text2Light and Blockade create high-detail 360 view but have occlusion issues
  • Our approach creates high-detail textures and geometry without holes

Iterative 3d scene generation

  • Scene is represented as a mesh with vertices, colors, and faces
  • Input to method is set of text prompts and poses
  • Iterative scene generation process follows render-refine-repeat pattern
  • For each step of generation, render current scene from novel viewpoint
  • Use text-to-image model to inpaint unobserved pixels
  • Inpaint unobserved depth with monocular depth estimator
  • Combine novel content with existing mesh using fusion scheme

Depth alignment step

  • Predicted depth is used to lift 2D image into 3D
  • Similar regions in a scene should be placed at similar depth
  • Depth alignment is done in two stages
  • State-of-the-art depth inpainting network is used
  • Optimization for scale and shift parameters is done
  • Smoothing is applied to extracted depth

Mesh fusion step

  • Insert new content into the scene
  • Backproject image-space pixels into world-space point cloud
  • Use triangulation scheme to create 3D geometry
  • Filter faces based on edge length and angle between surface normal and viewing direction
  • Fuse newly generated mesh patch with existing geometry

Two-stage viewpoint selection

  • Choice of text prompts and camera poses used to create indoor scene
  • Can lead to stretch and hole artifacts if poses chosen carelessly
  • Two-stage viewpoint selection strategy proposed to sample optimal positions and refine empty regions
  • First stage creates main parts of scene, including general layout and furniture
  • Second stage samples additional poses to inpaint scene and close any remaining holes


  • Implemented mesh rasterization and fusion with Pytorch3D
  • Used Stable Diffusion model for text-to-image
  • Used Iron-Depth model for monocular depth estimator
  • Generated prompts using guidelines suggested by Pierre
  • Compared against four related methods

Qualitative results

  • Showed top-down views and RGB renderings from within
  • Showed additional results of method in Figure 5

Quantitative results

  • We render 60 images from novel viewpoints for each scene to calculate 2D metrics
  • Stretched-out geometry and holes in 3D geometry lead to lower scores for baselines
  • Our approach achieves highest scores due to accurate and complete geometry and RGB texture
  • We use separate text prompts for different poses to combine multiple text prompts
  • We use depth alignment, mesh fusion and two-stage viewpoint selection
  • Our method can fail under certain conditions, such as stretched-out geometry or incomplete holes
  • We conduct a user study to score Perceptual Quality and 3D Structure Completeness
  • We use a tailored two-stage viewpoint selection scheme to generate the scene
  • We use a monocular depth inpainting network and align rendered depth and inpainted depth
  • We use a classical inpainting algorithm to inpaint small holes