Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Text2Room is a method for generating 3D meshes from text prompts.
- Pre-trained 2D text-to-image models are used to create a sequence of images from different poses.
- Monocular depth estimation and a text-conditioned inpainting model are used to lift the images into a 3D scene representation.
- A tailored viewpoint selection is used to fuse the content of each image into a seamless, textured 3D mesh.
- Text2Room is the first method to generate room-scale 3D geometry with compelling textures from only text as input.
Paper Content
Introduction
- Generating 3D meshes from text
- 2D text-to-image models used to create 3D models
- 3D datasets are smaller than 2D datasets
- Iterative optimization problem in image domain to generate 3D objects
- Generating large scenes with dense and coherent content
- Merging generated content with existing mesh to create smooth transitions
Related work
- Text-based Generation has seen advances due to datasets and model architectures
- Diffusion models have achieved impressive results on image synthesis
- Text-to-image models can generate diverse, high-fidelity, and controllable outputs
- Text-based generation has been extended to other modalities including audio, video, and 4D fields
- Text-to-3D methods use 3D data for supervised training or optimization in the image domain
- Recent methods combine text-to-image diffusion models and neural radiance fields to generate 3D objects
- Several methods have been proposed for novel-view synthesis from a single image
- Other methods optimize a neural 3D representation of an object
- Perpetual view generation synthesizes videos from a single RGB image
- Fridman et al. create 3D scenes from text, focusing on zoom-out video generation
- We generate complete, textured 3D room geometry from arbitrary trajectories
Method
- PureClipNeRF creates key objects but not complete 3D structure
- Outpainting creates high-detail textures but has occlusion issues
- Text2Light and Blockade create high-detail 360 view but have occlusion issues
- Our approach creates high-detail textures and geometry without holes
Iterative 3d scene generation
- Scene is represented as a mesh with vertices, colors, and faces
- Input to method is set of text prompts and poses
- Iterative scene generation process follows render-refine-repeat pattern
- For each step of generation, render current scene from novel viewpoint
- Use text-to-image model to inpaint unobserved pixels
- Inpaint unobserved depth with monocular depth estimator
- Combine novel content with existing mesh using fusion scheme
Depth alignment step
- Predicted depth is used to lift 2D image into 3D
- Similar regions in a scene should be placed at similar depth
- Depth alignment is done in two stages
- State-of-the-art depth inpainting network is used
- Optimization for scale and shift parameters is done
- Smoothing is applied to extracted depth
Mesh fusion step
- Insert new content into the scene
- Backproject image-space pixels into world-space point cloud
- Use triangulation scheme to create 3D geometry
- Filter faces based on edge length and angle between surface normal and viewing direction
- Fuse newly generated mesh patch with existing geometry
Two-stage viewpoint selection
- Choice of text prompts and camera poses used to create indoor scene
- Can lead to stretch and hole artifacts if poses chosen carelessly
- Two-stage viewpoint selection strategy proposed to sample optimal positions and refine empty regions
- First stage creates main parts of scene, including general layout and furniture
- Second stage samples additional poses to inpaint scene and close any remaining holes
Results
- Implemented mesh rasterization and fusion with Pytorch3D
- Used Stable Diffusion model for text-to-image
- Used Iron-Depth model for monocular depth estimator
- Generated prompts using guidelines suggested by Pierre
- Compared against four related methods
Qualitative results
- Showed top-down views and RGB renderings from within
- Showed additional results of method in Figure 5
Quantitative results
- We render 60 images from novel viewpoints for each scene to calculate 2D metrics
- Stretched-out geometry and holes in 3D geometry lead to lower scores for baselines
- Our approach achieves highest scores due to accurate and complete geometry and RGB texture
- We use separate text prompts for different poses to combine multiple text prompts
- We use depth alignment, mesh fusion and two-stage viewpoint selection
- Our method can fail under certain conditions, such as stretched-out geometry or incomplete holes
- We conduct a user study to score Perceptual Quality and 3D Structure Completeness
- We use a tailored two-stage viewpoint selection scheme to generate the scene
- We use a monocular depth inpainting network and align rendered depth and inpainted depth
- We use a classical inpainting algorithm to inpaint small holes