Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Text2Room is a method for generating 3D meshes from text prompts.
Pre-trained 2D text-to-image models are used to create a sequence of images from different poses.
Monocular depth estimation and a text-conditioned inpainting model are used to lift the images into a 3D scene representation.
A tailored viewpoint selection is used to fuse the content of each image into a seamless, textured 3D mesh.
Text2Room is the first method to generate room-scale 3D geometry with compelling textures from only text as input.

Paper Content

Introduction

Generating 3D meshes from text
2D text-to-image models used to create 3D models
3D datasets are smaller than 2D datasets
Iterative optimization problem in image domain to generate 3D objects
Generating large scenes with dense and coherent content
Merging generated content with existing mesh to create smooth transitions

Text-based Generation has seen advances due to datasets and model architectures
Diffusion models have achieved impressive results on image synthesis
Text-to-image models can generate diverse, high-fidelity, and controllable outputs
Text-based generation has been extended to other modalities including audio, video, and 4D fields
Text-to-3D methods use 3D data for supervised training or optimization in the image domain
Recent methods combine text-to-image diffusion models and neural radiance fields to generate 3D objects
Several methods have been proposed for novel-view synthesis from a single image
Other methods optimize a neural 3D representation of an object
Perpetual view generation synthesizes videos from a single RGB image
Fridman et al. create 3D scenes from text, focusing on zoom-out video generation
We generate complete, textured 3D room geometry from arbitrary trajectories

Method

PureClipNeRF creates key objects but not complete 3D structure
Outpainting creates high-detail textures but has occlusion issues
Text2Light and Blockade create high-detail 360 view but have occlusion issues
Our approach creates high-detail textures and geometry without holes

Iterative 3d scene generation

Scene is represented as a mesh with vertices, colors, and faces
Input to method is set of text prompts and poses
Iterative scene generation process follows render-refine-repeat pattern
For each step of generation, render current scene from novel viewpoint
Use text-to-image model to inpaint unobserved pixels
Inpaint unobserved depth with monocular depth estimator
Combine novel content with existing mesh using fusion scheme

Depth alignment step

Predicted depth is used to lift 2D image into 3D
Similar regions in a scene should be placed at similar depth
Depth alignment is done in two stages
State-of-the-art depth inpainting network is used
Optimization for scale and shift parameters is done
Smoothing is applied to extracted depth

Mesh fusion step

Insert new content into the scene
Backproject image-space pixels into world-space point cloud
Use triangulation scheme to create 3D geometry
Filter faces based on edge length and angle between surface normal and viewing direction
Fuse newly generated mesh patch with existing geometry

Two-stage viewpoint selection

Choice of text prompts and camera poses used to create indoor scene
Can lead to stretch and hole artifacts if poses chosen carelessly
Two-stage viewpoint selection strategy proposed to sample optimal positions and refine empty regions
First stage creates main parts of scene, including general layout and furniture
Second stage samples additional poses to inpaint scene and close any remaining holes

Results

Implemented mesh rasterization and fusion with Pytorch3D
Used Stable Diffusion model for text-to-image
Used Iron-Depth model for monocular depth estimator
Generated prompts using guidelines suggested by Pierre
Compared against four related methods

Qualitative results

Showed top-down views and RGB renderings from within
Showed additional results of method in Figure 5

Quantitative results

We render 60 images from novel viewpoints for each scene to calculate 2D metrics
Stretched-out geometry and holes in 3D geometry lead to lower scores for baselines
Our approach achieves highest scores due to accurate and complete geometry and RGB texture
We use separate text prompts for different poses to combine multiple text prompts
We use depth alignment, mesh fusion and two-stage viewpoint selection
Our method can fail under certain conditions, such as stretched-out geometry or incomplete holes
We conduct a user study to score Perceptual Quality and 3D Structure Completeness
We use a tailored two-stage viewpoint selection scheme to generate the scene
We use a monocular depth inpainting network and align rendered depth and inpainted depth
We use a classical inpainting algorithm to inpaint small holes

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Iterative 3d scene generation#

Depth alignment step#

Mesh fusion step#

Two-stage viewpoint selection#

Results#

Qualitative results#

Quantitative results#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Iterative 3d scene generation

Depth alignment step

Mesh fusion step

Two-stage viewpoint selection

Results

Qualitative results

Quantitative results