Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Presents DreamBooth3D, an approach to personalize 3D models from 3-6 images
Combines text-to-image and text-to-3D models
Naively combining these methods fails to yield satisfactory 3D assets
Uses 3-stage optimization strategy to leverage 3D consistency and personalization
Produces high-quality, subject-specific 3D assets with text-driven modifications

Paper Content

Introduction

Text-to-Image (T2I) generative models can create and edit visual content
Recent works have demonstrated high-quality Text-to-3D generation
Applications in graphics, VR, movies, and gaming
Text prompts allow for some degree of control, but difficult to precisely control identity, geometry, and appearance
Recent success in personalizing T2I models for subject-specific 2D image generation
DreamBooth3D proposed for subject-driven Text-to-3D generation
Given a few casual image captures of a subject, generate subject-specific 3D assets
Draws inspiration from recent works
3-stage optimization framework proposed
Synergistic optimization of NeRF and T2I models
Results indicate realistic 3D assets with high likeness to given subject

Text-to-Image Generation uses GANs, autoregressive models, and masked image models to generate images
Denoising diffusion models can generate high-quality images and be conditioned on various inputs
3D Generation uses 3D reconstruction from images or generative models from image collections
Text-to-3D methods generate 3D assets from text prompts
Subject-driven Generation enables users to personalize image generation for specific subjects
Textual Inversion optimizes for a new “word” in the embedding space of a pre-trained text-to-image model

Approach

Input consists of k casual subject captures with n pixels and a text prompt
Aim is to generate a 3D asset that captures the identity and is faithful to the text prompt
3D assets are optimized in the form of Neural Radiance Fields
Problem is more challenging than typical 3D reconstruction setting
Technique is based on advances in Text-to-3D optimization and personalization

Preliminaries

DreamBooth is a method to personalize a text-to-image (T2I) diffusion model.
DreamBooth uses a diffusion loss function to finetune the T2I model.
DreamFusion optimizes a volume represented as a NeRF to match a text prompt using a T2I diffusion model.
DreamFusion uses score distillation sampling to push noisy versions of the rendered images to lower energy states of the T2I diffusion model.

Failure of naive dreambooth+fusion

Personalizing a T2I model and using it for Text-to-3D optimization
Naive DreamBooth+Fusion technique results in unsatisfactory results
DreamBooth tends to overfit to the subject views in the training views, reducing viewpoint diversity in the image generations

Dreambooth3d optimization

Proposed an effective multi-stage optimization scheme called Dream-Booth3D for text-to-3D generation
3 stages: 3D with Partial DreamBooth, Multi-view Data Generation, Final NeRF with Multi-view DreamBooth
Initial checkpoints of Dream-Booth do not overfit to given subject views
Use SDS loss to optimize initial NeRF asset
Generate pseudo multi-view subject images with fully-trained DreamBooth
Add noise to renders from initial NeRF asset
Use Img2Img translation to generate images with more likeness to given subject
Further finetune partially trained DreamBooth with augmented data
Optimize NeRF with SDS and multi-view reconstruction losses
Final NeRF optimization objective includes NeRF regularizations

Experiments

Imagen T2I model and T5-XXL language model used
3 hours per prompt to complete 3 stages of optimization on 4 core TPUv4
150 iterations to train partial DreamBooth model, 800 iterations to train full DreamBooth model
20 images uniformly sampled at fixed radius for pseudo multi-view data generation
150 iterations to finetune partially trained model
Dataset consists of 30 image collections with 4-6 casual captures of a variety of subjects
Latent-NeRF and DreamBooth3D used as baselines
CLIP R-Precision metric used to evaluate results
User studies conducted to compare different results

Results

DreamBooth3D faithfully respects context in input text prompt
Latent-NeRF often fails to produce coherent 3D model
DreamBooth+Fusion produces 3D assets with Janus problem
DreamBooth3D consistently produces 360 consistent 3D assets
DreamBooth3D has higher scores than DreamBooth+Fusion
Initial NeRF has partial likeness to subject, final NeRF has better likeness
User study shows DreamBooth3D is preferred over baselines

Sample applications

DreamBooth3D can represent context and preserve identity
Simple changes in text-prompt enable 3D applications
Results consistently respect context in text-prompt
Color/material editing possible with simple text prompts
DreamBooth3D can convert 2D cartoon images into 3D shapes

Limitations

Method allows for high-quality 3D asset creation
Optimized 3D representations can be oversaturated and oversmoothed
Low image resolution of 64x64 pixels
Janus problem of appearing to be front-facing from multiple inconsistent viewpoints
Struggles to reconstruct thin object structures like sunglasses

Conclusion

Proposed DreamBooth3D, a method for subject-driven text-to-3D generation
Given a few casual image captures of a subject, generate 3D assets that adhere to the contextualization provided in the input text
Experiments on DreamBooth dataset show realistic 3D assets with high likeness to a given subject
Outperforms several baselines in quantitative and qualitative evaluations
Future plans to improve photorealism and controllability of subject-driven 3D generation
User study to compare to other approaches
Example applications include material editing, accessorization, color changes, and pose changes
Mip-NeRF used as volumetric representation
NeRF volume regularized using orientation and opacity loss
Results demonstrate high-quality geometry estimation
User study shows preference for DreamBooth3D over baselines
3D recontextualizations with DreamBooth3D
Quantitative comparisons show more accurate resemblance to text prompts
Applications include accessorization, stylization, material editing
Can produce plausible 3D models from unrealistic cartoon images
Can fail to reconstruct thin object structures and objects with not enough view variation

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Approach#

Preliminaries#

Failure of naive dreambooth+fusion#

Dreambooth3d optimization#

Experiments#

Results#

Sample applications#

Limitations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related works

Approach

Preliminaries

Failure of naive dreambooth+fusion

Dreambooth3d optimization

Experiments

Results

Sample applications

Limitations

Conclusion