Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Presents DreamBooth3D, an approach to personalize 3D models from 3-6 images
- Combines text-to-image and text-to-3D models
- Naively combining these methods fails to yield satisfactory 3D assets
- Uses 3-stage optimization strategy to leverage 3D consistency and personalization
- Produces high-quality, subject-specific 3D assets with text-driven modifications
Paper Content
Introduction
- Text-to-Image (T2I) generative models can create and edit visual content
- Recent works have demonstrated high-quality Text-to-3D generation
- Applications in graphics, VR, movies, and gaming
- Text prompts allow for some degree of control, but difficult to precisely control identity, geometry, and appearance
- Recent success in personalizing T2I models for subject-specific 2D image generation
- DreamBooth3D proposed for subject-driven Text-to-3D generation
- Given a few casual image captures of a subject, generate subject-specific 3D assets
- Draws inspiration from recent works
- 3-stage optimization framework proposed
- Synergistic optimization of NeRF and T2I models
- Results indicate realistic 3D assets with high likeness to given subject
Related works
- Text-to-Image Generation uses GANs, autoregressive models, and masked image models to generate images
- Denoising diffusion models can generate high-quality images and be conditioned on various inputs
- 3D Generation uses 3D reconstruction from images or generative models from image collections
- Text-to-3D methods generate 3D assets from text prompts
- Subject-driven Generation enables users to personalize image generation for specific subjects
- Textual Inversion optimizes for a new “word” in the embedding space of a pre-trained text-to-image model
Approach
- Input consists of k casual subject captures with n pixels and a text prompt
- Aim is to generate a 3D asset that captures the identity and is faithful to the text prompt
- 3D assets are optimized in the form of Neural Radiance Fields
- Problem is more challenging than typical 3D reconstruction setting
- Technique is based on advances in Text-to-3D optimization and personalization
Preliminaries
- DreamBooth is a method to personalize a text-to-image (T2I) diffusion model.
- DreamBooth uses a diffusion loss function to finetune the T2I model.
- DreamFusion optimizes a volume represented as a NeRF to match a text prompt using a T2I diffusion model.
- DreamFusion uses score distillation sampling to push noisy versions of the rendered images to lower energy states of the T2I diffusion model.
Failure of naive dreambooth+fusion
- Personalizing a T2I model and using it for Text-to-3D optimization
- Naive DreamBooth+Fusion technique results in unsatisfactory results
- DreamBooth tends to overfit to the subject views in the training views, reducing viewpoint diversity in the image generations
Dreambooth3d optimization
- Proposed an effective multi-stage optimization scheme called Dream-Booth3D for text-to-3D generation
- 3 stages: 3D with Partial DreamBooth, Multi-view Data Generation, Final NeRF with Multi-view DreamBooth
- Initial checkpoints of Dream-Booth do not overfit to given subject views
- Use SDS loss to optimize initial NeRF asset
- Generate pseudo multi-view subject images with fully-trained DreamBooth
- Add noise to renders from initial NeRF asset
- Use Img2Img translation to generate images with more likeness to given subject
- Further finetune partially trained DreamBooth with augmented data
- Optimize NeRF with SDS and multi-view reconstruction losses
- Final NeRF optimization objective includes NeRF regularizations
Experiments
- Imagen T2I model and T5-XXL language model used
- 3 hours per prompt to complete 3 stages of optimization on 4 core TPUv4
- 150 iterations to train partial DreamBooth model, 800 iterations to train full DreamBooth model
- 20 images uniformly sampled at fixed radius for pseudo multi-view data generation
- 150 iterations to finetune partially trained model
- Dataset consists of 30 image collections with 4-6 casual captures of a variety of subjects
- Latent-NeRF and DreamBooth3D used as baselines
- CLIP R-Precision metric used to evaluate results
- User studies conducted to compare different results
Results
- DreamBooth3D faithfully respects context in input text prompt
- Latent-NeRF often fails to produce coherent 3D model
- DreamBooth+Fusion produces 3D assets with Janus problem
- DreamBooth3D consistently produces 360 consistent 3D assets
- DreamBooth3D has higher scores than DreamBooth+Fusion
- Initial NeRF has partial likeness to subject, final NeRF has better likeness
- User study shows DreamBooth3D is preferred over baselines
Sample applications
- DreamBooth3D can represent context and preserve identity
- Simple changes in text-prompt enable 3D applications
- Results consistently respect context in text-prompt
- Color/material editing possible with simple text prompts
- DreamBooth3D can convert 2D cartoon images into 3D shapes
Limitations
- Method allows for high-quality 3D asset creation
- Optimized 3D representations can be oversaturated and oversmoothed
- Low image resolution of 64x64 pixels
- Janus problem of appearing to be front-facing from multiple inconsistent viewpoints
- Struggles to reconstruct thin object structures like sunglasses
Conclusion
- Proposed DreamBooth3D, a method for subject-driven text-to-3D generation
- Given a few casual image captures of a subject, generate 3D assets that adhere to the contextualization provided in the input text
- Experiments on DreamBooth dataset show realistic 3D assets with high likeness to a given subject
- Outperforms several baselines in quantitative and qualitative evaluations
- Future plans to improve photorealism and controllability of subject-driven 3D generation
- User study to compare to other approaches
- Example applications include material editing, accessorization, color changes, and pose changes
- Mip-NeRF used as volumetric representation
- NeRF volume regularized using orientation and opacity loss
- Results demonstrate high-quality geometry estimation
- User study shows preference for DreamBooth3D over baselines
- 3D recontextualizations with DreamBooth3D
- Quantitative comparisons show more accurate resemblance to text prompts
- Applications include accessorization, stylization, material editing
- Can produce plausible 3D models from unrealistic cartoon images
- Can fail to reconstruct thin object structures and objects with not enough view variation