Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Presents DreamBooth3D, an approach to personalize 3D models from 3-6 images
  • Combines text-to-image and text-to-3D models
  • Naively combining these methods fails to yield satisfactory 3D assets
  • Uses 3-stage optimization strategy to leverage 3D consistency and personalization
  • Produces high-quality, subject-specific 3D assets with text-driven modifications

Paper Content

Introduction

  • Text-to-Image (T2I) generative models can create and edit visual content
  • Recent works have demonstrated high-quality Text-to-3D generation
  • Applications in graphics, VR, movies, and gaming
  • Text prompts allow for some degree of control, but difficult to precisely control identity, geometry, and appearance
  • Recent success in personalizing T2I models for subject-specific 2D image generation
  • DreamBooth3D proposed for subject-driven Text-to-3D generation
  • Given a few casual image captures of a subject, generate subject-specific 3D assets
  • Draws inspiration from recent works
  • 3-stage optimization framework proposed
  • Synergistic optimization of NeRF and T2I models
  • Results indicate realistic 3D assets with high likeness to given subject
  • Text-to-Image Generation uses GANs, autoregressive models, and masked image models to generate images
  • Denoising diffusion models can generate high-quality images and be conditioned on various inputs
  • 3D Generation uses 3D reconstruction from images or generative models from image collections
  • Text-to-3D methods generate 3D assets from text prompts
  • Subject-driven Generation enables users to personalize image generation for specific subjects
  • Textual Inversion optimizes for a new “word” in the embedding space of a pre-trained text-to-image model

Approach

  • Input consists of k casual subject captures with n pixels and a text prompt
  • Aim is to generate a 3D asset that captures the identity and is faithful to the text prompt
  • 3D assets are optimized in the form of Neural Radiance Fields
  • Problem is more challenging than typical 3D reconstruction setting
  • Technique is based on advances in Text-to-3D optimization and personalization

Preliminaries

  • DreamBooth is a method to personalize a text-to-image (T2I) diffusion model.
  • DreamBooth uses a diffusion loss function to finetune the T2I model.
  • DreamFusion optimizes a volume represented as a NeRF to match a text prompt using a T2I diffusion model.
  • DreamFusion uses score distillation sampling to push noisy versions of the rendered images to lower energy states of the T2I diffusion model.

Failure of naive dreambooth+fusion

  • Personalizing a T2I model and using it for Text-to-3D optimization
  • Naive DreamBooth+Fusion technique results in unsatisfactory results
  • DreamBooth tends to overfit to the subject views in the training views, reducing viewpoint diversity in the image generations

Dreambooth3d optimization

  • Proposed an effective multi-stage optimization scheme called Dream-Booth3D for text-to-3D generation
  • 3 stages: 3D with Partial DreamBooth, Multi-view Data Generation, Final NeRF with Multi-view DreamBooth
  • Initial checkpoints of Dream-Booth do not overfit to given subject views
  • Use SDS loss to optimize initial NeRF asset
  • Generate pseudo multi-view subject images with fully-trained DreamBooth
  • Add noise to renders from initial NeRF asset
  • Use Img2Img translation to generate images with more likeness to given subject
  • Further finetune partially trained DreamBooth with augmented data
  • Optimize NeRF with SDS and multi-view reconstruction losses
  • Final NeRF optimization objective includes NeRF regularizations

Experiments

  • Imagen T2I model and T5-XXL language model used
  • 3 hours per prompt to complete 3 stages of optimization on 4 core TPUv4
  • 150 iterations to train partial DreamBooth model, 800 iterations to train full DreamBooth model
  • 20 images uniformly sampled at fixed radius for pseudo multi-view data generation
  • 150 iterations to finetune partially trained model
  • Dataset consists of 30 image collections with 4-6 casual captures of a variety of subjects
  • Latent-NeRF and DreamBooth3D used as baselines
  • CLIP R-Precision metric used to evaluate results
  • User studies conducted to compare different results

Results

  • DreamBooth3D faithfully respects context in input text prompt
  • Latent-NeRF often fails to produce coherent 3D model
  • DreamBooth+Fusion produces 3D assets with Janus problem
  • DreamBooth3D consistently produces 360 consistent 3D assets
  • DreamBooth3D has higher scores than DreamBooth+Fusion
  • Initial NeRF has partial likeness to subject, final NeRF has better likeness
  • User study shows DreamBooth3D is preferred over baselines

Sample applications

  • DreamBooth3D can represent context and preserve identity
  • Simple changes in text-prompt enable 3D applications
  • Results consistently respect context in text-prompt
  • Color/material editing possible with simple text prompts
  • DreamBooth3D can convert 2D cartoon images into 3D shapes

Limitations

  • Method allows for high-quality 3D asset creation
  • Optimized 3D representations can be oversaturated and oversmoothed
  • Low image resolution of 64x64 pixels
  • Janus problem of appearing to be front-facing from multiple inconsistent viewpoints
  • Struggles to reconstruct thin object structures like sunglasses

Conclusion

  • Proposed DreamBooth3D, a method for subject-driven text-to-3D generation
  • Given a few casual image captures of a subject, generate 3D assets that adhere to the contextualization provided in the input text
  • Experiments on DreamBooth dataset show realistic 3D assets with high likeness to a given subject
  • Outperforms several baselines in quantitative and qualitative evaluations
  • Future plans to improve photorealism and controllability of subject-driven 3D generation
  • User study to compare to other approaches
  • Example applications include material editing, accessorization, color changes, and pose changes
  • Mip-NeRF used as volumetric representation
  • NeRF volume regularized using orientation and opacity loss
  • Results demonstrate high-quality geometry estimation
  • User study shows preference for DreamBooth3D over baselines
  • 3D recontextualizations with DreamBooth3D
  • Quantitative comparisons show more accurate resemblance to text prompts
  • Applications include accessorization, stylization, material editing
  • Can produce plausible 3D models from unrealistic cartoon images
  • Can fail to reconstruct thin object structures and objects with not enough view variation