Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent breakthroughs in text-to-image synthesis have been driven by diffusion models
  • 3D synthesis requires large-scale datasets and efficient architectures, which don’t exist
  • Text-to-3D synthesis is done using a pretrained 2D text-to-image diffusion model
  • Loss based on probability density distillation enables use of 2D diffusion model as prior
  • DeepDream-like procedure optimizes a randomly-initialized 3D model via gradient descent

Paper Content

Introduction

  • Generative image models now support high-fidelity, diverse and controllable image synthesis
  • Quality improvements come from large aligned image-text datasets and scalable generative model architectures
  • Diffusion models are effective at learning high-quality image generators
  • Applying diffusion models to other modalities requires large amounts of modality-specific training data
  • This work develops techniques to transfer pretrained 2D image-text diffusion models to 3D object synthesis
  • 3D generative models can be trained on explicit representations of structure
  • GANs can learn controllable 3D generators from photographs of a single object category
  • Neural Radiance Fields can be used for neural inverse rendering
  • Many 3D generative approaches have found success incorporating NeRF-like models
  • This work uses pretrained 2D image-text models for 3D synthesis
  • Score Distillation Sampling (SDS) enables sampling via optimization in differentiable image parameterizations
  • DreamFusion generates high-fidelity coherent 3D objects and scenes for user-provided text prompts

Diffusion models and score distillation sampling

  • Diffusion models are generative models that learn to transform a sample from a noise distribution to a data distribution.
  • The forward process is typically a Gaussian distribution that transitions from a less noisy latent to a noisier latent.
  • The reverse process is trained to slowly add structure starting from random noise.
  • The optimal reverse process step is also Gaussian and related to an optimal MSE denoiser.
  • The generative model is trained with a weighted evidence lower bound (ELBO).
  • Text-to-image diffusion models use classifier-free guidance (CFG) to improve sample fidelity.
  • Parameters are updated with SGD and samples are updated in pixel space.

Score distillation sampling ancestral sampling

  • 2D sampling methods are compared in Figure 2
  • Score distillation sampling is used as an example with an image generator that restricts images to be symmetric

How can we sample in parameter space, not pixel space?

  • Existing approaches for sampling from diffusion models generate a sample that is the same type and dimensionality as the observed data.
  • Conditional diffusion sampling enables flexibility.
  • Diffusion models trained on pixels have traditionally been used to sample only pixels.
  • We want to create 3D models that look like good images when rendered from random angles.
  • We use a differentiable image parameterization (DIP) to express constraints, optimize in more compact spaces, or leverage more powerful optimization algorithms.
  • We need a differentiable loss function where plausible images have low loss, and implausible images have high loss.
  • We investigated reusing the diffusion training loss to find modes of the learned conditional density.
  • Minimizing the diffusion training loss with respect to a generated datapoint does not produce realistic samples.
  • We found that omitting the U-Net Jacobian term leads to an effective gradient for optimizing DIPs with diffusion models.
  • We use a weighted probability density distillation loss to compute parameter updates.
  • We name our sampling approach Score Distillation Sampling (SDS).
  • SDS produces detail comparable to ancestral sampling.

The dreamfusion algorithm

  • Use diffusion model as loss in continuous optimization problem to generate samples
  • Use Imagen model from Saharia et al. (2022) to synthesize images from text
  • Use 64x64 base model, no modifications
  • Initialize NeRF-like model with random weights
  • Render views of NeRF from random camera positions and angles
  • Use renderings as input to score distillation loss function
  • Gradient descent eventually results in 3D model resembling text

Neural rendering of a 3d model

  • NeRF is a technique for neural inverse rendering that uses a volumetric raytracer and a multilayer perceptron
  • Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection
  • Sampled 3D points are passed through an MLP, which produces 4 scalar values as output
  • Output includes volumetric density and an RGB color
  • NeRF MLP is trained from random initialization using a mean squared error loss function
  • Model is built upon mip-NeRF 360 which reduces aliasing
  • MLP parameterizes the color of the surface itself, which is then lit by an illumination
  • Regularization penalty on the opacity along each ray and a modified version of the orientation loss are used
  • Full details on these regularizers and additional hyperparameters of NeRF are in Appendix A.2

Text-to-3d synthesis

  • Pretrained text-to-image diffusion model, differentiable image parameterization in the form of a NeRF, and a loss function used for text-to-3D synthesis
  • Randomly sample a camera and light
  • Render an image of the NeRF from the camera and shade with the light
  • Compute gradients of the SDS loss with respect to the NeRF parameters
  • Update the NeRF parameters using an optimizer
  • Optimize for 15,000 iterations

Experiments

  • Evaluating DreamFusion’s ability to generate 3D scenes from text prompts
  • Comparing to existing zero-shot text-to-3D generative models
  • Identifying key components of DreamFusion that enable accurate 3D geometry
  • Exploring qualitative capabilities of DreamFusion
  • Evaluating with CLIP R-Precision
  • Ablation study to evaluate components of DreamFusion

Discussion

  • DreamFusion is a technique for text-to-3D synthesis
  • DreamFusion uses a Score Distillation Sampling approach and a NeRF-like rendering engine
  • DreamFusion does not require 3D or multi-view training data
  • DreamFusion has limitations, such as oversaturated and oversmoothed results and lack of diversity

Ethics statement

  • Generative models for images have ethical concerns
  • Imagen diffusion model has biases and limitations
  • LAION-400M subset of Imagen data contains undesirable images
  • Imagen is conditioned on features from a pretrained language model
  • Need to be careful about datasets used in text-to-image and image-to-3D models
  • Generative models can be used to generate disinformation
  • 3D objects may be more convincing than 2D images
  • Generative models may displace creative workers but also enable growth and improve accessibility

Reproducibility statement

  • Mip-NeRF 360 model is publicly available through the “MultiNeRF” code repository
  • DreamFusion algorithm can produce similar results to Imagen diffusion model
  • Schematic overview of algorithm, pseudocode, hyperparameters, and evaluation setup details included
  • Derivations for loss included in Appendix A.4
  • Sinusoidal positional encoding function uses frequencies 2 0 , 2 1 , . . . , 2 L−1 , where L = 8
  • NeRF MLP consists of 5 ResNet blocks with 128 hidden units, Swish/SiLU activation, and layer normalization
  • Ambient light color a set to 1 and diffuse light color ρ set to 0 for first 1k steps of optimization
  • Textureless shading (ρ = 1) chosen with probability 0.5 when shading is on ( ρ > 0)
  • Small “blob” of density around the origin added to output of MLP
  • Uniformly sample camera elevation φ cam from biased distribution with probability 0.5
  • Light position vector direction sampled from N (p cam , I) and norm sampled from U(0.8, 1.5)
  • Orientation loss weight set to 10 −2 and annealed in starting from 10 −4 over first 5k steps
  • Accumulated alpha loss weight set to lie in [10 −3 , 5 × 10 −3 ]
  • Interpolate between front/side/back view prompt augmentations based on which quadrant contains sampled azimuth θ cam
  • Optimizer uses Distributed Shampoo with β 1 = 0.9, β 2 = 0.9, exponent override = 2, block size = 128, graft type = SQRT N, = 10 −6
  • Linear warmup of learning rate over 3000 steps from 10 −9 to 10 −4 followed by cosine decay down to 10 −6
  • Score distillation sampling loss L SDS used to find modes of score functions
  • Gradient of loss leads to same update as optimizing training loss L Diff
  • GAN-like amortized samplers can be learned by minimizing the Stein discrepancy
  • Large guidance weights (ω = 100) important for learning high-quality 3D models
  • DreamFusion does not yield large amounts of diversity across random seeds