Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent breakthroughs in text-to-image synthesis have been driven by diffusion models
- 3D synthesis requires large-scale datasets and efficient architectures, which don’t exist
- Text-to-3D synthesis is done using a pretrained 2D text-to-image diffusion model
- Loss based on probability density distillation enables use of 2D diffusion model as prior
- DeepDream-like procedure optimizes a randomly-initialized 3D model via gradient descent
Paper Content
Introduction
- Generative image models now support high-fidelity, diverse and controllable image synthesis
- Quality improvements come from large aligned image-text datasets and scalable generative model architectures
- Diffusion models are effective at learning high-quality image generators
- Applying diffusion models to other modalities requires large amounts of modality-specific training data
- This work develops techniques to transfer pretrained 2D image-text diffusion models to 3D object synthesis
- 3D generative models can be trained on explicit representations of structure
- GANs can learn controllable 3D generators from photographs of a single object category
- Neural Radiance Fields can be used for neural inverse rendering
- Many 3D generative approaches have found success incorporating NeRF-like models
- This work uses pretrained 2D image-text models for 3D synthesis
- Score Distillation Sampling (SDS) enables sampling via optimization in differentiable image parameterizations
- DreamFusion generates high-fidelity coherent 3D objects and scenes for user-provided text prompts
Diffusion models and score distillation sampling
- Diffusion models are generative models that learn to transform a sample from a noise distribution to a data distribution.
- The forward process is typically a Gaussian distribution that transitions from a less noisy latent to a noisier latent.
- The reverse process is trained to slowly add structure starting from random noise.
- The optimal reverse process step is also Gaussian and related to an optimal MSE denoiser.
- The generative model is trained with a weighted evidence lower bound (ELBO).
- Text-to-image diffusion models use classifier-free guidance (CFG) to improve sample fidelity.
- Parameters are updated with SGD and samples are updated in pixel space.
Score distillation sampling ancestral sampling
- 2D sampling methods are compared in Figure 2
- Score distillation sampling is used as an example with an image generator that restricts images to be symmetric
How can we sample in parameter space, not pixel space?
- Existing approaches for sampling from diffusion models generate a sample that is the same type and dimensionality as the observed data.
- Conditional diffusion sampling enables flexibility.
- Diffusion models trained on pixels have traditionally been used to sample only pixels.
- We want to create 3D models that look like good images when rendered from random angles.
- We use a differentiable image parameterization (DIP) to express constraints, optimize in more compact spaces, or leverage more powerful optimization algorithms.
- We need a differentiable loss function where plausible images have low loss, and implausible images have high loss.
- We investigated reusing the diffusion training loss to find modes of the learned conditional density.
- Minimizing the diffusion training loss with respect to a generated datapoint does not produce realistic samples.
- We found that omitting the U-Net Jacobian term leads to an effective gradient for optimizing DIPs with diffusion models.
- We use a weighted probability density distillation loss to compute parameter updates.
- We name our sampling approach Score Distillation Sampling (SDS).
- SDS produces detail comparable to ancestral sampling.
The dreamfusion algorithm
- Use diffusion model as loss in continuous optimization problem to generate samples
- Use Imagen model from Saharia et al. (2022) to synthesize images from text
- Use 64x64 base model, no modifications
- Initialize NeRF-like model with random weights
- Render views of NeRF from random camera positions and angles
- Use renderings as input to score distillation loss function
- Gradient descent eventually results in 3D model resembling text
Neural rendering of a 3d model
- NeRF is a technique for neural inverse rendering that uses a volumetric raytracer and a multilayer perceptron
- Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection
- Sampled 3D points are passed through an MLP, which produces 4 scalar values as output
- Output includes volumetric density and an RGB color
- NeRF MLP is trained from random initialization using a mean squared error loss function
- Model is built upon mip-NeRF 360 which reduces aliasing
- MLP parameterizes the color of the surface itself, which is then lit by an illumination
- Regularization penalty on the opacity along each ray and a modified version of the orientation loss are used
- Full details on these regularizers and additional hyperparameters of NeRF are in Appendix A.2
Text-to-3d synthesis
- Pretrained text-to-image diffusion model, differentiable image parameterization in the form of a NeRF, and a loss function used for text-to-3D synthesis
- Randomly sample a camera and light
- Render an image of the NeRF from the camera and shade with the light
- Compute gradients of the SDS loss with respect to the NeRF parameters
- Update the NeRF parameters using an optimizer
- Optimize for 15,000 iterations
Experiments
- Evaluating DreamFusion’s ability to generate 3D scenes from text prompts
- Comparing to existing zero-shot text-to-3D generative models
- Identifying key components of DreamFusion that enable accurate 3D geometry
- Exploring qualitative capabilities of DreamFusion
- Evaluating with CLIP R-Precision
- Ablation study to evaluate components of DreamFusion
Discussion
- DreamFusion is a technique for text-to-3D synthesis
- DreamFusion uses a Score Distillation Sampling approach and a NeRF-like rendering engine
- DreamFusion does not require 3D or multi-view training data
- DreamFusion has limitations, such as oversaturated and oversmoothed results and lack of diversity
Ethics statement
- Generative models for images have ethical concerns
- Imagen diffusion model has biases and limitations
- LAION-400M subset of Imagen data contains undesirable images
- Imagen is conditioned on features from a pretrained language model
- Need to be careful about datasets used in text-to-image and image-to-3D models
- Generative models can be used to generate disinformation
- 3D objects may be more convincing than 2D images
- Generative models may displace creative workers but also enable growth and improve accessibility
Reproducibility statement
- Mip-NeRF 360 model is publicly available through the “MultiNeRF” code repository
- DreamFusion algorithm can produce similar results to Imagen diffusion model
- Schematic overview of algorithm, pseudocode, hyperparameters, and evaluation setup details included
- Derivations for loss included in Appendix A.4
- Sinusoidal positional encoding function uses frequencies 2 0 , 2 1 , . . . , 2 L−1 , where L = 8
- NeRF MLP consists of 5 ResNet blocks with 128 hidden units, Swish/SiLU activation, and layer normalization
- Ambient light color a set to 1 and diffuse light color ρ set to 0 for first 1k steps of optimization
- Textureless shading (ρ = 1) chosen with probability 0.5 when shading is on ( ρ > 0)
- Small “blob” of density around the origin added to output of MLP
- Uniformly sample camera elevation φ cam from biased distribution with probability 0.5
- Light position vector direction sampled from N (p cam , I) and norm sampled from U(0.8, 1.5)
- Orientation loss weight set to 10 −2 and annealed in starting from 10 −4 over first 5k steps
- Accumulated alpha loss weight set to lie in [10 −3 , 5 × 10 −3 ]
- Interpolate between front/side/back view prompt augmentations based on which quadrant contains sampled azimuth θ cam
- Optimizer uses Distributed Shampoo with β 1 = 0.9, β 2 = 0.9, exponent override = 2, block size = 128, graft type = SQRT N, = 10 −6
- Linear warmup of learning rate over 3000 steps from 10 −9 to 10 −4 followed by cosine decay down to 10 −6
- Score distillation sampling loss L SDS used to find modes of score functions
- Gradient of loss leads to same update as optimizing training loss L Diff
- GAN-like amortized samplers can be learned by minimizing the Stein discrepancy
- Large guidance weights (ω = 100) important for learning high-quality 3D models
- DreamFusion does not yield large amounts of diversity across random seeds