Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent breakthroughs in text-to-image synthesis have been driven by diffusion models
3D synthesis requires large-scale datasets and efficient architectures, which don’t exist
Text-to-3D synthesis is done using a pretrained 2D text-to-image diffusion model
Loss based on probability density distillation enables use of 2D diffusion model as prior
DeepDream-like procedure optimizes a randomly-initialized 3D model via gradient descent

Paper Content

Introduction

Generative image models now support high-fidelity, diverse and controllable image synthesis
Quality improvements come from large aligned image-text datasets and scalable generative model architectures
Diffusion models are effective at learning high-quality image generators
Applying diffusion models to other modalities requires large amounts of modality-specific training data
This work develops techniques to transfer pretrained 2D image-text diffusion models to 3D object synthesis
3D generative models can be trained on explicit representations of structure
GANs can learn controllable 3D generators from photographs of a single object category
Neural Radiance Fields can be used for neural inverse rendering
Many 3D generative approaches have found success incorporating NeRF-like models
This work uses pretrained 2D image-text models for 3D synthesis
Score Distillation Sampling (SDS) enables sampling via optimization in differentiable image parameterizations
DreamFusion generates high-fidelity coherent 3D objects and scenes for user-provided text prompts

Diffusion models and score distillation sampling

Diffusion models are generative models that learn to transform a sample from a noise distribution to a data distribution.
The forward process is typically a Gaussian distribution that transitions from a less noisy latent to a noisier latent.
The reverse process is trained to slowly add structure starting from random noise.
The optimal reverse process step is also Gaussian and related to an optimal MSE denoiser.
The generative model is trained with a weighted evidence lower bound (ELBO).
Text-to-image diffusion models use classifier-free guidance (CFG) to improve sample fidelity.
Parameters are updated with SGD and samples are updated in pixel space.

Score distillation sampling ancestral sampling

2D sampling methods are compared in Figure 2
Score distillation sampling is used as an example with an image generator that restricts images to be symmetric

How can we sample in parameter space, not pixel space?

Existing approaches for sampling from diffusion models generate a sample that is the same type and dimensionality as the observed data.
Conditional diffusion sampling enables flexibility.
Diffusion models trained on pixels have traditionally been used to sample only pixels.
We want to create 3D models that look like good images when rendered from random angles.
We use a differentiable image parameterization (DIP) to express constraints, optimize in more compact spaces, or leverage more powerful optimization algorithms.
We need a differentiable loss function where plausible images have low loss, and implausible images have high loss.
We investigated reusing the diffusion training loss to find modes of the learned conditional density.
Minimizing the diffusion training loss with respect to a generated datapoint does not produce realistic samples.
We found that omitting the U-Net Jacobian term leads to an effective gradient for optimizing DIPs with diffusion models.
We use a weighted probability density distillation loss to compute parameter updates.
We name our sampling approach Score Distillation Sampling (SDS).
SDS produces detail comparable to ancestral sampling.

The dreamfusion algorithm

Use diffusion model as loss in continuous optimization problem to generate samples
Use Imagen model from Saharia et al. (2022) to synthesize images from text
Use 64x64 base model, no modifications
Initialize NeRF-like model with random weights
Render views of NeRF from random camera positions and angles
Use renderings as input to score distillation loss function
Gradient descent eventually results in 3D model resembling text

Neural rendering of a 3d model

NeRF is a technique for neural inverse rendering that uses a volumetric raytracer and a multilayer perceptron
Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection
Sampled 3D points are passed through an MLP, which produces 4 scalar values as output
Output includes volumetric density and an RGB color
NeRF MLP is trained from random initialization using a mean squared error loss function
Model is built upon mip-NeRF 360 which reduces aliasing
MLP parameterizes the color of the surface itself, which is then lit by an illumination
Regularization penalty on the opacity along each ray and a modified version of the orientation loss are used
Full details on these regularizers and additional hyperparameters of NeRF are in Appendix A.2

Text-to-3d synthesis

Pretrained text-to-image diffusion model, differentiable image parameterization in the form of a NeRF, and a loss function used for text-to-3D synthesis
Randomly sample a camera and light
Render an image of the NeRF from the camera and shade with the light
Compute gradients of the SDS loss with respect to the NeRF parameters
Update the NeRF parameters using an optimizer
Optimize for 15,000 iterations

Experiments

Evaluating DreamFusion’s ability to generate 3D scenes from text prompts
Comparing to existing zero-shot text-to-3D generative models
Identifying key components of DreamFusion that enable accurate 3D geometry
Exploring qualitative capabilities of DreamFusion
Evaluating with CLIP R-Precision
Ablation study to evaluate components of DreamFusion

Discussion

DreamFusion is a technique for text-to-3D synthesis
DreamFusion uses a Score Distillation Sampling approach and a NeRF-like rendering engine
DreamFusion does not require 3D or multi-view training data
DreamFusion has limitations, such as oversaturated and oversmoothed results and lack of diversity

Ethics statement

Generative models for images have ethical concerns
Imagen diffusion model has biases and limitations
LAION-400M subset of Imagen data contains undesirable images
Imagen is conditioned on features from a pretrained language model
Need to be careful about datasets used in text-to-image and image-to-3D models
Generative models can be used to generate disinformation
3D objects may be more convincing than 2D images
Generative models may displace creative workers but also enable growth and improve accessibility

Reproducibility statement

Mip-NeRF 360 model is publicly available through the “MultiNeRF” code repository
DreamFusion algorithm can produce similar results to Imagen diffusion model
Schematic overview of algorithm, pseudocode, hyperparameters, and evaluation setup details included
Derivations for loss included in Appendix A.4
Sinusoidal positional encoding function uses frequencies 2 0 , 2 1 , . . . , 2 L−1 , where L = 8
NeRF MLP consists of 5 ResNet blocks with 128 hidden units, Swish/SiLU activation, and layer normalization
Ambient light color a set to 1 and diffuse light color ρ set to 0 for first 1k steps of optimization
Textureless shading (ρ = 1) chosen with probability 0.5 when shading is on ( ρ > 0)
Small “blob” of density around the origin added to output of MLP
Uniformly sample camera elevation φ cam from biased distribution with probability 0.5
Light position vector direction sampled from N (p cam , I) and norm sampled from U(0.8, 1.5)
Orientation loss weight set to 10 −2 and annealed in starting from 10 −4 over first 5k steps
Accumulated alpha loss weight set to lie in [10 −3 , 5 × 10 −3 ]
Interpolate between front/side/back view prompt augmentations based on which quadrant contains sampled azimuth θ cam
Optimizer uses Distributed Shampoo with β 1 = 0.9, β 2 = 0.9, exponent override = 2, block size = 128, graft type = SQRT N, = 10 −6
Linear warmup of learning rate over 3000 steps from 10 −9 to 10 −4 followed by cosine decay down to 10 −6
Score distillation sampling loss L SDS used to find modes of score functions
Gradient of loss leads to same update as optimizing training loss L Diff
GAN-like amortized samplers can be learned by minimizing the Stein discrepancy
Large guidance weights (ω = 100) important for learning high-quality 3D models
DreamFusion does not yield large amounts of diversity across random seeds

Link to paper#

Abstract#

Paper Content#

Introduction#

Diffusion models and score distillation sampling#

Score distillation sampling ancestral sampling#

How can we sample in parameter space, not pixel space?#

The dreamfusion algorithm#

Neural rendering of a 3d model#

Text-to-3d synthesis#

Experiments#

Discussion#

Ethics statement#

Reproducibility statement#

Link to paper

Abstract

Paper Content

Introduction

Diffusion models and score distillation sampling

Score distillation sampling ancestral sampling

How can we sample in parameter space, not pixel space?

The dreamfusion algorithm

Neural rendering of a 3d model

Text-to-3d synthesis

Experiments

Discussion

Ethics statement

Reproducibility statement