Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Problem of reconstructing a full 360° photographic model of an object from a single image.
  • Fitting a neural radiance field to the image is severely ill-posed.
  • Using an approach inspired by DreamFields and DreamFusion to fuse the given input view, the conditional prior, and other regularizers in a final, consistent reconstruction.
  • Demonstrating state-of-the-art reconstruction results on benchmark images.
  • Reconstructions provide a faithful match of the input view and a plausible extrapolation of its appearance and 3D shape.

Paper Content

Introduction

  • Problem of obtaining a 360° photographic reconstruction of any object given a single image
  • Single image does not contain sufficient information for 3D reconstruction
  • Skilled 3D artist can take a picture of almost any object and create a plausible 3D model
  • Algorithmically, must marry visual geometry with a powerful statistical model of the 3D world
  • Recent explosion of 2D image generators suggests 3D models might not be far behind
  • Diffusion models can solve highly-ambiguous generation tasks
  • Training a 3D diffusion model is infeasible due to lack of 3D data
  • Alternative is to extract 3D information from an existing 2D model
  • Early GAN-based generators showed some success for simple data
  • Recent methods do not solve single-image 3D reconstruction problem
  • RealFusion proposed to extract 3D reconstruction from single image without assumptions on type of object or 3D supervision
  • Leverages existing 2D diffusion image generator via single-image variant of textual inversion
  • Introduces new regularizers and efficient implementation using Instant-NGP
  • Demonstrates state-of-the-art reconstruction results on in-the-wild images and existing datasets
  • Early work on 3D reconstruction used photometry to match image features and then discarded it
  • Neural radiance fields (RFs) can be used to model 3D fields
  • Variants of NeRF-like models use sign distance functions (SDFs) to recover cleaner geometry
  • Some authors have attempted to improve the statistical efficiency of NeRF-like models by learning or incorporating various kinds of priors
  • Some authors have attempted to recover full radiance fields from single images, but this generally requires multi-view data for training
  • Extracting 3D models from 2D image generators has been proposed, using GANs and CLIP embeddings
  • Our method optimizes a neural radiance field using two objectives simultaneously: a reconstruction objective and a prior objective
  • Diffusion denoising probabilistic models are a class of generative models based on iteratively reversing a Markovian noising process

Method

Radiance fields and dreamfusion

  • Radiance fields are a pair of functions mapping a 3D point to an opacity and color value
  • Neural networks can be used to implement the functions
  • Rendering is done using the emission-absorption model
  • Diffusion models draw a sample from a probability distribution by adding noise to the image
  • Denoising neural networks are used to predict the noise component
  • Dream-Fusion extracts a 3D rendition of a concept from a diffusion model
  • Single-image textual inversion is used as a substitute for alternative views
  • Coarse-to-fine training is used to optimize the radiance field
  • Normal vector regularization is used to encourage smooth normals
  • A mask loss term is used to incorporate an object mask
  • The final objective consists of four terms

Experiments

Implementation details

  • We use the same set of hyperparameters for all experiments.
  • We use an open-source Stable Diffusion model.
  • We use a model with 16 resolution levels, a feature dimension of 2, and a maximum resolution of 2048.
  • The camera for reconstruction is placed looking at the origin on a sphere of radius 1.8, at an angle of 15deg above the plane.
  • We use λ image = 5.0, λ mask = 0.5, and λ normal = 0.5.
  • We keep nearly all parameters the same as [33].

Quantitative results

  • Few methods attempt to reconstruct arbitrary objects in 3D
  • Shelf-Supervised Mesh Prediction is the most recent and best-performing method
  • Evaluated on seven categories in the CO3D dataset
  • Quality of recovered 3D shape tested in Fig. 5
  • F-score with threshold 0.05 used to measure distance between predicted and ground truth point clouds
  • Novel-view renderings compared to check if generated views are close to other views given in CO3D

Qualitative results

  • Figure 4 shows multiple 3D reconstructions of an object from a single image.
  • Figure 6 explores the ability of RealFusion to sample the space of possible solutions.
  • Figure 9 shows the effect of normal smoothness on reconstruction quality.
  • Figure 11 shows two typical failure modes of RealFusion.
  • Results from two different priors show that Stable Diffusion yields higher-quality reconstructions.

Analysis and ablations

  • RealFusion uses single-image textual inversion to correctly imagine novel views of a specific object.
  • Without textual inversion, the model reconstructs the backside of the object as a generic instance from the object category.
  • Normal smoothness regularizer of Eq. (5) results in smoother, more realistic meshes and reduces the number of artifacts.
  • Coarse-to-fine optimization reduces the presence of low-level artifacts and results in smoother surfaces.
  • Stable Diffusion works significantly better than relying on an alternative such as CLIP.

Conclusions

  • Introduced RealFusion, a new approach to obtain full 360° photographic reconstructions of any object given a single image
  • Used an off-the-shelf diffusion model trained using only 2D images and no special supervision for 3D reconstruction
  • Selected the model prompt to imagine other views of the object
  • Learned an efficient, multi-scale radiance field representation of the reconstructed object
  • Incorporated an additional regularizer to smooth out the reconstructed surface
  • Generated plausible 3D reconstructions of objects captured in the wild
  • Future works include specializing the diffusion model and incorporating dynamics
  • Used CO3D dataset in a manner compatible with their terms
  • Rendered at resolution 96px and upsampled to 512px before passing to Stable Diffusion latent space encoder
  • Optimized using Adam optimizer with learning rate 1e-3 for 5000 iterations
  • Used two-layer MLP as background model
  • Added orientation loss and entropy loss as regularizers
  • Single-image textual inversion step used heavy image augmentations
  • Compared to recent single-view reconstruction methods on the lego scene from the synthetic NeRF dataset
  • Explored the idea of reconstructing a 3D object from a text prompt alone
  • Three most common failure cases: neural fields lacking well-defined geometry, floaters, and the Janus problem