Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Problem of reconstructing a full 360° photographic model of an object from a single image.
- Fitting a neural radiance field to the image is severely ill-posed.
- Using an approach inspired by DreamFields and DreamFusion to fuse the given input view, the conditional prior, and other regularizers in a final, consistent reconstruction.
- Demonstrating state-of-the-art reconstruction results on benchmark images.
- Reconstructions provide a faithful match of the input view and a plausible extrapolation of its appearance and 3D shape.
Paper Content
Introduction
- Problem of obtaining a 360° photographic reconstruction of any object given a single image
- Single image does not contain sufficient information for 3D reconstruction
- Skilled 3D artist can take a picture of almost any object and create a plausible 3D model
- Algorithmically, must marry visual geometry with a powerful statistical model of the 3D world
- Recent explosion of 2D image generators suggests 3D models might not be far behind
- Diffusion models can solve highly-ambiguous generation tasks
- Training a 3D diffusion model is infeasible due to lack of 3D data
- Alternative is to extract 3D information from an existing 2D model
- Early GAN-based generators showed some success for simple data
- Recent methods do not solve single-image 3D reconstruction problem
- RealFusion proposed to extract 3D reconstruction from single image without assumptions on type of object or 3D supervision
- Leverages existing 2D diffusion image generator via single-image variant of textual inversion
- Introduces new regularizers and efficient implementation using Instant-NGP
- Demonstrates state-of-the-art reconstruction results on in-the-wild images and existing datasets
Related work
- Early work on 3D reconstruction used photometry to match image features and then discarded it
- Neural radiance fields (RFs) can be used to model 3D fields
- Variants of NeRF-like models use sign distance functions (SDFs) to recover cleaner geometry
- Some authors have attempted to improve the statistical efficiency of NeRF-like models by learning or incorporating various kinds of priors
- Some authors have attempted to recover full radiance fields from single images, but this generally requires multi-view data for training
- Extracting 3D models from 2D image generators has been proposed, using GANs and CLIP embeddings
- Our method optimizes a neural radiance field using two objectives simultaneously: a reconstruction objective and a prior objective
- Diffusion denoising probabilistic models are a class of generative models based on iteratively reversing a Markovian noising process
Method
Radiance fields and dreamfusion
- Radiance fields are a pair of functions mapping a 3D point to an opacity and color value
- Neural networks can be used to implement the functions
- Rendering is done using the emission-absorption model
- Diffusion models draw a sample from a probability distribution by adding noise to the image
- Denoising neural networks are used to predict the noise component
- Dream-Fusion extracts a 3D rendition of a concept from a diffusion model
- Single-image textual inversion is used as a substitute for alternative views
- Coarse-to-fine training is used to optimize the radiance field
- Normal vector regularization is used to encourage smooth normals
- A mask loss term is used to incorporate an object mask
- The final objective consists of four terms
Experiments
Implementation details
- We use the same set of hyperparameters for all experiments.
- We use an open-source Stable Diffusion model.
- We use a model with 16 resolution levels, a feature dimension of 2, and a maximum resolution of 2048.
- The camera for reconstruction is placed looking at the origin on a sphere of radius 1.8, at an angle of 15deg above the plane.
- We use λ image = 5.0, λ mask = 0.5, and λ normal = 0.5.
- We keep nearly all parameters the same as [33].
Quantitative results
- Few methods attempt to reconstruct arbitrary objects in 3D
- Shelf-Supervised Mesh Prediction is the most recent and best-performing method
- Evaluated on seven categories in the CO3D dataset
- Quality of recovered 3D shape tested in Fig. 5
- F-score with threshold 0.05 used to measure distance between predicted and ground truth point clouds
- Novel-view renderings compared to check if generated views are close to other views given in CO3D
Qualitative results
- Figure 4 shows multiple 3D reconstructions of an object from a single image.
- Figure 6 explores the ability of RealFusion to sample the space of possible solutions.
- Figure 9 shows the effect of normal smoothness on reconstruction quality.
- Figure 11 shows two typical failure modes of RealFusion.
- Results from two different priors show that Stable Diffusion yields higher-quality reconstructions.
Analysis and ablations
- RealFusion uses single-image textual inversion to correctly imagine novel views of a specific object.
- Without textual inversion, the model reconstructs the backside of the object as a generic instance from the object category.
- Normal smoothness regularizer of Eq. (5) results in smoother, more realistic meshes and reduces the number of artifacts.
- Coarse-to-fine optimization reduces the presence of low-level artifacts and results in smoother surfaces.
- Stable Diffusion works significantly better than relying on an alternative such as CLIP.
Conclusions
- Introduced RealFusion, a new approach to obtain full 360° photographic reconstructions of any object given a single image
- Used an off-the-shelf diffusion model trained using only 2D images and no special supervision for 3D reconstruction
- Selected the model prompt to imagine other views of the object
- Learned an efficient, multi-scale radiance field representation of the reconstructed object
- Incorporated an additional regularizer to smooth out the reconstructed surface
- Generated plausible 3D reconstructions of objects captured in the wild
- Future works include specializing the diffusion model and incorporating dynamics
- Used CO3D dataset in a manner compatible with their terms
- Rendered at resolution 96px and upsampled to 512px before passing to Stable Diffusion latent space encoder
- Optimized using Adam optimizer with learning rate 1e-3 for 5000 iterations
- Used two-layer MLP as background model
- Added orientation loss and entropy loss as regularizers
- Single-image textual inversion step used heavy image augmentations
- Compared to recent single-view reconstruction methods on the lego scene from the synthetic NeRF dataset
- Explored the idea of reconstructing a 3D object from a text prompt alone
- Three most common failure cases: neural fields lacking well-defined geometry, floaters, and the Janus problem