Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Reconstructing 3D shape from single RGB image is a challenging problem in computer vision.
Proposed method generates sparse point cloud via a conditional denoising diffusion process.
Method takes input of single RGB image and camera pose.
Projection conditioning process enables high-resolution sparse geometries that are well-aligned with input image.
Method can generate multiple different shapes consistent with single input image.
Performs well on synthetic and real-world data.

Reconstructing 3D structure from a single 2D view is a difficult computer vision problem.
Humans are good at using cues and prior knowledge to infer 3D structure from single views.
Single-view reconstruction has practical applications in augmented and virtual reality.
Recent research has used end-to-end deep learning methods to predict volumes from single images.
Diffusion models have been used to generate high-fidelity image samples from scratch or when conditioned on textual inputs.
This work uses diffusion models to conditionally generate the shape of unseen regions of a 3D object.
The model is able to generate multiple plausible 3D point clouds which are all consistent with the input.
The model performs competitively on the synthetic ShapeNet benchmark and on the challenging, real-world Co3D dataset.

Single-view 3D reconstruction uses 2D and 3D convolutional networks to map an input image into a 3D representation.
3D-R2N2 is a pioneering method in this line of work.
PC 2 reconstructs a colored point cloud from a single input image.
It uses a model projection conditioning method to project image features onto the partially-denoised point cloud.
LegoFormer uses a transformer-based approach to encode an image into a feature vector and decode a 3D voxel grid.
NeRF-WCE and Pix-elNeRF are methods for single/few-view reconstruction.
This paper uses denoising diffusion probabilistic models for point cloud-based reconstruction.

Diffusion denoising probabilistic models are generative models inspired by stochastic differential equations and non-equilibrium thermodynamics.
Diffusion denoising models are based on an iterative noising process.
To form a generative model, the reverse diffusion process is considered, which begins with a sample from the noise distribution and denoises over a series of steps.
The distribution is learned using a neural network and the mean is predicted with a neural network.

3D point cloud is a 3N dimensional object
Network denoises a set of points from a spherical Gaussian ball into a recognizable object
Network is trained to predict the noise added in the most recent time step
At inference time, a random point cloud is sampled from a 3N-dimensional Gaussian
Reverse diffusion process is run to produce a sample

3D reconstruction is formulated as conditional generation
Model is conditioned on reference image and camera view
Prior work used encoder-decoder architectures conditioned on image embeddings to generate 3D shapes
Weak form of geometric consistency between input image and reconstructed shape
PC 2 uses projection-based conditioning to promote geometric consistency
PC 2 also reconstructs object color
PC 2-FM and PC 2-FA use silhouette to filter results
PC 2-FM uses object mask, PC 2-FA uses mutual agreement between predictions

Proposed PC 2, a novel diffusion-based method for single-view 3D shape reconstruction
Iteratively reconstructs a shape by projecting image features onto a partially-denoised point cloud
Outperforms prior methods on synthetic benchmarks
Reconstructs objects with high levels of detail from challenging real-world images
Uses Point-Voxel model to process point cloud
Projects points onto image using rasterization
Includes mask distance function and projection method
Evaluated on ShapeNet-R2N2 dataset
Performance similar to prior work without filtering, outperforms with filtering
Failure cases on ambiguous categories
Qualitative examples of reconstructions on Co3D and ShapeNet