Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Reconstructing 3D shape from single RGB image is a challenging problem in computer vision.
- Proposed method generates sparse point cloud via a conditional denoising diffusion process.
- Method takes input of single RGB image and camera pose.
- Projection conditioning process enables high-resolution sparse geometries that are well-aligned with input image.
- Method can generate multiple different shapes consistent with single input image.
- Performs well on synthetic and real-world data.
Paper Content
Introduction
- Reconstructing 3D structure from a single 2D view is a difficult computer vision problem.
- Humans are good at using cues and prior knowledge to infer 3D structure from single views.
- Single-view reconstruction has practical applications in augmented and virtual reality.
- Recent research has used end-to-end deep learning methods to predict volumes from single images.
- Diffusion models have been used to generate high-fidelity image samples from scratch or when conditioned on textual inputs.
- This work uses diffusion models to conditionally generate the shape of unseen regions of a 3D object.
- The model is able to generate multiple plausible 3D point clouds which are all consistent with the input.
- The model performs competitively on the synthetic ShapeNet benchmark and on the challenging, real-world Co3D dataset.
Related work
- Single-view 3D reconstruction uses 2D and 3D convolutional networks to map an input image into a 3D representation.
- 3D-R2N2 is a pioneering method in this line of work.
- PC 2 reconstructs a colored point cloud from a single input image.
- It uses a model projection conditioning method to project image features onto the partially-denoised point cloud.
- LegoFormer uses a transformer-based approach to encode an image into a feature vector and decode a 3D voxel grid.
- NeRF-WCE and Pix-elNeRF are methods for single/few-view reconstruction.
- This paper uses denoising diffusion probabilistic models for point cloud-based reconstruction.
Diffusion models
- Diffusion denoising probabilistic models are generative models inspired by stochastic differential equations and non-equilibrium thermodynamics.
- Diffusion denoising models are based on an iterative noising process.
- To form a generative model, the reverse diffusion process is considered, which begins with a sample from the noise distribution and denoises over a series of steps.
- The distribution is learned using a neural network and the mean is predicted with a neural network.
Method
- Overview of denoising diffusion models
- Introduction of novel conditioning scheme PC 2
- Description of filtering method PC 2 -FM
Point cloud diffusion models
- 3D point cloud is a 3N dimensional object
- Network denoises a set of points from a spherical Gaussian ball into a recognizable object
- Network is trained to predict the noise added in the most recent time step
- At inference time, a random point cloud is sampled from a 3N-dimensional Gaussian
- Reverse diffusion process is run to produce a sample
Conditional point cloud diffusion models
- 3D reconstruction is formulated as conditional generation
- Model is conditioned on reference image and camera view
- Prior work used encoder-decoder architectures conditioned on image embeddings to generate 3D shapes
- Weak form of geometric consistency between input image and reconstructed shape
- PC 2 uses projection-based conditioning to promote geometric consistency
- PC 2 also reconstructs object color
- PC 2-FM and PC 2-FA use silhouette to filter results
- PC 2-FM uses object mask, PC 2-FA uses mutual agreement between predictions
Experiments
- ShapeNet Dataset is a collection of 3D CAD models
- Model is Point-Voxel CNN (PVCNN)
- Implemented in PyTorch and PyTorch3D library
- Trained with batch size 16 for 100,000 steps
- MAE used for feature extraction
- AdamW used for optimization
- Images of size 137x137px and point clouds with 8192 points
- Linear schedule with warmup for diffusion noise schedule
- Quantitative results show on-par performance with prior work
- Qualitative results show realistic object shapes from any viewpoint
- Probabilistic approach allows for multiple plausible shapes
- Filtering improves performance substantially
- Limitation is need for point cloud ground truth for training
Conclusions
- Proposed PC 2, a novel diffusion-based method for single-view 3D shape reconstruction
- Iteratively reconstructs a shape by projecting image features onto a partially-denoised point cloud
- Outperforms prior methods on synthetic benchmarks
- Reconstructs objects with high levels of detail from challenging real-world images
- Uses Point-Voxel model to process point cloud
- Projects points onto image using rasterization
- Includes mask distance function and projection method
- Evaluated on ShapeNet-R2N2 dataset
- Performance similar to prior work without filtering, outperforms with filtering
- Failure cases on ambiguous categories
- Qualitative examples of reconstructions on Co3D and ShapeNet