Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Reconstructing 3D shape from single RGB image is a challenging problem in computer vision.
  • Proposed method generates sparse point cloud via a conditional denoising diffusion process.
  • Method takes input of single RGB image and camera pose.
  • Projection conditioning process enables high-resolution sparse geometries that are well-aligned with input image.
  • Method can generate multiple different shapes consistent with single input image.
  • Performs well on synthetic and real-world data.

Paper Content

Introduction

  • Reconstructing 3D structure from a single 2D view is a difficult computer vision problem.
  • Humans are good at using cues and prior knowledge to infer 3D structure from single views.
  • Single-view reconstruction has practical applications in augmented and virtual reality.
  • Recent research has used end-to-end deep learning methods to predict volumes from single images.
  • Diffusion models have been used to generate high-fidelity image samples from scratch or when conditioned on textual inputs.
  • This work uses diffusion models to conditionally generate the shape of unseen regions of a 3D object.
  • The model is able to generate multiple plausible 3D point clouds which are all consistent with the input.
  • The model performs competitively on the synthetic ShapeNet benchmark and on the challenging, real-world Co3D dataset.
  • Single-view 3D reconstruction uses 2D and 3D convolutional networks to map an input image into a 3D representation.
  • 3D-R2N2 is a pioneering method in this line of work.
  • PC 2 reconstructs a colored point cloud from a single input image.
  • It uses a model projection conditioning method to project image features onto the partially-denoised point cloud.
  • LegoFormer uses a transformer-based approach to encode an image into a feature vector and decode a 3D voxel grid.
  • NeRF-WCE and Pix-elNeRF are methods for single/few-view reconstruction.
  • This paper uses denoising diffusion probabilistic models for point cloud-based reconstruction.

Diffusion models

  • Diffusion denoising probabilistic models are generative models inspired by stochastic differential equations and non-equilibrium thermodynamics.
  • Diffusion denoising models are based on an iterative noising process.
  • To form a generative model, the reverse diffusion process is considered, which begins with a sample from the noise distribution and denoises over a series of steps.
  • The distribution is learned using a neural network and the mean is predicted with a neural network.

Method

  • Overview of denoising diffusion models
  • Introduction of novel conditioning scheme PC 2
  • Description of filtering method PC 2 -FM

Point cloud diffusion models

  • 3D point cloud is a 3N dimensional object
  • Network denoises a set of points from a spherical Gaussian ball into a recognizable object
  • Network is trained to predict the noise added in the most recent time step
  • At inference time, a random point cloud is sampled from a 3N-dimensional Gaussian
  • Reverse diffusion process is run to produce a sample

Conditional point cloud diffusion models

  • 3D reconstruction is formulated as conditional generation
  • Model is conditioned on reference image and camera view
  • Prior work used encoder-decoder architectures conditioned on image embeddings to generate 3D shapes
  • Weak form of geometric consistency between input image and reconstructed shape
  • PC 2 uses projection-based conditioning to promote geometric consistency
  • PC 2 also reconstructs object color
  • PC 2-FM and PC 2-FA use silhouette to filter results
  • PC 2-FM uses object mask, PC 2-FA uses mutual agreement between predictions

Experiments

  • ShapeNet Dataset is a collection of 3D CAD models
  • Model is Point-Voxel CNN (PVCNN)
  • Implemented in PyTorch and PyTorch3D library
  • Trained with batch size 16 for 100,000 steps
  • MAE used for feature extraction
  • AdamW used for optimization
  • Images of size 137x137px and point clouds with 8192 points
  • Linear schedule with warmup for diffusion noise schedule
  • Quantitative results show on-par performance with prior work
  • Qualitative results show realistic object shapes from any viewpoint
  • Probabilistic approach allows for multiple plausible shapes
  • Filtering improves performance substantially
  • Limitation is need for point cloud ground truth for training

Conclusions

  • Proposed PC 2, a novel diffusion-based method for single-view 3D shape reconstruction
  • Iteratively reconstructs a shape by projecting image features onto a partially-denoised point cloud
  • Outperforms prior methods on synthetic benchmarks
  • Reconstructs objects with high levels of detail from challenging real-world images
  • Uses Point-Voxel model to process point cloud
  • Projects points onto image using rasterization
  • Includes mask distance function and projection method
  • Evaluated on ShapeNet-R2N2 dataset
  • Performance similar to prior work without filtering, outperforms with filtering
  • Failure cases on ambiguous categories
  • Qualitative examples of reconstructions on Co3D and ShapeNet