Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduce Zero-1-to-3, a framework for changing camera viewpoint of an object from a single RGB image
  • Capitalize on geometric priors learned from large-scale diffusion models
  • Use synthetic dataset to learn controls of relative camera viewpoint
  • Model has strong zero-shot generalization ability to out-of-distribution datasets and in-the-wild images
  • Can be used for 3D reconstruction from a single image
  • Outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models

Paper Content

Introduction

  • Humans can imagine 3D shape and appearance from a single camera view
  • This ability is important for everyday tasks and visual creativity
  • Humans rely on prior knowledge accumulated through a lifetime of visual exploration
  • Existing approaches for 3D image reconstruction rely on expensive 3D annotations or category-specific priors
  • Recent methods have made strides in open-world 3D reconstruction
  • Paper demonstrates that large diffusion models have learned rich 3D priors from 2D images
  • Paper presents experiments to evaluate zero-shot view synthesis and 3D reconstruction from a single image
  • Recent advancements in generative image architectures have made it possible to synthesize high-fidelity diverse scenes and objects.
  • Diffusion models have been used to learn scalable image generators.
  • Neural Radiance Fields (NeRFs) have emerged as a powerful representation for single-scene reconstruction.
  • DreamFields has shown that NeRF can be used as the main component in a 3D generative system.
  • Reconstructing 3D objects from a single view is a challenging problem that requires strong priors.

Method

  • Goal is to synthesize an image of an object from a different camera viewpoint
  • Model f synthesizes a new image under a camera transformation
  • Estimate xR,T to be perceptually similar to the true but unobserved novel view x R,T
  • Novel view synthesis from monocular RGB image is severely under-constrained
  • Approach capitalizes on large diffusion models to perform this task
  • Diffusion models do not explicitly encode the correspondences between viewpoints
  • Generative models inherit viewpoint biases reflected on the Internet

Learning to control camera viewpoint

  • Diffusion models have been trained on internetscale data
  • Diffusion models cannot control the camera extrinsics with which a photo is captured
  • An approach is proposed to fine-tune a pre-trained diffusion model to learn controls over the camera parameters
  • The model is fine-tuned to learn a generic mechanism for controlling the camera viewpoints
  • The model can generate photorealistic images with control of viewpoints
  • The model can synthesize new views for object classes that lack 3D assets

View-conditioned diffusion

  • 3D reconstruction from a single image requires both low-level and high-level understanding
  • Hybrid conditioning mechanism is adopted to achieve this
  • Cross-attention is used to condition the denoising U-Net
  • Input image is channel-concatenated with the image being denoised to keep identity and details

3d reconstruction

  • Synthesizing novel views of an object is not enough, a full 3D reconstruction is desired.
  • A recently open-sourced framework, Score Jacobian Chaining (SJC), is used to optimize a 3D representation with priors from text-to-image diffusion models.
  • A technique used in SJC is to set the classifier-free guidance value higher than usual to improve the fidelity of the reconstruction.
  • An MSE loss, depth smoothness loss, and near-view consistency loss are used to regularize the NeRF representation.

Dataset

  • Objaverse is a large-scale open-source dataset containing 800K+ 3D models created by 100K+ artists.
  • It has no explicit class labels like ShapeNet, but has a large diversity of high-quality 3D models with rich geometry and fine-grained details and material properties.
  • For each object in the dataset, 12 camera extrinsic matrices are randomly sampled and 12 views are rendered with a raytracing engine.
  • At training time, two views are sampled for each object to form an image pair and the corresponding relative viewpoint transformation is derived from the two extrinsic matrices.

Experiments

  • Assessed model performance on zero-shot novel view synthesis and 3D reconstruction
  • Compared model to state-of-the-art on synthetic objects and scenes with different levels of complexity
  • Reported qualitative results using diverse in-the-wild images

Tasks

  • Novel view synthesis is a 3D problem in computer vision that requires a model to learn the depth, texture, and shape of an object from a single view.
  • Our approach for view-conditional image generation inverts the order of 3D reconstruction and novel view synthesis, while still retaining the identity of the object depicted in the input image.

Baselines

  • Comparing to methods that use single-view RGB images as input
  • Comparing to DietNeRF, Image Variations, and SJC-I
  • Comparing to Multiview Compressive Coding and Point-E
  • Using MiDaS for depth estimation

Benchmarks and metrics

  • Evaluated tasks on Google Scanned Objects and RTMV datasets
  • Ground truth 3D models used for 3D reconstruction
  • Numerically evaluated with four metrics for image similarity
  • Measured Chamfer Distance and volumetric IoU for 3D reconstruction

Novel view synthesis results

  • Our method is able to generate photorealistic images that are consistent with the ground truth.
  • Point-E achieves better results than other baselines and has good zero-shot generalizability.
  • Our model is able to synthesize high-fidelity viewpoints while maintaining object type, identity and low-level details.
  • Diffusion models are a good choice of architecture compared to NeRF for capturing underlying uncertainty.

3d reconstruction results

  • Tables 3 and 4 show numerical results for 3D reconstruction on GSO.
  • Our method reconstructs high-fidelity 3D meshes that are consistent with the ground truth.
  • MCC and SJC-I often fail to correctly infer the geometry at the back of the object.
  • Point-E is able to predict a reasonable estimate of object geometry, but generates non-uniform sparse point clouds.
  • Our method leverages multi-view priors and NeRF-style representation, resulting in improvements in terms of CD and volumetric IoU.

Text to image to 3d

  • Tested method on images generated by txt2img models
  • Generated novel views of images while preserving object identity

Discussion

  • Proposed a novel approach for zero-shot, single-image novel-view synthesis and 3D reconstruction
  • Leverages Stable Diffusion model, pre-trained on internet-scaled data, to capture rich semantic and geometric priors
  • Fine-tuned the model on synthetic data to learn control over the camera viewpoint
  • Demonstrated state-of-the-art results on several benchmarks

Future work

  • Approach is trained on single objects on plain background
  • Generalization to scenes with several objects is demonstrated
  • Generalization to scenes with complex backgrounds is an important challenge
  • Reasoning about geometry of dynamic scenes from single view would open novel research directions
  • Combining graphics pipelines with Stable Diffusion
  • Extract 3D knowledge of objects from Stable Diffusion
  • Smoothness loss to depth map to remove holes in object representation
  • Near-view consistency loss to regularize difference between images from different views
  • Mesh extraction from Vox-elRF representation
  • Evaluation of 3D shape using chamfer distance and volumetric IoU
  • Baselines used for comparison
  • Viewpoint bias in text-to-image models
  • Zero-1-to-3 is a viewpoint-conditioned image translation model
  • 3D reconstruction with Zero-1-to-3
  • Novel view synthesis on Google Scanned Objects and RTMV
  • Diversity of novel view synthesis
  • Qualitative examples of 3D reconstruction
  • Novel view synthesis from Dall-E-2 generated images
  • Results for novel view synthesis on Google Scanned Objects and RTMV
  • Results for single view 3D reconstruction on RTMV