Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introduce Zero-1-to-3, a framework for changing camera viewpoint of an object from a single RGB image
Capitalize on geometric priors learned from large-scale diffusion models
Use synthetic dataset to learn controls of relative camera viewpoint
Model has strong zero-shot generalization ability to out-of-distribution datasets and in-the-wild images
Can be used for 3D reconstruction from a single image
Outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models

Paper Content

Introduction

Humans can imagine 3D shape and appearance from a single camera view
This ability is important for everyday tasks and visual creativity
Humans rely on prior knowledge accumulated through a lifetime of visual exploration
Existing approaches for 3D image reconstruction rely on expensive 3D annotations or category-specific priors
Recent methods have made strides in open-world 3D reconstruction
Paper demonstrates that large diffusion models have learned rich 3D priors from 2D images
Paper presents experiments to evaluate zero-shot view synthesis and 3D reconstruction from a single image

Recent advancements in generative image architectures have made it possible to synthesize high-fidelity diverse scenes and objects.
Diffusion models have been used to learn scalable image generators.
Neural Radiance Fields (NeRFs) have emerged as a powerful representation for single-scene reconstruction.
DreamFields has shown that NeRF can be used as the main component in a 3D generative system.
Reconstructing 3D objects from a single view is a challenging problem that requires strong priors.

Method

Goal is to synthesize an image of an object from a different camera viewpoint
Model f synthesizes a new image under a camera transformation
Estimate xR,T to be perceptually similar to the true but unobserved novel view x R,T
Novel view synthesis from monocular RGB image is severely under-constrained
Approach capitalizes on large diffusion models to perform this task
Diffusion models do not explicitly encode the correspondences between viewpoints
Generative models inherit viewpoint biases reflected on the Internet

Learning to control camera viewpoint

Diffusion models have been trained on internetscale data
Diffusion models cannot control the camera extrinsics with which a photo is captured
An approach is proposed to fine-tune a pre-trained diffusion model to learn controls over the camera parameters
The model is fine-tuned to learn a generic mechanism for controlling the camera viewpoints
The model can generate photorealistic images with control of viewpoints
The model can synthesize new views for object classes that lack 3D assets

View-conditioned diffusion

3D reconstruction from a single image requires both low-level and high-level understanding
Hybrid conditioning mechanism is adopted to achieve this
Cross-attention is used to condition the denoising U-Net
Input image is channel-concatenated with the image being denoised to keep identity and details

3d reconstruction

Synthesizing novel views of an object is not enough, a full 3D reconstruction is desired.
A recently open-sourced framework, Score Jacobian Chaining (SJC), is used to optimize a 3D representation with priors from text-to-image diffusion models.
A technique used in SJC is to set the classifier-free guidance value higher than usual to improve the fidelity of the reconstruction.
An MSE loss, depth smoothness loss, and near-view consistency loss are used to regularize the NeRF representation.

Dataset

Objaverse is a large-scale open-source dataset containing 800K+ 3D models created by 100K+ artists.
It has no explicit class labels like ShapeNet, but has a large diversity of high-quality 3D models with rich geometry and fine-grained details and material properties.
For each object in the dataset, 12 camera extrinsic matrices are randomly sampled and 12 views are rendered with a raytracing engine.
At training time, two views are sampled for each object to form an image pair and the corresponding relative viewpoint transformation is derived from the two extrinsic matrices.

Experiments

Assessed model performance on zero-shot novel view synthesis and 3D reconstruction
Compared model to state-of-the-art on synthetic objects and scenes with different levels of complexity
Reported qualitative results using diverse in-the-wild images

Tasks

Novel view synthesis is a 3D problem in computer vision that requires a model to learn the depth, texture, and shape of an object from a single view.
Our approach for view-conditional image generation inverts the order of 3D reconstruction and novel view synthesis, while still retaining the identity of the object depicted in the input image.

Baselines

Comparing to methods that use single-view RGB images as input
Comparing to DietNeRF, Image Variations, and SJC-I
Comparing to Multiview Compressive Coding and Point-E
Using MiDaS for depth estimation

Benchmarks and metrics

Evaluated tasks on Google Scanned Objects and RTMV datasets
Ground truth 3D models used for 3D reconstruction
Numerically evaluated with four metrics for image similarity
Measured Chamfer Distance and volumetric IoU for 3D reconstruction

Novel view synthesis results

Our method is able to generate photorealistic images that are consistent with the ground truth.
Point-E achieves better results than other baselines and has good zero-shot generalizability.
Our model is able to synthesize high-fidelity viewpoints while maintaining object type, identity and low-level details.
Diffusion models are a good choice of architecture compared to NeRF for capturing underlying uncertainty.

3d reconstruction results

Tables 3 and 4 show numerical results for 3D reconstruction on GSO.
Our method reconstructs high-fidelity 3D meshes that are consistent with the ground truth.
MCC and SJC-I often fail to correctly infer the geometry at the back of the object.
Point-E is able to predict a reasonable estimate of object geometry, but generates non-uniform sparse point clouds.
Our method leverages multi-view priors and NeRF-style representation, resulting in improvements in terms of CD and volumetric IoU.

Text to image to 3d

Tested method on images generated by txt2img models
Generated novel views of images while preserving object identity

Discussion

Proposed a novel approach for zero-shot, single-image novel-view synthesis and 3D reconstruction
Leverages Stable Diffusion model, pre-trained on internet-scaled data, to capture rich semantic and geometric priors
Fine-tuned the model on synthetic data to learn control over the camera viewpoint
Demonstrated state-of-the-art results on several benchmarks

Future work

Approach is trained on single objects on plain background
Generalization to scenes with several objects is demonstrated
Generalization to scenes with complex backgrounds is an important challenge
Reasoning about geometry of dynamic scenes from single view would open novel research directions
Combining graphics pipelines with Stable Diffusion
Extract 3D knowledge of objects from Stable Diffusion
Smoothness loss to depth map to remove holes in object representation
Near-view consistency loss to regularize difference between images from different views
Mesh extraction from Vox-elRF representation
Evaluation of 3D shape using chamfer distance and volumetric IoU
Baselines used for comparison
Viewpoint bias in text-to-image models
Zero-1-to-3 is a viewpoint-conditioned image translation model
3D reconstruction with Zero-1-to-3
Novel view synthesis on Google Scanned Objects and RTMV
Diversity of novel view synthesis
Qualitative examples of 3D reconstruction
Novel view synthesis from Dall-E-2 generated images
Results for novel view synthesis on Google Scanned Objects and RTMV
Results for single view 3D reconstruction on RTMV

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Learning to control camera viewpoint#

View-conditioned diffusion#

3d reconstruction#

Dataset#

Experiments#

Tasks#

Baselines#

Benchmarks and metrics#

Novel view synthesis results#

3d reconstruction results#

Text to image to 3d#

Discussion#

Future work#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Learning to control camera viewpoint

View-conditioned diffusion

3d reconstruction

Dataset

Experiments

Tasks

Baselines

Benchmarks and metrics

Novel view synthesis results

3d reconstruction results

Text to image to 3d

Discussion

Future work