Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce Zero-1-to-3, a framework for changing camera viewpoint of an object from a single RGB image
- Capitalize on geometric priors learned from large-scale diffusion models
- Use synthetic dataset to learn controls of relative camera viewpoint
- Model has strong zero-shot generalization ability to out-of-distribution datasets and in-the-wild images
- Can be used for 3D reconstruction from a single image
- Outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models
Paper Content
Introduction
- Humans can imagine 3D shape and appearance from a single camera view
- This ability is important for everyday tasks and visual creativity
- Humans rely on prior knowledge accumulated through a lifetime of visual exploration
- Existing approaches for 3D image reconstruction rely on expensive 3D annotations or category-specific priors
- Recent methods have made strides in open-world 3D reconstruction
- Paper demonstrates that large diffusion models have learned rich 3D priors from 2D images
- Paper presents experiments to evaluate zero-shot view synthesis and 3D reconstruction from a single image
Related work
- Recent advancements in generative image architectures have made it possible to synthesize high-fidelity diverse scenes and objects.
- Diffusion models have been used to learn scalable image generators.
- Neural Radiance Fields (NeRFs) have emerged as a powerful representation for single-scene reconstruction.
- DreamFields has shown that NeRF can be used as the main component in a 3D generative system.
- Reconstructing 3D objects from a single view is a challenging problem that requires strong priors.
Method
- Goal is to synthesize an image of an object from a different camera viewpoint
- Model f synthesizes a new image under a camera transformation
- Estimate xR,T to be perceptually similar to the true but unobserved novel view x R,T
- Novel view synthesis from monocular RGB image is severely under-constrained
- Approach capitalizes on large diffusion models to perform this task
- Diffusion models do not explicitly encode the correspondences between viewpoints
- Generative models inherit viewpoint biases reflected on the Internet
Learning to control camera viewpoint
- Diffusion models have been trained on internetscale data
- Diffusion models cannot control the camera extrinsics with which a photo is captured
- An approach is proposed to fine-tune a pre-trained diffusion model to learn controls over the camera parameters
- The model is fine-tuned to learn a generic mechanism for controlling the camera viewpoints
- The model can generate photorealistic images with control of viewpoints
- The model can synthesize new views for object classes that lack 3D assets
View-conditioned diffusion
- 3D reconstruction from a single image requires both low-level and high-level understanding
- Hybrid conditioning mechanism is adopted to achieve this
- Cross-attention is used to condition the denoising U-Net
- Input image is channel-concatenated with the image being denoised to keep identity and details
3d reconstruction
- Synthesizing novel views of an object is not enough, a full 3D reconstruction is desired.
- A recently open-sourced framework, Score Jacobian Chaining (SJC), is used to optimize a 3D representation with priors from text-to-image diffusion models.
- A technique used in SJC is to set the classifier-free guidance value higher than usual to improve the fidelity of the reconstruction.
- An MSE loss, depth smoothness loss, and near-view consistency loss are used to regularize the NeRF representation.
Dataset
- Objaverse is a large-scale open-source dataset containing 800K+ 3D models created by 100K+ artists.
- It has no explicit class labels like ShapeNet, but has a large diversity of high-quality 3D models with rich geometry and fine-grained details and material properties.
- For each object in the dataset, 12 camera extrinsic matrices are randomly sampled and 12 views are rendered with a raytracing engine.
- At training time, two views are sampled for each object to form an image pair and the corresponding relative viewpoint transformation is derived from the two extrinsic matrices.
Experiments
- Assessed model performance on zero-shot novel view synthesis and 3D reconstruction
- Compared model to state-of-the-art on synthetic objects and scenes with different levels of complexity
- Reported qualitative results using diverse in-the-wild images
Tasks
- Novel view synthesis is a 3D problem in computer vision that requires a model to learn the depth, texture, and shape of an object from a single view.
- Our approach for view-conditional image generation inverts the order of 3D reconstruction and novel view synthesis, while still retaining the identity of the object depicted in the input image.
Baselines
- Comparing to methods that use single-view RGB images as input
- Comparing to DietNeRF, Image Variations, and SJC-I
- Comparing to Multiview Compressive Coding and Point-E
- Using MiDaS for depth estimation
Benchmarks and metrics
- Evaluated tasks on Google Scanned Objects and RTMV datasets
- Ground truth 3D models used for 3D reconstruction
- Numerically evaluated with four metrics for image similarity
- Measured Chamfer Distance and volumetric IoU for 3D reconstruction
Novel view synthesis results
- Our method is able to generate photorealistic images that are consistent with the ground truth.
- Point-E achieves better results than other baselines and has good zero-shot generalizability.
- Our model is able to synthesize high-fidelity viewpoints while maintaining object type, identity and low-level details.
- Diffusion models are a good choice of architecture compared to NeRF for capturing underlying uncertainty.
3d reconstruction results
- Tables 3 and 4 show numerical results for 3D reconstruction on GSO.
- Our method reconstructs high-fidelity 3D meshes that are consistent with the ground truth.
- MCC and SJC-I often fail to correctly infer the geometry at the back of the object.
- Point-E is able to predict a reasonable estimate of object geometry, but generates non-uniform sparse point clouds.
- Our method leverages multi-view priors and NeRF-style representation, resulting in improvements in terms of CD and volumetric IoU.
Text to image to 3d
- Tested method on images generated by txt2img models
- Generated novel views of images while preserving object identity
Discussion
- Proposed a novel approach for zero-shot, single-image novel-view synthesis and 3D reconstruction
- Leverages Stable Diffusion model, pre-trained on internet-scaled data, to capture rich semantic and geometric priors
- Fine-tuned the model on synthetic data to learn control over the camera viewpoint
- Demonstrated state-of-the-art results on several benchmarks
Future work
- Approach is trained on single objects on plain background
- Generalization to scenes with several objects is demonstrated
- Generalization to scenes with complex backgrounds is an important challenge
- Reasoning about geometry of dynamic scenes from single view would open novel research directions
- Combining graphics pipelines with Stable Diffusion
- Extract 3D knowledge of objects from Stable Diffusion
- Smoothness loss to depth map to remove holes in object representation
- Near-view consistency loss to regularize difference between images from different views
- Mesh extraction from Vox-elRF representation
- Evaluation of 3D shape using chamfer distance and volumetric IoU
- Baselines used for comparison
- Viewpoint bias in text-to-image models
- Zero-1-to-3 is a viewpoint-conditioned image translation model
- 3D reconstruction with Zero-1-to-3
- Novel view synthesis on Google Scanned Objects and RTMV
- Diversity of novel view synthesis
- Qualitative examples of 3D reconstruction
- Novel view synthesis from Dall-E-2 generated images
- Results for novel view synthesis on Google Scanned Objects and RTMV
- Results for single view 3D reconstruction on RTMV