Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Visual recognition aims to understand objects and scenes from a single image.
3D recognition is more challenging due to occlusions not depicted in the image.
This work explores single-view 3D reconstruction by learning generalizable representations.
A framework is introduced that operates on 3D points of single objects or whole scenes.
The model, Multiview Compressive Coding (MCC), learns to compress the input appearance and geometry.
MCC is efficient and can learn from large-scale and diverse data sources.

Paper Content

Introduction

Images depict objects and scenes in diverse settings.
Popular 2D visual tasks aim to recognize them on the image plane.
3D reconstruction is a longstanding problem in AI with applications in robotics and AR/VR.
Structure from Motion lifts images to 3D by triangulation.
NeRF optimizes radiance fields to synthesize novel views.
Others predict 3D from a single image but rely on expensive CAD supervision.
Some introduce object-specific priors via category-specific 3D templates, pose or symmetries.
Large-scale learning is largely underexplored for 3D reconstruction.
Motivated by advances in domain-agnostic architectures and large-scale category-agnostic learning, a scalable, general-purpose model for 3D reconstruction from a single image is presented.
The model operates directly on 3D points and enables large-scale category-agnostic training.
Input to the model is a single RGB-D image, which returns the visible 3D points via unprojection.
Supervision is sourced from multiple RGB-D views with relative camera poses.
The model is compared to state-of-the-art methods and shows superiority in both object and scene reconstruction.

Multiview 3D reconstruction is a longstanding problem in computer vision
Traditional techniques include binocular stereopsis, SfM, SLAM, reconstruction by analysis, and synthesis via volume rendering
Supervised approaches predict 3D geometry via CAD, meshes, voxels, or point clouds
Single-view 3D reconstruction is challenging
Weakly supervised approaches use category-specific priors or learn via 2D silhouettes and re-projection
Shape completion methods complete the 3D geometry of partial reconstructions
Implicit 3D representations such as SDFs and occupancy nets have proven effective
Self-supervised learning has advanced image and language understanding
MCC adopts an encoder-decoder architecture with an attention mechanism
MCC is supervised with “true” points derived from posed RGB-D views
MCC requires only points for supervision, extracted from posed RGB-D views
MCC’s input is a single RGB-D image
MCC’s decoder masks out the attention weights to break unwanted dependencies
MCC’s inference is efficient and the encoder cost is amortized

Query sampling

Training involves sampling 550 queries from 3D world space
Training queries are considered “occupied” if within 0.1 radius of ground truth point, “unoccupied” otherwise
Ground truth is union of all unprojected points from all RGB-D views of scene
Inference involves uniformly sampling a grid of points covering 3D space
Queries with occupancy score > 0.1 and their color predictions form final reconstruction

Implementation details

XYZ Patch Embeddings use a self-attention-based design to transform 3D coordinates into a C-dimensional vector
RGB Patch Embeddings use a convolution-based design
Architecture uses a 12-layer 768-dimensional “ViT-Base” architecture and an 8-layer 512-dimensional transformer

Object reconstruction experiments

MCC works for both objects and scenes
Used CO3D-v2 dataset for single object reconstruction
10 categories held out for evaluation, 41 used for training
Metrics used: accuracy, completeness, F-score
3D augmentations build in rotation equivariance
CO3D coordinate system used for training and testing points

Qualitative results on novel categories

MCC tackles heavy self-occlusions and complex shapes
MCC predicts texture for unseen regions
MCC is robust to noisy depth from COLMAP

Ablation study

Encoder Structure: Decoupled design performs slightly better than shared transformer
E XYZ Design: Transformer and PointNet work slightly better than MLP
Training Query Sampling: Uniform and contrastive-style sampling work similarly
Feature Conditioning: Detailed conditioning works better than average-pooled vector or bilinearly interpolated vector
Decoder Design: Concat+attn transformer works better than loc+MLP or cross-attn
Comparison to Prior Work: MCC outperforms PoinTr by a large margin

Scaling behavior analysis

MCC only requires points for training and does not rely on any shape priors
Performance improves with larger training data and more categories
Building category-agnostic scaleable models is a promising direction for general-purpose 3D reconstruction
Expanding datasets, especially categories, is promising

Zero-shot generalization in-the-wild

Generalization to novel categories from CO3D dataset
MCC reconstructions on ImageNet, iPhone captures, and AI-generated images
MCC learns general shape priors instead of memorizing training set

Comparison to image-conditioned nerf

3D reconstruction is possible from one or few views
Variance in object types, image styles, depth systems, and visual context can be handled
NeRF-WCE and Ner-Former are two recent best performing methods on CO3D
MCC outperforms NeRF-WCE and Ner-Former when using depth as input or supervision
MCC predicts more accurate shapes from just a single view

Scene reconstruction experiments

MCC can handle single objects and scenes without modifications.
Test 3D scene reconstruction from a single RGB-D image.
Aim to reconstruct everything in front of the camera up to a certain range.
MCC outperforms the state-of-the-art scene reconstruction approach.
Experiment on the Hypersim dataset with over 77k images.

Hypersim scene reconstruction

MCC is able to complete furniture, walls, floors, and ceilings from a single view.
MCC reconstructs the room geometry but fails to capture fine details in both shape and texture.
MCC outperforms DRDF across all metrics.

Zero-shot generalization to taskonomy

Model MCC is trained on Hypersim and deployed on novel scenes from Taskonomy
Model MCC is able to reconstruct room layout in challenging setting

Failure cases

Sensitivity to depth input, can fail to reconstruct accurate 3D geometry
Distribution shifts, errors in texture and geometry
High-fidelity texture, omits details

Conclusions

We present MCC, a general-purpose 3D reconstruction model for both objects and scenes
We show generalization to challenging settings, including in-the-wild captures and AI-generated images
A simple point-based method coupled with category-agnostic large-scale training is effective
We provide 360-view animations and interactive 3D visualizations
The transformer architecture is composed of 12 layers of a 768-dimensional self-attention operator with 12 heads
We hold out 10 randomly selected categories as our test set
We train MCC on Hypersim with Adam for 100k iterations
We normalize each scene to have zero-mean and unit-variance
At inference time, we predict points up to 6.0 units away from the camera origin
We randomly scale augment images by s ∈ [0.8, 1.2]
We train with Adam for 150k iterations
We perform 3D augmentations by randomly rotating 3D points along each axis by θ ∈ [−180 o , 180 o ]
We test MCC on three challenging settings
We show reconstructions on held-out Hypersim scenes and novel scenes from Taskonomy
MCC predicts shape details and color

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Query sampling#

Implementation details#

Object reconstruction experiments#

Qualitative results on novel categories#

Ablation study#

Scaling behavior analysis#

Zero-shot generalization in-the-wild#

Comparison to image-conditioned nerf#

Scene reconstruction experiments#

Hypersim scene reconstruction#

Zero-shot generalization to taskonomy#

Failure cases#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Related work

Query sampling

Implementation details

Object reconstruction experiments

Qualitative results on novel categories

Ablation study

Scaling behavior analysis

Zero-shot generalization in-the-wild

Comparison to image-conditioned nerf

Scene reconstruction experiments

Hypersim scene reconstruction

Zero-shot generalization to taskonomy

Failure cases

Conclusions