Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Visual recognition aims to understand objects and scenes from a single image.
  • 3D recognition is more challenging due to occlusions not depicted in the image.
  • This work explores single-view 3D reconstruction by learning generalizable representations.
  • A framework is introduced that operates on 3D points of single objects or whole scenes.
  • The model, Multiview Compressive Coding (MCC), learns to compress the input appearance and geometry.
  • MCC is efficient and can learn from large-scale and diverse data sources.

Paper Content

Introduction

  • Images depict objects and scenes in diverse settings.
  • Popular 2D visual tasks aim to recognize them on the image plane.
  • 3D reconstruction is a longstanding problem in AI with applications in robotics and AR/VR.
  • Structure from Motion lifts images to 3D by triangulation.
  • NeRF optimizes radiance fields to synthesize novel views.
  • Others predict 3D from a single image but rely on expensive CAD supervision.
  • Some introduce object-specific priors via category-specific 3D templates, pose or symmetries.
  • Large-scale learning is largely underexplored for 3D reconstruction.
  • Motivated by advances in domain-agnostic architectures and large-scale category-agnostic learning, a scalable, general-purpose model for 3D reconstruction from a single image is presented.
  • The model operates directly on 3D points and enables large-scale category-agnostic training.
  • Input to the model is a single RGB-D image, which returns the visible 3D points via unprojection.
  • Supervision is sourced from multiple RGB-D views with relative camera poses.
  • The model is compared to state-of-the-art methods and shows superiority in both object and scene reconstruction.
  • Multiview 3D reconstruction is a longstanding problem in computer vision
  • Traditional techniques include binocular stereopsis, SfM, SLAM, reconstruction by analysis, and synthesis via volume rendering
  • Supervised approaches predict 3D geometry via CAD, meshes, voxels, or point clouds
  • Single-view 3D reconstruction is challenging
  • Weakly supervised approaches use category-specific priors or learn via 2D silhouettes and re-projection
  • Shape completion methods complete the 3D geometry of partial reconstructions
  • Implicit 3D representations such as SDFs and occupancy nets have proven effective
  • Self-supervised learning has advanced image and language understanding
  • MCC adopts an encoder-decoder architecture with an attention mechanism
  • MCC is supervised with “true” points derived from posed RGB-D views
  • MCC requires only points for supervision, extracted from posed RGB-D views
  • MCC’s input is a single RGB-D image
  • MCC’s decoder masks out the attention weights to break unwanted dependencies
  • MCC’s inference is efficient and the encoder cost is amortized

Query sampling

  • Training involves sampling 550 queries from 3D world space
  • Training queries are considered “occupied” if within 0.1 radius of ground truth point, “unoccupied” otherwise
  • Ground truth is union of all unprojected points from all RGB-D views of scene
  • Inference involves uniformly sampling a grid of points covering 3D space
  • Queries with occupancy score > 0.1 and their color predictions form final reconstruction

Implementation details

  • XYZ Patch Embeddings use a self-attention-based design to transform 3D coordinates into a C-dimensional vector
  • RGB Patch Embeddings use a convolution-based design
  • Architecture uses a 12-layer 768-dimensional “ViT-Base” architecture and an 8-layer 512-dimensional transformer

Object reconstruction experiments

  • MCC works for both objects and scenes
  • Used CO3D-v2 dataset for single object reconstruction
  • 10 categories held out for evaluation, 41 used for training
  • Metrics used: accuracy, completeness, F-score
  • 3D augmentations build in rotation equivariance
  • CO3D coordinate system used for training and testing points

Qualitative results on novel categories

  • MCC tackles heavy self-occlusions and complex shapes
  • MCC predicts texture for unseen regions
  • MCC is robust to noisy depth from COLMAP

Ablation study

  • Encoder Structure: Decoupled design performs slightly better than shared transformer
  • E XYZ Design: Transformer and PointNet work slightly better than MLP
  • Training Query Sampling: Uniform and contrastive-style sampling work similarly
  • Feature Conditioning: Detailed conditioning works better than average-pooled vector or bilinearly interpolated vector
  • Decoder Design: Concat+attn transformer works better than loc+MLP or cross-attn
  • Comparison to Prior Work: MCC outperforms PoinTr by a large margin

Scaling behavior analysis

  • MCC only requires points for training and does not rely on any shape priors
  • Performance improves with larger training data and more categories
  • Building category-agnostic scaleable models is a promising direction for general-purpose 3D reconstruction
  • Expanding datasets, especially categories, is promising

Zero-shot generalization in-the-wild

  • Generalization to novel categories from CO3D dataset
  • MCC reconstructions on ImageNet, iPhone captures, and AI-generated images
  • MCC learns general shape priors instead of memorizing training set

Comparison to image-conditioned nerf

  • 3D reconstruction is possible from one or few views
  • Variance in object types, image styles, depth systems, and visual context can be handled
  • NeRF-WCE and Ner-Former are two recent best performing methods on CO3D
  • MCC outperforms NeRF-WCE and Ner-Former when using depth as input or supervision
  • MCC predicts more accurate shapes from just a single view

Scene reconstruction experiments

  • MCC can handle single objects and scenes without modifications.
  • Test 3D scene reconstruction from a single RGB-D image.
  • Aim to reconstruct everything in front of the camera up to a certain range.
  • MCC outperforms the state-of-the-art scene reconstruction approach.
  • Experiment on the Hypersim dataset with over 77k images.

Hypersim scene reconstruction

  • MCC is able to complete furniture, walls, floors, and ceilings from a single view.
  • MCC reconstructs the room geometry but fails to capture fine details in both shape and texture.
  • MCC outperforms DRDF across all metrics.

Zero-shot generalization to taskonomy

  • Model MCC is trained on Hypersim and deployed on novel scenes from Taskonomy
  • Model MCC is able to reconstruct room layout in challenging setting

Failure cases

  • Sensitivity to depth input, can fail to reconstruct accurate 3D geometry
  • Distribution shifts, errors in texture and geometry
  • High-fidelity texture, omits details

Conclusions

  • We present MCC, a general-purpose 3D reconstruction model for both objects and scenes
  • We show generalization to challenging settings, including in-the-wild captures and AI-generated images
  • A simple point-based method coupled with category-agnostic large-scale training is effective
  • We provide 360-view animations and interactive 3D visualizations
  • The transformer architecture is composed of 12 layers of a 768-dimensional self-attention operator with 12 heads
  • We hold out 10 randomly selected categories as our test set
  • We train MCC on Hypersim with Adam for 100k iterations
  • We normalize each scene to have zero-mean and unit-variance
  • At inference time, we predict points up to 6.0 units away from the camera origin
  • We randomly scale augment images by s ∈ [0.8, 1.2]
  • We train with Adam for 150k iterations
  • We perform 3D augmentations by randomly rotating 3D points along each axis by θ ∈ [−180 o , 180 o ]
  • We test MCC on three challenging settings
  • We show reconstructions on held-out Hypersim scenes and novel scenes from Taskonomy
  • MCC predicts shape details and color