Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Deep learning algorithms are data-driven
  • ImageNet has driven a trend of “learning from large-scale data” in computer vision
  • Pretraining on ImageNet has been used to benefit 2D visual tasks
  • There is no generic dataset for 3D vision like ImageNet
  • MVImgNet is a large-scale dataset of multi-view images
  • MVImgNet has been used to benefit 3D and 2D visual tasks
  • MVPNet is a 3D object point cloud dataset derived from MVImgNet

Paper Content

Introduction

  • Deep learning algorithms are data-driven
  • Training on large-scale datasets allows deep neural networks to extract rich representations
  • ImageNet is the pioneer of large-scale real-world image datasets
  • Pretraining on ImageNet boosts model performance
  • Various 3D datasets have been produced to facilitate 3D visual applications
  • Existing 3D datasets are either synthetic or not comparable to ImageNet
  • Goal of paper is to build primary dataset and explore its effect
  • Dataset is created from multi-view images
  • Dataset contains 6.5 million frames from 219,188 videos
  • Dataset is used to explore view-based 3D reconstruction, general image classification, and salient object detection
  • Derived 3D object point cloud dataset called MVPNet
  • Datasets will be made public
  • MNIST is a pioneering dataset of 70k monochrome images of handwritten digits
  • CIFAR10 and CIFAR100 contain 60k tiny color images of objects or animals in 10 and 100 classes
  • ImageNet is a large scale, high accuracy, large diversity, and hierarchical structure dataset
  • MSCOCO is a popular dataset with 328k images and rich annotations
  • Other datasets include PASCAL VOC, Visual Genome, Cityscapes, MPII, etc.
  • HMDB-51 and UCF-101 are video datasets
  • ActivityNet and Kinetics are larger scale and variation datasets
  • Other video datasets are for human pose estimation, object detection/segmentation/tracking, emotion recognition, and captioning
  • 3D datasets include indoor scenes, outdoor scenes, object point clouds, shape analysis, and real 3D objects
  • Multi-view image datasets include real objects, 3D models, textures, and synthetic datasets
  • CO3D is a similar dataset to MVImgNet but with smaller scale

Raw data preparation

  • Building a large-scale dataset is challenging
  • Rapid development of mobile devices makes multi-view images accessible
  • Different classes have different expected number of video captures
  • Requirements for video capture: length, no blur, 180/360 view, 15% proportion, one class, stereoscopic
  • 1000 normal collectors and 200 expert data cleaners to review submissions

Dataset summary

  • MVImgNet includes 238 object classes from 6.5 million frames of 219,188 videos
  • 65 classes overlap with ImageNet
  • Used to train Neural Radiance Fields (NeRF)
  • Fits the data demand of learning-based generalizable NeRF methods

3d reconstruction

  • IBRNet is used as the baseline for an empirical study
  • Pretraining on MVImgNet improves the generalization ability of the model
  • Pretraining on CO3D and MVImgNet-small is also tested
  • Results show that MVImgNet has greater power than CO3D
  • More data leads to better generalization metrics

Multi-view stereo

  • Multi-view Stereo (MVS) is a 3D computer vision task to reconstruct 3D geometry from multi-view images
  • Deep learning methods have been introduced to MVS to solve issues such as weak textures and nonlaplacian spheres
  • Self-supervised MVS methods have emerged to reduce the need for massive amounts of RGB-D data
  • MVImgNet is capable of benefiting data-efficient MVS with limited training examples
  • Pretraining on MVImgNet improves model accuracy compared to training from scratch

View-consistent image classification

  • Humans can recognize objects from different views, but deep learning models cannot.
  • MVImgNet provides images from different viewpoints to help enhance the model’s view consistency.
  • A new training set, MVI-Mix, is created by mixing the original ImageNet data with MVImgNet.
  • Two other artificial training sets, MVI-Gap and MVI-Aug, are created with different image views.
  • MVImgNet brings more benefits than CO3D for view-consistent image recognition.

View-consistent contrastive learning

  • CL is a self-supervised training technique
  • MVImgNet can be used as positive pairs for CL
  • Finetuning MoCo-v2 on MVImgNet can improve view consistency and accuracy
  • View-consistent image classification is described in the supplementary material

View-consistent sod

  • Salient Object Detection (SOD) aims to segment the most visually prominent objects in an image
  • U2Net failed to segment “hard” views
  • Leverage multi-view consistency to improve SOD with the help of optical flows
  • Finetune U2Net on MVImgNet to improve SOD performance for “hard” views

3d point cloud classification

  • Focused on point cloud classification
  • Utilize dataset for 3D understanding tasks
  • Pay attention to real-world setup
  • Used ScanObjectNN for comparison
  • Pretrained models on MVPNet
  • Evaluated with small and heavy rotation
  • Pretraining on MVPNet increased accuracy

Self-supervised point cloud pretraining

  • Self-supervised learning has been used for 3D object point cloud understanding.
  • Most of these methods are pretrained on synthetic datasets, making it difficult to obtain rich real-world representations.
  • MVPNet is a natural choice to solve this problem.
  • PointMAE pretrained on MVPNet outperforms state-of-the-art methods when finetuned on ScanObjectNN.
  • Pretraining on MVPNet is more powerful than CO3D for real-world point cloud classification.

Mvpnet benchmark challenge

  • 64,000 training and 16,000 testing samples are included in the MVPNet benchmark challenge for realworld point cloud classification.
  • MVPNet is more challenging than ScanObjectNN.
  • Accuracy drops significantly when training on ScanObjectNN and testing on MVPNet.

Conclusion

  • Introduced MVImgNet, a large-scale dataset of multi-view images
  • Collected by shooting videos of real-world objects
  • 3D-aware visual signals
  • Pilot experiments on various visual tasks
  • Bonus of MVImgNet: MVPNet
  • Experiments show MVPNet can benefit real-world 3D object classification
  • Human-centric classes
  • Category richness of MVImgNet is less than ImageNet
  • Data do not consider very complex backgrounds
  • Xianggang Yu contributes to the whole building pipeline
  • Mutian Xu advised the exploration of view-consistent image understanding
  • Yidan Zhang worked on data processing
  • Haolin Liu was responsible for point cloud labeling
  • Chongjie Ye conducted data labelling
  • Yushuang Wu worked on the data collection and annotation
  • Zizheng Yan suggested the experiments on view-consistent image understanding
  • Chenming Zhu conducted the experiment on view-consistent contrastive learning
  • Zhangyang Xiong worked on data collection and processing
  • Tianyou Liang worked on data cleaning
  • Guanying Chen, Shuguang Cui advised the project
  • Xiaoguang Han proposed and led the whole research