Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Deep learning algorithms are data-driven
- ImageNet has driven a trend of “learning from large-scale data” in computer vision
- Pretraining on ImageNet has been used to benefit 2D visual tasks
- There is no generic dataset for 3D vision like ImageNet
- MVImgNet is a large-scale dataset of multi-view images
- MVImgNet has been used to benefit 3D and 2D visual tasks
- MVPNet is a 3D object point cloud dataset derived from MVImgNet
Paper Content
Introduction
- Deep learning algorithms are data-driven
- Training on large-scale datasets allows deep neural networks to extract rich representations
- ImageNet is the pioneer of large-scale real-world image datasets
- Pretraining on ImageNet boosts model performance
- Various 3D datasets have been produced to facilitate 3D visual applications
- Existing 3D datasets are either synthetic or not comparable to ImageNet
- Goal of paper is to build primary dataset and explore its effect
- Dataset is created from multi-view images
- Dataset contains 6.5 million frames from 219,188 videos
- Dataset is used to explore view-based 3D reconstruction, general image classification, and salient object detection
- Derived 3D object point cloud dataset called MVPNet
- Datasets will be made public
Related work
- MNIST is a pioneering dataset of 70k monochrome images of handwritten digits
- CIFAR10 and CIFAR100 contain 60k tiny color images of objects or animals in 10 and 100 classes
- ImageNet is a large scale, high accuracy, large diversity, and hierarchical structure dataset
- MSCOCO is a popular dataset with 328k images and rich annotations
- Other datasets include PASCAL VOC, Visual Genome, Cityscapes, MPII, etc.
- HMDB-51 and UCF-101 are video datasets
- ActivityNet and Kinetics are larger scale and variation datasets
- Other video datasets are for human pose estimation, object detection/segmentation/tracking, emotion recognition, and captioning
- 3D datasets include indoor scenes, outdoor scenes, object point clouds, shape analysis, and real 3D objects
- Multi-view image datasets include real objects, 3D models, textures, and synthetic datasets
- CO3D is a similar dataset to MVImgNet but with smaller scale
Raw data preparation
- Building a large-scale dataset is challenging
- Rapid development of mobile devices makes multi-view images accessible
- Different classes have different expected number of video captures
- Requirements for video capture: length, no blur, 180/360 view, 15% proportion, one class, stereoscopic
- 1000 normal collectors and 200 expert data cleaners to review submissions
Dataset summary
- MVImgNet includes 238 object classes from 6.5 million frames of 219,188 videos
- 65 classes overlap with ImageNet
- Used to train Neural Radiance Fields (NeRF)
- Fits the data demand of learning-based generalizable NeRF methods
3d reconstruction
- IBRNet is used as the baseline for an empirical study
- Pretraining on MVImgNet improves the generalization ability of the model
- Pretraining on CO3D and MVImgNet-small is also tested
- Results show that MVImgNet has greater power than CO3D
- More data leads to better generalization metrics
Multi-view stereo
- Multi-view Stereo (MVS) is a 3D computer vision task to reconstruct 3D geometry from multi-view images
- Deep learning methods have been introduced to MVS to solve issues such as weak textures and nonlaplacian spheres
- Self-supervised MVS methods have emerged to reduce the need for massive amounts of RGB-D data
- MVImgNet is capable of benefiting data-efficient MVS with limited training examples
- Pretraining on MVImgNet improves model accuracy compared to training from scratch
View-consistent image classification
- Humans can recognize objects from different views, but deep learning models cannot.
- MVImgNet provides images from different viewpoints to help enhance the model’s view consistency.
- A new training set, MVI-Mix, is created by mixing the original ImageNet data with MVImgNet.
- Two other artificial training sets, MVI-Gap and MVI-Aug, are created with different image views.
- MVImgNet brings more benefits than CO3D for view-consistent image recognition.
View-consistent contrastive learning
- CL is a self-supervised training technique
- MVImgNet can be used as positive pairs for CL
- Finetuning MoCo-v2 on MVImgNet can improve view consistency and accuracy
- View-consistent image classification is described in the supplementary material
View-consistent sod
- Salient Object Detection (SOD) aims to segment the most visually prominent objects in an image
- U2Net failed to segment “hard” views
- Leverage multi-view consistency to improve SOD with the help of optical flows
- Finetune U2Net on MVImgNet to improve SOD performance for “hard” views
3d point cloud classification
- Focused on point cloud classification
- Utilize dataset for 3D understanding tasks
- Pay attention to real-world setup
- Used ScanObjectNN for comparison
- Pretrained models on MVPNet
- Evaluated with small and heavy rotation
- Pretraining on MVPNet increased accuracy
Self-supervised point cloud pretraining
- Self-supervised learning has been used for 3D object point cloud understanding.
- Most of these methods are pretrained on synthetic datasets, making it difficult to obtain rich real-world representations.
- MVPNet is a natural choice to solve this problem.
- PointMAE pretrained on MVPNet outperforms state-of-the-art methods when finetuned on ScanObjectNN.
- Pretraining on MVPNet is more powerful than CO3D for real-world point cloud classification.
Mvpnet benchmark challenge
- 64,000 training and 16,000 testing samples are included in the MVPNet benchmark challenge for realworld point cloud classification.
- MVPNet is more challenging than ScanObjectNN.
- Accuracy drops significantly when training on ScanObjectNN and testing on MVPNet.
Conclusion
- Introduced MVImgNet, a large-scale dataset of multi-view images
- Collected by shooting videos of real-world objects
- 3D-aware visual signals
- Pilot experiments on various visual tasks
- Bonus of MVImgNet: MVPNet
- Experiments show MVPNet can benefit real-world 3D object classification
- Human-centric classes
- Category richness of MVImgNet is less than ImageNet
- Data do not consider very complex backgrounds
- Xianggang Yu contributes to the whole building pipeline
- Mutian Xu advised the exploration of view-consistent image understanding
- Yidan Zhang worked on data processing
- Haolin Liu was responsible for point cloud labeling
- Chongjie Ye conducted data labelling
- Yushuang Wu worked on the data collection and annotation
- Zizheng Yan suggested the experiments on view-consistent image understanding
- Chenming Zhu conducted the experiment on view-consistent contrastive learning
- Zhangyang Xiong worked on data collection and processing
- Tianyou Liang worked on data cleaning
- Guanying Chen, Shuguang Cui advised the project
- Xiaoguang Han proposed and led the whole research