Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Deep learning algorithms are data-driven
ImageNet has driven a trend of “learning from large-scale data” in computer vision
Pretraining on ImageNet has been used to benefit 2D visual tasks
There is no generic dataset for 3D vision like ImageNet
MVImgNet is a large-scale dataset of multi-view images
MVImgNet has been used to benefit 3D and 2D visual tasks
MVPNet is a 3D object point cloud dataset derived from MVImgNet

Paper Content

Introduction

Deep learning algorithms are data-driven
Training on large-scale datasets allows deep neural networks to extract rich representations
ImageNet is the pioneer of large-scale real-world image datasets
Pretraining on ImageNet boosts model performance
Various 3D datasets have been produced to facilitate 3D visual applications
Existing 3D datasets are either synthetic or not comparable to ImageNet
Goal of paper is to build primary dataset and explore its effect
Dataset is created from multi-view images
Dataset contains 6.5 million frames from 219,188 videos
Dataset is used to explore view-based 3D reconstruction, general image classification, and salient object detection
Derived 3D object point cloud dataset called MVPNet
Datasets will be made public

MNIST is a pioneering dataset of 70k monochrome images of handwritten digits
CIFAR10 and CIFAR100 contain 60k tiny color images of objects or animals in 10 and 100 classes
ImageNet is a large scale, high accuracy, large diversity, and hierarchical structure dataset
MSCOCO is a popular dataset with 328k images and rich annotations
Other datasets include PASCAL VOC, Visual Genome, Cityscapes, MPII, etc.
HMDB-51 and UCF-101 are video datasets
ActivityNet and Kinetics are larger scale and variation datasets
Other video datasets are for human pose estimation, object detection/segmentation/tracking, emotion recognition, and captioning
3D datasets include indoor scenes, outdoor scenes, object point clouds, shape analysis, and real 3D objects
Multi-view image datasets include real objects, 3D models, textures, and synthetic datasets
CO3D is a similar dataset to MVImgNet but with smaller scale

Raw data preparation

Building a large-scale dataset is challenging
Rapid development of mobile devices makes multi-view images accessible
Different classes have different expected number of video captures
Requirements for video capture: length, no blur, 180/360 view, 15% proportion, one class, stereoscopic
1000 normal collectors and 200 expert data cleaners to review submissions

Dataset summary

MVImgNet includes 238 object classes from 6.5 million frames of 219,188 videos
65 classes overlap with ImageNet
Used to train Neural Radiance Fields (NeRF)
Fits the data demand of learning-based generalizable NeRF methods

3d reconstruction

IBRNet is used as the baseline for an empirical study
Pretraining on MVImgNet improves the generalization ability of the model
Pretraining on CO3D and MVImgNet-small is also tested
Results show that MVImgNet has greater power than CO3D
More data leads to better generalization metrics

Multi-view stereo

Multi-view Stereo (MVS) is a 3D computer vision task to reconstruct 3D geometry from multi-view images
Deep learning methods have been introduced to MVS to solve issues such as weak textures and nonlaplacian spheres
Self-supervised MVS methods have emerged to reduce the need for massive amounts of RGB-D data
MVImgNet is capable of benefiting data-efficient MVS with limited training examples
Pretraining on MVImgNet improves model accuracy compared to training from scratch

View-consistent image classification

Humans can recognize objects from different views, but deep learning models cannot.
MVImgNet provides images from different viewpoints to help enhance the model’s view consistency.
A new training set, MVI-Mix, is created by mixing the original ImageNet data with MVImgNet.
Two other artificial training sets, MVI-Gap and MVI-Aug, are created with different image views.
MVImgNet brings more benefits than CO3D for view-consistent image recognition.

View-consistent contrastive learning

CL is a self-supervised training technique
MVImgNet can be used as positive pairs for CL
Finetuning MoCo-v2 on MVImgNet can improve view consistency and accuracy
View-consistent image classification is described in the supplementary material

View-consistent sod

Salient Object Detection (SOD) aims to segment the most visually prominent objects in an image
U2Net failed to segment “hard” views
Leverage multi-view consistency to improve SOD with the help of optical flows
Finetune U2Net on MVImgNet to improve SOD performance for “hard” views

3d point cloud classification

Focused on point cloud classification
Utilize dataset for 3D understanding tasks
Pay attention to real-world setup
Used ScanObjectNN for comparison
Pretrained models on MVPNet
Evaluated with small and heavy rotation
Pretraining on MVPNet increased accuracy

Self-supervised point cloud pretraining

Self-supervised learning has been used for 3D object point cloud understanding.
Most of these methods are pretrained on synthetic datasets, making it difficult to obtain rich real-world representations.
MVPNet is a natural choice to solve this problem.
PointMAE pretrained on MVPNet outperforms state-of-the-art methods when finetuned on ScanObjectNN.
Pretraining on MVPNet is more powerful than CO3D for real-world point cloud classification.

Mvpnet benchmark challenge

64,000 training and 16,000 testing samples are included in the MVPNet benchmark challenge for realworld point cloud classification.
MVPNet is more challenging than ScanObjectNN.
Accuracy drops significantly when training on ScanObjectNN and testing on MVPNet.

Conclusion

Introduced MVImgNet, a large-scale dataset of multi-view images
Collected by shooting videos of real-world objects
3D-aware visual signals
Pilot experiments on various visual tasks
Bonus of MVImgNet: MVPNet
Experiments show MVPNet can benefit real-world 3D object classification
Human-centric classes
Category richness of MVImgNet is less than ImageNet
Data do not consider very complex backgrounds
Xianggang Yu contributes to the whole building pipeline
Mutian Xu advised the exploration of view-consistent image understanding
Yidan Zhang worked on data processing
Haolin Liu was responsible for point cloud labeling
Chongjie Ye conducted data labelling
Yushuang Wu worked on the data collection and annotation
Zizheng Yan suggested the experiments on view-consistent image understanding
Chenming Zhu conducted the experiment on view-consistent contrastive learning
Zhangyang Xiong worked on data collection and processing
Tianyou Liang worked on data cleaning
Guanying Chen, Shuguang Cui advised the project
Xiaoguang Han proposed and led the whole research

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Raw data preparation#

Dataset summary#

3d reconstruction#

Multi-view stereo#

View-consistent image classification#

View-consistent contrastive learning#

View-consistent sod#

3d point cloud classification#

Self-supervised point cloud pretraining#

Mvpnet benchmark challenge#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Raw data preparation

Dataset summary

3d reconstruction

Multi-view stereo

View-consistent image classification

View-consistent contrastive learning

View-consistent sod

3d point cloud classification

Self-supervised point cloud pretraining

Mvpnet benchmark challenge

Conclusion