Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Investigating if progress on ImageNet transfers to real-world datasets
Evaluating ImageNet pre-trained models with varying accuracy on six practical image classification datasets
Datasets collected with the goal of solving real-world tasks
Higher ImageNet accuracy does not consistently yield performance improvements
Data augmentation can improve performance even when architectures do not

Paper Content

Introduction

ImageNet is a widely used dataset in machine learning
AlexNet’s success in 2012 re-popularized neural networks
ImageNet is still a main benchmark for computer vision models
Machine learning community has invested effort into increasing performance on ImageNet
Question of whether the community is over-optimizing for this dataset
Early methodological innovations transferred more broadly to other tasks
Goal of paper is to investigate transfer of neural network architecture to real-world data
Consider classification tasks derived from image data collected with goal of classification in mind

Previous work has investigated the effect of architecture on transferability of ImageNet-pretrained models to different datasets.
Kornblith et al. (2019) showed that ImageNet accuracy is strongly correlated with downstream accuracy on web-scraped object-centric computer vision tasks.
Studies have investigated the relationship between ImageNet and transfer accuracy for self-supervised networks, adversarially trained networks, and networks trained with different loss functions.
VTAB (Zhai et al., 2019) comprises a more diverse set of tasks, including natural and non-natural classification tasks as well as non-classification tasks.
Models have been extensively evaluated on real-world data in the medical imaging domain, with limited gains from newer models that perform better on ImageNet.
Tuggener et al. (2021) investigate performance of 500 CNN architectures on datasets, several of which are not web-scraped.
Other work has evaluated transferability of representations of networks trained on datasets beyond ImageNet.
Abnar et al. (2022) explore the relationship between upstream and downstream accuracy for models pretrained on JFT and ImageNet-21K.
Early work identified lack of diversity as a key shortcoming of the benchmarks of the time.
More recent studies have investigated the extent to which ImageNet classification accuracy correlates with accuracy on out-of-distribution data.
Miller et al. (2021) has shown that in-distribution accuracy improvements often directly yield out-of-distribution accuracy improvements.

Datasets

Criteria for dataset selection: diverse data sources, relevance to an application, availability of baseline models
Evaluation of model performance on target tasks

Selection criteria

Prior work has investigated transfer of ImageNet architectures to many downstream datasets
12 datasets used by Kornblith et al. (2019) often serve as a standard evaluation suite
Image classification problems not represented by these datasets
1,500 datasets listed on Kaggle website
Six datasets chosen based on criteria of diverse data sources, application relevance, and availability of baselines
Four popular image classification competitions on Kaggle
Caltech Camera Traps and EuroSAT datasets used to broaden applications studied

Datasets studied

Main experiments

19 model architectures tested, including CNNs and Vision Transformers
Extensive hyperparameter tuning done
Positive trend between ImageNet performance and CCT-20 performance
CLIP ViT L/14-336px model with additional augmentation achieved 83.4% accuracy on CCT-20

Caltech camera traps

Aptos 2019 blindness detection

Dataset created for Kaggle competition run by APTOS
Images taken using fundus photography
Labeled by clinicians on a scale of 0-4 for severity of diabetic retinopathy
Evaluation metric is quadratic weighted kappa (QWK)
Models after VGG do not show significant improvement
DeiT and EfficientNets perform slightly worse
Accuracy has similar trend as QWK
Top Kaggle submission achieves 0.936 QWK
Additional augmentation, external data, training on L1-loss, replacing pooling layer with generalized mean pooling, and ensembling models with different input sizes used
Increasing color and affine augmentation can account for 0.03 QWK point improvement
Top leaderboard entry Inception-ResNet v2 trained with additional interventions achieves 0.927 QWK
New models trained with additional interventions do at least 0.03 QWK points better

Human protein atlas image classification

Human Protein Atlas runs a Kaggle competition to build a tool to identify and locate proteins from microscopy images
Macro F1 score is used to account for multiple proteins in images
73%/18%/9% train/validation/test-validation split used
Positive trend between task performance and ImageNet performance
Challenges include class imbalance, multi-label thresholding, and generalization
Competitors used external data, data cleaning, augmentation, ensembling, and oversampling
Modified architectures and metric learning used in first place solution

Siim-isic melanoma classification

SIIM & ISIC ran a Kaggle competition to identify Melanoma
33,126 training images and 25,331 images from previous competitions were used
80% to 20% class-balanced and year-balanced train/validation split
Evaluation metric was area under ROC curve
Weak positive correlation (0.44) between ImageNet performance and task performance
Classification accuracy showed stronger trend for transfer
Top two Kaggle solutions used models with different input size, ensembling, cross-validation and training augmentation

Cassava leaf disease classification

Makerere Artificial Intelligence Lab focuses on applications that benefit the developing world
Goal of Cassava Leaf Disease Classification Kaggle competition is to give farmers access to methods for diagnosing plant diseases
Images were taken with an inexpensive camera and labeled by agricultural experts
Each image was classified as healthy or as one of four different diseases
80%/20% random class-balanced train/validation split of the provided training data
Weak positive correlation (0.12) and a near-zero normalized slope (0.02)
Using 13 spectral bands
Using RGB images and keeping experimental setup consistent
All models over 60% ImageNet accuracy achieve over 98.5% EuroSAT accuracy
Majority of models achieve over 99.0% EuroSAT accuracy

Additional studies

Augmentation ablations

Experiments were conducted to compare models and minimize confounding factors.
Different architectures and augmentation strategies may have different results.
Experiments were conducted to explore the effect of data augmentation on transfer.
Pre-training with RandAugment improved performance on downstream tasks, but pre-training with AugMix did not.
Fine-tuning with RandAugment usually yields additional performance gains.
Additional augmentation did not significantly increase performance on downstream tasks for DeiT models.

Clip models

CLIP models from Radford et al. (2021) use diverse pre-training data and achieve high performance on downstream datasets
Results of fine-tuning CLIP models are visualized in Appendix I Figure 8
Using larger images increases performance on all datasets
Pre-training data helps for CCT-20, HPA, and Cassava

Discussion

ImageNet accuracy does not always correlate with transfer accuracy on real-world tasks
Number of classes and dataset size do not explain the differences from Kornblith et al. (2019)
Parameter count is not a good indicator of improved transfer performance on real-world datasets

Differences between web-scraped datasets and real-world images

It is possible to perform well on web-scraped target datasets by collecting a large amount of data from the Internet and training a large model on it.
Recent models trained on large web-scraped datasets have achieved high accuracy on web-scraped benchmarks.
There is a large amount of distribution shift between web-scraped datasets and real-world datasets.
Gains in ImageNet accuracy over the last decade have primarily come from improving and scaling architectures, and these gains generally transfer to other web-scraped datasets.
Improvements arising from architecture generally do not transfer to non-web-scraped tasks.
Data augmentation and other tweaks can provide further gains on these tasks.
Researchers should explicitly search for methods that improve accuracy on real-world non-web-scraped datasets.
There may be methods that improve accuracy on real-world tasks but not ImageNet.
Average accuracy across non-web-scraped datasets explains variance beyond that explained by ImageNet accuracy.
Most methodological innovations that help on ImageNet are useful for some real-world tasks.
Future benchmarks should include more diverse datasets to encourage a more comprehensive approach to improving learning algorithms.
19 model architectures are examined to observe the relationship between ImageNet performance and target dataset performance.
Hyperparameter tuning is used to get the best performance out of each model.
Models are pretrained on ImageNet and then fine-tuned on the downstream task.
Experiments are ran on TPU v2-8s and NVIDIA A40s.
Models are trained using AdamW with a cosine scheduler.
Images are normalized to ImageNet’s mean and standard deviation.
Data augmentation techniques are used.
There is a strong linear trend between ImageNet accuracy and the target metrics.

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Datasets#

Selection criteria#

Datasets studied#

Main experiments#

Caltech camera traps#

Aptos 2019 blindness detection#

Human protein atlas image classification#

Siim-isic melanoma classification#

Cassava leaf disease classification#

Additional studies#

Augmentation ablations#

Clip models#

Discussion#

Differences between web-scraped datasets and real-world images#

Link to paper

Abstract

Paper Content

Introduction

Related work

Datasets

Selection criteria

Datasets studied

Main experiments

Caltech camera traps

Aptos 2019 blindness detection

Human protein atlas image classification

Siim-isic melanoma classification

Cassava leaf disease classification

Additional studies

Augmentation ablations

Clip models

Discussion

Differences between web-scraped datasets and real-world images