Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Investigating if progress on ImageNet transfers to real-world datasets
- Evaluating ImageNet pre-trained models with varying accuracy on six practical image classification datasets
- Datasets collected with the goal of solving real-world tasks
- Higher ImageNet accuracy does not consistently yield performance improvements
- Data augmentation can improve performance even when architectures do not
Paper Content
Introduction
- ImageNet is a widely used dataset in machine learning
- AlexNet’s success in 2012 re-popularized neural networks
- ImageNet is still a main benchmark for computer vision models
- Machine learning community has invested effort into increasing performance on ImageNet
- Question of whether the community is over-optimizing for this dataset
- Early methodological innovations transferred more broadly to other tasks
- Goal of paper is to investigate transfer of neural network architecture to real-world data
- Consider classification tasks derived from image data collected with goal of classification in mind
Related work
- Previous work has investigated the effect of architecture on transferability of ImageNet-pretrained models to different datasets.
- Kornblith et al. (2019) showed that ImageNet accuracy is strongly correlated with downstream accuracy on web-scraped object-centric computer vision tasks.
- Studies have investigated the relationship between ImageNet and transfer accuracy for self-supervised networks, adversarially trained networks, and networks trained with different loss functions.
- VTAB (Zhai et al., 2019) comprises a more diverse set of tasks, including natural and non-natural classification tasks as well as non-classification tasks.
- Models have been extensively evaluated on real-world data in the medical imaging domain, with limited gains from newer models that perform better on ImageNet.
- Tuggener et al. (2021) investigate performance of 500 CNN architectures on datasets, several of which are not web-scraped.
- Other work has evaluated transferability of representations of networks trained on datasets beyond ImageNet.
- Abnar et al. (2022) explore the relationship between upstream and downstream accuracy for models pretrained on JFT and ImageNet-21K.
- Early work identified lack of diversity as a key shortcoming of the benchmarks of the time.
- More recent studies have investigated the extent to which ImageNet classification accuracy correlates with accuracy on out-of-distribution data.
- Miller et al. (2021) has shown that in-distribution accuracy improvements often directly yield out-of-distribution accuracy improvements.
Datasets
- Criteria for dataset selection: diverse data sources, relevance to an application, availability of baseline models
- Evaluation of model performance on target tasks
Selection criteria
- Prior work has investigated transfer of ImageNet architectures to many downstream datasets
- 12 datasets used by Kornblith et al. (2019) often serve as a standard evaluation suite
- Image classification problems not represented by these datasets
- 1,500 datasets listed on Kaggle website
- Six datasets chosen based on criteria of diverse data sources, application relevance, and availability of baselines
- Four popular image classification competitions on Kaggle
- Caltech Camera Traps and EuroSAT datasets used to broaden applications studied
Datasets studied
Main experiments
- 19 model architectures tested, including CNNs and Vision Transformers
- Extensive hyperparameter tuning done
- Positive trend between ImageNet performance and CCT-20 performance
- CLIP ViT L/14-336px model with additional augmentation achieved 83.4% accuracy on CCT-20
Caltech camera traps
Aptos 2019 blindness detection
- Dataset created for Kaggle competition run by APTOS
- Images taken using fundus photography
- Labeled by clinicians on a scale of 0-4 for severity of diabetic retinopathy
- Evaluation metric is quadratic weighted kappa (QWK)
- Models after VGG do not show significant improvement
- DeiT and EfficientNets perform slightly worse
- Accuracy has similar trend as QWK
- Top Kaggle submission achieves 0.936 QWK
- Additional augmentation, external data, training on L1-loss, replacing pooling layer with generalized mean pooling, and ensembling models with different input sizes used
- Increasing color and affine augmentation can account for 0.03 QWK point improvement
- Top leaderboard entry Inception-ResNet v2 trained with additional interventions achieves 0.927 QWK
- New models trained with additional interventions do at least 0.03 QWK points better
Human protein atlas image classification
- Human Protein Atlas runs a Kaggle competition to build a tool to identify and locate proteins from microscopy images
- Macro F1 score is used to account for multiple proteins in images
- 73%/18%/9% train/validation/test-validation split used
- Positive trend between task performance and ImageNet performance
- Challenges include class imbalance, multi-label thresholding, and generalization
- Competitors used external data, data cleaning, augmentation, ensembling, and oversampling
- Modified architectures and metric learning used in first place solution
Siim-isic melanoma classification
- SIIM & ISIC ran a Kaggle competition to identify Melanoma
- 33,126 training images and 25,331 images from previous competitions were used
- 80% to 20% class-balanced and year-balanced train/validation split
- Evaluation metric was area under ROC curve
- Weak positive correlation (0.44) between ImageNet performance and task performance
- Classification accuracy showed stronger trend for transfer
- Top two Kaggle solutions used models with different input size, ensembling, cross-validation and training augmentation
Cassava leaf disease classification
- Makerere Artificial Intelligence Lab focuses on applications that benefit the developing world
- Goal of Cassava Leaf Disease Classification Kaggle competition is to give farmers access to methods for diagnosing plant diseases
- Images were taken with an inexpensive camera and labeled by agricultural experts
- Each image was classified as healthy or as one of four different diseases
- 80%/20% random class-balanced train/validation split of the provided training data
- Weak positive correlation (0.12) and a near-zero normalized slope (0.02)
- Using 13 spectral bands
- Using RGB images and keeping experimental setup consistent
- All models over 60% ImageNet accuracy achieve over 98.5% EuroSAT accuracy
- Majority of models achieve over 99.0% EuroSAT accuracy
Additional studies
Augmentation ablations
- Experiments were conducted to compare models and minimize confounding factors.
- Different architectures and augmentation strategies may have different results.
- Experiments were conducted to explore the effect of data augmentation on transfer.
- Pre-training with RandAugment improved performance on downstream tasks, but pre-training with AugMix did not.
- Fine-tuning with RandAugment usually yields additional performance gains.
- Additional augmentation did not significantly increase performance on downstream tasks for DeiT models.
Clip models
- CLIP models from Radford et al. (2021) use diverse pre-training data and achieve high performance on downstream datasets
- Results of fine-tuning CLIP models are visualized in Appendix I Figure 8
- Using larger images increases performance on all datasets
- Pre-training data helps for CCT-20, HPA, and Cassava
Discussion
- ImageNet accuracy does not always correlate with transfer accuracy on real-world tasks
- Number of classes and dataset size do not explain the differences from Kornblith et al. (2019)
- Parameter count is not a good indicator of improved transfer performance on real-world datasets
Differences between web-scraped datasets and real-world images
- It is possible to perform well on web-scraped target datasets by collecting a large amount of data from the Internet and training a large model on it.
- Recent models trained on large web-scraped datasets have achieved high accuracy on web-scraped benchmarks.
- There is a large amount of distribution shift between web-scraped datasets and real-world datasets.
- Gains in ImageNet accuracy over the last decade have primarily come from improving and scaling architectures, and these gains generally transfer to other web-scraped datasets.
- Improvements arising from architecture generally do not transfer to non-web-scraped tasks.
- Data augmentation and other tweaks can provide further gains on these tasks.
- Researchers should explicitly search for methods that improve accuracy on real-world non-web-scraped datasets.
- There may be methods that improve accuracy on real-world tasks but not ImageNet.
- Average accuracy across non-web-scraped datasets explains variance beyond that explained by ImageNet accuracy.
- Most methodological innovations that help on ImageNet are useful for some real-world tasks.
- Future benchmarks should include more diverse datasets to encourage a more comprehensive approach to improving learning algorithms.
- 19 model architectures are examined to observe the relationship between ImageNet performance and target dataset performance.
- Hyperparameter tuning is used to get the best performance out of each model.
- Models are pretrained on ImageNet and then fine-tuned on the downstream task.
- Experiments are ran on TPU v2-8s and NVIDIA A40s.
- Models are trained using AdamW with a cosine scheduler.
- Images are normalized to ImageNet’s mean and standard deviation.
- Data augmentation techniques are used.
- There is a strong linear trend between ImageNet accuracy and the target metrics.