Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Transfer learning produces high-accuracy models
- Pre-training size affects transfer learning performance
- Choice of pre-training data source is essential for few-shot transfer
- Label noise and size of pre-training dataset have trade-offs
- Language-image contrastive vs. image-image contrastive pre-training methods have different effects on downstream accuracy
Paper Content
Introduction
- Transfer learning is a popular computer vision model production method
- Pre-trained models have improved in recent years
- Research question: How do pre-training dataset and algorithm affect downstream performance?
- Differences between pre-training sources diminish as more data is available for downstream tasks
- In few-shot setting, different pre-training datasets lead to noticeable differences in downstream performance
- When controlling for size of pre-train model and downstream dataset, changing pre-train dataset leads to noticeable differences in downstream accuracy
- Certain pre-training datasets consistently lead to better transfer accuracy than others
- Comparing supervised and semi-supervised pre-training strategies
- Increasing pre-training dataset size improves transfer performance, but depends on dataset and number of fine-tuning samples
- SimCLR pre-training leads to better transfer than CLIP pre-training in few-shot regime
- 4000 experiments conducted to evaluate downstream performance
Related work
- Kim et al. (2022) studied the effect of network architecture, pre-training dataset, supervised vs self-supervised learning objectives, and different domain transfer methods on the transferability of representations to new domains
- Abnar et al. (2021) explored how different upstream training settings affect transfer accuracy for two upstream datasets and more than 20 downstream tasks
- Abnar et al. (2021) lacked controlled comparison between different distributions in the pre-training datasets
- This work extends the results of Abnar et al. (2021) to more pre-training datasets and methods, with a special focus on data distribution and curation
- This work looks at both few-shot and full-shot transfer accuracy to study the effect of transfer learning as more target data become available
Experimental setup
- CLIP model has demonstrated robustness to natural distribution shifts
- CLIP learns a joint embedding space for images and captions
- ResNet-50 used as image encoder
- Vary pre-training data distribution, curation method, and pre-training dataset size
- Change contrastive loss function to SimCLR
- Fine-tune pre-trained model end-to-end on target transfer dataset
- Pre-training datasets consist of million-size image and language pairs from multiple sources
- Downstream tasks use nine different datasets
- Six datasets are internet-crawled, three are domain-specific
Experiments and results
- Pre-training data sources have an impact on transfer learning performance
- As more images are available for fine-tuning, the difference in accuracy between different pre-training models is reduced
- Changing the pre-training dataset leads to noticeable differences in the downstream performance in a few-shot setting
Which data distribution is better for transfer learning?
- Pre-training on Shutterstock and LAION datasets results in superior transfer performance
- Redcaps yields superior performance for PETS
- Most common words in Redcaps captions are “cats” and “dogs”
- Most common words in Shutterstock captions are “background”, “design”, “pattern”, and “texture”
- Transfer learning from pre-training dataset outperforms training from scratch
Do well-curated pre-training datasets lead to better transfer?
- Significant effort to create computer vision datasets with high-quality labels
- Recent datasets are large but noisy
- Investigating how much laborious ImageNet labeling is worth
- Pre-training ResNet-50 on ImageNet-1K using supervised cross-entropy loss
- Discarding ImageNet labels and using CLIP to pre-train on ImageNet with Flickr captions
- Supervised pre-training on ImageNet outperforms CLIP pre-training
- Pre-training on larger datasets improves downstream transfer accuracy
Effect of pertaining loss
- Pre-training loss from language-image contrastive in CLIP replaced with image-only contrastive loss in SimCLR
- Changing pre-train dataset leads to differences in few-shot downstream performance
- SimCLR pre-training leads to better downstream transfer accuracy than CLIP in few-shot regime
- Difference in accuracy between pre-training methods decreases with more data for fine-tuning
- Difference in accuracy between CLIP and SimCLR varies across datasets
- Pre-training decisions lead to similar accuracy in high-shot regime, outperforming training from scratch
A effect of the pre-training data distribution
- Figure 7 shows detailed results for Figure 1.
- Pre-training datasets lead to differences in downstream performance in low-shot settings.
- If many samples are available for fine-tuning, the difference in accuracy between models pre-trained on different sources is reduced.
- Figure 7 compares different data sources for pre-training.
- Changing the pre-training dataset leads to noticeable differences in the downstream low-shot performance of nine datasets.
B training details
- CLIP models trained from scratch using AdamW optimizer
- Data augmentations from Radford et al. (2021)
- SimCLR implementation follows SLIP (Mu et al., 2021)
- Finetuned on downstream tasks for 128 epochs with learning rate from 0.0001-0.003
C effect of data curation: imagenet captioning
- Compared CLIP models pre-trained on LAION with CLIP models pre-trained on two versions of the curated ImageNet dataset
- Image deduplication routine and removal of text containing profanity left dataset of 463,622 images with corresponding text data
- IN1K-Template-Captions dataset includes all data in ImageNet dataset, paired with templated captions
- CLIP pre-training on clean images and text similar to standard supervised training
D other architectures
- Results were extended to include Vison Transformers
- Figure 9 shows the effect of data distribution on finetune transfer to CIFAR100, DTD, and CALTECH101 when using ViT instead of ResNet-50
- Difference between fine-tune performance is minimal
- Both models perform similarly in few-shot setting
- Hypothesis that this is due to similarity between LAION and OpenAI distributions
E effect of pre-training data distribution: simclr instead of clip
- Previous experiments used CLIP and fine-tuned end-to-end from a zero-shot pre-trained model
- SimCLR finetuning uses LP-FT instead of a zero-shot pre-trained model
- LP-FT is a two-step procedure
- First step freezes the encoder and trains a classification head from random initialization
- Second step initializes the classification head with the linear probe and finetunes the whole model
F extended related works
- Transfer learning is widely used in deep learning
- Neyshabur et al. (2020) separated the effect of feature reuse from that of learning low-level pre-training data statistics
- Raghu et al. (2019) found that transfer learning from ImageNet pre-trained models shows little benefit in performance
- Ericsson et al. (2021) found that self-supervised models can outperform supervised pre-training
- Islam et al. (2021) found that contrastively trained models outperform standard cross-entropy models in transfer learning
- Goyal et al. (2021) showed that self-supervised models outperform supervised models on ImageNet
- Radford et al. (2021) introduced CLIP which learns a joint embedding space for images and captions
- Flamingo (Alayrac et al., 2022) enables visual question answering and image captioning
- Fang et al. (2022) found that the diverse training distribution is the main cause of the robustness properties of CLIP
- Nguyen et al. (2022) explored the role of the pre-training dataset for CLIP
- Santurkar et al. (2022) found language supervision important if the pre-training dataset is large
- Cherti & Jitsev (2021) found that pre-training on natural ImageNet-21k is as good or better than pre-trained medical X-Ray data
- Djolonga et al. (2021) found that increasing pre-training data size and model size significantly improves robustness
- 9 different downstream datasets used
G.2 pre-training datasets
- 7 pre-training datasets used: YFCC, LAION, Redcaps, Shutterstock, Conceptual Captions-3m, Conceptual Captions-12m, WIT
- ImageNet-1K compared to contrastive pre-training on original captions from Flickr and Templated captions using ImageNet labels
- Supervised pre-training on ImageNet leads to better transfer accuracy than contrastive pre-training
- Increasing pre-training dataset size leads to better transfer accuracy on downstream tasks
- Pre-training CLIP on ImageNet outperforms LAION-1m by a large margin
- Including 15x more data from LAION outperforms supervised ImageNet pre-training on CIFAR100
- 2000x more data from LAION needed to match or outperform ImageNet pre-training on DTD, REAL, and CLIPART
- Supervised ImageNet pre-training still the best choice for CALTECH101 and PETS
- SimCLR pre-training leads to better downstream transfer accuracy than CLIP pre-training
- Different pre-training datasets lead to noticeable differences in downstream performance in the low-shot setting
- Pre-training CLIP on LAION-1m is only as good as ImageNet with Flickr captions with half of the data
- Details on downstream datasets and pre-training datasets provided