Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Transfer learning produces high-accuracy models
  • Pre-training size affects transfer learning performance
  • Choice of pre-training data source is essential for few-shot transfer
  • Label noise and size of pre-training dataset have trade-offs
  • Language-image contrastive vs. image-image contrastive pre-training methods have different effects on downstream accuracy

Paper Content

Introduction

  • Transfer learning is a popular computer vision model production method
  • Pre-trained models have improved in recent years
  • Research question: How do pre-training dataset and algorithm affect downstream performance?
  • Differences between pre-training sources diminish as more data is available for downstream tasks
  • In few-shot setting, different pre-training datasets lead to noticeable differences in downstream performance
  • When controlling for size of pre-train model and downstream dataset, changing pre-train dataset leads to noticeable differences in downstream accuracy
  • Certain pre-training datasets consistently lead to better transfer accuracy than others
  • Comparing supervised and semi-supervised pre-training strategies
  • Increasing pre-training dataset size improves transfer performance, but depends on dataset and number of fine-tuning samples
  • SimCLR pre-training leads to better transfer than CLIP pre-training in few-shot regime
  • 4000 experiments conducted to evaluate downstream performance
  • Kim et al. (2022) studied the effect of network architecture, pre-training dataset, supervised vs self-supervised learning objectives, and different domain transfer methods on the transferability of representations to new domains
  • Abnar et al. (2021) explored how different upstream training settings affect transfer accuracy for two upstream datasets and more than 20 downstream tasks
  • Abnar et al. (2021) lacked controlled comparison between different distributions in the pre-training datasets
  • This work extends the results of Abnar et al. (2021) to more pre-training datasets and methods, with a special focus on data distribution and curation
  • This work looks at both few-shot and full-shot transfer accuracy to study the effect of transfer learning as more target data become available

Experimental setup

  • CLIP model has demonstrated robustness to natural distribution shifts
  • CLIP learns a joint embedding space for images and captions
  • ResNet-50 used as image encoder
  • Vary pre-training data distribution, curation method, and pre-training dataset size
  • Change contrastive loss function to SimCLR
  • Fine-tune pre-trained model end-to-end on target transfer dataset
  • Pre-training datasets consist of million-size image and language pairs from multiple sources
  • Downstream tasks use nine different datasets
  • Six datasets are internet-crawled, three are domain-specific

Experiments and results

  • Pre-training data sources have an impact on transfer learning performance
  • As more images are available for fine-tuning, the difference in accuracy between different pre-training models is reduced
  • Changing the pre-training dataset leads to noticeable differences in the downstream performance in a few-shot setting

Which data distribution is better for transfer learning?

  • Pre-training on Shutterstock and LAION datasets results in superior transfer performance
  • Redcaps yields superior performance for PETS
  • Most common words in Redcaps captions are “cats” and “dogs”
  • Most common words in Shutterstock captions are “background”, “design”, “pattern”, and “texture”
  • Transfer learning from pre-training dataset outperforms training from scratch

Do well-curated pre-training datasets lead to better transfer?

  • Significant effort to create computer vision datasets with high-quality labels
  • Recent datasets are large but noisy
  • Investigating how much laborious ImageNet labeling is worth
  • Pre-training ResNet-50 on ImageNet-1K using supervised cross-entropy loss
  • Discarding ImageNet labels and using CLIP to pre-train on ImageNet with Flickr captions
  • Supervised pre-training on ImageNet outperforms CLIP pre-training
  • Pre-training on larger datasets improves downstream transfer accuracy

Effect of pertaining loss

  • Pre-training loss from language-image contrastive in CLIP replaced with image-only contrastive loss in SimCLR
  • Changing pre-train dataset leads to differences in few-shot downstream performance
  • SimCLR pre-training leads to better downstream transfer accuracy than CLIP in few-shot regime
  • Difference in accuracy between pre-training methods decreases with more data for fine-tuning
  • Difference in accuracy between CLIP and SimCLR varies across datasets
  • Pre-training decisions lead to similar accuracy in high-shot regime, outperforming training from scratch

A effect of the pre-training data distribution

  • Figure 7 shows detailed results for Figure 1.
  • Pre-training datasets lead to differences in downstream performance in low-shot settings.
  • If many samples are available for fine-tuning, the difference in accuracy between models pre-trained on different sources is reduced.
  • Figure 7 compares different data sources for pre-training.
  • Changing the pre-training dataset leads to noticeable differences in the downstream low-shot performance of nine datasets.

B training details

  • CLIP models trained from scratch using AdamW optimizer
  • Data augmentations from Radford et al. (2021)
  • SimCLR implementation follows SLIP (Mu et al., 2021)
  • Finetuned on downstream tasks for 128 epochs with learning rate from 0.0001-0.003

C effect of data curation: imagenet captioning

  • Compared CLIP models pre-trained on LAION with CLIP models pre-trained on two versions of the curated ImageNet dataset
  • Image deduplication routine and removal of text containing profanity left dataset of 463,622 images with corresponding text data
  • IN1K-Template-Captions dataset includes all data in ImageNet dataset, paired with templated captions
  • CLIP pre-training on clean images and text similar to standard supervised training

D other architectures

  • Results were extended to include Vison Transformers
  • Figure 9 shows the effect of data distribution on finetune transfer to CIFAR100, DTD, and CALTECH101 when using ViT instead of ResNet-50
  • Difference between fine-tune performance is minimal
  • Both models perform similarly in few-shot setting
  • Hypothesis that this is due to similarity between LAION and OpenAI distributions

E effect of pre-training data distribution: simclr instead of clip

  • Previous experiments used CLIP and fine-tuned end-to-end from a zero-shot pre-trained model
  • SimCLR finetuning uses LP-FT instead of a zero-shot pre-trained model
  • LP-FT is a two-step procedure
  • First step freezes the encoder and trains a classification head from random initialization
  • Second step initializes the classification head with the linear probe and finetunes the whole model
  • Transfer learning is widely used in deep learning
  • Neyshabur et al. (2020) separated the effect of feature reuse from that of learning low-level pre-training data statistics
  • Raghu et al. (2019) found that transfer learning from ImageNet pre-trained models shows little benefit in performance
  • Ericsson et al. (2021) found that self-supervised models can outperform supervised pre-training
  • Islam et al. (2021) found that contrastively trained models outperform standard cross-entropy models in transfer learning
  • Goyal et al. (2021) showed that self-supervised models outperform supervised models on ImageNet
  • Radford et al. (2021) introduced CLIP which learns a joint embedding space for images and captions
  • Flamingo (Alayrac et al., 2022) enables visual question answering and image captioning
  • Fang et al. (2022) found that the diverse training distribution is the main cause of the robustness properties of CLIP
  • Nguyen et al. (2022) explored the role of the pre-training dataset for CLIP
  • Santurkar et al. (2022) found language supervision important if the pre-training dataset is large
  • Cherti & Jitsev (2021) found that pre-training on natural ImageNet-21k is as good or better than pre-trained medical X-Ray data
  • Djolonga et al. (2021) found that increasing pre-training data size and model size significantly improves robustness
  • 9 different downstream datasets used

G.2 pre-training datasets

  • 7 pre-training datasets used: YFCC, LAION, Redcaps, Shutterstock, Conceptual Captions-3m, Conceptual Captions-12m, WIT
  • ImageNet-1K compared to contrastive pre-training on original captions from Flickr and Templated captions using ImageNet labels
  • Supervised pre-training on ImageNet leads to better transfer accuracy than contrastive pre-training
  • Increasing pre-training dataset size leads to better transfer accuracy on downstream tasks
  • Pre-training CLIP on ImageNet outperforms LAION-1m by a large margin
  • Including 15x more data from LAION outperforms supervised ImageNet pre-training on CIFAR100
  • 2000x more data from LAION needed to match or outperform ImageNet pre-training on DTD, REAL, and CLIPART
  • Supervised ImageNet pre-training still the best choice for CALTECH101 and PETS
  • SimCLR pre-training leads to better downstream transfer accuracy than CLIP pre-training
  • Different pre-training datasets lead to noticeable differences in downstream performance in the low-shot setting
  • Pre-training CLIP on LAION-1m is only as good as ImageNet with Flickr captions with half of the data
  • Details on downstream datasets and pre-training datasets provided