Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • JE-SSL has seen rapid developments in recent years due to its promise to leverage large unlabeled data.
  • Development of JE-SSL methods driven by search for increasing classification accuracies and use of computational resources.
  • This has led to numerous pre-conceived ideas that carried over across methods.
  • This work debunks these ideas to unleash the full potential of JE-SSL.
  • Examples of debunked ideas include SimCLR requiring large mini batches and strong data augmentations.
  • PyTorch library introduced to allow researchers to easily make more extensive evaluations of their methods.

Paper Content

Introduction

  • Interest in Self-Supervised Learning (SSL) has increased
  • [5] used data augmentations to design positive pairs of examples and large mini-batches to define negative examples
  • Other works tried to build upon the contrastive method of [5]
  • Common criticism is the need for a large number of negative examples
  • FFCV-SSL is proposed library optimized for Self-Supervised Learning
  • FFCV-SSL enables empirical investigations against preconceived failure modes of SSL models
  • SimCLR can perform equally well with small or large mini-batch training
  • Strong data-augmentations are not always necessary
  • It is possible to train SSL method with only 1 GPU
  • SSL experiment can be run 3 times faster with FFCV-SSL

Joint-embedding self-supervised methods and notations

  • JE-SSL relies on processing multiple related views of an input through a deep network
  • A positive term and a collapse prevention term are used in the loss function
  • Input samples and a known positive relationship between them are needed
  • Data augmentation is used to construct the JE-SSL dataset
  • Different losses and constraints are used to facilitate training
  • Projector networks are added on top of the deep network to improve performance
  • JE-SSL models require a large number of hyper-parameters and high computation cost to train
  • Previous studies have only explored a small part of the hyper-parameter space
  • Sensitivity experiments can be useful but can also lead to misconceptions
  • This paper aims to investigate and debunk previous observations
  • Results show that JE-SSL methods are more similar than previously thought and suffer less dramatic failures

The impact of mini-batch size for simclr and barlowtwins

  • SimCLR and BarlowTwins benefit from large mini-batch size for more accurate estimation
  • Belief that many methods rely on large mini-batch size is mentioned in recent studies
  • Gap in performances between bigger and smaller batch size can be reduced with better optimization technique
  • Performances on ImageNet better with larger batch size, but not necessarily true for all downstream tasks
  • Hyper-parameters of SSL loss need to be optimized
  • Learning rate has direct impact on gap in accuracy between larger and smaller batch size
  • Adding layers in projector helps bridge gap in performances between large and small mini-batch size

The impact of data-augmentation

  • JE-SSL methods need strong data augmentations
  • Solarization transformation applied with 20% probability
  • Grayscale operation applied with 20% probability
  • Gaussian blur applied 100% of the time
  • ColorJitter operation applied with 80% probability
  • Grayscaling has most significant impact on ImageNet accuracy

The impact of the evaluation protocol

  • Self-Supervised learning involves evaluating the learned representation
  • Evaluation typically includes linear probing and/or finetuning
  • Non linear probing should also be considered
  • Comparing linear probing in online and offline settings
  • Non linear probing results in a significant boost in accuracy
  • Non linear probing can easily overfit on its training set

Per-instance positive and negative sample generation

  • JE-SSL methods need positive and negative inputs to prevent representation from collapsing
  • Hard-negative sampling is not necessary for the model to work
  • Negative samples can be generated from simple transformations of the same image
  • Instance-based SimCLR can learn good representations for downstream tasks
  • Accuracy on CIFAR10 and Eurosat is high, but lower on ImageNet
  • Instance SimCLR learns to denoise a specific patch of the image

Ffcv-ssl: a fast data loading library tuned to improve je-ssl training time

  • Most implementations of Self-Supervised learning use Pytorch and Torchvision
  • Data loading process could become a bottleneck when training SSL models
  • Replaced Torchvision with FFCV library, created a fork called FFCV-SSL
  • Adding data augmentations significantly increases time for a full epoch
  • Switching data loading library can give almost 3 times speed up for a single epoch

Enabling single gpu training with ffcv-ssl

  • SimCLR can be trained using a single GPU
  • Best hyperparameters for single GPU training are temperature of 0.15 and learning rate of 1.0
  • This leads to 65.58 accuracy in online linear probing and 68.5% in non linear probing

Recalibrated observations for selfsupervised learning research

  • Batch size and data augmentations have been overstated in JE-SSL studies.
  • Batch size can be counteracted by adjusting learning rate.
  • Strong data augmentations are not necessary.
  • Imagenet-1k top-1 metric is not a full picture.
  • Projector architecture is important for performance.

Conclusion

  • Popular ideas around joint-embedding SSL have remained unchallenged for years
  • Key findings regarding the requirement of rich data augmentations and the impact of mini-batch size
  • Strategies to speed up training and evaluation
  • Py-Torch library FFCV-SSL developed to enable rapid training of JE-SSL models
  • SimCLR can perform equally well with small or large mini-batch training
  • Strong data-augmentations are not always necessary
  • Crop-ping+grayscale is enough to reach competitive performances across SSL methods
  • Possible to train SSL method using only 1 GPU
  • Performances on ImageNet better with larger batch size
  • Optimal temperature might not be the same depending of the batch size
  • Gain in accuracy using all sets of augmentations available
  • Grayscale transformation is very competitive
  • MLP only needs a few epochs, linear case limited differences between online and offline performances
  • Instance based SimCLR learn to denoise a specific patch of the image
  • Best hyper-parameters for single GPU training are temperature of 0.15 and learning rate of 1.0
  • Optimal learning rate can be different depending on batch size
  • Blur and ColorJitter operations add considerable time in training