Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- JE-SSL has seen rapid developments in recent years due to its promise to leverage large unlabeled data.
- Development of JE-SSL methods driven by search for increasing classification accuracies and use of computational resources.
- This has led to numerous pre-conceived ideas that carried over across methods.
- This work debunks these ideas to unleash the full potential of JE-SSL.
- Examples of debunked ideas include SimCLR requiring large mini batches and strong data augmentations.
- PyTorch library introduced to allow researchers to easily make more extensive evaluations of their methods.
Paper Content
Introduction
- Interest in Self-Supervised Learning (SSL) has increased
- [5] used data augmentations to design positive pairs of examples and large mini-batches to define negative examples
- Other works tried to build upon the contrastive method of [5]
- Common criticism is the need for a large number of negative examples
- FFCV-SSL is proposed library optimized for Self-Supervised Learning
- FFCV-SSL enables empirical investigations against preconceived failure modes of SSL models
- SimCLR can perform equally well with small or large mini-batch training
- Strong data-augmentations are not always necessary
- It is possible to train SSL method with only 1 GPU
- SSL experiment can be run 3 times faster with FFCV-SSL
Joint-embedding self-supervised methods and notations
- JE-SSL relies on processing multiple related views of an input through a deep network
- A positive term and a collapse prevention term are used in the loss function
- Input samples and a known positive relationship between them are needed
- Data augmentation is used to construct the JE-SSL dataset
- Different losses and constraints are used to facilitate training
- Projector networks are added on top of the deep network to improve performance
Debunking popular myths about joint-embedding failure cases
- JE-SSL models require a large number of hyper-parameters and high computation cost to train
- Previous studies have only explored a small part of the hyper-parameter space
- Sensitivity experiments can be useful but can also lead to misconceptions
- This paper aims to investigate and debunk previous observations
- Results show that JE-SSL methods are more similar than previously thought and suffer less dramatic failures
The impact of mini-batch size for simclr and barlowtwins
- SimCLR and BarlowTwins benefit from large mini-batch size for more accurate estimation
- Belief that many methods rely on large mini-batch size is mentioned in recent studies
- Gap in performances between bigger and smaller batch size can be reduced with better optimization technique
- Performances on ImageNet better with larger batch size, but not necessarily true for all downstream tasks
- Hyper-parameters of SSL loss need to be optimized
- Learning rate has direct impact on gap in accuracy between larger and smaller batch size
- Adding layers in projector helps bridge gap in performances between large and small mini-batch size
The impact of data-augmentation
- JE-SSL methods need strong data augmentations
- Solarization transformation applied with 20% probability
- Grayscale operation applied with 20% probability
- Gaussian blur applied 100% of the time
- ColorJitter operation applied with 80% probability
- Grayscaling has most significant impact on ImageNet accuracy
The impact of the evaluation protocol
- Self-Supervised learning involves evaluating the learned representation
- Evaluation typically includes linear probing and/or finetuning
- Non linear probing should also be considered
- Comparing linear probing in online and offline settings
- Non linear probing results in a significant boost in accuracy
- Non linear probing can easily overfit on its training set
Per-instance positive and negative sample generation
- JE-SSL methods need positive and negative inputs to prevent representation from collapsing
- Hard-negative sampling is not necessary for the model to work
- Negative samples can be generated from simple transformations of the same image
- Instance-based SimCLR can learn good representations for downstream tasks
- Accuracy on CIFAR10 and Eurosat is high, but lower on ImageNet
- Instance SimCLR learns to denoise a specific patch of the image
Ffcv-ssl: a fast data loading library tuned to improve je-ssl training time
- Most implementations of Self-Supervised learning use Pytorch and Torchvision
- Data loading process could become a bottleneck when training SSL models
- Replaced Torchvision with FFCV library, created a fork called FFCV-SSL
- Adding data augmentations significantly increases time for a full epoch
- Switching data loading library can give almost 3 times speed up for a single epoch
Enabling single gpu training with ffcv-ssl
- SimCLR can be trained using a single GPU
- Best hyperparameters for single GPU training are temperature of 0.15 and learning rate of 1.0
- This leads to 65.58 accuracy in online linear probing and 68.5% in non linear probing
Recalibrated observations for selfsupervised learning research
- Batch size and data augmentations have been overstated in JE-SSL studies.
- Batch size can be counteracted by adjusting learning rate.
- Strong data augmentations are not necessary.
- Imagenet-1k top-1 metric is not a full picture.
- Projector architecture is important for performance.
Conclusion
- Popular ideas around joint-embedding SSL have remained unchallenged for years
- Key findings regarding the requirement of rich data augmentations and the impact of mini-batch size
- Strategies to speed up training and evaluation
- Py-Torch library FFCV-SSL developed to enable rapid training of JE-SSL models
- SimCLR can perform equally well with small or large mini-batch training
- Strong data-augmentations are not always necessary
- Crop-ping+grayscale is enough to reach competitive performances across SSL methods
- Possible to train SSL method using only 1 GPU
- Performances on ImageNet better with larger batch size
- Optimal temperature might not be the same depending of the batch size
- Gain in accuracy using all sets of augmentations available
- Grayscale transformation is very competitive
- MLP only needs a few epochs, linear case limited differences between online and offline performances
- Instance based SimCLR learn to denoise a specific patch of the image
- Best hyper-parameters for single GPU training are temperature of 0.15 and learning rate of 1.0
- Optimal learning rate can be different depending on batch size
- Blur and ColorJitter operations add considerable time in training