Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Dropout is a regularizer for preventing overfitting in neural networks
  • Dropout can also mitigate underfitting when used at the start of training
  • Dropout reduces the directional variance of gradients across mini-batches
  • Early dropout (dropout used only during the initial phases of training) can improve performance in underfitting models
  • Late dropout (dropout not used in the early iterations and is only activated later in training) can regularize overfitting models

Paper Content

Introduction

  • AlexNet’s “ImageNet moment” in 2012 launched a new era in deep learning
  • Dropout was invented in 2012 and has since become widely adopted to reduce overfitting in neural networks
  • Deep learning is evolving quickly and dropout has stayed relevant
  • Drop rate of dropout has generally been decreasing over the years
  • Dropout can be used to tackle underfitting
  • Dropout can reduce gradient variance and allow the model to update in more consistent directions
  • Early and late dropout can improve results compared to no dropout and standard dropout

Revisiting overfitting vs. underfitting

  • Overfitting occurs when a model fits the training data too well but generalizes poorly to unseen data
  • Factors that determine overfitting include model capacity, dataset scale, and training length
  • Smaller datasets and larger models lead to more overfitting
  • Dropout is a method used to prevent overfitting
  • Stochastic depth is a dropout variant designed for regularizing residual networks
  • Optimal drop rate depends on model size and dataset size
  • With increasing data size, drop rate used for dropout has generally decreased
  • Future models may have more trouble fitting data properly than overfitting

How dropout can reduce underfitting

  • Dropout can be used to reduce underfitting
  • Gradient norm of dropout model is smaller
  • Dropout model moves a larger distance from its initial point than the baseline model
  • Dropout model produces more consistent gradient directions
  • Gradient direction error of dropout model is smaller at the beginning of training
  • Gradient variance of dropout model is lower
  • Dropout helps prevent the model from overfitting

Approach

  • Dropout can improve model’s ability to fit training data
  • Underfitting and overfitting regimes can be difficult to define
  • Early dropout: use dropout before certain iteration, then disable
  • Late dropout: don’t use dropout before certain iteration, then use
  • Two hyper-parameters: number of epochs to wait and drop rate

Experiments

  • Conducted empirical evaluations on ImageNet-1K classification
  • 1,000 classes and 1.2M training images

Early dropout

  • Evaluated early dropout using small models on ImageNet-1K
  • Doubled training epochs and reduced mixup and cutmix strength
  • Baselines achieved improved accuracy, surpassing previous literature results
  • Early dropout provided further boost in accuracy

Analysis

  • Ablation studies were conducted to understand the characteristics of early dropout.
  • Different strategies for scheduling dropout or related regularizers have been explored.
  • Strategies typically involve either gradually increasing or decreasing the strength of dropout over the entire or nearly the entire training process.
  • The purpose of these strategies is to reduce overfitting rather than underfitting.
  • Experiments used a linear decreasing schedule from an initial value p to 0 by default.
  • Early dropout helps models fit better to the training data.
  • Results show that early dropout is effective in improving the performance of the first two models, but was not effective in the case of the larger ViT-B.
  • Early dropout does not depend on one particular schedule to work.
  • Optimal p value for each option may differ.
  • Early dropout consistently improves the accuracy regardless of the use of lr warmup.

Downstream tasks

  • Evaluate pre-trained ImageNet-1K models by finetuning them on downstream tasks
  • Direct evaluation of robustness benchmarks in Appendix D
  • Finetune pre-trained Swin-F and ConvNeXt-F backbones with Mask-RCNN
  • Finetune pretrained models on ADE-20K semantic segmentation task
  • Evaluate on several downstream classification datasets
  • Weight decay (L2 regularization) is commonly used to train neural networks
  • L1 regularization can select features
  • Label smoothing replaces one-hot targets with soft probabilities
  • Data augmentation can be a form of regularization
  • Dropout has many variants to improve or adapt it

Conclusion

  • Dropout helps reduce overfitting
  • Dropout counters data randomness brought by SGD
  • Early dropout helps underfitting models fit better
  • Late dropout helps improve generalization of overfitting models
  • Dropout produces mini-batch gradients that are more aligned with the entire dataset
  • Dropout leads to smaller gradient magnitudes and greater distance in parameter space
  • Early dropout results in lower training loss and higher test accuracy
  • Late dropout improves test accuracy and reduces overfitting
  • Late dropout is less sensitive to changes in drop rate