Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Dropout is a regularizer for preventing overfitting in neural networks
Dropout can also mitigate underfitting when used at the start of training
Dropout reduces the directional variance of gradients across mini-batches
Early dropout (dropout used only during the initial phases of training) can improve performance in underfitting models
Late dropout (dropout not used in the early iterations and is only activated later in training) can regularize overfitting models

Paper Content

Introduction

AlexNet’s “ImageNet moment” in 2012 launched a new era in deep learning
Dropout was invented in 2012 and has since become widely adopted to reduce overfitting in neural networks
Deep learning is evolving quickly and dropout has stayed relevant
Drop rate of dropout has generally been decreasing over the years
Dropout can be used to tackle underfitting
Dropout can reduce gradient variance and allow the model to update in more consistent directions
Early and late dropout can improve results compared to no dropout and standard dropout

Revisiting overfitting vs. underfitting

Overfitting occurs when a model fits the training data too well but generalizes poorly to unseen data
Factors that determine overfitting include model capacity, dataset scale, and training length
Smaller datasets and larger models lead to more overfitting
Dropout is a method used to prevent overfitting
Stochastic depth is a dropout variant designed for regularizing residual networks
Optimal drop rate depends on model size and dataset size
With increasing data size, drop rate used for dropout has generally decreased
Future models may have more trouble fitting data properly than overfitting

How dropout can reduce underfitting

Dropout can be used to reduce underfitting
Gradient norm of dropout model is smaller
Dropout model moves a larger distance from its initial point than the baseline model
Dropout model produces more consistent gradient directions
Gradient direction error of dropout model is smaller at the beginning of training
Gradient variance of dropout model is lower
Dropout helps prevent the model from overfitting

Approach

Dropout can improve model’s ability to fit training data
Underfitting and overfitting regimes can be difficult to define
Early dropout: use dropout before certain iteration, then disable
Late dropout: don’t use dropout before certain iteration, then use
Two hyper-parameters: number of epochs to wait and drop rate

Experiments

Conducted empirical evaluations on ImageNet-1K classification
1,000 classes and 1.2M training images

Early dropout

Evaluated early dropout using small models on ImageNet-1K
Doubled training epochs and reduced mixup and cutmix strength
Baselines achieved improved accuracy, surpassing previous literature results
Early dropout provided further boost in accuracy

Analysis

Ablation studies were conducted to understand the characteristics of early dropout.
Different strategies for scheduling dropout or related regularizers have been explored.
Strategies typically involve either gradually increasing or decreasing the strength of dropout over the entire or nearly the entire training process.
The purpose of these strategies is to reduce overfitting rather than underfitting.
Experiments used a linear decreasing schedule from an initial value p to 0 by default.
Early dropout helps models fit better to the training data.
Results show that early dropout is effective in improving the performance of the first two models, but was not effective in the case of the larger ViT-B.
Early dropout does not depend on one particular schedule to work.
Optimal p value for each option may differ.
Early dropout consistently improves the accuracy regardless of the use of lr warmup.

Downstream tasks

Evaluate pre-trained ImageNet-1K models by finetuning them on downstream tasks
Direct evaluation of robustness benchmarks in Appendix D
Finetune pre-trained Swin-F and ConvNeXt-F backbones with Mask-RCNN
Finetune pretrained models on ADE-20K semantic segmentation task
Evaluate on several downstream classification datasets

Weight decay (L2 regularization) is commonly used to train neural networks
L1 regularization can select features
Label smoothing replaces one-hot targets with soft probabilities
Data augmentation can be a form of regularization
Dropout has many variants to improve or adapt it

Conclusion

Dropout helps reduce overfitting
Dropout counters data randomness brought by SGD
Early dropout helps underfitting models fit better
Late dropout helps improve generalization of overfitting models
Dropout produces mini-batch gradients that are more aligned with the entire dataset
Dropout leads to smaller gradient magnitudes and greater distance in parameter space
Early dropout results in lower training loss and higher test accuracy
Late dropout improves test accuracy and reduces overfitting
Late dropout is less sensitive to changes in drop rate

Link to paper#

Abstract#

Paper Content#

Introduction#

Revisiting overfitting vs. underfitting#

How dropout can reduce underfitting#

Approach#

Experiments#

Early dropout#

Analysis#

Downstream tasks#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Revisiting overfitting vs. underfitting

How dropout can reduce underfitting

Approach

Experiments

Early dropout

Analysis

Downstream tasks

Related work

Conclusion