Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Dropout is a regularizer for preventing overfitting in neural networks
- Dropout can also mitigate underfitting when used at the start of training
- Dropout reduces the directional variance of gradients across mini-batches
- Early dropout (dropout used only during the initial phases of training) can improve performance in underfitting models
- Late dropout (dropout not used in the early iterations and is only activated later in training) can regularize overfitting models
Paper Content
Introduction
- AlexNet’s “ImageNet moment” in 2012 launched a new era in deep learning
- Dropout was invented in 2012 and has since become widely adopted to reduce overfitting in neural networks
- Deep learning is evolving quickly and dropout has stayed relevant
- Drop rate of dropout has generally been decreasing over the years
- Dropout can be used to tackle underfitting
- Dropout can reduce gradient variance and allow the model to update in more consistent directions
- Early and late dropout can improve results compared to no dropout and standard dropout
Revisiting overfitting vs. underfitting
- Overfitting occurs when a model fits the training data too well but generalizes poorly to unseen data
- Factors that determine overfitting include model capacity, dataset scale, and training length
- Smaller datasets and larger models lead to more overfitting
- Dropout is a method used to prevent overfitting
- Stochastic depth is a dropout variant designed for regularizing residual networks
- Optimal drop rate depends on model size and dataset size
- With increasing data size, drop rate used for dropout has generally decreased
- Future models may have more trouble fitting data properly than overfitting
How dropout can reduce underfitting
- Dropout can be used to reduce underfitting
- Gradient norm of dropout model is smaller
- Dropout model moves a larger distance from its initial point than the baseline model
- Dropout model produces more consistent gradient directions
- Gradient direction error of dropout model is smaller at the beginning of training
- Gradient variance of dropout model is lower
- Dropout helps prevent the model from overfitting
Approach
- Dropout can improve model’s ability to fit training data
- Underfitting and overfitting regimes can be difficult to define
- Early dropout: use dropout before certain iteration, then disable
- Late dropout: don’t use dropout before certain iteration, then use
- Two hyper-parameters: number of epochs to wait and drop rate
Experiments
- Conducted empirical evaluations on ImageNet-1K classification
- 1,000 classes and 1.2M training images
Early dropout
- Evaluated early dropout using small models on ImageNet-1K
- Doubled training epochs and reduced mixup and cutmix strength
- Baselines achieved improved accuracy, surpassing previous literature results
- Early dropout provided further boost in accuracy
Analysis
- Ablation studies were conducted to understand the characteristics of early dropout.
- Different strategies for scheduling dropout or related regularizers have been explored.
- Strategies typically involve either gradually increasing or decreasing the strength of dropout over the entire or nearly the entire training process.
- The purpose of these strategies is to reduce overfitting rather than underfitting.
- Experiments used a linear decreasing schedule from an initial value p to 0 by default.
- Early dropout helps models fit better to the training data.
- Results show that early dropout is effective in improving the performance of the first two models, but was not effective in the case of the larger ViT-B.
- Early dropout does not depend on one particular schedule to work.
- Optimal p value for each option may differ.
- Early dropout consistently improves the accuracy regardless of the use of lr warmup.
Downstream tasks
- Evaluate pre-trained ImageNet-1K models by finetuning them on downstream tasks
- Direct evaluation of robustness benchmarks in Appendix D
- Finetune pre-trained Swin-F and ConvNeXt-F backbones with Mask-RCNN
- Finetune pretrained models on ADE-20K semantic segmentation task
- Evaluate on several downstream classification datasets
Related work
- Weight decay (L2 regularization) is commonly used to train neural networks
- L1 regularization can select features
- Label smoothing replaces one-hot targets with soft probabilities
- Data augmentation can be a form of regularization
- Dropout has many variants to improve or adapt it
Conclusion
- Dropout helps reduce overfitting
- Dropout counters data randomness brought by SGD
- Early dropout helps underfitting models fit better
- Late dropout helps improve generalization of overfitting models
- Dropout produces mini-batch gradients that are more aligned with the entire dataset
- Dropout leads to smaller gradient magnitudes and greater distance in parameter space
- Early dropout results in lower training loss and higher test accuracy
- Late dropout improves test accuracy and reduces overfitting
- Late dropout is less sensitive to changes in drop rate