Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent trends in language modeling focus on increasing performance through scaling
- Training language models is out of reach for most researchers and practitioners
- Investigating how far can be achieved with a single GPU in one day
- Re-analyzing components of the pretraining pipeline and providing a modified pipeline
- Investigating why scaling down is hard and which modifications improve performance
- Performance follows scaling laws observed in large-compute settings
- Categorizing recent improvements to training and architecture and discussing their merit
Paper Content
Scaling up and scaling down
- Large-scale training of machine learning models with transformer architectures has improved natural language processing
- Performance of these systems increases when number of model parameters and amount of data grow
- Power of scale has created an environment where few researchers or practitioners feel capable of training a language model
- BERT model requires significant amount of computation to train
- Competition for largest language model has become a focal point for industrial labs
- Goal is to investigate how to best scale down language model training and what trade-offs emerge
- Scaled-down model pretraining opens up a host of further academic investigations
Tying our hands behind our back: a setup with limited compute
- Training a transformer-based language model from scratch
- No pre-trained models allowed
- Raw text can be included for training
- Pre-processing of raw data is exempted from compute budget
- Training on single GPU for 24 hours
- Downstream performance evaluated on GLUE
Preprint
Related work on efficient transformers
- Training BERT requires varying hardware and software setups.
- An upper bound on compute can be established by finding the total number of FLOPs.
- Initial reactions estimated up to 11 days of compute for comparable results on GPUs.
- Improvements in software have reduced the upper limit significantly.
Scaling laws
- Difficulty in finding tangible improvements is echoed in Kaplan et al. (2020) scaling laws
- Model size (non-embedding layers) strongly predicts performance
- Optimal model size can be derived for fixed compute budget
- Performance is mildly connected to model size
- Scaling laws continue to be iterated on and adapted for related settings
Investigations
- Implemented and tested modifications to Devlin et al. (2019)
- Investigated architectural, training and dataset improvements
Implementation details
- Implemented in PyTorch
- No specialized implementations used
- Automated operator fusion used
- Efficient attention kernel enabled after choosing final architecture variant
- Experiments and ablation studies run with automated mixed precision
Initial data setup
- Used recent dump of English Wikipedia and English bookcorpus
- Forced text into lower-case, stripped accents and non-ascii characters
- Created English tokenizer from scratch
- Used WordPiece with vocabulary size of 2 15
- Packed tokenized data into randomized sequences of length 128
- Separated unrelated fragments by
- No impact from including token in pretraining
- Sequence length sufficient for downstream applications
- Micro-batch sizes of 64 to 96 for most variations of base BERT architecture
- Accumulated micro-batch sizes into larger batch sizes
- Single-epoch training, no data point revisited
Modifying the architecture
- Model architecture can be modified to efficiently scale down training
- Per-token efficiency of training depends on model size, not transformer type
- Smaller models learn less efficiently, reducing throughput gains
- Scaling laws hold in low-resource regime
- Architecture selection based on how it affects computation time for a single gradient step
Modifying the training setup
- Study impact of training hyper-parameters on BERT-base architecture
- Original BERT training recipe results in poor model performance
- Mask language modeling with 15% masking rate
- 10% of masks filled with random words, 10% unchanged
- No improvement from masking at larger rates
- No difference enabling/disabling 20% rule
- Evaluate other functions for masked-language objective, no benefits
- Adam optimizer with weight decay of 0.01
- Gradient clipping at clip value of 0.5
- One-cycle learning rate with peak of 10-3
- Batch size schedule to maximize performance
- Dropout disabled during pretraining, re-enabled during fine-tuning
- No gains from length curricula or token dropping
Optimizing the dataset
- Scaling laws create a barrier to making major gains with architectural modifications
- Training on better data can be done to improve performance
- Two data based pathways to better down-scaling: filtering, processing, or sorting existing data; swapping data source
- Experimented with several subsets of The Pile
- Evaluated deduplication and filtering for uncompressible data
- Sorting tokenized sequences by average token prevalence and increasing batch size can improve performance
- Investigated whether original vocabulary size of 32768 is optimal
Finetuning performance on glue
- Evaluated performance on GLUE benchmark minus WNLI
- Used MNLI (m) during previous sections, no hyperparameter tuning based on full GLUE scores
- Finetuned BERT-base checkpoint and models with same constraints
- BERT-base finetuned with batch size of 32 and learning rate of 2 x 10^-5
- Crammed models finetuned with batch size of 16 and learning rate of 4 x 10^-5 with cosine decay
- Performance surprisingly decent, especially for larger datasets
- Gains over naive BERT training and recipe from Izsak et al. (2021)
- CoLA performance intriguing, two hypotheses offered
Ablation -which changes really mattered?
- Table 5 provides a summary of all changes discussed in the paper
- Modifications are grouped into three categories: architecture, training, and data
- Minimal modifications are necessary, as architecture changes allow for more aggressive learning rate schedules
What happens when training longer?
- We tested what happens when the cramming recipe is used with more budget.
- We trained models for 48 hours on 8 A6000 GPUs, which is 208 total exaFLOP.
- The setting described earlier was applied, with the learning rate schedule adjusted to cover the new budget.
- The recipe generalizes to larger compute budgets.
- The newly trained models have strong performances, especially on MNLI and SST-2.
- The new models barely improve in other tasks, such as CoLA.
Limitations
- Investigated transformer-based architectures trained with MLM objectives
- Relaxing constraints of investigation is interesting
- Modifications proposed to MLM objective
- MLM still holds up well as pretraining objective
- Suggestions such as ELECTRA could be beneficial for crammed models
- Optimal architecture might not be transformer-based
Conclusions
- Transformer-based language models can achieve decent performance with limited compute
- Cramming language models is difficult
- Baseline for exploring cramming question
- Training on single GPU with mixed precision
- Batch size of 4036 and dataset is bookcorpus-wikipedia
- Downstream evaluation as described in Section 5
- Pretraining on an A4000
- Discrepancy between optimal pretraining batch size and optimal batch size for evaluation on MNLI when ramp-up is used
- Improvements through larger models are evened out by their slower speed
- Ablation study to determine which improvements were most important