Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent trends in language modeling focus on increasing performance through scaling
Training language models is out of reach for most researchers and practitioners
Investigating how far can be achieved with a single GPU in one day
Re-analyzing components of the pretraining pipeline and providing a modified pipeline
Investigating why scaling down is hard and which modifications improve performance
Performance follows scaling laws observed in large-compute settings
Categorizing recent improvements to training and architecture and discussing their merit

Paper Content

Scaling up and scaling down

Large-scale training of machine learning models with transformer architectures has improved natural language processing
Performance of these systems increases when number of model parameters and amount of data grow
Power of scale has created an environment where few researchers or practitioners feel capable of training a language model
BERT model requires significant amount of computation to train
Competition for largest language model has become a focal point for industrial labs
Goal is to investigate how to best scale down language model training and what trade-offs emerge
Scaled-down model pretraining opens up a host of further academic investigations

Tying our hands behind our back: a setup with limited compute

Training a transformer-based language model from scratch
No pre-trained models allowed
Raw text can be included for training
Pre-processing of raw data is exempted from compute budget
Training on single GPU for 24 hours
Downstream performance evaluated on GLUE

Preprint

Training BERT requires varying hardware and software setups.
An upper bound on compute can be established by finding the total number of FLOPs.
Initial reactions estimated up to 11 days of compute for comparable results on GPUs.
Improvements in software have reduced the upper limit significantly.

Scaling laws

Difficulty in finding tangible improvements is echoed in Kaplan et al. (2020) scaling laws
Model size (non-embedding layers) strongly predicts performance
Optimal model size can be derived for fixed compute budget
Performance is mildly connected to model size
Scaling laws continue to be iterated on and adapted for related settings

Investigations

Implemented and tested modifications to Devlin et al. (2019)
Investigated architectural, training and dataset improvements

Implementation details

Implemented in PyTorch
No specialized implementations used
Automated operator fusion used
Efficient attention kernel enabled after choosing final architecture variant
Experiments and ablation studies run with automated mixed precision

Initial data setup

Used recent dump of English Wikipedia and English bookcorpus
Forced text into lower-case, stripped accents and non-ascii characters
Created English tokenizer from scratch
Used WordPiece with vocabulary size of 2 15
Packed tokenized data into randomized sequences of length 128
Separated unrelated fragments by
No impact from including token in pretraining
Sequence length sufficient for downstream applications
Micro-batch sizes of 64 to 96 for most variations of base BERT architecture
Accumulated micro-batch sizes into larger batch sizes
Single-epoch training, no data point revisited

Modifying the architecture

Model architecture can be modified to efficiently scale down training
Per-token efficiency of training depends on model size, not transformer type
Smaller models learn less efficiently, reducing throughput gains
Scaling laws hold in low-resource regime
Architecture selection based on how it affects computation time for a single gradient step

Modifying the training setup

Study impact of training hyper-parameters on BERT-base architecture
Original BERT training recipe results in poor model performance
Mask language modeling with 15% masking rate
10% of masks filled with random words, 10% unchanged
No improvement from masking at larger rates
No difference enabling/disabling 20% rule
Evaluate other functions for masked-language objective, no benefits
Adam optimizer with weight decay of 0.01
Gradient clipping at clip value of 0.5
One-cycle learning rate with peak of 10-3
Batch size schedule to maximize performance
Dropout disabled during pretraining, re-enabled during fine-tuning
No gains from length curricula or token dropping

Optimizing the dataset

Scaling laws create a barrier to making major gains with architectural modifications
Training on better data can be done to improve performance
Two data based pathways to better down-scaling: filtering, processing, or sorting existing data; swapping data source
Experimented with several subsets of The Pile
Evaluated deduplication and filtering for uncompressible data
Sorting tokenized sequences by average token prevalence and increasing batch size can improve performance
Investigated whether original vocabulary size of 32768 is optimal

Finetuning performance on glue

Evaluated performance on GLUE benchmark minus WNLI
Used MNLI (m) during previous sections, no hyperparameter tuning based on full GLUE scores
Finetuned BERT-base checkpoint and models with same constraints
BERT-base finetuned with batch size of 32 and learning rate of 2 x 10^-5
Crammed models finetuned with batch size of 16 and learning rate of 4 x 10^-5 with cosine decay
Performance surprisingly decent, especially for larger datasets
Gains over naive BERT training and recipe from Izsak et al. (2021)
CoLA performance intriguing, two hypotheses offered

Ablation -which changes really mattered?

Table 5 provides a summary of all changes discussed in the paper
Modifications are grouped into three categories: architecture, training, and data
Minimal modifications are necessary, as architecture changes allow for more aggressive learning rate schedules

What happens when training longer?

We tested what happens when the cramming recipe is used with more budget.
We trained models for 48 hours on 8 A6000 GPUs, which is 208 total exaFLOP.
The setting described earlier was applied, with the learning rate schedule adjusted to cover the new budget.
The recipe generalizes to larger compute budgets.
The newly trained models have strong performances, especially on MNLI and SST-2.
The new models barely improve in other tasks, such as CoLA.

Limitations

Investigated transformer-based architectures trained with MLM objectives
Relaxing constraints of investigation is interesting
Modifications proposed to MLM objective
MLM still holds up well as pretraining objective
Suggestions such as ELECTRA could be beneficial for crammed models
Optimal architecture might not be transformer-based

Conclusions

Transformer-based language models can achieve decent performance with limited compute
Cramming language models is difficult
Baseline for exploring cramming question
Training on single GPU with mixed precision
Batch size of 4036 and dataset is bookcorpus-wikipedia
Downstream evaluation as described in Section 5
Pretraining on an A4000
Discrepancy between optimal pretraining batch size and optimal batch size for evaluation on MNLI when ramp-up is used
Improvements through larger models are evened out by their slower speed
Ablation study to determine which improvements were most important

Link to paper#

Abstract#

Paper Content#

Scaling up and scaling down#

Tying our hands behind our back: a setup with limited compute#

Preprint#

Related work on efficient transformers#

Scaling laws#

Investigations#

Implementation details#

Initial data setup#

Modifying the architecture#

Modifying the training setup#

Optimizing the dataset#

Finetuning performance on glue#

Ablation -which changes really mattered?#

What happens when training longer?#

Limitations#

Conclusions#

Link to paper

Abstract

Paper Content

Scaling up and scaling down

Tying our hands behind our back: a setup with limited compute

Preprint

Related work on efficient transformers

Scaling laws

Investigations

Implementation details

Initial data setup

Modifying the architecture

Modifying the training setup

Optimizing the dataset

Finetuning performance on glue

Ablation -which changes really mattered?

What happens when training longer?

Limitations

Conclusions