Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent trends in language modeling focus on increasing performance through scaling
  • Training language models is out of reach for most researchers and practitioners
  • Investigating how far can be achieved with a single GPU in one day
  • Re-analyzing components of the pretraining pipeline and providing a modified pipeline
  • Investigating why scaling down is hard and which modifications improve performance
  • Performance follows scaling laws observed in large-compute settings
  • Categorizing recent improvements to training and architecture and discussing their merit

Paper Content

Scaling up and scaling down

  • Large-scale training of machine learning models with transformer architectures has improved natural language processing
  • Performance of these systems increases when number of model parameters and amount of data grow
  • Power of scale has created an environment where few researchers or practitioners feel capable of training a language model
  • BERT model requires significant amount of computation to train
  • Competition for largest language model has become a focal point for industrial labs
  • Goal is to investigate how to best scale down language model training and what trade-offs emerge
  • Scaled-down model pretraining opens up a host of further academic investigations

Tying our hands behind our back: a setup with limited compute

  • Training a transformer-based language model from scratch
  • No pre-trained models allowed
  • Raw text can be included for training
  • Pre-processing of raw data is exempted from compute budget
  • Training on single GPU for 24 hours
  • Downstream performance evaluated on GLUE

Preprint

  • Training BERT requires varying hardware and software setups.
  • An upper bound on compute can be established by finding the total number of FLOPs.
  • Initial reactions estimated up to 11 days of compute for comparable results on GPUs.
  • Improvements in software have reduced the upper limit significantly.

Scaling laws

  • Difficulty in finding tangible improvements is echoed in Kaplan et al. (2020) scaling laws
  • Model size (non-embedding layers) strongly predicts performance
  • Optimal model size can be derived for fixed compute budget
  • Performance is mildly connected to model size
  • Scaling laws continue to be iterated on and adapted for related settings

Investigations

  • Implemented and tested modifications to Devlin et al. (2019)
  • Investigated architectural, training and dataset improvements

Implementation details

  • Implemented in PyTorch
  • No specialized implementations used
  • Automated operator fusion used
  • Efficient attention kernel enabled after choosing final architecture variant
  • Experiments and ablation studies run with automated mixed precision

Initial data setup

  • Used recent dump of English Wikipedia and English bookcorpus
  • Forced text into lower-case, stripped accents and non-ascii characters
  • Created English tokenizer from scratch
  • Used WordPiece with vocabulary size of 2 15
  • Packed tokenized data into randomized sequences of length 128
  • Separated unrelated fragments by
  • No impact from including token in pretraining
  • Sequence length sufficient for downstream applications
  • Micro-batch sizes of 64 to 96 for most variations of base BERT architecture
  • Accumulated micro-batch sizes into larger batch sizes
  • Single-epoch training, no data point revisited

Modifying the architecture

  • Model architecture can be modified to efficiently scale down training
  • Per-token efficiency of training depends on model size, not transformer type
  • Smaller models learn less efficiently, reducing throughput gains
  • Scaling laws hold in low-resource regime
  • Architecture selection based on how it affects computation time for a single gradient step

Modifying the training setup

  • Study impact of training hyper-parameters on BERT-base architecture
  • Original BERT training recipe results in poor model performance
  • Mask language modeling with 15% masking rate
  • 10% of masks filled with random words, 10% unchanged
  • No improvement from masking at larger rates
  • No difference enabling/disabling 20% rule
  • Evaluate other functions for masked-language objective, no benefits
  • Adam optimizer with weight decay of 0.01
  • Gradient clipping at clip value of 0.5
  • One-cycle learning rate with peak of 10-3
  • Batch size schedule to maximize performance
  • Dropout disabled during pretraining, re-enabled during fine-tuning
  • No gains from length curricula or token dropping

Optimizing the dataset

  • Scaling laws create a barrier to making major gains with architectural modifications
  • Training on better data can be done to improve performance
  • Two data based pathways to better down-scaling: filtering, processing, or sorting existing data; swapping data source
  • Experimented with several subsets of The Pile
  • Evaluated deduplication and filtering for uncompressible data
  • Sorting tokenized sequences by average token prevalence and increasing batch size can improve performance
  • Investigated whether original vocabulary size of 32768 is optimal

Finetuning performance on glue

  • Evaluated performance on GLUE benchmark minus WNLI
  • Used MNLI (m) during previous sections, no hyperparameter tuning based on full GLUE scores
  • Finetuned BERT-base checkpoint and models with same constraints
  • BERT-base finetuned with batch size of 32 and learning rate of 2 x 10^-5
  • Crammed models finetuned with batch size of 16 and learning rate of 4 x 10^-5 with cosine decay
  • Performance surprisingly decent, especially for larger datasets
  • Gains over naive BERT training and recipe from Izsak et al. (2021)
  • CoLA performance intriguing, two hypotheses offered

Ablation -which changes really mattered?

  • Table 5 provides a summary of all changes discussed in the paper
  • Modifications are grouped into three categories: architecture, training, and data
  • Minimal modifications are necessary, as architecture changes allow for more aggressive learning rate schedules

What happens when training longer?

  • We tested what happens when the cramming recipe is used with more budget.
  • We trained models for 48 hours on 8 A6000 GPUs, which is 208 total exaFLOP.
  • The setting described earlier was applied, with the learning rate schedule adjusted to cover the new budget.
  • The recipe generalizes to larger compute budgets.
  • The newly trained models have strong performances, especially on MNLI and SST-2.
  • The new models barely improve in other tasks, such as CoLA.

Limitations

  • Investigated transformer-based architectures trained with MLM objectives
  • Relaxing constraints of investigation is interesting
  • Modifications proposed to MLM objective
  • MLM still holds up well as pretraining objective
  • Suggestions such as ELECTRA could be beneficial for crammed models
  • Optimal architecture might not be transformer-based

Conclusions

  • Transformer-based language models can achieve decent performance with limited compute
  • Cramming language models is difficult
  • Baseline for exploring cramming question
  • Training on single GPU with mixed precision
  • Batch size of 4036 and dataset is bookcorpus-wikipedia
  • Downstream evaluation as described in Section 5
  • Pretraining on an A4000
  • Discrepancy between optimal pretraining batch size and optimal batch size for evaluation on MNLI when ramp-up is used
  • Improvements through larger models are evened out by their slower speed
  • Ablation study to determine which improvements were most important