Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Modern masked language models are trained on large corpora.
  • We explore the effects of training on a smaller, well-balanced corpus.
  • Pre-training on this corpus can reach better performance than the original BERT model.
  • Smaller corpora have potential as a language modeling benchmark.
  • We present comparative studies of LMs to evaluate training objectives and model architectures.
  • We propose an optimized LM architecture called LTG-BERT.

Paper Content

Introduction

  • NLP practitioners use large amounts of data to pre-train language models
  • Aim is to focus on more efficient language modeling on a small and standardizable pre-training corpus
  • Study data efficiency of current language models on an openly available corpus of approximately 100M words
  • Goal is not to rival the paradigm of ‘massively pre-trained language models’
  • Contribution is twofold: 100M words is enough to train a competitive language model and reproducibility and fair comparison of language models can be achieved by pre-training on the British National Corpus
  • Language models have been pretrained on different corpora tokenized by different tokenizers and fine-tuned by increasingly complex learning methods
  • Data requirements of language models have been growing in orders of magnitude
  • ELMo and BERT introduced deep contextualized embeddings of words
  • XLNet, RoBERTa and GPT-3 trained on 33B, 30B and 400B words respectively
  • Effect of corpus size has been studied
  • Evaluate effect of training on a small corpus which was carefully curated to create a representative sample of English

British national corpus

  • BNC is a monolingual English corpus
  • Contains 100 million words of written and spoken language
  • Sources include newspapers, journals, books, letters, conversations, radio shows, phone calls
  • Sources are truncated to 45,000 words to ensure diversity
  • Widely acknowledged to have been a major influence on language corpora
  • Does not reflect 21st century English language
  • Used as a model for creating representative corpora for other languages
  • Third release of the corpus is BNC XML Edition (2007)

Preprocessing

  • XML version of BNC is converted to Markdown format to make it human-readable
  • Metainformation is preserved
  • Articles are randomly placed into training and development splits
  • Text units are words, sentences, paragraphs and articles
  • Wordtokens are not preserved, heuristics are used
  • Headers, speech turns, quotes and incomprehensible speech are kept in Markdown format

Model architecture

  • Depart from typical post-norm Transformer architecture
  • Preliminary experiments showed model tends to diverge
  • Follow recent improvements of Transformer
  • Introduce NormFormer architecture to stabilize training
  • Use EGLU activation function to enhance expressiveness

Training objectives

  • Established a controlled test bed for a comparative study of training objectives
  • Evaluated five different configurations of two self-supervised training objectives (MLM and NSP)

Masked language modeling (mlm)

  • BERT learns a bidirectional contextualized representation for each token in a text segment.
  • 15% of subword tokens are randomly selected and 80% are masked, 10% randomly replaced and 10% are left untouched.
  • Three common choices of the masked text units: subwords, whole words, and spans.

Next sentence prediction (nsp)

  • Masked language modeling is a token-level training objective
  • Some downstream tasks need a single sentence-level representation
  • Researchers have designed additional semi-supervised training objectives
  • NSP objectives may not help downstream performance and can be dropped
  • Experiment with two NSP objectives: document discrimination and sentence-order discrimination

Evaluation metrics

  • Evaluating amount of linguistic knowledge acquired by BNC language models using 3 methods
  • SuperGLUE datasets test model’s ability to adapt to NLU tasks
  • Edge probing tasks evaluate how much linguistic info can be extracted from frozen pre-trained model
  • BLiMP uses pretrained network to model language and probes knowledge without additional training

(super)glue

  • GLUE and SuperGLUE are used to evaluate language understanding capabilities of language models
  • Technical details of SuperGLUE fine-tuning in Appendix B.1
  • Winograd schema datasets, WNLI and WSC excluded
  • 14 (Super)GLUE datasets measure performance on inference, linguistic acceptability, sentiment analysis, semantic similarity, word sense disambiguation, and question answering
  • Deep learning systems prone to finding spurious correlations in training data
  • HANS test set to identify fallible syntactic heuristics
  • Models tested on MNLI

Edge probing

  • GLUE tasks measure the ability of a language model to be finetuned on a sentence-level NLU problem.
  • Edge probing is a simple approach of probing for a diverse set of linguistic phenomena.
  • Edge probing reformulates traditional NLP tasks as span classification.
  • Five basic tasks are probed: POS, DP, SRL, NER and CR.
  • Model only learns to classify each span provided to the model as gold data.

Blimp

  • Evaluation metrics can be skewed by supervised training, making it difficult to separate prior knowledge from acquired knowledge.
  • BLiMP measures language model knowledge without additional training.
  • BLiMP consists of 67,000 sentence pairs, with one sentence being grammatically valid.
  • Language models can assign a probability to each sentence, and be tested on how often they assign a higher probability to the correct sentence.

Experiments

  • Conducted experiments to compare different training hyperparameters and model configurations
  • Used overall best training setting to compare training objectives
  • Investigated sampling efficiency of proposed language model and compared BNC with a Wikipedia & BookCorpus subset of same size
  • Central model used was a base-sized Transformer with 12 encoder layers, hidden size 768 and 12 attention heads
  • Utilized same cased WordPiece tokenizer with a vocabulary size of 16384 trained with BNC dataset

Comparison of model architectures and training settings

  • NormFormer-like layer normalization performs better than post-norm and pre-norm transformer variants
  • Absolute positional embeddings perform better on language modeling but less adaptable for fine-tuning
  • Weight decay of 0.1 boosts performance on masked language modeling
  • GEGLU activation, lower weight norms, and no bias parameters in feed-forward layers improve performance

Training objective comparison

  • Three masking methods compared: subword, whole-word, span masking
  • Span-based masking performs best
  • All methods perform equally well on edge probing
  • Subword masking is still a competitive baseline
  • Combining NSP task and subword masking does not lead to improved performance
  • Order discrimination leads to worse performance

Sampling efficiency

  • Training steps are important for efficient language models.
  • Increasing the steps does not lead to better performance.
  • Training for half the time is enough to get comparable performance.
  • Decreasing the training steps further degrades the downstream results.
  • Current self-supervised language modeling methods are sampling inefficient.

100 million subset of wikipedia & bookcorpus

  • Experiment evaluates how much curation of BNC helps downstream performance
  • Experiment uses a random subset of Wikipedia and BookCorpus (equal size to BNC)
  • BNC is a corpus of British English from 1990s
  • Quality of data source is not necessary to learn from 100M words, but better quality leads to noticeable difference in downstream performance

Conclusion

  • Evaluated how data-efficient masked language models can be
  • Trained a variety of models with different training objectives on the same training data: British National Corpus
  • BNC is small but well balanced and carefully crafted
  • Models perform better than BERT base trained on a much larger corpus
  • Limited data regime is beneficial for the development of efficient and reliable language models
  • 100 million word tokens is enough to learn basic linguistic skills
  • Huge amounts of training data are not always necessary
  • Next sentence prediction objective does not improve BERT-like models
  • Standard subword masking is outperformed by span masking
  • Linguistic performance can be increased by better neural architectures and training configurations
  • Results serve as foundation for future research
  • Only considers language modeling of English
  • Training process still requires a similar amount of computational resources
  • Pseudo-loglikelihood score of a sentence used to evaluate models
  • Layer-wise convex weights used to rate contribution of each Transformer layer to a particular task
  • NormFormer-like architecture used
  • Results of all evaluated models provided in tables
  • Pre-training and fine-tuning hyperparameters listed in tables