Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Modern masked language models are trained on large corpora.
- We explore the effects of training on a smaller, well-balanced corpus.
- Pre-training on this corpus can reach better performance than the original BERT model.
- Smaller corpora have potential as a language modeling benchmark.
- We present comparative studies of LMs to evaluate training objectives and model architectures.
- We propose an optimized LM architecture called LTG-BERT.
Paper Content
Introduction
- NLP practitioners use large amounts of data to pre-train language models
- Aim is to focus on more efficient language modeling on a small and standardizable pre-training corpus
- Study data efficiency of current language models on an openly available corpus of approximately 100M words
- Goal is not to rival the paradigm of ‘massively pre-trained language models’
- Contribution is twofold: 100M words is enough to train a competitive language model and reproducibility and fair comparison of language models can be achieved by pre-training on the British National Corpus
- Language models have been pretrained on different corpora tokenized by different tokenizers and fine-tuned by increasingly complex learning methods
- Data requirements of language models have been growing in orders of magnitude
- ELMo and BERT introduced deep contextualized embeddings of words
- XLNet, RoBERTa and GPT-3 trained on 33B, 30B and 400B words respectively
- Effect of corpus size has been studied
- Evaluate effect of training on a small corpus which was carefully curated to create a representative sample of English
British national corpus
- BNC is a monolingual English corpus
- Contains 100 million words of written and spoken language
- Sources include newspapers, journals, books, letters, conversations, radio shows, phone calls
- Sources are truncated to 45,000 words to ensure diversity
- Widely acknowledged to have been a major influence on language corpora
- Does not reflect 21st century English language
- Used as a model for creating representative corpora for other languages
- Third release of the corpus is BNC XML Edition (2007)
Preprocessing
- XML version of BNC is converted to Markdown format to make it human-readable
- Metainformation is preserved
- Articles are randomly placed into training and development splits
- Text units are words, sentences, paragraphs and articles
- Wordtokens are not preserved, heuristics are used
- Headers, speech turns, quotes and incomprehensible speech are kept in Markdown format
Model architecture
- Depart from typical post-norm Transformer architecture
- Preliminary experiments showed model tends to diverge
- Follow recent improvements of Transformer
- Introduce NormFormer architecture to stabilize training
- Use EGLU activation function to enhance expressiveness
Training objectives
- Established a controlled test bed for a comparative study of training objectives
- Evaluated five different configurations of two self-supervised training objectives (MLM and NSP)
Masked language modeling (mlm)
- BERT learns a bidirectional contextualized representation for each token in a text segment.
- 15% of subword tokens are randomly selected and 80% are masked, 10% randomly replaced and 10% are left untouched.
- Three common choices of the masked text units: subwords, whole words, and spans.
Next sentence prediction (nsp)
- Masked language modeling is a token-level training objective
- Some downstream tasks need a single sentence-level representation
- Researchers have designed additional semi-supervised training objectives
- NSP objectives may not help downstream performance and can be dropped
- Experiment with two NSP objectives: document discrimination and sentence-order discrimination
Evaluation metrics
- Evaluating amount of linguistic knowledge acquired by BNC language models using 3 methods
- SuperGLUE datasets test model’s ability to adapt to NLU tasks
- Edge probing tasks evaluate how much linguistic info can be extracted from frozen pre-trained model
- BLiMP uses pretrained network to model language and probes knowledge without additional training
(super)glue
- GLUE and SuperGLUE are used to evaluate language understanding capabilities of language models
- Technical details of SuperGLUE fine-tuning in Appendix B.1
- Winograd schema datasets, WNLI and WSC excluded
- 14 (Super)GLUE datasets measure performance on inference, linguistic acceptability, sentiment analysis, semantic similarity, word sense disambiguation, and question answering
- Deep learning systems prone to finding spurious correlations in training data
- HANS test set to identify fallible syntactic heuristics
- Models tested on MNLI
Edge probing
- GLUE tasks measure the ability of a language model to be finetuned on a sentence-level NLU problem.
- Edge probing is a simple approach of probing for a diverse set of linguistic phenomena.
- Edge probing reformulates traditional NLP tasks as span classification.
- Five basic tasks are probed: POS, DP, SRL, NER and CR.
- Model only learns to classify each span provided to the model as gold data.
Blimp
- Evaluation metrics can be skewed by supervised training, making it difficult to separate prior knowledge from acquired knowledge.
- BLiMP measures language model knowledge without additional training.
- BLiMP consists of 67,000 sentence pairs, with one sentence being grammatically valid.
- Language models can assign a probability to each sentence, and be tested on how often they assign a higher probability to the correct sentence.
Experiments
- Conducted experiments to compare different training hyperparameters and model configurations
- Used overall best training setting to compare training objectives
- Investigated sampling efficiency of proposed language model and compared BNC with a Wikipedia & BookCorpus subset of same size
- Central model used was a base-sized Transformer with 12 encoder layers, hidden size 768 and 12 attention heads
- Utilized same cased WordPiece tokenizer with a vocabulary size of 16384 trained with BNC dataset
Comparison of model architectures and training settings
- NormFormer-like layer normalization performs better than post-norm and pre-norm transformer variants
- Absolute positional embeddings perform better on language modeling but less adaptable for fine-tuning
- Weight decay of 0.1 boosts performance on masked language modeling
- GEGLU activation, lower weight norms, and no bias parameters in feed-forward layers improve performance
Training objective comparison
- Three masking methods compared: subword, whole-word, span masking
- Span-based masking performs best
- All methods perform equally well on edge probing
- Subword masking is still a competitive baseline
- Combining NSP task and subword masking does not lead to improved performance
- Order discrimination leads to worse performance
Sampling efficiency
- Training steps are important for efficient language models.
- Increasing the steps does not lead to better performance.
- Training for half the time is enough to get comparable performance.
- Decreasing the training steps further degrades the downstream results.
- Current self-supervised language modeling methods are sampling inefficient.
100 million subset of wikipedia & bookcorpus
- Experiment evaluates how much curation of BNC helps downstream performance
- Experiment uses a random subset of Wikipedia and BookCorpus (equal size to BNC)
- BNC is a corpus of British English from 1990s
- Quality of data source is not necessary to learn from 100M words, but better quality leads to noticeable difference in downstream performance
Conclusion
- Evaluated how data-efficient masked language models can be
- Trained a variety of models with different training objectives on the same training data: British National Corpus
- BNC is small but well balanced and carefully crafted
- Models perform better than BERT base trained on a much larger corpus
- Limited data regime is beneficial for the development of efficient and reliable language models
- 100 million word tokens is enough to learn basic linguistic skills
- Huge amounts of training data are not always necessary
- Next sentence prediction objective does not improve BERT-like models
- Standard subword masking is outperformed by span masking
- Linguistic performance can be increased by better neural architectures and training configurations
- Results serve as foundation for future research
- Only considers language modeling of English
- Training process still requires a similar amount of computational resources
- Pseudo-loglikelihood score of a sentence used to evaluate models
- Layer-wise convex weights used to rate contribution of each Transformer layer to a particular task
- NormFormer-like architecture used
- Results of all evaluated models provided in tables
- Pre-training and fine-tuning hyperparameters listed in tables