Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Modern masked language models are trained on large corpora.
We explore the effects of training on a smaller, well-balanced corpus.
Pre-training on this corpus can reach better performance than the original BERT model.
Smaller corpora have potential as a language modeling benchmark.
We present comparative studies of LMs to evaluate training objectives and model architectures.
We propose an optimized LM architecture called LTG-BERT.

Paper Content

Introduction

NLP practitioners use large amounts of data to pre-train language models
Aim is to focus on more efficient language modeling on a small and standardizable pre-training corpus
Study data efficiency of current language models on an openly available corpus of approximately 100M words
Goal is not to rival the paradigm of ‘massively pre-trained language models’
Contribution is twofold: 100M words is enough to train a competitive language model and reproducibility and fair comparison of language models can be achieved by pre-training on the British National Corpus
Language models have been pretrained on different corpora tokenized by different tokenizers and fine-tuned by increasingly complex learning methods
Data requirements of language models have been growing in orders of magnitude
ELMo and BERT introduced deep contextualized embeddings of words
XLNet, RoBERTa and GPT-3 trained on 33B, 30B and 400B words respectively
Effect of corpus size has been studied
Evaluate effect of training on a small corpus which was carefully curated to create a representative sample of English

British national corpus

BNC is a monolingual English corpus
Contains 100 million words of written and spoken language
Sources include newspapers, journals, books, letters, conversations, radio shows, phone calls
Sources are truncated to 45,000 words to ensure diversity
Widely acknowledged to have been a major influence on language corpora
Does not reflect 21st century English language
Used as a model for creating representative corpora for other languages
Third release of the corpus is BNC XML Edition (2007)

Preprocessing

XML version of BNC is converted to Markdown format to make it human-readable
Metainformation is preserved
Articles are randomly placed into training and development splits
Text units are words, sentences, paragraphs and articles
Wordtokens are not preserved, heuristics are used
Headers, speech turns, quotes and incomprehensible speech are kept in Markdown format

Model architecture

Depart from typical post-norm Transformer architecture
Preliminary experiments showed model tends to diverge
Follow recent improvements of Transformer
Introduce NormFormer architecture to stabilize training
Use EGLU activation function to enhance expressiveness

Training objectives

Established a controlled test bed for a comparative study of training objectives
Evaluated five different configurations of two self-supervised training objectives (MLM and NSP)

Masked language modeling (mlm)

BERT learns a bidirectional contextualized representation for each token in a text segment.
15% of subword tokens are randomly selected and 80% are masked, 10% randomly replaced and 10% are left untouched.
Three common choices of the masked text units: subwords, whole words, and spans.

Next sentence prediction (nsp)

Masked language modeling is a token-level training objective
Some downstream tasks need a single sentence-level representation
Researchers have designed additional semi-supervised training objectives
NSP objectives may not help downstream performance and can be dropped
Experiment with two NSP objectives: document discrimination and sentence-order discrimination

Evaluation metrics

Evaluating amount of linguistic knowledge acquired by BNC language models using 3 methods
SuperGLUE datasets test model’s ability to adapt to NLU tasks
Edge probing tasks evaluate how much linguistic info can be extracted from frozen pre-trained model
BLiMP uses pretrained network to model language and probes knowledge without additional training

(super)glue

GLUE and SuperGLUE are used to evaluate language understanding capabilities of language models
Technical details of SuperGLUE fine-tuning in Appendix B.1
Winograd schema datasets, WNLI and WSC excluded
14 (Super)GLUE datasets measure performance on inference, linguistic acceptability, sentiment analysis, semantic similarity, word sense disambiguation, and question answering
Deep learning systems prone to finding spurious correlations in training data
HANS test set to identify fallible syntactic heuristics
Models tested on MNLI

Edge probing

GLUE tasks measure the ability of a language model to be finetuned on a sentence-level NLU problem.
Edge probing is a simple approach of probing for a diverse set of linguistic phenomena.
Edge probing reformulates traditional NLP tasks as span classification.
Five basic tasks are probed: POS, DP, SRL, NER and CR.
Model only learns to classify each span provided to the model as gold data.

Blimp

Evaluation metrics can be skewed by supervised training, making it difficult to separate prior knowledge from acquired knowledge.
BLiMP measures language model knowledge without additional training.
BLiMP consists of 67,000 sentence pairs, with one sentence being grammatically valid.
Language models can assign a probability to each sentence, and be tested on how often they assign a higher probability to the correct sentence.

Experiments

Conducted experiments to compare different training hyperparameters and model configurations
Used overall best training setting to compare training objectives
Investigated sampling efficiency of proposed language model and compared BNC with a Wikipedia & BookCorpus subset of same size
Central model used was a base-sized Transformer with 12 encoder layers, hidden size 768 and 12 attention heads
Utilized same cased WordPiece tokenizer with a vocabulary size of 16384 trained with BNC dataset

Comparison of model architectures and training settings

NormFormer-like layer normalization performs better than post-norm and pre-norm transformer variants
Absolute positional embeddings perform better on language modeling but less adaptable for fine-tuning
Weight decay of 0.1 boosts performance on masked language modeling
GEGLU activation, lower weight norms, and no bias parameters in feed-forward layers improve performance

Training objective comparison

Three masking methods compared: subword, whole-word, span masking
Span-based masking performs best
All methods perform equally well on edge probing
Subword masking is still a competitive baseline
Combining NSP task and subword masking does not lead to improved performance
Order discrimination leads to worse performance

Sampling efficiency

Training steps are important for efficient language models.
Increasing the steps does not lead to better performance.
Training for half the time is enough to get comparable performance.
Decreasing the training steps further degrades the downstream results.
Current self-supervised language modeling methods are sampling inefficient.

100 million subset of wikipedia & bookcorpus

Experiment evaluates how much curation of BNC helps downstream performance
Experiment uses a random subset of Wikipedia and BookCorpus (equal size to BNC)
BNC is a corpus of British English from 1990s
Quality of data source is not necessary to learn from 100M words, but better quality leads to noticeable difference in downstream performance

Conclusion

Evaluated how data-efficient masked language models can be
Trained a variety of models with different training objectives on the same training data: British National Corpus
BNC is small but well balanced and carefully crafted
Models perform better than BERT base trained on a much larger corpus
Limited data regime is beneficial for the development of efficient and reliable language models
100 million word tokens is enough to learn basic linguistic skills
Huge amounts of training data are not always necessary
Next sentence prediction objective does not improve BERT-like models
Standard subword masking is outperformed by span masking
Linguistic performance can be increased by better neural architectures and training configurations
Results serve as foundation for future research
Only considers language modeling of English
Training process still requires a similar amount of computational resources
Pseudo-loglikelihood score of a sentence used to evaluate models
Layer-wise convex weights used to rate contribution of each Transformer layer to a particular task
NormFormer-like architecture used
Results of all evaluated models provided in tables
Pre-training and fine-tuning hyperparameters listed in tables

Link to paper#

Abstract#

Paper Content#

Introduction#

British national corpus#

Preprocessing#

Model architecture#

Training objectives#

Masked language modeling (mlm)#

Next sentence prediction (nsp)#

Evaluation metrics#

(super)glue#

Edge probing#

Blimp#

Experiments#

Comparison of model architectures and training settings#

Training objective comparison#

Sampling efficiency#

100 million subset of wikipedia & bookcorpus#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

British national corpus

Preprocessing

Model architecture

Training objectives

Masked language modeling (mlm)

Next sentence prediction (nsp)

Evaluation metrics

(super)glue

Edge probing

Blimp

Experiments

Comparison of model architectures and training settings

Training objective comparison

Sampling efficiency

100 million subset of wikipedia & bookcorpus

Conclusion