Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Various types of pretraining architectures exist, including autoencoding, autoregressive, and encoder-decoder models.
- None of the pretraining frameworks performs best for all tasks.
- GLM is proposed to address this challenge, which improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans.
- GLM outperforms BERT, T5, and GPT on a wide range of tasks.
Paper Content
Introduction
- Language models pretrained on unlabeled texts have advanced the state of the art in NLP tasks
- Existing pretraining frameworks can be categorized into three families: autoregressive, autoencoding, and encoder-decoder models
- Autoregressive models learn left-to-right language models, but have unidirectional attention mechanism
- Autoencoding models learn bidirectional context encoders via denoising objectives
- Encoder-decoder models adopt bidirectional attention for the encoder, unidirectional attention for the decoder, and cross attention between them
- T5 unifies NLU and conditional generation via encoder-decoder models
- Previous works have tried to unify different frameworks by combining their objectives via multi-task learning
- GLM proposed as a pretraining framework based on autoregressive blank infilling
- GLM outperforms BERT, RoBERTa, BART, and T5 on NLU and generation tasks
- GLM formulates NLU tasks as cloze questions that contain task descriptions
- GLM with multi-task pretraining achieves improvements in NLU, conditional text generation, and language modeling tasks
Pretraining objective
- GLM is trained by optimizing an autoregressive blank infilling objective
- Input text is replaced with a single [MASK] token
- Model predicts missing tokens in spans from corrupted text in an autoregressive manner
- Order of spans is randomly permuted
- Tokens in each blank are generated in a left-to-right order
- Spans of length are randomly sampled from a Poisson distribution
- At least 15% of original tokens are masked
Multi-task pretraining
- GLM evaluated in multi-task setting
- Sampled short and long spans with equal chances
- Evaluated on NLU, seq2seq, blank infilling, and zero-shot language modeling
- GLM outperforms BERT and UniLM
- GLM Sent outperforms GLM Doc
- Increasing GLM Doc’s parameters leads to better performance
- GLM Large matches other pretraining models on generation tasks
- GLM RoBERTa matches seq2seq BART model
- GLM outperforms previous methods on text infilling
- GLM Large performs worse than GPT Large on language modeling
- Increasing GLM Doc’s parameters leads to performance close to GPT Large
- Encoding context with bidirectional attention improves language modeling
Model architecture
- GLM uses a single Transformer with modifications to the architecture
- Positional information is encoded with two positional ids
- Model is not aware of the length of the masked span
- NLU classification tasks are reformulated as generation tasks of blank infilling
- Finetune GLM with a cross-entropy loss
- GLM can be used for unconditional or conditional generation tasks
Discussion and analysis
- GLM captures interdependencies of masked tokens, while BERT does not
- BERT cannot fill in multiple tokens properly
- XLNet uses original position encodings before corruption
- XLNet uses two-stream self-attention mechanism
- T5 uses sentinel tokens to differentiate masked spans
- GLM outperforms T5 on NLU and seq2seq tasks
- UniLM replaces masked spans with [MASK] tokens
- GLM unifies NLU and generation tasks with autoregressive pretraining
Experiments
- Pretraining setup described
- Evaluation of downstream tasks described
Pretraining setup
- Used BooksCorpus and English Wikipedia as pretraining data
- Used same architectures as BERT Base and BERT Large
- Trained GLM Base and GLM Large with 110M and 340M parameters respectively
- Trained two larger GLM models with 410M and 515M parameters
- Trained GLM RoBERTa with same data, tokenization, and hyperparameters as RoBERTa
- Pretrained model for 250,000 steps
Superglue
- Experiments conducted on SuperGLUE benchmark
- Classification tasks reformulated as blank infilling with human-crafted cloze questions
- BERT Base and BERT Large used as baselines
- GLM consistently outperforms BERT on most tasks
- GLM Base scores 4.6% higher than BERT Base
- GLM Large scores 5.0% higher than BERT Large
- GLM RoBERTa outperforms T5 Large but is only half its size
- BART does not perform well on SuperGLUE benchmark
Ablation study
- GLM is superior to BERT on NLU tasks
- GLM outperforms BERT with cloze-style finetuning on ReCoRD and WSC
- Cloze formulation is critical for GLM’s performance on NLU tasks
- Removing span shuffling and using different sentinel tokens leads to a performance drop
Related work
- Pretrained language models improve performance of downstream tasks
- Three types of pretrained models: autoregressive, encoder-decoder, and masked language modeling
- NLU tasks can be completed by generative language models without finetuning
- PET and Athiwaratkun et al. study blanking infilling models
Conclusions
- GLM is a pretraining framework for natural language understanding and generation
- GLM unifies the pretraining objectives for different tasks as autoregressive blank infilling
- GLM outperforms previous methods for NLU tasks
- Hyperparameters for all pre-training settings are summarized in Table 7
- SuperGLUE benchmark consists of 8 NLU tasks
- Finetuning GLM on SuperGLUE tasks involves constructing the input using cloze questions and replacing the blank with a [MASK] token
- Baseline classifiers concatenate the input parts of each task and add a classification layer on top of the [CLS] token representation
- Text summarization task uses Gigaword dataset for model finetuning and evaluation
- Question generation task uses SQuAD 1.1 dataset
- GLUE and SQuAD benchmarks used to compare GLM with BERT
- Text infilling performance evaluated on Yahoo Answers dataset
- Language modeling ability evaluated with perplexity on BookWiki and accuracy on LAMBDA dataset