Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Various types of pretraining architectures exist, including autoencoding, autoregressive, and encoder-decoder models.
None of the pretraining frameworks performs best for all tasks.
GLM is proposed to address this challenge, which improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans.
GLM outperforms BERT, T5, and GPT on a wide range of tasks.

Paper Content

Introduction

Language models pretrained on unlabeled texts have advanced the state of the art in NLP tasks
Existing pretraining frameworks can be categorized into three families: autoregressive, autoencoding, and encoder-decoder models
Autoregressive models learn left-to-right language models, but have unidirectional attention mechanism
Autoencoding models learn bidirectional context encoders via denoising objectives
Encoder-decoder models adopt bidirectional attention for the encoder, unidirectional attention for the decoder, and cross attention between them
T5 unifies NLU and conditional generation via encoder-decoder models
Previous works have tried to unify different frameworks by combining their objectives via multi-task learning
GLM proposed as a pretraining framework based on autoregressive blank infilling
GLM outperforms BERT, RoBERTa, BART, and T5 on NLU and generation tasks
GLM formulates NLU tasks as cloze questions that contain task descriptions
GLM with multi-task pretraining achieves improvements in NLU, conditional text generation, and language modeling tasks

Pretraining objective

GLM is trained by optimizing an autoregressive blank infilling objective
Input text is replaced with a single [MASK] token
Model predicts missing tokens in spans from corrupted text in an autoregressive manner
Order of spans is randomly permuted
Tokens in each blank are generated in a left-to-right order
Spans of length are randomly sampled from a Poisson distribution
At least 15% of original tokens are masked

Multi-task pretraining

GLM evaluated in multi-task setting
Sampled short and long spans with equal chances
Evaluated on NLU, seq2seq, blank infilling, and zero-shot language modeling
GLM outperforms BERT and UniLM
GLM Sent outperforms GLM Doc
Increasing GLM Doc’s parameters leads to better performance
GLM Large matches other pretraining models on generation tasks
GLM RoBERTa matches seq2seq BART model
GLM outperforms previous methods on text infilling
GLM Large performs worse than GPT Large on language modeling
Increasing GLM Doc’s parameters leads to performance close to GPT Large
Encoding context with bidirectional attention improves language modeling

Model architecture

GLM uses a single Transformer with modifications to the architecture
Positional information is encoded with two positional ids
Model is not aware of the length of the masked span
NLU classification tasks are reformulated as generation tasks of blank infilling
Finetune GLM with a cross-entropy loss
GLM can be used for unconditional or conditional generation tasks

Discussion and analysis

GLM captures interdependencies of masked tokens, while BERT does not
BERT cannot fill in multiple tokens properly
XLNet uses original position encodings before corruption
XLNet uses two-stream self-attention mechanism
T5 uses sentinel tokens to differentiate masked spans
GLM outperforms T5 on NLU and seq2seq tasks
UniLM replaces masked spans with [MASK] tokens
GLM unifies NLU and generation tasks with autoregressive pretraining

Experiments

Pretraining setup described
Evaluation of downstream tasks described

Pretraining setup

Used BooksCorpus and English Wikipedia as pretraining data
Used same architectures as BERT Base and BERT Large
Trained GLM Base and GLM Large with 110M and 340M parameters respectively
Trained two larger GLM models with 410M and 515M parameters
Trained GLM RoBERTa with same data, tokenization, and hyperparameters as RoBERTa
Pretrained model for 250,000 steps

Superglue

Experiments conducted on SuperGLUE benchmark
Classification tasks reformulated as blank infilling with human-crafted cloze questions
BERT Base and BERT Large used as baselines
GLM consistently outperforms BERT on most tasks
GLM Base scores 4.6% higher than BERT Base
GLM Large scores 5.0% higher than BERT Large
GLM RoBERTa outperforms T5 Large but is only half its size
BART does not perform well on SuperGLUE benchmark

Ablation study

GLM is superior to BERT on NLU tasks
GLM outperforms BERT with cloze-style finetuning on ReCoRD and WSC
Cloze formulation is critical for GLM’s performance on NLU tasks
Removing span shuffling and using different sentinel tokens leads to a performance drop

Pretrained language models improve performance of downstream tasks
Three types of pretrained models: autoregressive, encoder-decoder, and masked language modeling
NLU tasks can be completed by generative language models without finetuning
PET and Athiwaratkun et al. study blanking infilling models

Conclusions

GLM is a pretraining framework for natural language understanding and generation
GLM unifies the pretraining objectives for different tasks as autoregressive blank infilling
GLM outperforms previous methods for NLU tasks
Hyperparameters for all pre-training settings are summarized in Table 7
SuperGLUE benchmark consists of 8 NLU tasks
Finetuning GLM on SuperGLUE tasks involves constructing the input using cloze questions and replacing the blank with a [MASK] token
Baseline classifiers concatenate the input parts of each task and add a classification layer on top of the [CLS] token representation
Text summarization task uses Gigaword dataset for model finetuning and evaluation
Question generation task uses SQuAD 1.1 dataset
GLUE and SQuAD benchmarks used to compare GLM with BERT
Text infilling performance evaluated on Yahoo Answers dataset
Language modeling ability evaluated with perplexity on BookWiki and accuracy on LAMBDA dataset

Link to paper#

Abstract#

Paper Content#

Introduction#

Pretraining objective#

Multi-task pretraining#

Model architecture#

Discussion and analysis#

Experiments#

Pretraining setup#

Superglue#

Ablation study#

Related work#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Pretraining objective

Multi-task pretraining

Model architecture

Discussion and analysis

Experiments

Pretraining setup

Superglue

Ablation study

Related work

Conclusions