Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Various types of pretraining architectures exist, including autoencoding, autoregressive, and encoder-decoder models.
  • None of the pretraining frameworks performs best for all tasks.
  • GLM is proposed to address this challenge, which improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans.
  • GLM outperforms BERT, T5, and GPT on a wide range of tasks.

Paper Content

Introduction

  • Language models pretrained on unlabeled texts have advanced the state of the art in NLP tasks
  • Existing pretraining frameworks can be categorized into three families: autoregressive, autoencoding, and encoder-decoder models
  • Autoregressive models learn left-to-right language models, but have unidirectional attention mechanism
  • Autoencoding models learn bidirectional context encoders via denoising objectives
  • Encoder-decoder models adopt bidirectional attention for the encoder, unidirectional attention for the decoder, and cross attention between them
  • T5 unifies NLU and conditional generation via encoder-decoder models
  • Previous works have tried to unify different frameworks by combining their objectives via multi-task learning
  • GLM proposed as a pretraining framework based on autoregressive blank infilling
  • GLM outperforms BERT, RoBERTa, BART, and T5 on NLU and generation tasks
  • GLM formulates NLU tasks as cloze questions that contain task descriptions
  • GLM with multi-task pretraining achieves improvements in NLU, conditional text generation, and language modeling tasks

Pretraining objective

  • GLM is trained by optimizing an autoregressive blank infilling objective
  • Input text is replaced with a single [MASK] token
  • Model predicts missing tokens in spans from corrupted text in an autoregressive manner
  • Order of spans is randomly permuted
  • Tokens in each blank are generated in a left-to-right order
  • Spans of length are randomly sampled from a Poisson distribution
  • At least 15% of original tokens are masked

Multi-task pretraining

  • GLM evaluated in multi-task setting
  • Sampled short and long spans with equal chances
  • Evaluated on NLU, seq2seq, blank infilling, and zero-shot language modeling
  • GLM outperforms BERT and UniLM
  • GLM Sent outperforms GLM Doc
  • Increasing GLM Doc’s parameters leads to better performance
  • GLM Large matches other pretraining models on generation tasks
  • GLM RoBERTa matches seq2seq BART model
  • GLM outperforms previous methods on text infilling
  • GLM Large performs worse than GPT Large on language modeling
  • Increasing GLM Doc’s parameters leads to performance close to GPT Large
  • Encoding context with bidirectional attention improves language modeling

Model architecture

  • GLM uses a single Transformer with modifications to the architecture
  • Positional information is encoded with two positional ids
  • Model is not aware of the length of the masked span
  • NLU classification tasks are reformulated as generation tasks of blank infilling
  • Finetune GLM with a cross-entropy loss
  • GLM can be used for unconditional or conditional generation tasks

Discussion and analysis

  • GLM captures interdependencies of masked tokens, while BERT does not
  • BERT cannot fill in multiple tokens properly
  • XLNet uses original position encodings before corruption
  • XLNet uses two-stream self-attention mechanism
  • T5 uses sentinel tokens to differentiate masked spans
  • GLM outperforms T5 on NLU and seq2seq tasks
  • UniLM replaces masked spans with [MASK] tokens
  • GLM unifies NLU and generation tasks with autoregressive pretraining

Experiments

  • Pretraining setup described
  • Evaluation of downstream tasks described

Pretraining setup

  • Used BooksCorpus and English Wikipedia as pretraining data
  • Used same architectures as BERT Base and BERT Large
  • Trained GLM Base and GLM Large with 110M and 340M parameters respectively
  • Trained two larger GLM models with 410M and 515M parameters
  • Trained GLM RoBERTa with same data, tokenization, and hyperparameters as RoBERTa
  • Pretrained model for 250,000 steps

Superglue

  • Experiments conducted on SuperGLUE benchmark
  • Classification tasks reformulated as blank infilling with human-crafted cloze questions
  • BERT Base and BERT Large used as baselines
  • GLM consistently outperforms BERT on most tasks
  • GLM Base scores 4.6% higher than BERT Base
  • GLM Large scores 5.0% higher than BERT Large
  • GLM RoBERTa outperforms T5 Large but is only half its size
  • BART does not perform well on SuperGLUE benchmark

Ablation study

  • GLM is superior to BERT on NLU tasks
  • GLM outperforms BERT with cloze-style finetuning on ReCoRD and WSC
  • Cloze formulation is critical for GLM’s performance on NLU tasks
  • Removing span shuffling and using different sentinel tokens leads to a performance drop
  • Pretrained language models improve performance of downstream tasks
  • Three types of pretrained models: autoregressive, encoder-decoder, and masked language modeling
  • NLU tasks can be completed by generative language models without finetuning
  • PET and Athiwaratkun et al. study blanking infilling models

Conclusions

  • GLM is a pretraining framework for natural language understanding and generation
  • GLM unifies the pretraining objectives for different tasks as autoregressive blank infilling
  • GLM outperforms previous methods for NLU tasks
  • Hyperparameters for all pre-training settings are summarized in Table 7
  • SuperGLUE benchmark consists of 8 NLU tasks
  • Finetuning GLM on SuperGLUE tasks involves constructing the input using cloze questions and replacing the blank with a [MASK] token
  • Baseline classifiers concatenate the input parts of each task and add a classification layer on top of the [CLS] token representation
  • Text summarization task uses Gigaword dataset for model finetuning and evaluation
  • Question generation task uses SQuAD 1.1 dataset
  • GLUE and SQuAD benchmarks used to compare GLM with BERT
  • Text infilling performance evaluated on Yahoo Answers dataset
  • Language modeling ability evaluated with perplexity on BookWiki and accuracy on LAMBDA dataset