Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Gated Linear Units (GLU) are a product of two linear projections, one of which is passed through a sigmoid function.
  • Variations on GLU can use different nonlinear or linear functions instead of sigmoid.
  • GLU variants are tested in the feed-forward sublayers of the Transformer sequence-to-sequence model.
  • Some GLU variants yield quality improvements over ReLU or GELU activations.

Paper Content

Introduction

  • Transformer sequence-to-sequence model alternates between multi-head attention and position-wise feed-forward networks
  • Feed-forward networks take a vector x and pass it through two linear transformations
  • Rectified-linear activation function applied between the two linear transformations
  • Gated Linear Units (GLU) and Variants introduced Gated Linear Units (GLU) as a neural network layer
  • Variations on the Transformer FFN layer use GLU or one of its variants in place of the first linear transformation and the activation function

Experiments on text-to-text transfer transformer (t5)

  • FFN variants tested on transfer-learning setup from Raffel et al., 2019
  • Encoder-decoder transformer model trained on denoising objective of predicting missing text segments
  • Fine-tuned on various language understanding tasks

Model architecture

  • Used same code base, model architecture, and training task as base model from Raffel et al., 2019
  • Encoder and decoder each have 12 layers, with d model = 768
  • Attention layers have h = 12 and d k = d v = 64
  • FFN layers have hidden size d f f = 3072
  • GLU-variant-based FFN layers have hidden layer d f f = 2048 to maintain same parameter and operation counts as base model

Pre-training and perplexity results

  • Pre-trained for 524,288 steps on the span-filling objective on the C4 dataset
  • Each training batch consists of 128 examples, each with an input of 512 tokens and an output of 114 tokens
  • Used the Adafactor optimizer and an inverse-square-root learning-rate schedule
  • Learning rate decayed linearly for the final 10 percent of training steps
  • No dropout during pre-training
  • Log-perplexity on the training objective used to measure model quality

Fine-tuning

  • Fine-tuned model on SQuAD and GLUE/SuperGlue benchmarks
  • 131072 steps with learning rate of 10-3
  • Input sequences have combined length of 65,536 tokens
  • Dropout rate of 0.1 on layer outputs, feed-forward hidden-layers and attention weights
  • Embedding matrices fixed during fine-tuning

Conclusions

  • Extended GLU family of layers and proposed their use in Transformer
  • Better perplexities for de-noising objective used in pre-training and better results on downstream language-understanding tasks
  • Simple to implement and no apparent computational drawbacks
  • Success attributed to divine benevolence