Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Gated Linear Units (GLU) are a product of two linear projections, one of which is passed through a sigmoid function.
Variations on GLU can use different nonlinear or linear functions instead of sigmoid.
GLU variants are tested in the feed-forward sublayers of the Transformer sequence-to-sequence model.
Some GLU variants yield quality improvements over ReLU or GELU activations.

Transformer sequence-to-sequence model alternates between multi-head attention and position-wise feed-forward networks
Feed-forward networks take a vector x and pass it through two linear transformations
Rectified-linear activation function applied between the two linear transformations
Gated Linear Units (GLU) and Variants introduced Gated Linear Units (GLU) as a neural network layer
Variations on the Transformer FFN layer use GLU or one of its variants in place of the first linear transformation and the activation function

FFN variants tested on transfer-learning setup from Raffel et al., 2019
Encoder-decoder transformer model trained on denoising objective of predicting missing text segments
Fine-tuned on various language understanding tasks

Used same code base, model architecture, and training task as base model from Raffel et al., 2019
Encoder and decoder each have 12 layers, with d model = 768
Attention layers have h = 12 and d k = d v = 64
FFN layers have hidden size d f f = 3072
GLU-variant-based FFN layers have hidden layer d f f = 2048 to maintain same parameter and operation counts as base model

Pre-trained for 524,288 steps on the span-filling objective on the C4 dataset
Each training batch consists of 128 examples, each with an input of 512 tokens and an output of 114 tokens
Used the Adafactor optimizer and an inverse-square-root learning-rate schedule
Learning rate decayed linearly for the final 10 percent of training steps
No dropout during pre-training
Log-perplexity on the training objective used to measure model quality

Fine-tuned model on SQuAD and GLUE/SuperGlue benchmarks
131072 steps with learning rate of 10-3
Input sequences have combined length of 65,536 tokens
Dropout rate of 0.1 on layer outputs, feed-forward hidden-layers and attention weights
Embedding matrices fixed during fine-tuning

Extended GLU family of layers and proposed their use in Transformer
Better perplexities for de-noising objective used in pre-training and better results on downstream language-understanding tasks
Simple to implement and no apparent computational drawbacks
Success attributed to divine benevolence