Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Dominant sequence transduction models use complex neural networks.
- Transformer is a new, simpler network architecture based on attention mechanisms.
- Transformer performs better than existing models and is more parallelizable.
- Transformer achieves good results on two machine translation tasks.
- Transformer is successful in English constituency parsing.
Paper Content
Introduction
- Recurrent neural networks are used for sequence modeling and transduction problems.
- Recurrent models factor computation along the symbol positions of the input and output sequences.
- Attention mechanisms allow modeling of dependencies without regard to their distance in the input or output sequences.
- The Transformer is a model architecture that relies entirely on an attention mechanism and allows for more parallelization.
- The Transformer can reach a new state of the art in translation quality after being trained for a short amount of time.
Background
- Goal of reducing sequential computation
- Extended Neural GPU, ByteNet, and ConvS2S use convolutional neural networks
- Number of operations required to relate signals from two arbitrary input or output positions grows with distance
- Transformer reduces to a constant number of operations
- Self-attention used in a variety of tasks
- End-to-end memory networks use recurrent attention mechanism
- Transformer is the first transduction model relying entirely on self-attention
Model architecture
- Most competitive neural sequence transduction models have an encoder-decoder structure.
- The encoder maps an input sequence of symbols to a sequence of continuous representations.
- The decoder then generates an output sequence of symbols one element at a time.
- The Transformer follows this overall architecture using self-attention and point-wise, fully connected layers.
Encoder and decoder stacks
- Encoder consists of 6 identical layers with two sub-layers each
- Decoder consists of 6 identical layers with three sub-layers each
- Sub-layers use residual connections and layer normalization
- Decoder sub-layer uses masking to prevent positions from attending to subsequent positions
Attention
- Attention function maps query and key-value pairs to output vector
- Output is weighted sum of values
- Scaled Dot-Product Attention used in this paper
- Input consists of queries, keys, and values of different dimensions
- Two most common attention functions are additive and dot-product
- Multi-head attention used in Transformer
- Encoder-decoder attention, self-attention in encoder, self-attention in decoder
- Masking used to prevent leftward information flow in decoder
Position-wise feed-forward networks
- Each layer in the encoder and decoder contains a fully connected feed-forward network.
- This network consists of two linear transformations with a ReLU activation in between.
- The linear transformations use different parameters from layer to layer.
Embeddings and softmax
- Model uses learned embeddings to convert input and output tokens to vectors of dimension d model
- Model uses linear transformation and softmax function to convert decoder output to predicted next-token probabilities
- Weight matrix is shared between embedding layers and pre-softmax linear transformation
Positional encoding
- Model does not contain recurrence or convolution, so positional encodings are added to input embeddings to make use of sequence order.
- Positional encodings have same dimension as embeddings, so they can be summed.
- Positional encodings use sine and cosine functions of different frequencies.
- Experiments with learned positional embeddings produced similar results.
- Sinusoidal version chosen as it may allow model to extrapolate to longer sequences.
Why self-attention
- Self-attention layers compared to recurrent and convolutional layers
- Three desiderata for self-attention layers: total computational complexity, amount of computation that can be parallelized, and path length between long-range dependencies
- Self-attention layers have shorter paths between long-range dependencies than recurrent and convolutional layers
- Self-attention layers are faster than recurrent layers when sequence length is smaller than representation dimensionality
- Convolutional layers require a stack of layers to connect all pairs of input and output positions
- Separable convolutions decrease complexity of convolutional layers
- Self-attention layers may yield more interpretable models
Training
- Training regime described
- Models trained
Training data and batching
- Trained on WMT 2014 English-German dataset with 4.5 million sentence pairs
- Used byte-pair encoding with 37000 tokens
- Used WMT 2014 English-French dataset with 36M sentences and 32000 word-piece vocabulary
- Sentence pairs batched together by approximate sequence length, each batch containing 25000 source and 25000 target tokens
Hardware and schedule
- Trained models on one machine with 8 NVIDIA P100 GPUs
- Base models took 0.4 seconds per training step, trained for 100,000 steps (12 hours)
- Big models took 1.0 seconds per training step, trained for 300,000 steps (3.5 days)
Optimizer
- Used Adam optimizer with specific parameters
- Varied learning rate over training according to formula
- Increased learning rate linearly for first 4000 training steps, decreased thereafter
Regularization
- Three types of regularization are used during training: residual dropout, label smoothing, and results.
- Dropout is applied to the output of each sub-layer, the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.
- Label smoothing of value ls = 0.1 is used during training, which hurts perplexity but improves accuracy and BLEU score.
Machine translation
- Transformer (big) model achieved a BLEU score of 28.4 on WMT 2014 English-to-German translation task
- Transformer (big) model achieved a BLEU score of 41.0 on WMT 2014 English-to-French translation task
- Base model outperformed all previously published models and ensembles at a fraction of the training cost
Model variations
- Varying the base model and measuring the change in performance on English-to-German translation.
- Quality drops off with too many attention heads.
- Reducing the attention key size hurts model quality.
- Bigger models are better and dropout is helpful.
- Replacing positional encoding with learned positional embeddings yields nearly identical results.
English constituency parsing
- Evaluated if Transformer can generalize to other tasks using English constituency parsing
- Output is subject to strong structural constraints and is longer than input
- RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes
- Trained 4-layer transformer with d model = 1024 on Wall Street Journal portion of Penn Treebank
- Trained in semi-supervised setting using larger high-confidence and BerkleyParser corpora
- Used 16K tokens for WSJ only setting and 32K tokens for semi-supervised setting
- Performed small number of experiments to select dropout, attention and residual, learning rates and beam size
- Transformer outperforms Berkeley-Parser even when training only on WSJ training set of 40K sentences
Conclusion
- Transformer is a sequence transduction model based on attention, replacing recurrent layers in encoder-decoder architectures.
- Transformer can be trained faster than recurrent or convolutional architectures.
- Transformer achieves new state of the art on English-to-German and English-to-French translation tasks.