Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Dominant sequence transduction models use complex neural networks.
Transformer is a new, simpler network architecture based on attention mechanisms.
Transformer performs better than existing models and is more parallelizable.
Transformer achieves good results on two machine translation tasks.
Transformer is successful in English constituency parsing.

Paper Content

Introduction

Recurrent neural networks are used for sequence modeling and transduction problems.
Recurrent models factor computation along the symbol positions of the input and output sequences.
Attention mechanisms allow modeling of dependencies without regard to their distance in the input or output sequences.
The Transformer is a model architecture that relies entirely on an attention mechanism and allows for more parallelization.
The Transformer can reach a new state of the art in translation quality after being trained for a short amount of time.

Background

Goal of reducing sequential computation
Extended Neural GPU, ByteNet, and ConvS2S use convolutional neural networks
Number of operations required to relate signals from two arbitrary input or output positions grows with distance
Transformer reduces to a constant number of operations
Self-attention used in a variety of tasks
End-to-end memory networks use recurrent attention mechanism
Transformer is the first transduction model relying entirely on self-attention

Model architecture

Most competitive neural sequence transduction models have an encoder-decoder structure.
The encoder maps an input sequence of symbols to a sequence of continuous representations.
The decoder then generates an output sequence of symbols one element at a time.
The Transformer follows this overall architecture using self-attention and point-wise, fully connected layers.

Encoder and decoder stacks

Encoder consists of 6 identical layers with two sub-layers each
Decoder consists of 6 identical layers with three sub-layers each
Sub-layers use residual connections and layer normalization
Decoder sub-layer uses masking to prevent positions from attending to subsequent positions

Attention

Attention function maps query and key-value pairs to output vector
Output is weighted sum of values
Scaled Dot-Product Attention used in this paper
Input consists of queries, keys, and values of different dimensions
Two most common attention functions are additive and dot-product
Multi-head attention used in Transformer
Encoder-decoder attention, self-attention in encoder, self-attention in decoder
Masking used to prevent leftward information flow in decoder

Position-wise feed-forward networks

Each layer in the encoder and decoder contains a fully connected feed-forward network.
This network consists of two linear transformations with a ReLU activation in between.
The linear transformations use different parameters from layer to layer.

Embeddings and softmax

Model uses learned embeddings to convert input and output tokens to vectors of dimension d model
Model uses linear transformation and softmax function to convert decoder output to predicted next-token probabilities
Weight matrix is shared between embedding layers and pre-softmax linear transformation

Positional encoding

Model does not contain recurrence or convolution, so positional encodings are added to input embeddings to make use of sequence order.
Positional encodings have same dimension as embeddings, so they can be summed.
Positional encodings use sine and cosine functions of different frequencies.
Experiments with learned positional embeddings produced similar results.
Sinusoidal version chosen as it may allow model to extrapolate to longer sequences.

Why self-attention

Self-attention layers compared to recurrent and convolutional layers
Three desiderata for self-attention layers: total computational complexity, amount of computation that can be parallelized, and path length between long-range dependencies
Self-attention layers have shorter paths between long-range dependencies than recurrent and convolutional layers
Self-attention layers are faster than recurrent layers when sequence length is smaller than representation dimensionality
Convolutional layers require a stack of layers to connect all pairs of input and output positions
Separable convolutions decrease complexity of convolutional layers
Self-attention layers may yield more interpretable models

Training

Training regime described
Models trained

Training data and batching

Trained on WMT 2014 English-German dataset with 4.5 million sentence pairs
Used byte-pair encoding with 37000 tokens
Used WMT 2014 English-French dataset with 36M sentences and 32000 word-piece vocabulary
Sentence pairs batched together by approximate sequence length, each batch containing 25000 source and 25000 target tokens

Hardware and schedule

Trained models on one machine with 8 NVIDIA P100 GPUs
Base models took 0.4 seconds per training step, trained for 100,000 steps (12 hours)
Big models took 1.0 seconds per training step, trained for 300,000 steps (3.5 days)

Optimizer

Used Adam optimizer with specific parameters
Varied learning rate over training according to formula
Increased learning rate linearly for first 4000 training steps, decreased thereafter

Regularization

Three types of regularization are used during training: residual dropout, label smoothing, and results.
Dropout is applied to the output of each sub-layer, the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.
Label smoothing of value ls = 0.1 is used during training, which hurts perplexity but improves accuracy and BLEU score.

Machine translation

Transformer (big) model achieved a BLEU score of 28.4 on WMT 2014 English-to-German translation task
Transformer (big) model achieved a BLEU score of 41.0 on WMT 2014 English-to-French translation task
Base model outperformed all previously published models and ensembles at a fraction of the training cost

Model variations

Varying the base model and measuring the change in performance on English-to-German translation.
Quality drops off with too many attention heads.
Reducing the attention key size hurts model quality.
Bigger models are better and dropout is helpful.
Replacing positional encoding with learned positional embeddings yields nearly identical results.

English constituency parsing

Evaluated if Transformer can generalize to other tasks using English constituency parsing
Output is subject to strong structural constraints and is longer than input
RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes
Trained 4-layer transformer with d model = 1024 on Wall Street Journal portion of Penn Treebank
Trained in semi-supervised setting using larger high-confidence and BerkleyParser corpora
Used 16K tokens for WSJ only setting and 32K tokens for semi-supervised setting
Performed small number of experiments to select dropout, attention and residual, learning rates and beam size
Transformer outperforms Berkeley-Parser even when training only on WSJ training set of 40K sentences

Conclusion

Transformer is a sequence transduction model based on attention, replacing recurrent layers in encoder-decoder architectures.
Transformer can be trained faster than recurrent or convolutional architectures.
Transformer achieves new state of the art on English-to-German and English-to-French translation tasks.

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Model architecture#

Encoder and decoder stacks#

Attention#

Position-wise feed-forward networks#

Embeddings and softmax#

Positional encoding#

Why self-attention#

Training#

Training data and batching#

Hardware and schedule#

Optimizer#

Regularization#

Machine translation#

Model variations#

English constituency parsing#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background

Model architecture

Encoder and decoder stacks

Attention

Position-wise feed-forward networks

Embeddings and softmax

Positional encoding

Why self-attention

Training

Training data and batching

Hardware and schedule

Optimizer

Regularization

Machine translation

Model variations

English constituency parsing

Conclusion