Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Dominant sequence transduction models use complex neural networks.
  • Transformer is a new, simpler network architecture based on attention mechanisms.
  • Transformer performs better than existing models and is more parallelizable.
  • Transformer achieves good results on two machine translation tasks.
  • Transformer is successful in English constituency parsing.

Paper Content


  • Recurrent neural networks are used for sequence modeling and transduction problems.
  • Recurrent models factor computation along the symbol positions of the input and output sequences.
  • Attention mechanisms allow modeling of dependencies without regard to their distance in the input or output sequences.
  • The Transformer is a model architecture that relies entirely on an attention mechanism and allows for more parallelization.
  • The Transformer can reach a new state of the art in translation quality after being trained for a short amount of time.


  • Goal of reducing sequential computation
  • Extended Neural GPU, ByteNet, and ConvS2S use convolutional neural networks
  • Number of operations required to relate signals from two arbitrary input or output positions grows with distance
  • Transformer reduces to a constant number of operations
  • Self-attention used in a variety of tasks
  • End-to-end memory networks use recurrent attention mechanism
  • Transformer is the first transduction model relying entirely on self-attention

Model architecture

  • Most competitive neural sequence transduction models have an encoder-decoder structure.
  • The encoder maps an input sequence of symbols to a sequence of continuous representations.
  • The decoder then generates an output sequence of symbols one element at a time.
  • The Transformer follows this overall architecture using self-attention and point-wise, fully connected layers.

Encoder and decoder stacks

  • Encoder consists of 6 identical layers with two sub-layers each
  • Decoder consists of 6 identical layers with three sub-layers each
  • Sub-layers use residual connections and layer normalization
  • Decoder sub-layer uses masking to prevent positions from attending to subsequent positions


  • Attention function maps query and key-value pairs to output vector
  • Output is weighted sum of values
  • Scaled Dot-Product Attention used in this paper
  • Input consists of queries, keys, and values of different dimensions
  • Two most common attention functions are additive and dot-product
  • Multi-head attention used in Transformer
  • Encoder-decoder attention, self-attention in encoder, self-attention in decoder
  • Masking used to prevent leftward information flow in decoder

Position-wise feed-forward networks

  • Each layer in the encoder and decoder contains a fully connected feed-forward network.
  • This network consists of two linear transformations with a ReLU activation in between.
  • The linear transformations use different parameters from layer to layer.

Embeddings and softmax

  • Model uses learned embeddings to convert input and output tokens to vectors of dimension d model
  • Model uses linear transformation and softmax function to convert decoder output to predicted next-token probabilities
  • Weight matrix is shared between embedding layers and pre-softmax linear transformation

Positional encoding

  • Model does not contain recurrence or convolution, so positional encodings are added to input embeddings to make use of sequence order.
  • Positional encodings have same dimension as embeddings, so they can be summed.
  • Positional encodings use sine and cosine functions of different frequencies.
  • Experiments with learned positional embeddings produced similar results.
  • Sinusoidal version chosen as it may allow model to extrapolate to longer sequences.

Why self-attention

  • Self-attention layers compared to recurrent and convolutional layers
  • Three desiderata for self-attention layers: total computational complexity, amount of computation that can be parallelized, and path length between long-range dependencies
  • Self-attention layers have shorter paths between long-range dependencies than recurrent and convolutional layers
  • Self-attention layers are faster than recurrent layers when sequence length is smaller than representation dimensionality
  • Convolutional layers require a stack of layers to connect all pairs of input and output positions
  • Separable convolutions decrease complexity of convolutional layers
  • Self-attention layers may yield more interpretable models


  • Training regime described
  • Models trained

Training data and batching

  • Trained on WMT 2014 English-German dataset with 4.5 million sentence pairs
  • Used byte-pair encoding with 37000 tokens
  • Used WMT 2014 English-French dataset with 36M sentences and 32000 word-piece vocabulary
  • Sentence pairs batched together by approximate sequence length, each batch containing 25000 source and 25000 target tokens

Hardware and schedule

  • Trained models on one machine with 8 NVIDIA P100 GPUs
  • Base models took 0.4 seconds per training step, trained for 100,000 steps (12 hours)
  • Big models took 1.0 seconds per training step, trained for 300,000 steps (3.5 days)


  • Used Adam optimizer with specific parameters
  • Varied learning rate over training according to formula
  • Increased learning rate linearly for first 4000 training steps, decreased thereafter


  • Three types of regularization are used during training: residual dropout, label smoothing, and results.
  • Dropout is applied to the output of each sub-layer, the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.
  • Label smoothing of value ls = 0.1 is used during training, which hurts perplexity but improves accuracy and BLEU score.

Machine translation

  • Transformer (big) model achieved a BLEU score of 28.4 on WMT 2014 English-to-German translation task
  • Transformer (big) model achieved a BLEU score of 41.0 on WMT 2014 English-to-French translation task
  • Base model outperformed all previously published models and ensembles at a fraction of the training cost

Model variations

  • Varying the base model and measuring the change in performance on English-to-German translation.
  • Quality drops off with too many attention heads.
  • Reducing the attention key size hurts model quality.
  • Bigger models are better and dropout is helpful.
  • Replacing positional encoding with learned positional embeddings yields nearly identical results.

English constituency parsing

  • Evaluated if Transformer can generalize to other tasks using English constituency parsing
  • Output is subject to strong structural constraints and is longer than input
  • RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes
  • Trained 4-layer transformer with d model = 1024 on Wall Street Journal portion of Penn Treebank
  • Trained in semi-supervised setting using larger high-confidence and BerkleyParser corpora
  • Used 16K tokens for WSJ only setting and 32K tokens for semi-supervised setting
  • Performed small number of experiments to select dropout, attention and residual, learning rates and beam size
  • Transformer outperforms Berkeley-Parser even when training only on WSJ training set of 40K sentences


  • Transformer is a sequence transduction model based on attention, replacing recurrent layers in encoder-decoder architectures.
  • Transformer can be trained faster than recurrent or convolutional architectures.
  • Transformer achieves new state of the art on English-to-German and English-to-French translation tasks.