Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Position modeling is important for Transformers.
  • This paper focuses on training on short texts and evaluating longer sequences.
  • Two designs are proposed to improve the metric of Transformers.

Paper Content

Introduction

  • Transformer (Vaswani et al., 2017) is a universal choice for NLP
  • Most Transformers can only deal with in-distribution size of inputs
  • Length-extrapolatable Transformer is essential for wider usage
  • Position information plays a crucial role in sequence modeling
  • Vaswani et al. (2017) and Devlin et al. (2019) proposed absolute and learnable position embeddings
  • Relative position embeddings (Shaw et al., 2018;Su et al., 2021;Press et al., 2021) are more effective
  • Design principles of Transformers for position modeling include sensitivity to order, position translation, and ability to deal with any input length

Order variance

  • Transformer captures long-term dependency efficiently
  • Distance between every two tokens is 1
  • Transformer without position information is a bag-of-word model
  • Position information is essential for sequence modeling

Translation invariance

  • Representation of a sequence should be robust with position translation.
  • Meaning of a sentence is variant with padding before or after the whole sentence.
  • Relative positions have translation invariance property instead of absolute ones.
  • Absolute sinusoidal embedding has similar property, but addition operation in initial word embedding messes attention weight.

Length extrapolation

  • Pre-training costs are increasing due to larger models and corpora.
  • Learnable absolute position embedding cannot extrapolate.
  • Performance of position embedding drops significantly.
  • Alibi solves the problem by adding an exponential decay.
  • Alibi’s performance is lower than other relative strategies.
  • Position embedding is a crucial component for extrapolation.

A length-extrapolatable transformer

  • Attention resolution is an indicator of length extrapolation
  • Two ways to maximize attention resolution: relative position encoding and blockwise causal masking
  • Proposed architecture is called Length-Extrapolatable (LEX) Transformer

Attention resolution

  • Monotonicity of attention scores is essential to represent distance in language models.
  • Softmax operation is used to simulate the attention probability.
  • Su et al. (2021) proposed adding absolute position embedding on query and key to encode relative position information.

Improve resolution by position encoding

  • A simple solution is proposed to obtain q and k by a linear transformation
  • If ξ = 0, the form is the same as ROPE
  • Transformation provides a rotation on vectors
  • Cosine value is not monotony if rotating angle is larger than π
  • Expectation of inner product oscillates dramatically with growth of relative distance
  • Function is defined to represent property of relative position
  • Optimal values for ζ are found by manually setting γ and numerical optimization methods

Blockwise causal attention

  • Blockwise Causal Attention works for ROPE to prevent perplexity from exploding.
  • Alibi performs well without windowed attention due to its “soft window”.
  • XPOS’s perplexity increases without BCA, but it can recognize position with BCA’s constraint.

Experiments

Pre-training

  • Pre-trained Transformer from scratch with 1024 hidden dimension, 16 heads, and 24 layers
  • Training corpus includes subset of Pile datasets
  • Training procedure implemented on 16xV100 GPUs
  • Maximal length of 1024 for saving memory and extrapolation evaluation
  • Learning rate of 3x10^-4 and polynomial decay to adjust learning rate
  • Global batch size of 512, 0.5M token size
  • Adam optimizer with β1=0.9, β2=0.98, ε=10^-6
  • Code based on TorchScale

Language modeling

  • Measured perplexity on arXiv documents
  • Evaluated performance on different input lengths
  • Analyzed interpolation and extrapolation capability
  • XPOS had a stable advantage on other models

Measuring resolution

  • Resolution is a crucial index for building an effective Transformer
  • Resolution is calculated in every layer
  • XPOS makes the position more recognizable in training length
  • Alibi uses explicit decay to achieve stable resolution
  • Ablation on BCA shows that it helps model distinguish positions better
  • Combination of vector rotation and exponential decay is necessary for strong performance
  • Linear attention targets efficiency while underperforming vanilla Transformers
  • Sparse attention leverages structured sparsity to reduce computation
  • XPOS provides stable and accurate position modeling
  • Higher resolution indicates better ability to distinguish context tokens