Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Position modeling is important for Transformers.
- This paper focuses on training on short texts and evaluating longer sequences.
- Two designs are proposed to improve the metric of Transformers.
Paper Content
Introduction
- Transformer (Vaswani et al., 2017) is a universal choice for NLP
- Most Transformers can only deal with in-distribution size of inputs
- Length-extrapolatable Transformer is essential for wider usage
- Position information plays a crucial role in sequence modeling
- Vaswani et al. (2017) and Devlin et al. (2019) proposed absolute and learnable position embeddings
- Relative position embeddings (Shaw et al., 2018;Su et al., 2021;Press et al., 2021) are more effective
- Design principles of Transformers for position modeling include sensitivity to order, position translation, and ability to deal with any input length
Order variance
- Transformer captures long-term dependency efficiently
- Distance between every two tokens is 1
- Transformer without position information is a bag-of-word model
- Position information is essential for sequence modeling
Translation invariance
- Representation of a sequence should be robust with position translation.
- Meaning of a sentence is variant with padding before or after the whole sentence.
- Relative positions have translation invariance property instead of absolute ones.
- Absolute sinusoidal embedding has similar property, but addition operation in initial word embedding messes attention weight.
Length extrapolation
- Pre-training costs are increasing due to larger models and corpora.
- Learnable absolute position embedding cannot extrapolate.
- Performance of position embedding drops significantly.
- Alibi solves the problem by adding an exponential decay.
- Alibi’s performance is lower than other relative strategies.
- Position embedding is a crucial component for extrapolation.
A length-extrapolatable transformer
- Attention resolution is an indicator of length extrapolation
- Two ways to maximize attention resolution: relative position encoding and blockwise causal masking
- Proposed architecture is called Length-Extrapolatable (LEX) Transformer
Attention resolution
- Monotonicity of attention scores is essential to represent distance in language models.
- Softmax operation is used to simulate the attention probability.
- Su et al. (2021) proposed adding absolute position embedding on query and key to encode relative position information.
Improve resolution by position encoding
- A simple solution is proposed to obtain q and k by a linear transformation
- If ξ = 0, the form is the same as ROPE
- Transformation provides a rotation on vectors
- Cosine value is not monotony if rotating angle is larger than π
- Expectation of inner product oscillates dramatically with growth of relative distance
- Function is defined to represent property of relative position
- Optimal values for ζ are found by manually setting γ and numerical optimization methods
Blockwise causal attention
- Blockwise Causal Attention works for ROPE to prevent perplexity from exploding.
- Alibi performs well without windowed attention due to its “soft window”.
- XPOS’s perplexity increases without BCA, but it can recognize position with BCA’s constraint.
Experiments
Pre-training
- Pre-trained Transformer from scratch with 1024 hidden dimension, 16 heads, and 24 layers
- Training corpus includes subset of Pile datasets
- Training procedure implemented on 16xV100 GPUs
- Maximal length of 1024 for saving memory and extrapolation evaluation
- Learning rate of 3x10^-4 and polynomial decay to adjust learning rate
- Global batch size of 512, 0.5M token size
- Adam optimizer with β1=0.9, β2=0.98, ε=10^-6
- Code based on TorchScale
Language modeling
- Measured perplexity on arXiv documents
- Evaluated performance on different input lengths
- Analyzed interpolation and extrapolation capability
- XPOS had a stable advantage on other models
Measuring resolution
- Resolution is a crucial index for building an effective Transformer
- Resolution is calculated in every layer
- XPOS makes the position more recognizable in training length
- Alibi uses explicit decay to achieve stable resolution
- Ablation on BCA shows that it helps model distinguish positions better
- Combination of vector rotation and exponential decay is necessary for strong performance
- Linear attention targets efficiency while underperforming vanilla Transformers
- Sparse attention leverages structured sparsity to reduce computation
- XPOS provides stable and accurate position modeling
- Higher resolution indicates better ability to distinguish context tokens