Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Position modeling is important for Transformers.
This paper focuses on training on short texts and evaluating longer sequences.
Two designs are proposed to improve the metric of Transformers.

Paper Content

Introduction

Transformer (Vaswani et al., 2017) is a universal choice for NLP
Most Transformers can only deal with in-distribution size of inputs
Length-extrapolatable Transformer is essential for wider usage
Position information plays a crucial role in sequence modeling
Vaswani et al. (2017) and Devlin et al. (2019) proposed absolute and learnable position embeddings
Relative position embeddings (Shaw et al., 2018;Su et al., 2021;Press et al., 2021) are more effective
Design principles of Transformers for position modeling include sensitivity to order, position translation, and ability to deal with any input length

Order variance

Transformer captures long-term dependency efficiently
Distance between every two tokens is 1
Transformer without position information is a bag-of-word model
Position information is essential for sequence modeling

Translation invariance

Representation of a sequence should be robust with position translation.
Meaning of a sentence is variant with padding before or after the whole sentence.
Relative positions have translation invariance property instead of absolute ones.
Absolute sinusoidal embedding has similar property, but addition operation in initial word embedding messes attention weight.

Length extrapolation

Pre-training costs are increasing due to larger models and corpora.
Learnable absolute position embedding cannot extrapolate.
Performance of position embedding drops significantly.
Alibi solves the problem by adding an exponential decay.
Alibi’s performance is lower than other relative strategies.
Position embedding is a crucial component for extrapolation.

A length-extrapolatable transformer

Attention resolution is an indicator of length extrapolation
Two ways to maximize attention resolution: relative position encoding and blockwise causal masking
Proposed architecture is called Length-Extrapolatable (LEX) Transformer

Attention resolution

Monotonicity of attention scores is essential to represent distance in language models.
Softmax operation is used to simulate the attention probability.
Su et al. (2021) proposed adding absolute position embedding on query and key to encode relative position information.

Improve resolution by position encoding

A simple solution is proposed to obtain q and k by a linear transformation
If ξ = 0, the form is the same as ROPE
Transformation provides a rotation on vectors
Cosine value is not monotony if rotating angle is larger than π
Expectation of inner product oscillates dramatically with growth of relative distance
Function is defined to represent property of relative position
Optimal values for ζ are found by manually setting γ and numerical optimization methods

Blockwise causal attention

Blockwise Causal Attention works for ROPE to prevent perplexity from exploding.
Alibi performs well without windowed attention due to its “soft window”.
XPOS’s perplexity increases without BCA, but it can recognize position with BCA’s constraint.

Experiments

Pre-training

Pre-trained Transformer from scratch with 1024 hidden dimension, 16 heads, and 24 layers
Training corpus includes subset of Pile datasets
Training procedure implemented on 16xV100 GPUs
Maximal length of 1024 for saving memory and extrapolation evaluation
Learning rate of 3x10^-4 and polynomial decay to adjust learning rate
Global batch size of 512, 0.5M token size
Adam optimizer with β1=0.9, β2=0.98, ε=10^-6
Code based on TorchScale

Language modeling

Measured perplexity on arXiv documents
Evaluated performance on different input lengths
Analyzed interpolation and extrapolation capability
XPOS had a stable advantage on other models

Measuring resolution

Resolution is a crucial index for building an effective Transformer
Resolution is calculated in every layer
XPOS makes the position more recognizable in training length
Alibi uses explicit decay to achieve stable resolution
Ablation on BCA shows that it helps model distinguish positions better
Combination of vector rotation and exponential decay is necessary for strong performance
Linear attention targets efficiency while underperforming vanilla Transformers
Sparse attention leverages structured sparsity to reduce computation
XPOS provides stable and accurate position modeling
Higher resolution indicates better ability to distinguish context tokens

Link to paper#

Abstract#

Paper Content#

Introduction#

Order variance#

Translation invariance#

Length extrapolation#

A length-extrapolatable transformer#

Attention resolution#

Improve resolution by position encoding#

Blockwise causal attention#

Experiments#

Pre-training#

Language modeling#

Measuring resolution#

Link to paper

Abstract

Paper Content

Introduction

Order variance

Translation invariance

Length extrapolation

A length-extrapolatable transformer

Attention resolution

Improve resolution by position encoding

Blockwise causal attention

Experiments

Pre-training

Language modeling

Measuring resolution