Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Relative positional embeddings (RPE) have been studied for their ability to model the relative distance between tokens and enable length extrapolation.
- KERPLE is a framework that uses conditionally positive definite (CPD) kernels to generalize RPEs for length extrapolation.
- CPD kernels can be transformed into PD kernels by adding a constant offset, which is absorbed in the Softmax normalization during self-attention.
- Experiments show that the logarithmic variant of KERPLE achieves excellent extrapolation performance on three large language modeling datasets.
Paper Content
Introduction
- Transformer-based models have been used for various natural language processing tasks.
- These tasks often require the model to operate on longer text sequences than used during training.
- Training the model with a larger value of L is often infeasible.
- Relative positional embeddings are believed to be more robust to input length change.
- Relative positional embeddings encode the idea of shift-invariance.
- Recent work has studied shift-invariant kernels for RPE.
- A framework for KErnelize Relative Positional Embedding for Length Extrapolation (KERPLE) has been proposed.
Preliminary
- Input tokens are scalars used to index embedding vectors
- Learnable matrices convert embedding vectors into query, key, and value vectors
- Self-attention module computes scaled attention scores and generates output vector
- Positional information helps model token interactions
Positional embedding
- Absolute positional embeddings assign a positional vector to each position and add it to the embedding vector.
- Learnable absolute positional embeddings have been used in various tasks.
- Relative positional embeddings model the positional difference between two positions.
Kernel and its application in transformer
- Kernel trick is a classic approach to generalize inner product to high dimensional spaces
- Examples of applying kernels to self-attention structure to enhance performance
- Leveraging kernel’s feature map to linearize self-attention module and reduce computational cost
Pd and cpd kernels
- Shift-invariant conditionally positive definite (CPD) kernels are used to model the effect of relative positional differences.
- CPD kernels generalize distance metrics to high dimensional spaces.
- PD kernels represent inner products.
- CPD kernels can be transformed into PD kernels if needed.
Constructing pd kernels from cpd kernels via constant shifts
- CPD kernels can be scaled and summed
- CPD kernels can be transformed into PD kernels with a constant shift
- Exact value of constant shift is not needed due to Softmax normalization
- Rich family of CPD kernels can be generated
- Lemma 1 proves CPD kernels can be made PD with a large enough constant
- Constant shift can be left as an under-determined constant in positional embedding design
- Geometric sequence search can be used to find a suitable constant
Kernelized relative positional embedding
- Input queries and keys are denoted as {q m } and {k n }
- Learnable parameters (r 1 , …, r ) are proposed
- A kernelized relative positional embedding is proposed
- A composite kernel is introduced
- Two variants of the composite kernel are proposed
- Connection to prior work is discussed
- Logarithmic variant has an implicit connection to T5 positional bias
Experiments
Dataset and implementation description
- Experiments conducted on OpenWebText2, GitHub, and ArXiv datasets
- GitHub includes open-source repositories written in Java, C/C++, Python, and Go
- ArXiv includes papers written in LaTex in Math, Computer Science, Physics, and related fields
- Model trained on one NVIDIA A100 GPU with 40 GB of memory
- Compared KERPLE with Sinusoidal, Rotary, T5, and ALiBi
- Logarithmic variant of KERPLE better than power variant
- Logarithmic variant faster than T5 and better at longer extrapolation lengths
- Logarithmic variant slower than ALiBi, Rotary, and Sinusoidal but consistently outperforms them
Experiments on complicated kernels
- (bias+wht) and (3-para-log) are two complicated versions of the composite kernel
- (bias+wht) uses a weight and bias kernel
- (3-para-log) uses a logarithmic variant
- Performance of these RPE is tested in the KERPLE framework
- Enlarging the complexity of kernels does not necessarily give better performance
Plots of kernel functions
- ALiBi and its generalized power variant quickly reach a very negative value
- Log variant successfully discovers several flat kernels, extending window attention
Position-wise perplexity evaluation
- KERPLE-log has the lowest PPL@512 among all model variants
- KERPLE-log uses more distant information than window attention
- PPL of KERPLE-log continues to decrease till the end of 4096 positions
- T5 lies below KERPLE-log-windowed@512 most of the time
- ALiBi lies above KERPLE-log-windowed@512 for almost all positions
- KERPLE-log is almost like a free lunch compared to window attention
Conclusion and future work
- A general framework, KERPLE, is proposed to kernelize relative positional embeddings for length extrapolation
- At the core of this framework is the application of CPD kernels and the derivation of practical variants
- CPD kernels can be implicitly converted to PD kernels, which keep the inner product interpretation of self-attention
- Logarithmic variant achieves exceptional extrapolation performance on three large language modeling datasets
- Future directions include general kernel families and model non-monotonic effects due to positional differences
- Learnable parameters in KERPLE might enable better generalization to inputs higher than one-dimensional
- Memory efficiency can be improved by adjusting the model architecture and training procedure
- Results apply to domains where positional information is helpful, such as natural language, programming language, and DNA/protein sequences
- Positive economic effects from transformers include enabling new tasks and enhancing accuracy and efficiency
- Negative societal impacts include job loss due to automation, ethical challenges from improper text generation, and privacy issues in data collection