Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Relative positional embeddings (RPE) have been studied for their ability to model the relative distance between tokens and enable length extrapolation.
KERPLE is a framework that uses conditionally positive definite (CPD) kernels to generalize RPEs for length extrapolation.
CPD kernels can be transformed into PD kernels by adding a constant offset, which is absorbed in the Softmax normalization during self-attention.
Experiments show that the logarithmic variant of KERPLE achieves excellent extrapolation performance on three large language modeling datasets.

Paper Content

Introduction

Transformer-based models have been used for various natural language processing tasks.
These tasks often require the model to operate on longer text sequences than used during training.
Training the model with a larger value of L is often infeasible.
Relative positional embeddings are believed to be more robust to input length change.
Relative positional embeddings encode the idea of shift-invariance.
Recent work has studied shift-invariant kernels for RPE.
A framework for KErnelize Relative Positional Embedding for Length Extrapolation (KERPLE) has been proposed.

Preliminary

Input tokens are scalars used to index embedding vectors
Learnable matrices convert embedding vectors into query, key, and value vectors
Self-attention module computes scaled attention scores and generates output vector
Positional information helps model token interactions

Positional embedding

Absolute positional embeddings assign a positional vector to each position and add it to the embedding vector.
Learnable absolute positional embeddings have been used in various tasks.
Relative positional embeddings model the positional difference between two positions.

Kernel and its application in transformer

Kernel trick is a classic approach to generalize inner product to high dimensional spaces
Examples of applying kernels to self-attention structure to enhance performance
Leveraging kernel’s feature map to linearize self-attention module and reduce computational cost

Pd and cpd kernels

Shift-invariant conditionally positive definite (CPD) kernels are used to model the effect of relative positional differences.
CPD kernels generalize distance metrics to high dimensional spaces.
PD kernels represent inner products.
CPD kernels can be transformed into PD kernels if needed.

Constructing pd kernels from cpd kernels via constant shifts

CPD kernels can be scaled and summed
CPD kernels can be transformed into PD kernels with a constant shift
Exact value of constant shift is not needed due to Softmax normalization
Rich family of CPD kernels can be generated
Lemma 1 proves CPD kernels can be made PD with a large enough constant
Constant shift can be left as an under-determined constant in positional embedding design
Geometric sequence search can be used to find a suitable constant

Kernelized relative positional embedding

Input queries and keys are denoted as {q m } and {k n }
Learnable parameters (r 1 , …, r ) are proposed
A kernelized relative positional embedding is proposed
A composite kernel is introduced
Two variants of the composite kernel are proposed
Connection to prior work is discussed
Logarithmic variant has an implicit connection to T5 positional bias

Experiments

Dataset and implementation description

Experiments conducted on OpenWebText2, GitHub, and ArXiv datasets
GitHub includes open-source repositories written in Java, C/C++, Python, and Go
ArXiv includes papers written in LaTex in Math, Computer Science, Physics, and related fields
Model trained on one NVIDIA A100 GPU with 40 GB of memory
Compared KERPLE with Sinusoidal, Rotary, T5, and ALiBi
Logarithmic variant of KERPLE better than power variant
Logarithmic variant faster than T5 and better at longer extrapolation lengths
Logarithmic variant slower than ALiBi, Rotary, and Sinusoidal but consistently outperforms them

Experiments on complicated kernels

(bias+wht) and (3-para-log) are two complicated versions of the composite kernel
(bias+wht) uses a weight and bias kernel
(3-para-log) uses a logarithmic variant
Performance of these RPE is tested in the KERPLE framework
Enlarging the complexity of kernels does not necessarily give better performance

Plots of kernel functions

ALiBi and its generalized power variant quickly reach a very negative value
Log variant successfully discovers several flat kernels, extending window attention

Position-wise perplexity evaluation

KERPLE-log has the lowest PPL@512 among all model variants
KERPLE-log uses more distant information than window attention
PPL of KERPLE-log continues to decrease till the end of 4096 positions
T5 lies below KERPLE-log-windowed@512 most of the time
ALiBi lies above KERPLE-log-windowed@512 for almost all positions
KERPLE-log is almost like a free lunch compared to window attention

Conclusion and future work

A general framework, KERPLE, is proposed to kernelize relative positional embeddings for length extrapolation
At the core of this framework is the application of CPD kernels and the derivation of practical variants
CPD kernels can be implicitly converted to PD kernels, which keep the inner product interpretation of self-attention
Logarithmic variant achieves exceptional extrapolation performance on three large language modeling datasets
Future directions include general kernel families and model non-monotonic effects due to positional differences
Learnable parameters in KERPLE might enable better generalization to inputs higher than one-dimensional
Memory efficiency can be improved by adjusting the model architecture and training procedure
Results apply to domains where positional information is helpful, such as natural language, programming language, and DNA/protein sequences
Positive economic effects from transformers include enabling new tasks and enhancing accuracy and efficiency
Negative societal impacts include job loss due to automation, ethical challenges from improper text generation, and privacy issues in data collection

Link to paper#

Abstract#

Paper Content#

Introduction#

Preliminary#

Positional embedding#

Kernel and its application in transformer#

Pd and cpd kernels#

Constructing pd kernels from cpd kernels via constant shifts#

Kernelized relative positional embedding#

Experiments#

Dataset and implementation description#

Experiments on complicated kernels#

Plots of kernel functions#

Position-wise perplexity evaluation#

Conclusion and future work#

Link to paper

Abstract

Paper Content

Introduction

Preliminary

Positional embedding

Kernel and its application in transformer

Pd and cpd kernels

Constructing pd kernels from cpd kernels via constant shifts

Kernelized relative positional embedding

Experiments

Dataset and implementation description

Experiments on complicated kernels

Plots of kernel functions

Position-wise perplexity evaluation

Conclusion and future work