Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Relative positional embeddings (RPE) have been studied for their ability to model the relative distance between tokens and enable length extrapolation.
  • KERPLE is a framework that uses conditionally positive definite (CPD) kernels to generalize RPEs for length extrapolation.
  • CPD kernels can be transformed into PD kernels by adding a constant offset, which is absorbed in the Softmax normalization during self-attention.
  • Experiments show that the logarithmic variant of KERPLE achieves excellent extrapolation performance on three large language modeling datasets.

Paper Content

Introduction

  • Transformer-based models have been used for various natural language processing tasks.
  • These tasks often require the model to operate on longer text sequences than used during training.
  • Training the model with a larger value of L is often infeasible.
  • Relative positional embeddings are believed to be more robust to input length change.
  • Relative positional embeddings encode the idea of shift-invariance.
  • Recent work has studied shift-invariant kernels for RPE.
  • A framework for KErnelize Relative Positional Embedding for Length Extrapolation (KERPLE) has been proposed.

Preliminary

  • Input tokens are scalars used to index embedding vectors
  • Learnable matrices convert embedding vectors into query, key, and value vectors
  • Self-attention module computes scaled attention scores and generates output vector
  • Positional information helps model token interactions

Positional embedding

  • Absolute positional embeddings assign a positional vector to each position and add it to the embedding vector.
  • Learnable absolute positional embeddings have been used in various tasks.
  • Relative positional embeddings model the positional difference between two positions.

Kernel and its application in transformer

  • Kernel trick is a classic approach to generalize inner product to high dimensional spaces
  • Examples of applying kernels to self-attention structure to enhance performance
  • Leveraging kernel’s feature map to linearize self-attention module and reduce computational cost

Pd and cpd kernels

  • Shift-invariant conditionally positive definite (CPD) kernels are used to model the effect of relative positional differences.
  • CPD kernels generalize distance metrics to high dimensional spaces.
  • PD kernels represent inner products.
  • CPD kernels can be transformed into PD kernels if needed.

Constructing pd kernels from cpd kernels via constant shifts

  • CPD kernels can be scaled and summed
  • CPD kernels can be transformed into PD kernels with a constant shift
  • Exact value of constant shift is not needed due to Softmax normalization
  • Rich family of CPD kernels can be generated
  • Lemma 1 proves CPD kernels can be made PD with a large enough constant
  • Constant shift can be left as an under-determined constant in positional embedding design
  • Geometric sequence search can be used to find a suitable constant

Kernelized relative positional embedding

  • Input queries and keys are denoted as {q m } and {k n }
  • Learnable parameters (r 1 , …, r ) are proposed
  • A kernelized relative positional embedding is proposed
  • A composite kernel is introduced
  • Two variants of the composite kernel are proposed
  • Connection to prior work is discussed
  • Logarithmic variant has an implicit connection to T5 positional bias

Experiments

Dataset and implementation description

  • Experiments conducted on OpenWebText2, GitHub, and ArXiv datasets
  • GitHub includes open-source repositories written in Java, C/C++, Python, and Go
  • ArXiv includes papers written in LaTex in Math, Computer Science, Physics, and related fields
  • Model trained on one NVIDIA A100 GPU with 40 GB of memory
  • Compared KERPLE with Sinusoidal, Rotary, T5, and ALiBi
  • Logarithmic variant of KERPLE better than power variant
  • Logarithmic variant faster than T5 and better at longer extrapolation lengths
  • Logarithmic variant slower than ALiBi, Rotary, and Sinusoidal but consistently outperforms them

Experiments on complicated kernels

  • (bias+wht) and (3-para-log) are two complicated versions of the composite kernel
  • (bias+wht) uses a weight and bias kernel
  • (3-para-log) uses a logarithmic variant
  • Performance of these RPE is tested in the KERPLE framework
  • Enlarging the complexity of kernels does not necessarily give better performance

Plots of kernel functions

  • ALiBi and its generalized power variant quickly reach a very negative value
  • Log variant successfully discovers several flat kernels, extending window attention

Position-wise perplexity evaluation

  • KERPLE-log has the lowest PPL@512 among all model variants
  • KERPLE-log uses more distant information than window attention
  • PPL of KERPLE-log continues to decrease till the end of 4096 positions
  • T5 lies below KERPLE-log-windowed@512 most of the time
  • ALiBi lies above KERPLE-log-windowed@512 for almost all positions
  • KERPLE-log is almost like a free lunch compared to window attention

Conclusion and future work

  • A general framework, KERPLE, is proposed to kernelize relative positional embeddings for length extrapolation
  • At the core of this framework is the application of CPD kernels and the derivation of practical variants
  • CPD kernels can be implicitly converted to PD kernels, which keep the inner product interpretation of self-attention
  • Logarithmic variant achieves exceptional extrapolation performance on three large language modeling datasets
  • Future directions include general kernel families and model non-monotonic effects due to positional differences
  • Learnable parameters in KERPLE might enable better generalization to inputs higher than one-dimensional
  • Memory efficiency can be improved by adjusting the model architecture and training procedure
  • Results apply to domains where positional information is helpful, such as natural language, programming language, and DNA/protein sequences
  • Positive economic effects from transformers include enabling new tasks and enhancing accuracy and efficiency
  • Negative societal impacts include job loss due to automation, ethical challenges from improper text generation, and privacy issues in data collection