Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Position encoding has been effective in the transformer architecture.
  • Various methods have been used to integrate positional information into the learning process of transformer-based language models.
  • RoPE encodes the absolute position with a rotation matrix and incorporates the explicit relative position dependency in self-attention formulation.
  • RoPE has properties such as flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding.
  • RoFormer has been evaluated on various long text classification benchmark datasets and consistently outperforms its alternatives.

Paper Content

Introduction

  • Order of words is important for natural language understanding
  • Recurrent neural networks (RNNs) encode tokens’ order
  • Convolution neural networks (CNNs) are position-agnostic
  • Pre-trained language models (PLMs) use self-attention mechanism to capture contextual representation
  • Various approaches have been proposed to encode position information
  • Absolute position encoding and relative position encoding
  • Rotary Position Embedding (RoPE) encodes absolute position with a rotation matrix
  • RoPE decays with increasing relative distances
  • RoPE is compatible with linear self-attention
  • RoFormer (RoPE + PLMs) achieves better performance than alternatives

Preliminary

  • Sequence of N input tokens denoted as S N
  • Word embedding of S N denoted as , where x i is the d-dimensional word embedding vector
  • Position information incorporated into queries, keys, and value representations
  • Attention weights computed from query and key values
  • Output computed as weighted sum over value RoFormer representation
  • Existing approaches of transformer-based position encoding focus on choosing suitable function for Equation (1)

Absolute position embedding

  • Equation (1) is a typical choice for a d-dimensional vector p i
  • RoPE proposes to incorporate relative position information by multiplying with sinusoidal functions

Relative position embedding

  • Shaw et al. applied different settings of an equation to train relative position embeddings
  • Dai et al. proposed to decompose an equation to replace absolute position embeddings with relative counterparts
  • Wk and Wk distinguished for content-based and location-based key vectors
  • Position information removed from value term
  • Later work followed these settings by encoding relative position information into attention weights
  • Raffel et al. reformed equation and Ke et al. investigated middle two terms
  • He et al. argued relative positions of two tokens can only be modeled using middle two terms
  • Absolute position embeddings replaced with relative position embeddings
  • Comparison of four variants of relative position embeddings showed one variant is most efficient
  • All approaches attempt to modify equation based on decomposition proposed in Vaswani et al.
  • Approach aims to derive relative position encoding from equation under constraints

Proposed approach

  • Formulated the relative position encoding problem
  • Derived the RoPE
  • Investigated properties of RoPE

Formulation

  • Transformer-based language modeling uses self-attention mechanism to leverage position information of individual tokens.
  • Equation (2) enables knowledge conveyance between tokens at different positions.
  • To incorporate relative position information, inner product of query and key is formulated by a function g.
  • Function g takes word embeddings and relative position as input variables.
  • Goal is to find an equivalent encoding mechanism to solve functions f q and f k to conform the relation.

Rotary position embedding

  • We prove that a solution to a formulation exists in a 2D plane.
  • We can write the solution in a multiplication matrix.
  • We can generalize the solution to any x i ∈ R d where d is even.
  • Our approach is multiplicative and incorporates relative position information through rotation matrix product.

Properties of rope

  • Long-term decay property means that the inner-product will decrease when the relative position increases.
  • Self-attention has a quadratic complexity of O(N2).
  • Linear attentions reformulate self-attention by using non-negative functions and softmax.

Theoretical explanation

  • Two-word embedding vectors x q and x k correspond to query and key and their position m and n
  • There is a function g that defines the inner product between vectors produced by f {q,k}
  • Radial functions R q , R k and R g are independent from the position information
  • Angular functions do not depend on query and key
  • Equation (31) is the final solution
  • Entries of vectors q and k can be grouped in pairs
  • Inner product of RoPE can be written as a complex number multiplication
  • Abel transformation can be used to rewrite the summation

Experiments and evaluation

  • Evaluated proposed RoFormer on various NLP tasks
  • Validated performance on machine translation task
  • Compared RoPE implementation with BERTDevlin et al. [2019]
  • Evaluated across different downstream tasks from GLUE benchmarksSingh et al. [2018]
  • Conducted experiments using RoPE with linear attention of PerFormer Choromanski et al. [2020]
  • Additional tests on Chinese data included

Machine translation

  • RoFormer is tested on the WMT 2014 English-German dataset and the Enwik8 dataset.
  • RoPE is incorporated into the 12 layer char-based PerFormer.
  • RoFormer is compared to the transformer-based baseline alternative and outperforms it.

Pre-training language modeling

  • Replaced sinusoidal position encoding of BERT with RoPE during pre-training step
  • Evaluated performance using F1-score, spearman correlation and accuracy

Fine-tuning on glue tasks

  • Pre-trained RoFormer weights were fine-tuned
  • Evaluated generalization ability on downstream NLP tasks

Performer with rope

  • Performer Choromanski et al. [2020] introduced an alternative attention mechanism called linear attention.
  • Linear attention is designed to avoid quadratic computation cost that scales with input sequence length.
  • RoPE can be implemented in the PerFormer model to realize relative position encoding while keeping linear complexity in self-attention.

Evaluation on chinese data

  • Conducted experiments on English and Chinese data
  • Experiments on long documents with length exceeding 512 characters
  • Task is to predict whether pair (A, B) is closer than (A, C)
  • Existing methods mostly cannot perform significantly on CAIL2019-SCM dataset due to length of documents
  • Split train, validation and test sets based on 6:2:2 ratio

Conclusions

  • Proposed a new position embedding method for transformer architectures
  • Method incorporates explicit relative position dependency in self-attention
  • Relative position can be formulated using vector production in self-attention
  • Advantages of proposed method when applied to Transformer
  • Experiments on English and Chinese datasets show faster convergence in pre-training