Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Position encoding has been effective in the transformer architecture.
Various methods have been used to integrate positional information into the learning process of transformer-based language models.
RoPE encodes the absolute position with a rotation matrix and incorporates the explicit relative position dependency in self-attention formulation.
RoPE has properties such as flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding.
RoFormer has been evaluated on various long text classification benchmark datasets and consistently outperforms its alternatives.

Paper Content

Introduction

Order of words is important for natural language understanding
Recurrent neural networks (RNNs) encode tokens’ order
Convolution neural networks (CNNs) are position-agnostic
Pre-trained language models (PLMs) use self-attention mechanism to capture contextual representation
Various approaches have been proposed to encode position information
Absolute position encoding and relative position encoding
Rotary Position Embedding (RoPE) encodes absolute position with a rotation matrix
RoPE decays with increasing relative distances
RoPE is compatible with linear self-attention
RoFormer (RoPE + PLMs) achieves better performance than alternatives

Preliminary

Sequence of N input tokens denoted as S N
Word embedding of S N denoted as , where x i is the d-dimensional word embedding vector
Position information incorporated into queries, keys, and value representations
Attention weights computed from query and key values
Output computed as weighted sum over value RoFormer representation
Existing approaches of transformer-based position encoding focus on choosing suitable function for Equation (1)

Absolute position embedding

Equation (1) is a typical choice for a d-dimensional vector p i
RoPE proposes to incorporate relative position information by multiplying with sinusoidal functions

Relative position embedding

Shaw et al. applied different settings of an equation to train relative position embeddings
Dai et al. proposed to decompose an equation to replace absolute position embeddings with relative counterparts
Wk and Wk distinguished for content-based and location-based key vectors
Position information removed from value term
Later work followed these settings by encoding relative position information into attention weights
Raffel et al. reformed equation and Ke et al. investigated middle two terms
He et al. argued relative positions of two tokens can only be modeled using middle two terms
Absolute position embeddings replaced with relative position embeddings
Comparison of four variants of relative position embeddings showed one variant is most efficient
All approaches attempt to modify equation based on decomposition proposed in Vaswani et al.
Approach aims to derive relative position encoding from equation under constraints

Proposed approach

Formulated the relative position encoding problem
Derived the RoPE
Investigated properties of RoPE

Formulation

Transformer-based language modeling uses self-attention mechanism to leverage position information of individual tokens.
Equation (2) enables knowledge conveyance between tokens at different positions.
To incorporate relative position information, inner product of query and key is formulated by a function g.
Function g takes word embeddings and relative position as input variables.
Goal is to find an equivalent encoding mechanism to solve functions f q and f k to conform the relation.

Rotary position embedding

We prove that a solution to a formulation exists in a 2D plane.
We can write the solution in a multiplication matrix.
We can generalize the solution to any x i ∈ R d where d is even.
Our approach is multiplicative and incorporates relative position information through rotation matrix product.

Properties of rope

Long-term decay property means that the inner-product will decrease when the relative position increases.
Self-attention has a quadratic complexity of O(N2).
Linear attentions reformulate self-attention by using non-negative functions and softmax.

Theoretical explanation

Two-word embedding vectors x q and x k correspond to query and key and their position m and n
There is a function g that defines the inner product between vectors produced by f {q,k}
Radial functions R q , R k and R g are independent from the position information
Angular functions do not depend on query and key
Equation (31) is the final solution
Entries of vectors q and k can be grouped in pairs
Inner product of RoPE can be written as a complex number multiplication
Abel transformation can be used to rewrite the summation

Experiments and evaluation

Evaluated proposed RoFormer on various NLP tasks
Validated performance on machine translation task
Compared RoPE implementation with BERTDevlin et al. [2019]
Evaluated across different downstream tasks from GLUE benchmarksSingh et al. [2018]
Conducted experiments using RoPE with linear attention of PerFormer Choromanski et al. [2020]
Additional tests on Chinese data included

Machine translation

RoFormer is tested on the WMT 2014 English-German dataset and the Enwik8 dataset.
RoPE is incorporated into the 12 layer char-based PerFormer.
RoFormer is compared to the transformer-based baseline alternative and outperforms it.

Pre-training language modeling

Replaced sinusoidal position encoding of BERT with RoPE during pre-training step
Evaluated performance using F1-score, spearman correlation and accuracy

Fine-tuning on glue tasks

Pre-trained RoFormer weights were fine-tuned
Evaluated generalization ability on downstream NLP tasks

Performer with rope

Performer Choromanski et al. [2020] introduced an alternative attention mechanism called linear attention.
Linear attention is designed to avoid quadratic computation cost that scales with input sequence length.
RoPE can be implemented in the PerFormer model to realize relative position encoding while keeping linear complexity in self-attention.

Evaluation on chinese data

Conducted experiments on English and Chinese data
Experiments on long documents with length exceeding 512 characters
Task is to predict whether pair (A, B) is closer than (A, C)
Existing methods mostly cannot perform significantly on CAIL2019-SCM dataset due to length of documents
Split train, validation and test sets based on 6:2:2 ratio

Conclusions

Proposed a new position embedding method for transformer architectures
Method incorporates explicit relative position dependency in self-attention
Relative position can be formulated using vector production in self-attention
Advantages of proposed method when applied to Transformer
Experiments on English and Chinese datasets show faster convergence in pre-training

Link to paper#

Abstract#

Paper Content#

Introduction#

Preliminary#

Absolute position embedding#

Relative position embedding#

Proposed approach#

Formulation#

Rotary position embedding#

Properties of rope#

Theoretical explanation#

Experiments and evaluation#

Machine translation#

Pre-training language modeling#

Fine-tuning on glue tasks#

Performer with rope#

Evaluation on chinese data#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Preliminary

Absolute position embedding

Relative position embedding

Proposed approach

Formulation

Rotary position embedding

Properties of rope

Theoretical explanation

Experiments and evaluation

Machine translation

Pre-training language modeling

Fine-tuning on glue tasks

Performer with rope

Evaluation on chinese data

Conclusions