Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Position encoding has been effective in the transformer architecture.
- Various methods have been used to integrate positional information into the learning process of transformer-based language models.
- RoPE encodes the absolute position with a rotation matrix and incorporates the explicit relative position dependency in self-attention formulation.
- RoPE has properties such as flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding.
- RoFormer has been evaluated on various long text classification benchmark datasets and consistently outperforms its alternatives.
Paper Content
Introduction
- Order of words is important for natural language understanding
- Recurrent neural networks (RNNs) encode tokens’ order
- Convolution neural networks (CNNs) are position-agnostic
- Pre-trained language models (PLMs) use self-attention mechanism to capture contextual representation
- Various approaches have been proposed to encode position information
- Absolute position encoding and relative position encoding
- Rotary Position Embedding (RoPE) encodes absolute position with a rotation matrix
- RoPE decays with increasing relative distances
- RoPE is compatible with linear self-attention
- RoFormer (RoPE + PLMs) achieves better performance than alternatives
Preliminary
- Sequence of N input tokens denoted as S N
- Word embedding of S N denoted as , where x i is the d-dimensional word embedding vector
- Position information incorporated into queries, keys, and value representations
- Attention weights computed from query and key values
- Output computed as weighted sum over value RoFormer representation
- Existing approaches of transformer-based position encoding focus on choosing suitable function for Equation (1)
Absolute position embedding
- Equation (1) is a typical choice for a d-dimensional vector p i
- RoPE proposes to incorporate relative position information by multiplying with sinusoidal functions
Relative position embedding
- Shaw et al. applied different settings of an equation to train relative position embeddings
- Dai et al. proposed to decompose an equation to replace absolute position embeddings with relative counterparts
- Wk and Wk distinguished for content-based and location-based key vectors
- Position information removed from value term
- Later work followed these settings by encoding relative position information into attention weights
- Raffel et al. reformed equation and Ke et al. investigated middle two terms
- He et al. argued relative positions of two tokens can only be modeled using middle two terms
- Absolute position embeddings replaced with relative position embeddings
- Comparison of four variants of relative position embeddings showed one variant is most efficient
- All approaches attempt to modify equation based on decomposition proposed in Vaswani et al.
- Approach aims to derive relative position encoding from equation under constraints
Proposed approach
- Formulated the relative position encoding problem
- Derived the RoPE
- Investigated properties of RoPE
Formulation
- Transformer-based language modeling uses self-attention mechanism to leverage position information of individual tokens.
- Equation (2) enables knowledge conveyance between tokens at different positions.
- To incorporate relative position information, inner product of query and key is formulated by a function g.
- Function g takes word embeddings and relative position as input variables.
- Goal is to find an equivalent encoding mechanism to solve functions f q and f k to conform the relation.
Rotary position embedding
- We prove that a solution to a formulation exists in a 2D plane.
- We can write the solution in a multiplication matrix.
- We can generalize the solution to any x i β R d where d is even.
- Our approach is multiplicative and incorporates relative position information through rotation matrix product.
Properties of rope
- Long-term decay property means that the inner-product will decrease when the relative position increases.
- Self-attention has a quadratic complexity of O(N2).
- Linear attentions reformulate self-attention by using non-negative functions and softmax.
Theoretical explanation
- Two-word embedding vectors x q and x k correspond to query and key and their position m and n
- There is a function g that defines the inner product between vectors produced by f {q,k}
- Radial functions R q , R k and R g are independent from the position information
- Angular functions do not depend on query and key
- Equation (31) is the final solution
- Entries of vectors q and k can be grouped in pairs
- Inner product of RoPE can be written as a complex number multiplication
- Abel transformation can be used to rewrite the summation
Experiments and evaluation
- Evaluated proposed RoFormer on various NLP tasks
- Validated performance on machine translation task
- Compared RoPE implementation with BERTDevlin et al. [2019]
- Evaluated across different downstream tasks from GLUE benchmarksSingh et al. [2018]
- Conducted experiments using RoPE with linear attention of PerFormer Choromanski et al. [2020]
- Additional tests on Chinese data included
Machine translation
- RoFormer is tested on the WMT 2014 English-German dataset and the Enwik8 dataset.
- RoPE is incorporated into the 12 layer char-based PerFormer.
- RoFormer is compared to the transformer-based baseline alternative and outperforms it.
Pre-training language modeling
- Replaced sinusoidal position encoding of BERT with RoPE during pre-training step
- Evaluated performance using F1-score, spearman correlation and accuracy
Fine-tuning on glue tasks
- Pre-trained RoFormer weights were fine-tuned
- Evaluated generalization ability on downstream NLP tasks
Performer with rope
- Performer Choromanski et al. [2020] introduced an alternative attention mechanism called linear attention.
- Linear attention is designed to avoid quadratic computation cost that scales with input sequence length.
- RoPE can be implemented in the PerFormer model to realize relative position encoding while keeping linear complexity in self-attention.
Evaluation on chinese data
- Conducted experiments on English and Chinese data
- Experiments on long documents with length exceeding 512 characters
- Task is to predict whether pair (A, B) is closer than (A, C)
- Existing methods mostly cannot perform significantly on CAIL2019-SCM dataset due to length of documents
- Split train, validation and test sets based on 6:2:2 ratio
Conclusions
- Proposed a new position embedding method for transformer architectures
- Method incorporates explicit relative position dependency in self-attention
- Relative position can be formulated using vector production in self-attention
- Advantages of proposed method when applied to Transformer
- Experiments on English and Chinese datasets show faster convergence in pre-training