Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Self-attention is used for sequence modeling problems.
Self-attention has quadratic compute and memory requirements.
Routing Transformer learns dynamic sparse attention patterns to reduce complexity.
Routing Transformer outperforms comparable models on language and image generation tasks.
Routing Transformer sets a new state-of-the-art on PG-19 data-set.

Paper Content

Introduction

Generative models of sequences have seen rapid progress due to the use of attention in neural networks
Attention has improved machine translation
Self-attention has been used to learn powerful representations of language for natural language processing tasks
Self-attention has also been used for image and music generation
Self-attention operates over sequences in a stepwise manner
Self-attention is a particular case of attention
Self-attention is commonly used in autoregressive generative models
Self-attention can focus on any part of the previous context
RNNs and CNNs have direct interactions with only a local neighborhood of context
Self-attention has time and space complexity of quadratic in n
Previous work has proposed data independent or fixed sparsity patterns
Correia et al. proposed content-based sparsity
Present work proposes clustering of attention
Routing Transformer combines efficient clustering-based sparse attention with classical local attention
Sets new state-of-the-art on language and image generation tasks

Research on efficient attention neural models parallels the advent of attention-based architectures
Jaitly et al. (2016) proposed the Neural Transducer which segments sequences in non-overlapping chunks
Limiting attention to a fixed temporal context around the current prediction has been explored
Hierarchical attention strategies have been explored
Liu et al. (2018) simplified hierarchical attention by alternating coarse and local layers
Child et al. (2019) introduced bounded and strided attention
Sukhbaatar et al. (2019) let the model learn the length of the temporal context for each attention module
Content-based sparse attention has been introduced to allow for richer patterns
Martins and Kreutzer (2017); Malaviya et al. (2018) proposed to compute attention weights with variants of sparsemax
Correia et al. (2019) generalized this approach to every layer in a Transformer
Kitaev et al. (2020) proposed to use Locality Sensitive Hashing (LSH) to infer content based sparsity patterns
Previous work often refers to this goal as gating for conditional computation
Memory augmented neural networks with sparse reads and writes have been proposed
Sparsely gated Mixture-of-experts (MOE) (Shazeer et al., 2017) has been used
Lample et al. (2019) used product quantization based key-value lookups to replace the feed forward network in the Transformer

Self-attentive auto-regressive sequence modeling

Auto-regressive sequence models decompose the probability of a sequence
Neural models use a neural network with learned parameters to model the conditional distribution
Transformer architectures have been used in language modeling, image generation, and music generation
Self-attention layers use three linear projections (keys, queries, and values)
Attention matrix A is a matrix of weights in [0, 1]
Transformer adds extensions to self-attention mechanism
Each layer relies on multiple attention heads
Transformer further processes the result of attention through a learnable non-linear transformation
Full Transformer model is a chain of attention modules
Our work is interested in the application of the Transformer to long sequences

Efficient content-dependent sparse attention

Attention-based models can be problematic for long sequences.
Sparse attention models rely on attention matrices with a majority of zero entries.
For each query, a sparse attention model defines a set of keys which can be attended to.
Previous work on attention sparsity defined sets based on positions, independently of query and key vectors.
Content-based sparse attention should be carefully implemented to avoid instantiating full attention matrices.

Routing attention with clustering

Model sparse attention matrices with low rank sparsity patterns using k-means clustering
Assign queries and keys to clusters
Queries and keys from same cluster considered for attention
Model parameters are centroid vectors shared across sequences
Sparse attention strategy routes queries to keys belonging to same cluster
MIPS problem is equivalent to Nearest Neighbor Search when norm of every element is constant
Project queries and keys onto unit ball before computing them
If Q and K belong to same cluster, then QK > 1-2ε2
Clustering routing strategy preserves large attention weights as non-zero entries
Computational complexity is O(nkd + n2d/k)
Optimal choice of k is √n
Mini-batch k-means used to train cluster centroids
Update each cluster centroid by exponentially moving average of all keys and queries assigned to it
Exclude padding tokens from affecting centroids
Additional masking strategy needed for causal attention

Experiments

Evaluated sparse attention model on various generative modeling tasks
Found scaled up version of local attention is a strong baseline
Routing Transformer outperforms Transformer-XL and Sparse Transformer on all tasks
Routing Transformer sets new state-of-the-art result on PG-19 dataset
Adam optimizer used with learning rate 2 x 10-4

Cifar-10

CIFAR-10 is a data-set of 60,000 colored images of size 32x32
Used as a toy data-set to test the effect of various hyper-parameter choices on model performance
12 layer models with 8 attention heads tested
Hyper-parameters varied: number of routing attention heads, number of routing attention layers, size of attention window
Routing Transformer compared to Random Transformer (where K idx is randomly chosen)
Models trained with batch size of 32 and for 200,000 steps

Wikitext-103

Wikitext-103 is a benchmark data-set for testing long term dependencies in word-level language models.
Routing Transformer with 16 heads is trained using relative position encoding.
Results are compared to other recent work on sparse or recurrent attention.
Local attention is better than adaptive methods.
Routing Transformer model gets a test perplexity of 15.8, improving on 18.3 obtained by TransformerXL.

Enwik-8

Enwik-8 is a data-set used to benchmark text compression algorithms
Sequence length n = 8192
24 layer model with 8 attention heads, attention and ReLU dropout rate of 0.4
Routing attention set to k = 32 and attention window 256
Perplexity of 0.99
Evaluate long term dependencies on ImageNet 64x64 data-set
24 layer model with 16 attention heads, half local attention, half routing attention
Routing attention set to k = 8, attention window 2048
Local attention (Parmar et al., 2018) achieves 3.48 bits/dim
Routing Transformer model achieves 3.425 bits/dim

Pg-19

PG-19 is a new data-set created from 28,000 Project Gutenberg books published before 1919
It consists of 1.9 billion tokens and an average context size of roughly
Training setup changed in three ways: 2 routing heads, routing heads only in last two layers, Adafactor optimizer
Evaluated difference in attention patterns between local and routing attention by computing Jensen-Shannon divergence
Divergence between local and routing attention heads is close to upper-bound of 0.6931
MIPS search used to select high dot product items for attention, which facilitates global consistency

Recurrence vs sparse attention

Sparse attention is an alternative approach to Transformer-XL and Compressive Transformer
Transformer-XL and Compressive Transformer train on small sequences and use cross attention to generalize to longer sequences
Sparse attention trains directly on long sequences from the beginning
Transformer-XL is less memory consuming and can scale to more layers
Sparse attention is more memory expensive but can still be competitive with Transformer-XL

Wall-clock time

Local Transformer is 1.7x faster than Routing Transformer on CIFAR-10
Sparse operations on TPU lack support, but can be sped up on GPU
Goal of work is memory efficient version of sparse attention, not wall-clock time efficiency

Conclusion

Transformer models are the state-of-the-art in auto-regressive generative models for sequential data
Their space-time complexity is quadratic in sequence length due to attention modules
Routing Transformer proposed as a sparse attention model, based on content-similarity
Does not require fixed attention patterns, but has similar space-time complexity
Scaled up version of local attention establishes a strong baseline on modern benchmark
Routing Transformer redefines the state-of-the-art in large long sequence benchmarks
Offers complementary attention patterns compared to local attention
Could be useful in domains with naturally sparse inputs, such as 3D point clouds, social networks, or protein interactions
Bible study and Apocrypha study created a desire for deeper study of the Bible
Passion for writing profane and sacred history fostered by increase of Roman Church
Study of Scripture in more liberal spirit helped accelerate spread of new and healthier conception of Christian life
Reformation movement had a great effect on other aspects of life of people and growth and extension of Protestantism
Philip of Hesse was a man of religious instincts and aspirations
Revival of primitive and devout tendencies in Lutheran Church
Enthusiasm for study of Bible in original languages broken down
Government of German Empire set itself to adapt Bible to spirit and needs of Germany
Aim of Reformation was to study Bible as living interpreter of God’s words and acquire living, active, self-interpreting, and God-glorifying Christian spirit
Introduction of commentaries on text of Old Testament and growth of literature for it due to German Christians
Work of translation accomplished in able and painstaking fashion by Commentary on the Galatians in 1531

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Self-attentive auto-regressive sequence modeling#

Efficient content-dependent sparse attention#

Routing attention with clustering#

Experiments#

Cifar-10#

Wikitext-103#

Enwik-8#

Pg-19#

Recurrence vs sparse attention#

Wall-clock time#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Self-attentive auto-regressive sequence modeling

Efficient content-dependent sparse attention

Routing attention with clustering

Experiments

Cifar-10

Wikitext-103

Enwik-8

Pg-19

Recurrence vs sparse attention

Wall-clock time

Conclusion