Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Self-attention is used for sequence modeling problems.
  • Self-attention has quadratic compute and memory requirements.
  • Routing Transformer learns dynamic sparse attention patterns to reduce complexity.
  • Routing Transformer outperforms comparable models on language and image generation tasks.
  • Routing Transformer sets a new state-of-the-art on PG-19 data-set.

Paper Content

Introduction

  • Generative models of sequences have seen rapid progress due to the use of attention in neural networks
  • Attention has improved machine translation
  • Self-attention has been used to learn powerful representations of language for natural language processing tasks
  • Self-attention has also been used for image and music generation
  • Self-attention operates over sequences in a stepwise manner
  • Self-attention is a particular case of attention
  • Self-attention is commonly used in autoregressive generative models
  • Self-attention can focus on any part of the previous context
  • RNNs and CNNs have direct interactions with only a local neighborhood of context
  • Self-attention has time and space complexity of quadratic in n
  • Previous work has proposed data independent or fixed sparsity patterns
  • Correia et al. proposed content-based sparsity
  • Present work proposes clustering of attention
  • Routing Transformer combines efficient clustering-based sparse attention with classical local attention
  • Sets new state-of-the-art on language and image generation tasks
  • Research on efficient attention neural models parallels the advent of attention-based architectures
  • Jaitly et al. (2016) proposed the Neural Transducer which segments sequences in non-overlapping chunks
  • Limiting attention to a fixed temporal context around the current prediction has been explored
  • Hierarchical attention strategies have been explored
  • Liu et al. (2018) simplified hierarchical attention by alternating coarse and local layers
  • Child et al. (2019) introduced bounded and strided attention
  • Sukhbaatar et al. (2019) let the model learn the length of the temporal context for each attention module
  • Content-based sparse attention has been introduced to allow for richer patterns
  • Martins and Kreutzer (2017); Malaviya et al. (2018) proposed to compute attention weights with variants of sparsemax
  • Correia et al. (2019) generalized this approach to every layer in a Transformer
  • Kitaev et al. (2020) proposed to use Locality Sensitive Hashing (LSH) to infer content based sparsity patterns
  • Previous work often refers to this goal as gating for conditional computation
  • Memory augmented neural networks with sparse reads and writes have been proposed
  • Sparsely gated Mixture-of-experts (MOE) (Shazeer et al., 2017) has been used
  • Lample et al. (2019) used product quantization based key-value lookups to replace the feed forward network in the Transformer

Self-attentive auto-regressive sequence modeling

  • Auto-regressive sequence models decompose the probability of a sequence
  • Neural models use a neural network with learned parameters to model the conditional distribution
  • Transformer architectures have been used in language modeling, image generation, and music generation
  • Self-attention layers use three linear projections (keys, queries, and values)
  • Attention matrix A is a matrix of weights in [0, 1]
  • Transformer adds extensions to self-attention mechanism
  • Each layer relies on multiple attention heads
  • Transformer further processes the result of attention through a learnable non-linear transformation
  • Full Transformer model is a chain of attention modules
  • Our work is interested in the application of the Transformer to long sequences

Efficient content-dependent sparse attention

  • Attention-based models can be problematic for long sequences.
  • Sparse attention models rely on attention matrices with a majority of zero entries.
  • For each query, a sparse attention model defines a set of keys which can be attended to.
  • Previous work on attention sparsity defined sets based on positions, independently of query and key vectors.
  • Content-based sparse attention should be carefully implemented to avoid instantiating full attention matrices.

Routing attention with clustering

  • Model sparse attention matrices with low rank sparsity patterns using k-means clustering
  • Assign queries and keys to clusters
  • Queries and keys from same cluster considered for attention
  • Model parameters are centroid vectors shared across sequences
  • Sparse attention strategy routes queries to keys belonging to same cluster
  • MIPS problem is equivalent to Nearest Neighbor Search when norm of every element is constant
  • Project queries and keys onto unit ball before computing them
  • If Q and K belong to same cluster, then QK > 1-2ε2
  • Clustering routing strategy preserves large attention weights as non-zero entries
  • Computational complexity is O(nkd + n2d/k)
  • Optimal choice of k is √n
  • Mini-batch k-means used to train cluster centroids
  • Update each cluster centroid by exponentially moving average of all keys and queries assigned to it
  • Exclude padding tokens from affecting centroids
  • Additional masking strategy needed for causal attention

Experiments

  • Evaluated sparse attention model on various generative modeling tasks
  • Found scaled up version of local attention is a strong baseline
  • Routing Transformer outperforms Transformer-XL and Sparse Transformer on all tasks
  • Routing Transformer sets new state-of-the-art result on PG-19 dataset
  • Adam optimizer used with learning rate 2 x 10-4

Cifar-10

  • CIFAR-10 is a data-set of 60,000 colored images of size 32x32
  • Used as a toy data-set to test the effect of various hyper-parameter choices on model performance
  • 12 layer models with 8 attention heads tested
  • Hyper-parameters varied: number of routing attention heads, number of routing attention layers, size of attention window
  • Routing Transformer compared to Random Transformer (where K idx is randomly chosen)
  • Models trained with batch size of 32 and for 200,000 steps

Wikitext-103

  • Wikitext-103 is a benchmark data-set for testing long term dependencies in word-level language models.
  • Routing Transformer with 16 heads is trained using relative position encoding.
  • Results are compared to other recent work on sparse or recurrent attention.
  • Local attention is better than adaptive methods.
  • Routing Transformer model gets a test perplexity of 15.8, improving on 18.3 obtained by TransformerXL.

Enwik-8

  • Enwik-8 is a data-set used to benchmark text compression algorithms
  • Sequence length n = 8192
  • 24 layer model with 8 attention heads, attention and ReLU dropout rate of 0.4
  • Routing attention set to k = 32 and attention window 256
  • Perplexity of 0.99
  • Evaluate long term dependencies on ImageNet 64x64 data-set
  • 24 layer model with 16 attention heads, half local attention, half routing attention
  • Routing attention set to k = 8, attention window 2048
  • Local attention (Parmar et al., 2018) achieves 3.48 bits/dim
  • Routing Transformer model achieves 3.425 bits/dim

Pg-19

  • PG-19 is a new data-set created from 28,000 Project Gutenberg books published before 1919
  • It consists of 1.9 billion tokens and an average context size of roughly
  • Training setup changed in three ways: 2 routing heads, routing heads only in last two layers, Adafactor optimizer
  • Evaluated difference in attention patterns between local and routing attention by computing Jensen-Shannon divergence
  • Divergence between local and routing attention heads is close to upper-bound of 0.6931
  • MIPS search used to select high dot product items for attention, which facilitates global consistency

Recurrence vs sparse attention

  • Sparse attention is an alternative approach to Transformer-XL and Compressive Transformer
  • Transformer-XL and Compressive Transformer train on small sequences and use cross attention to generalize to longer sequences
  • Sparse attention trains directly on long sequences from the beginning
  • Transformer-XL is less memory consuming and can scale to more layers
  • Sparse attention is more memory expensive but can still be competitive with Transformer-XL

Wall-clock time

  • Local Transformer is 1.7x faster than Routing Transformer on CIFAR-10
  • Sparse operations on TPU lack support, but can be sped up on GPU
  • Goal of work is memory efficient version of sparse attention, not wall-clock time efficiency

Conclusion

  • Transformer models are the state-of-the-art in auto-regressive generative models for sequential data
  • Their space-time complexity is quadratic in sequence length due to attention modules
  • Routing Transformer proposed as a sparse attention model, based on content-similarity
  • Does not require fixed attention patterns, but has similar space-time complexity
  • Scaled up version of local attention establishes a strong baseline on modern benchmark
  • Routing Transformer redefines the state-of-the-art in large long sequence benchmarks
  • Offers complementary attention patterns compared to local attention
  • Could be useful in domains with naturally sparse inputs, such as 3D point clouds, social networks, or protein interactions
  • Bible study and Apocrypha study created a desire for deeper study of the Bible
  • Passion for writing profane and sacred history fostered by increase of Roman Church
  • Study of Scripture in more liberal spirit helped accelerate spread of new and healthier conception of Christian life
  • Reformation movement had a great effect on other aspects of life of people and growth and extension of Protestantism
  • Philip of Hesse was a man of religious instincts and aspirations
  • Revival of primitive and devout tendencies in Lutheran Church
  • Enthusiasm for study of Bible in original languages broken down
  • Government of German Empire set itself to adapt Bible to spirit and needs of Germany
  • Aim of Reformation was to study Bible as living interpreter of God’s words and acquire living, active, self-interpreting, and God-glorifying Christian spirit
  • Introduction of commentaries on text of Old Testament and growth of literature for it due to German Christians
  • Work of translation accomplished in able and painstaking fashion by Commentary on the Galatians in 1531