Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Transformer architecture is widely used for natural language processing tasks, but not for computer vision.
  • Attention is usually used in conjunction with convolutional networks or to replace certain components of convolutional networks.
  • Vision Transformer (ViT) can perform well on image classification tasks without relying on convolutional networks.
  • ViT can attain excellent results on mid-sized or small image recognition benchmarks with fewer computational resources.

Paper Content

Introduction

  • Transformers are the model of choice in NLP
  • Transformers are computationally efficient and scalable
  • Convolutional architectures are dominant in computer vision
  • Some works combine CNNs and self-attention
  • Transformer applied directly to images, split into patches
  • Model trained on mid-sized datasets yields modest accuracies
  • Large scale training trumps inductive bias
  • ViT approaches or beats state of the art on multiple image recognition benchmarks
  • Transformers proposed for machine translation and used in many NLP tasks
  • Large Transformer-based models pre-trained on large corpora and then fine-tuned
  • BERT uses denoising self-supervised pre-training task
  • GPT line of work uses language modeling as pre-training task
  • Applying Transformers to images requires approximations
  • Combining CNNs with forms of self-attention
  • iGPT applies Transformers to image pixels after reducing resolution and color space
  • Use of additional data sources allows to achieve state-of-the-art results on standard benchmarks

Method

  • Model design follows Transformer (Vaswani et al., 2017)
  • Advantages of simple setup: scalable NLP Transformer architectures and efficient implementations can be used

Vision transformer (vit)

  • Model takes 1D sequence of token embeddings as input
  • Image reshaped into sequence of flattened 2D patches
  • Trainable linear projection maps patches to constant latent vector size
  • Output of projection is patch embeddings
  • Learnable embedding prepended to sequence of embedded patches
  • Classification head attached to output of Transformer encoder
  • Position embeddings added to patch embeddings
  • Transformer encoder consists of alternating layers of multiheaded self-attention and MLP blocks
  • Hybrid model uses feature maps from CNN as input sequence

Fine-tuning and higher resolution

  • Pre-train ViT on large datasets and fine-tune to smaller downstream tasks.
  • Use a zero-initialized feedforward layer for downstream classes.
  • Fine-tune at higher resolution than pre-training.
  • Keep patch size the same when feeding higher resolution images.
  • Vision Transformer can handle arbitrary sequence lengths.
  • Perform 2D interpolation of pre-trained position embeddings.

Experiments

  • Evaluated representation learning capabilities of ResNet, Vision Transformer (ViT), and hybrid
  • Pre-trained on datasets of varying size and evaluated benchmark tasks
  • ViT performs well and attains state of the art on most recognition benchmarks at lower pre-training cost
  • Small experiment using self-supervision shows promise for the future

Setup

  • Used 3 datasets: ImageNet (1k classes, 1.3M images), ImageNet-21k (21k classes, 14M images), JFT (18k classes, 303M images)
  • Transferred models to several benchmark tasks: ImageNet (original and cleaned-up labels), CIFAR-10/100, Oxford-IIIT Pets, Oxford Flowers-102, VTAB (19 tasks)
  • Used 4 model variants: Base, Large, Huge, ResNet (BiT)
  • Used Adam for training, SGD for fine-tuning
  • Reported results on downstream datasets through few-shot and fine-tuning accuracy

Comparison to state of the art

  • ViT-H/14 and ViT-L/16 models compared to state-of-the-art CNNs
  • ViT-L/16 model pre-trained on ImageNet-21k dataset performs well on most datasets
  • ViT-H/14 outperforms BiT-R152x4 and other methods on Natural and Structured tasks

Pre-training data requirements

  • Vision Transformer performs well when pre-trained on large JFT-300M dataset
  • Experiments performed on datasets of increasing size: ImageNet, ImageNet-21k, and JFT-300M
  • Regularization parameters optimized to boost performance on smaller datasets
  • Experiments performed on random subsets of 9M, 30M, and 90M as well as full JFT-300M dataset
  • Results show Vision Transformers overfit more than ResNets on smaller datasets

Scaling study

  • Controlled scaling study of different models was performed by evaluating transfer performance from JFT-300M.
  • Model set included 7 ResNets, 6 Vision Transformers, and 5 hybrids.
  • Vision Transformers dominate ResNets on performance/compute trade-off, using 2-4x less compute to attain same performance.

Inspecting vision transformer

Input attention

  • Learned position embedding is added to patch representations
  • Closer patches tend to have more similar position embeddings
  • Row-column structure appears in position embeddings
  • Hand-crafted 2D-aware embedding variants do not yield improvements
  • Self-attention allows ViT to integrate information across the entire image
  • Average distance in image space across which information is integrated is computed
  • Some heads attend to most of the image already in the lowest layers
  • Highly localized attention is less pronounced in hybrid models
  • Attention distance increases with network depth
  • Model attends to image regions that are semantically relevant for classification

Self-supervision

  • Transformers show good performance on NLP tasks due to scalability and self-supervised pre-training
  • Smaller ViT-B/16 model improved accuracy on ImageNet by 2% when pre-trained with self-supervision, but still 4% behind supervised pre-training

Conclusion

  • Explored direct application of Transformers to image recognition
  • Did not introduce image-specific inductive biases
  • Interpreted image as sequence of patches and processed with standard Transformer encoder
  • Surprisingly well when coupled with pre-training on large datasets
  • Matched or exceeded state of the art on many image classification datasets
  • Cheap to pre-train
  • Challenge to apply to other computer vision tasks
  • Challenge to explore self-supervised pre-training methods
  • Gap between self-supervised and large-scale supervised pre-training
  • Scaling of ViT would likely lead to improved performance
  • Removed head and replaced with single, zero-initialized linear layer
  • Used same hyperparameter setting for all tasks
  • Masked patch prediction objective for self-supervision experiments
  • Predicting mean, 3-bit color of corrupted patches
  • Diminishing returns on downstream performance after 100k pretraining steps
  • Similar gains when pretraining on ImageNet
  • Little to no difference between different ways of encoding positional information
  • Pattern of position embedding similarity depends on training hyperparameters
  • ViT outperforms ResNets with same computational budget
  • Hybrids improve upon pure Transformers for smaller model sizes, but gap vanishes for larger models