Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Transformer architecture is widely used for natural language processing tasks, but not for computer vision.
Attention is usually used in conjunction with convolutional networks or to replace certain components of convolutional networks.
Vision Transformer (ViT) can perform well on image classification tasks without relying on convolutional networks.
ViT can attain excellent results on mid-sized or small image recognition benchmarks with fewer computational resources.

Paper Content

Introduction

Transformers are the model of choice in NLP
Transformers are computationally efficient and scalable
Convolutional architectures are dominant in computer vision
Some works combine CNNs and self-attention
Transformer applied directly to images, split into patches
Model trained on mid-sized datasets yields modest accuracies
Large scale training trumps inductive bias
ViT approaches or beats state of the art on multiple image recognition benchmarks

Transformers proposed for machine translation and used in many NLP tasks
Large Transformer-based models pre-trained on large corpora and then fine-tuned
BERT uses denoising self-supervised pre-training task
GPT line of work uses language modeling as pre-training task
Applying Transformers to images requires approximations
Combining CNNs with forms of self-attention
iGPT applies Transformers to image pixels after reducing resolution and color space
Use of additional data sources allows to achieve state-of-the-art results on standard benchmarks

Method

Model design follows Transformer (Vaswani et al., 2017)
Advantages of simple setup: scalable NLP Transformer architectures and efficient implementations can be used

Vision transformer (vit)

Model takes 1D sequence of token embeddings as input
Image reshaped into sequence of flattened 2D patches
Trainable linear projection maps patches to constant latent vector size
Output of projection is patch embeddings
Learnable embedding prepended to sequence of embedded patches
Classification head attached to output of Transformer encoder
Position embeddings added to patch embeddings
Transformer encoder consists of alternating layers of multiheaded self-attention and MLP blocks
Hybrid model uses feature maps from CNN as input sequence

Fine-tuning and higher resolution

Pre-train ViT on large datasets and fine-tune to smaller downstream tasks.
Use a zero-initialized feedforward layer for downstream classes.
Fine-tune at higher resolution than pre-training.
Keep patch size the same when feeding higher resolution images.
Vision Transformer can handle arbitrary sequence lengths.
Perform 2D interpolation of pre-trained position embeddings.

Experiments

Evaluated representation learning capabilities of ResNet, Vision Transformer (ViT), and hybrid
Pre-trained on datasets of varying size and evaluated benchmark tasks
ViT performs well and attains state of the art on most recognition benchmarks at lower pre-training cost
Small experiment using self-supervision shows promise for the future

Setup

Used 3 datasets: ImageNet (1k classes, 1.3M images), ImageNet-21k (21k classes, 14M images), JFT (18k classes, 303M images)
Transferred models to several benchmark tasks: ImageNet (original and cleaned-up labels), CIFAR-10/100, Oxford-IIIT Pets, Oxford Flowers-102, VTAB (19 tasks)
Used 4 model variants: Base, Large, Huge, ResNet (BiT)
Used Adam for training, SGD for fine-tuning
Reported results on downstream datasets through few-shot and fine-tuning accuracy

Comparison to state of the art

ViT-H/14 and ViT-L/16 models compared to state-of-the-art CNNs
ViT-L/16 model pre-trained on ImageNet-21k dataset performs well on most datasets
ViT-H/14 outperforms BiT-R152x4 and other methods on Natural and Structured tasks

Pre-training data requirements

Vision Transformer performs well when pre-trained on large JFT-300M dataset
Experiments performed on datasets of increasing size: ImageNet, ImageNet-21k, and JFT-300M
Regularization parameters optimized to boost performance on smaller datasets
Experiments performed on random subsets of 9M, 30M, and 90M as well as full JFT-300M dataset
Results show Vision Transformers overfit more than ResNets on smaller datasets

Scaling study

Controlled scaling study of different models was performed by evaluating transfer performance from JFT-300M.
Model set included 7 ResNets, 6 Vision Transformers, and 5 hybrids.
Vision Transformers dominate ResNets on performance/compute trade-off, using 2-4x less compute to attain same performance.

Inspecting vision transformer

Input attention

Learned position embedding is added to patch representations
Closer patches tend to have more similar position embeddings
Row-column structure appears in position embeddings
Hand-crafted 2D-aware embedding variants do not yield improvements
Self-attention allows ViT to integrate information across the entire image
Average distance in image space across which information is integrated is computed
Some heads attend to most of the image already in the lowest layers
Highly localized attention is less pronounced in hybrid models
Attention distance increases with network depth
Model attends to image regions that are semantically relevant for classification

Self-supervision

Transformers show good performance on NLP tasks due to scalability and self-supervised pre-training
Smaller ViT-B/16 model improved accuracy on ImageNet by 2% when pre-trained with self-supervision, but still 4% behind supervised pre-training

Conclusion

Explored direct application of Transformers to image recognition
Did not introduce image-specific inductive biases
Interpreted image as sequence of patches and processed with standard Transformer encoder
Surprisingly well when coupled with pre-training on large datasets
Matched or exceeded state of the art on many image classification datasets
Cheap to pre-train
Challenge to apply to other computer vision tasks
Challenge to explore self-supervised pre-training methods
Gap between self-supervised and large-scale supervised pre-training
Scaling of ViT would likely lead to improved performance
Removed head and replaced with single, zero-initialized linear layer
Used same hyperparameter setting for all tasks
Masked patch prediction objective for self-supervision experiments
Predicting mean, 3-bit color of corrupted patches
Diminishing returns on downstream performance after 100k pretraining steps
Similar gains when pretraining on ImageNet
Little to no difference between different ways of encoding positional information
Pattern of position embedding similarity depends on training hyperparameters
ViT outperforms ResNets with same computational budget
Hybrids improve upon pure Transformers for smaller model sizes, but gap vanishes for larger models

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Vision transformer (vit)#

Fine-tuning and higher resolution#

Experiments#

Setup#

Comparison to state of the art#

Pre-training data requirements#

Scaling study#

Inspecting vision transformer#

Input attention#

Self-supervision#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Vision transformer (vit)

Fine-tuning and higher resolution

Experiments

Setup

Comparison to state of the art

Pre-training data requirements

Scaling study

Inspecting vision transformer

Input attention

Self-supervision

Conclusion