Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Transformer architecture is widely used for natural language processing tasks, but not for computer vision.
- Attention is usually used in conjunction with convolutional networks or to replace certain components of convolutional networks.
- Vision Transformer (ViT) can perform well on image classification tasks without relying on convolutional networks.
- ViT can attain excellent results on mid-sized or small image recognition benchmarks with fewer computational resources.
Paper Content
Introduction
- Transformers are the model of choice in NLP
- Transformers are computationally efficient and scalable
- Convolutional architectures are dominant in computer vision
- Some works combine CNNs and self-attention
- Transformer applied directly to images, split into patches
- Model trained on mid-sized datasets yields modest accuracies
- Large scale training trumps inductive bias
- ViT approaches or beats state of the art on multiple image recognition benchmarks
Related work
- Transformers proposed for machine translation and used in many NLP tasks
- Large Transformer-based models pre-trained on large corpora and then fine-tuned
- BERT uses denoising self-supervised pre-training task
- GPT line of work uses language modeling as pre-training task
- Applying Transformers to images requires approximations
- Combining CNNs with forms of self-attention
- iGPT applies Transformers to image pixels after reducing resolution and color space
- Use of additional data sources allows to achieve state-of-the-art results on standard benchmarks
Method
- Model design follows Transformer (Vaswani et al., 2017)
- Advantages of simple setup: scalable NLP Transformer architectures and efficient implementations can be used
Vision transformer (vit)
- Model takes 1D sequence of token embeddings as input
- Image reshaped into sequence of flattened 2D patches
- Trainable linear projection maps patches to constant latent vector size
- Output of projection is patch embeddings
- Learnable embedding prepended to sequence of embedded patches
- Classification head attached to output of Transformer encoder
- Position embeddings added to patch embeddings
- Transformer encoder consists of alternating layers of multiheaded self-attention and MLP blocks
- Hybrid model uses feature maps from CNN as input sequence
Fine-tuning and higher resolution
- Pre-train ViT on large datasets and fine-tune to smaller downstream tasks.
- Use a zero-initialized feedforward layer for downstream classes.
- Fine-tune at higher resolution than pre-training.
- Keep patch size the same when feeding higher resolution images.
- Vision Transformer can handle arbitrary sequence lengths.
- Perform 2D interpolation of pre-trained position embeddings.
Experiments
- Evaluated representation learning capabilities of ResNet, Vision Transformer (ViT), and hybrid
- Pre-trained on datasets of varying size and evaluated benchmark tasks
- ViT performs well and attains state of the art on most recognition benchmarks at lower pre-training cost
- Small experiment using self-supervision shows promise for the future
Setup
- Used 3 datasets: ImageNet (1k classes, 1.3M images), ImageNet-21k (21k classes, 14M images), JFT (18k classes, 303M images)
- Transferred models to several benchmark tasks: ImageNet (original and cleaned-up labels), CIFAR-10/100, Oxford-IIIT Pets, Oxford Flowers-102, VTAB (19 tasks)
- Used 4 model variants: Base, Large, Huge, ResNet (BiT)
- Used Adam for training, SGD for fine-tuning
- Reported results on downstream datasets through few-shot and fine-tuning accuracy
Comparison to state of the art
- ViT-H/14 and ViT-L/16 models compared to state-of-the-art CNNs
- ViT-L/16 model pre-trained on ImageNet-21k dataset performs well on most datasets
- ViT-H/14 outperforms BiT-R152x4 and other methods on Natural and Structured tasks
Pre-training data requirements
- Vision Transformer performs well when pre-trained on large JFT-300M dataset
- Experiments performed on datasets of increasing size: ImageNet, ImageNet-21k, and JFT-300M
- Regularization parameters optimized to boost performance on smaller datasets
- Experiments performed on random subsets of 9M, 30M, and 90M as well as full JFT-300M dataset
- Results show Vision Transformers overfit more than ResNets on smaller datasets
Scaling study
- Controlled scaling study of different models was performed by evaluating transfer performance from JFT-300M.
- Model set included 7 ResNets, 6 Vision Transformers, and 5 hybrids.
- Vision Transformers dominate ResNets on performance/compute trade-off, using 2-4x less compute to attain same performance.
Inspecting vision transformer
Input attention
- Learned position embedding is added to patch representations
- Closer patches tend to have more similar position embeddings
- Row-column structure appears in position embeddings
- Hand-crafted 2D-aware embedding variants do not yield improvements
- Self-attention allows ViT to integrate information across the entire image
- Average distance in image space across which information is integrated is computed
- Some heads attend to most of the image already in the lowest layers
- Highly localized attention is less pronounced in hybrid models
- Attention distance increases with network depth
- Model attends to image regions that are semantically relevant for classification
Self-supervision
- Transformers show good performance on NLP tasks due to scalability and self-supervised pre-training
- Smaller ViT-B/16 model improved accuracy on ImageNet by 2% when pre-trained with self-supervision, but still 4% behind supervised pre-training
Conclusion
- Explored direct application of Transformers to image recognition
- Did not introduce image-specific inductive biases
- Interpreted image as sequence of patches and processed with standard Transformer encoder
- Surprisingly well when coupled with pre-training on large datasets
- Matched or exceeded state of the art on many image classification datasets
- Cheap to pre-train
- Challenge to apply to other computer vision tasks
- Challenge to explore self-supervised pre-training methods
- Gap between self-supervised and large-scale supervised pre-training
- Scaling of ViT would likely lead to improved performance
- Removed head and replaced with single, zero-initialized linear layer
- Used same hyperparameter setting for all tasks
- Masked patch prediction objective for self-supervision experiments
- Predicting mean, 3-bit color of corrupted patches
- Diminishing returns on downstream performance after 100k pretraining steps
- Similar gains when pretraining on ImageNet
- Little to no difference between different ways of encoding positional information
- Pattern of position embedding similarity depends on training hyperparameters
- ViT outperforms ResNets with same computational budget
- Hybrids improve upon pure Transformers for smaller model sizes, but gap vanishes for larger models