Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Vision Transformers convert images to sequences by slicing them into patches.
- Changing the patch size typically requires retraining the model, but randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
Paper Content
Introduction
- ViTs process images by cutting them into nonoverlapping patches
- Patchification is different from CNNs, which use small local and overlapping filters
- Patchification has enabled new capabilities
- Patch size is an important factor in ViT models
- FlexiViT is a flexible ViT model that can match or outperform standard fixed-patch ViTs
- FlexiViT can be used for efficient transfer learning and pre-training
- FlexiViT representations are often similar across different patch sizes
Related work
- Exploiting patchification to improve ViT’s efficiency
- Removing tokens during or after training
- Training a cascade of Transformers to allow early exiting during inference
- Keeping all tokens and not discarding any information
- Training one “supernet” from which individual, differently-shaped “subnets” can be extracted
- Patchifying an image at multiple scales and dropping random tokens to reduce the sequence length
- Focusing on ViT’s patch size only and allowing benefit from existing pretrained models
- Training models whose output vector contains meaningful subvectors
Making vit flexible
- Standard ViT models are not flexible.
- FlexiViT model and training procedure introduced.
- Experiments performed on ImageNet-21k dataset.
- ViT-B scale model and unregularized light2 setting used.
- Models trained for 90 epochs.
Background and notation
- FlexiViT is based on the Vision Transformer (ViT) architecture
- ViT tokenizes an image into a sequence of patches
- Patch size determines the length of the input sequence to the Transformer model
- ViT models are denoted as ViT-S/p, where S is the model scale and p is the patch size
- Flexible ViT model works simultaneously for any patch size
Standard vits are not flexible
- Evaluating a standard pre-trained ViT model at different patch sizes yields poor performance.
- To change the patch size, the patch embedding weights and position embeddings are resized with bilinear interpolation.
Training flexible vits
- FlexiViT-B model matches ViT-B/16 and ViT-B/30 when evaluated at their training patch sizes and outperforms them for all other patch sizes
- Two small changes to the model and training code are necessary
- Image resolution of 240² px used to have a large variety of patch sizes
- Patch size sampled from uniform distribution P over patch sizes 48-8, inclusive
How to resize patch embeddings
- Patch and embedding weights can be resized with bilinear interpolation
- Resizing can cause a dramatic change in token norm
- To maintain token norm, patch embeddings can be normalized
- PI-resize is a more principled way to maintain token norm
- PI-resize is compatible with existing pre-trained models
- PI-resize works better than other heuristics when upsampling and downsampling
Connection to knowledge distillation
- Knowledge distillation is a technique used to improve the performance of a smaller student model by training it to mimic the predictions of a larger teacher model.
- Knowledge distillation is a more challenging optimization problem than standard supervised training.
- Initializing the student close to the teacher can simplify the optimization problem.
- FlexiViT can be initialized with the weights of a powerful ViT teacher to improve distillation performance.
Flexivit’s internal representation
- FlexiViT processes inputs with different patch sizes in similar ways
- CKA and arccosine transform are used to compare representations within and across neural networks
- Feature map representations are similar across grid sizes until the MLP sublayer of block 6
- CLS token representations remain aligned across grid sizes
- Output representations are generally aligned
Using pre-trained flexivits
- ViTs can be trained flexibly without significant loss of performance
- Pre-trained FlexiViTs are comparable to individual fixed patch-size ViTs when transferred to other tasks
- FlexiViT is tested on classification, locked-image tuning, open-vocabulary detection and Universal Vision Model tasks
- Results are provided in Appendix E
Results
- FlexiViT model performs similarly to two fixed ViT models
- FlexiViT performs better than fixed ViT models at smaller patch sizes
Resource-efficient transfer via flexibility
- FlexiViT enables a more resource efficient way of making transfer learning
- FlexiViT saves accelerator memory and compute by using flexibility in input grid size
Flexifying existing training setups
- Flexifying models during pretraining
- Flexifying existing pre-trained models during transfer to downstream tasks
- Flexifying a diverse set of existing training setups
Transfer learning
- We use the same datasets and settings from Section 4.
- We evaluate the model at different patch sizes.
- Flexible transfer works best, but fixed transfer also works well.
- The baseline is a fixed-size model transferred and evaluated at the same size.
Multimodal image-text training
- FlexiLiT and FlexiCLIP are two ways to flexify multimodal imagetext training
- FlexiLiT trains a text tower to produce text embeddings that align with visual embeddings from various patch sizes
- Figure 10 shows zero-shot image to text retrieval results on the Flickr30k dataset
- FlexiLiT-B/flexi performs the best on average
- Flexification provides the possibility of fast transfer
- LiT-ViT baselines match FlexiLiT on the sequence length it has been trained for, but performance drops quickly when using a different sequence length during inference
Open-vocabulary detection
- Flexification works for object detection training.
- Flexible patch sizes can lead to improved results over evaluation at the smallest patch size.
Training times and flexification
- FlexiViT can be used to pre-train fixed ViTs faster
- A curriculum is used to specify a sequence of probability distributions over patch sizes
- Training with a patch size curriculum leads to better performance per compute budget than standard training
Analyzing flexivits
- Attention relevance changes at different scales
- Representations of tokens at different patch sizes are similar
- Ensembling multiple smaller FlexiViTs is not always better than running a single FlexiViT
- FlexiViT has a similar texture bias to ViT at the same patch size
Discussion of alternatives
- Varying patch embedding stride: changing the sampling stride of a fixed patch size to increase sequence length
- Varying model depth: adding flexibility in terms of depth by attaching the shared head to various intermediate layers
Conclusion
- FlexiViT is a way of trading off compute and predictive performance with a single model
- It can reduce pre-training costs by only training a single model for all scales at once
- It performs well at a variety of downstream tasks
- It has two new hyper-parameters: patch-embedding weights and position-embeddings
- Patch-size parameter has little influence on performance as long as it is in a “reasonable” range
- Position embedding parameter has little influence on performance in plain ViT training
- Used a uniform distribution for patch-size for simplicity
- Rewrote objective function in Eq. (2)
- Derived an analytic solution for an arbitrary objective function
- Visualized upscaling and downscaling matrix for bilinear and PI-resize operations
- PCA of patch embeddings
- Used BiT-HyperRule for transfer
- Used SGD optimizer with a momentum of 0.9, no weight decay, no dropout, and no other augmentations
- Used 4B image-text pairs dataset to train LiT models
- Used FlexiViT model instead of a standard ViT model
- Used same hyper parameters as LiT models
- Used same transfer learning setup from [15] to finetune FlexiViT models
- Used same experimental setup as Segmenter [51] for end-to-end finetuning of Vision Transformer
- Used same setup as UViM [27] for open-vocabulary detection
- Used same setup as LiT [66] for FlexiLiT
- Used same setup as CLIP [43] for FlexiCLIP
- Used a curriculum of probability distributions over patch sizes