Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Vision Transformers convert images to sequences by slicing them into patches.
  • Changing the patch size typically requires retraining the model, but randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.

Paper Content

Introduction

  • ViTs process images by cutting them into nonoverlapping patches
  • Patchification is different from CNNs, which use small local and overlapping filters
  • Patchification has enabled new capabilities
  • Patch size is an important factor in ViT models
  • FlexiViT is a flexible ViT model that can match or outperform standard fixed-patch ViTs
  • FlexiViT can be used for efficient transfer learning and pre-training
  • FlexiViT representations are often similar across different patch sizes
  • Exploiting patchification to improve ViT’s efficiency
  • Removing tokens during or after training
  • Training a cascade of Transformers to allow early exiting during inference
  • Keeping all tokens and not discarding any information
  • Training one “supernet” from which individual, differently-shaped “subnets” can be extracted
  • Patchifying an image at multiple scales and dropping random tokens to reduce the sequence length
  • Focusing on ViT’s patch size only and allowing benefit from existing pretrained models
  • Training models whose output vector contains meaningful subvectors

Making vit flexible

  • Standard ViT models are not flexible.
  • FlexiViT model and training procedure introduced.
  • Experiments performed on ImageNet-21k dataset.
  • ViT-B scale model and unregularized light2 setting used.
  • Models trained for 90 epochs.

Background and notation

  • FlexiViT is based on the Vision Transformer (ViT) architecture
  • ViT tokenizes an image into a sequence of patches
  • Patch size determines the length of the input sequence to the Transformer model
  • ViT models are denoted as ViT-S/p, where S is the model scale and p is the patch size
  • Flexible ViT model works simultaneously for any patch size

Standard vits are not flexible

  • Evaluating a standard pre-trained ViT model at different patch sizes yields poor performance.
  • To change the patch size, the patch embedding weights and position embeddings are resized with bilinear interpolation.

Training flexible vits

  • FlexiViT-B model matches ViT-B/16 and ViT-B/30 when evaluated at their training patch sizes and outperforms them for all other patch sizes
  • Two small changes to the model and training code are necessary
  • Image resolution of 240² px used to have a large variety of patch sizes
  • Patch size sampled from uniform distribution P over patch sizes 48-8, inclusive

How to resize patch embeddings

  • Patch and embedding weights can be resized with bilinear interpolation
  • Resizing can cause a dramatic change in token norm
  • To maintain token norm, patch embeddings can be normalized
  • PI-resize is a more principled way to maintain token norm
  • PI-resize is compatible with existing pre-trained models
  • PI-resize works better than other heuristics when upsampling and downsampling

Connection to knowledge distillation

  • Knowledge distillation is a technique used to improve the performance of a smaller student model by training it to mimic the predictions of a larger teacher model.
  • Knowledge distillation is a more challenging optimization problem than standard supervised training.
  • Initializing the student close to the teacher can simplify the optimization problem.
  • FlexiViT can be initialized with the weights of a powerful ViT teacher to improve distillation performance.

Flexivit’s internal representation

  • FlexiViT processes inputs with different patch sizes in similar ways
  • CKA and arccosine transform are used to compare representations within and across neural networks
  • Feature map representations are similar across grid sizes until the MLP sublayer of block 6
  • CLS token representations remain aligned across grid sizes
  • Output representations are generally aligned

Using pre-trained flexivits

  • ViTs can be trained flexibly without significant loss of performance
  • Pre-trained FlexiViTs are comparable to individual fixed patch-size ViTs when transferred to other tasks
  • FlexiViT is tested on classification, locked-image tuning, open-vocabulary detection and Universal Vision Model tasks
  • Results are provided in Appendix E

Results

  • FlexiViT model performs similarly to two fixed ViT models
  • FlexiViT performs better than fixed ViT models at smaller patch sizes

Resource-efficient transfer via flexibility

  • FlexiViT enables a more resource efficient way of making transfer learning
  • FlexiViT saves accelerator memory and compute by using flexibility in input grid size

Flexifying existing training setups

  • Flexifying models during pretraining
  • Flexifying existing pre-trained models during transfer to downstream tasks
  • Flexifying a diverse set of existing training setups

Transfer learning

  • We use the same datasets and settings from Section 4.
  • We evaluate the model at different patch sizes.
  • Flexible transfer works best, but fixed transfer also works well.
  • The baseline is a fixed-size model transferred and evaluated at the same size.

Multimodal image-text training

  • FlexiLiT and FlexiCLIP are two ways to flexify multimodal imagetext training
  • FlexiLiT trains a text tower to produce text embeddings that align with visual embeddings from various patch sizes
  • Figure 10 shows zero-shot image to text retrieval results on the Flickr30k dataset
  • FlexiLiT-B/flexi performs the best on average
  • Flexification provides the possibility of fast transfer
  • LiT-ViT baselines match FlexiLiT on the sequence length it has been trained for, but performance drops quickly when using a different sequence length during inference

Open-vocabulary detection

  • Flexification works for object detection training.
  • Flexible patch sizes can lead to improved results over evaluation at the smallest patch size.

Training times and flexification

  • FlexiViT can be used to pre-train fixed ViTs faster
  • A curriculum is used to specify a sequence of probability distributions over patch sizes
  • Training with a patch size curriculum leads to better performance per compute budget than standard training

Analyzing flexivits

  • Attention relevance changes at different scales
  • Representations of tokens at different patch sizes are similar
  • Ensembling multiple smaller FlexiViTs is not always better than running a single FlexiViT
  • FlexiViT has a similar texture bias to ViT at the same patch size

Discussion of alternatives

  • Varying patch embedding stride: changing the sampling stride of a fixed patch size to increase sequence length
  • Varying model depth: adding flexibility in terms of depth by attaching the shared head to various intermediate layers

Conclusion

  • FlexiViT is a way of trading off compute and predictive performance with a single model
  • It can reduce pre-training costs by only training a single model for all scales at once
  • It performs well at a variety of downstream tasks
  • It has two new hyper-parameters: patch-embedding weights and position-embeddings
  • Patch-size parameter has little influence on performance as long as it is in a “reasonable” range
  • Position embedding parameter has little influence on performance in plain ViT training
  • Used a uniform distribution for patch-size for simplicity
  • Rewrote objective function in Eq. (2)
  • Derived an analytic solution for an arbitrary objective function
  • Visualized upscaling and downscaling matrix for bilinear and PI-resize operations
  • PCA of patch embeddings
  • Used BiT-HyperRule for transfer
  • Used SGD optimizer with a momentum of 0.9, no weight decay, no dropout, and no other augmentations
  • Used 4B image-text pairs dataset to train LiT models
  • Used FlexiViT model instead of a standard ViT model
  • Used same hyper parameters as LiT models
  • Used same transfer learning setup from [15] to finetune FlexiViT models
  • Used same experimental setup as Segmenter [51] for end-to-end finetuning of Vision Transformer
  • Used same setup as UViM [27] for open-vocabulary detection
  • Used same setup as LiT [66] for FlexiLiT
  • Used same setup as CLIP [43] for FlexiCLIP
  • Used a curriculum of probability distributions over patch sizes