Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed a novel architecture to enhance parameter and compute utilization for computer vision tasks
  • Model uses global and local self-attention modules to model long and short-range spatial interactions
  • Addresses lack of inductive bias and improves modeling of inter-channel dependencies
  • Achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks

Paper Content

Introduction

  • Transformers have achieved SOTA performance in NLP benchmarks.
  • Self-attention mechanism allows for capturing contextual representations.
  • Vision Transformer (ViT) proposed to utilize image patches as tokens.
  • ViT-based models have achieved SOTA or competitive performance in various computer vision tasks.
  • Self-attention mechanism in ViT allows for learning more uniform short and long-range information.
  • Monolithic architecture of ViT and quadratic computational complexity of self-attention limit application to high resolution images.
  • Efforts have attempted to address balance between short-and long-range spatial dependencies.
  • Limited receptive field of local windows challenges self-attention to capture long-range information.
  • Global Context (GC) ViT network proposed to address limitations.
  • Hierarchical ViT architecture consisting of local and global self-attention modules.
  • Global query tokens shared across all global self-attention modules.
  • Novel downsampling block with a parameter-efficient fused-MBConv layer.
  • GC ViT achieves SOTA benchmarks of 83.4%, 83.9%, 84.4% and 84.6% Top-1 accuracy.
  • GC ViT outperforms ConvNeXt and Swin Transformer models.
  • GC ViT achieves SOTA results on MS COCO and ADE20K datasets.

Input features

Fused mbconv

  • Max pooling is a technique used in computer science
  • Max pooling involves dividing an image into 2x2 sections and taking the maximum value from each section

Extracted global features reshape repeat

Stage-wise global tokens

  • Spatial matching with local tokens
  • Tokenized features used for input-to-stage dimension matching

Global query generator

  • A layer in the generator consists of a Fused-MBConv block and a max pooling layer.
  • Global query tokens are computed once at every stage of the model and shared across all global attention blocks.
  • Global attention layers learn local key and value features which will be used for interaction with global query tokens.
  • Global self-attention query, key and value features are computed using a linear layer.

Global self-attention

  • Algorithm 1 presents a PyTorch-like pseudocode for computing global self-attention in GC ViT.
  • Interaction with rich contextual information embedded in the global query tokens provides an effective way of enlarging the receptive field and attending to various regions in the input feature maps.
  • ViT (Dosovitskiy et al., 2020) is an alternative to CNNs
  • ViT has an enlarged receptive field due to its self-attention layers

Experiments

  • Trained and tested model on ImageNet-1K dataset
  • Used AdamW optimizer for 300 epochs
  • Used MS COCO dataset for object detection and instance segmentation
  • Used ADE20K dataset for semantic segmentation
  • Compared against Tiny, Small and Base model variants

Classification

  • Presents ImageNet-1K classification benchmarks
  • Focuses on the performance of deep learning models

Ade20k semantic segmentation results

  • Models using pretrained GC ViT backbones outperform counterpart models
  • Removing window shifting causes significant performance degradation
  • Changing distribution of parameters improves performance
  • Adding CNN-based stem of GC ViT provides additional improvements
  • Using proposed downsampler further improves performance
  • Leveraging proposed global self-attention improves performance

Downsampler design

  • Studied effectiveness of downsampler blocks in Table 5
  • Simplest alternative is convolutional and maxpooling layers, but reduces ImageNet Top-1 accuracy by -0.7
  • Patch merging is another variant introduced in Swin Transformers (Liu et al., 2021)

Down-sampler architecture

  • Top-1 Conv Conv (s=1) achieved 82.7% accuracy
  • Swin Linear achieved 82.9% accuracy

Gc vit

  • Modified Fused-MBConv (s=2) achieves 83.4% Top-1 accuracy on ImageNet
  • Accuracy is reduced by -0.5 when using a different down-sampler
  • Proposed down-sampler consists of a modified Fused-MBConv block and strided convolution
  • SE operation boosts cross channel interaction while keeping number of parameters and FLOPs low
  • Proposed down-sampler is essential to achieve high accuracy as it introduces convolutional inductive bias

Interpretability

  • Proposed global self-attention and query tokens are interpretable
  • Visualization of the learned attention and Grad-CAM can be used to provide insights

E training details

  • GC ViT models were trained using 4 nodes with 32 NVIDIA A100 GPUs
  • Training batch size was 1024 for some models and 4096 for others
  • Training took 32 hours on average
  • Used timm package (Wightman, 2019)
  • Detection and instance segmentation models used 1 node with 8 NVIDIA A40 GPUs and took 56 hours on average
  • Semantic segmentation models used mmsegmentation (Contributors, 2020) and took 34 hours on average

F complexity analysis

  • GC ViT has a computational complexity similar to Swin Transformer
  • GC ViT captures long-range information and has better accuracy for classification and downstream tasks

G imagenet classification benchmarks

  • Table S.4 provides a benchmark for models trained on ImageNet-1K dataset
  • No additional data was used