Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed a novel architecture to enhance parameter and compute utilization for computer vision tasks
Model uses global and local self-attention modules to model long and short-range spatial interactions
Addresses lack of inductive bias and improves modeling of inter-channel dependencies
Achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks

Paper Content

Introduction

Transformers have achieved SOTA performance in NLP benchmarks.
Self-attention mechanism allows for capturing contextual representations.
Vision Transformer (ViT) proposed to utilize image patches as tokens.
ViT-based models have achieved SOTA or competitive performance in various computer vision tasks.
Self-attention mechanism in ViT allows for learning more uniform short and long-range information.
Monolithic architecture of ViT and quadratic computational complexity of self-attention limit application to high resolution images.
Efforts have attempted to address balance between short-and long-range spatial dependencies.
Limited receptive field of local windows challenges self-attention to capture long-range information.
Global Context (GC) ViT network proposed to address limitations.
Hierarchical ViT architecture consisting of local and global self-attention modules.
Global query tokens shared across all global self-attention modules.
Novel downsampling block with a parameter-efficient fused-MBConv layer.
GC ViT achieves SOTA benchmarks of 83.4%, 83.9%, 84.4% and 84.6% Top-1 accuracy.
GC ViT outperforms ConvNeXt and Swin Transformer models.
GC ViT achieves SOTA results on MS COCO and ADE20K datasets.

Input features

Fused mbconv

Max pooling is a technique used in computer science
Max pooling involves dividing an image into 2x2 sections and taking the maximum value from each section

Extracted global features reshape repeat

Stage-wise global tokens

Spatial matching with local tokens
Tokenized features used for input-to-stage dimension matching

Global query generator

A layer in the generator consists of a Fused-MBConv block and a max pooling layer.
Global query tokens are computed once at every stage of the model and shared across all global attention blocks.
Global attention layers learn local key and value features which will be used for interaction with global query tokens.
Global self-attention query, key and value features are computed using a linear layer.

Global self-attention

Algorithm 1 presents a PyTorch-like pseudocode for computing global self-attention in GC ViT.
Interaction with rich contextual information embedded in the global query tokens provides an effective way of enlarging the receptive field and attending to various regions in the input feature maps.

ViT (Dosovitskiy et al., 2020) is an alternative to CNNs
ViT has an enlarged receptive field due to its self-attention layers

Experiments

Trained and tested model on ImageNet-1K dataset
Used AdamW optimizer for 300 epochs
Used MS COCO dataset for object detection and instance segmentation
Used ADE20K dataset for semantic segmentation
Compared against Tiny, Small and Base model variants

Classification

Presents ImageNet-1K classification benchmarks
Focuses on the performance of deep learning models

Ade20k semantic segmentation results

Models using pretrained GC ViT backbones outperform counterpart models
Removing window shifting causes significant performance degradation
Changing distribution of parameters improves performance
Adding CNN-based stem of GC ViT provides additional improvements
Using proposed downsampler further improves performance
Leveraging proposed global self-attention improves performance

Downsampler design

Studied effectiveness of downsampler blocks in Table 5
Simplest alternative is convolutional and maxpooling layers, but reduces ImageNet Top-1 accuracy by -0.7
Patch merging is another variant introduced in Swin Transformers (Liu et al., 2021)

Down-sampler architecture

Top-1 Conv Conv (s=1) achieved 82.7% accuracy
Swin Linear achieved 82.9% accuracy

Gc vit

Modified Fused-MBConv (s=2) achieves 83.4% Top-1 accuracy on ImageNet
Accuracy is reduced by -0.5 when using a different down-sampler
Proposed down-sampler consists of a modified Fused-MBConv block and strided convolution
SE operation boosts cross channel interaction while keeping number of parameters and FLOPs low
Proposed down-sampler is essential to achieve high accuracy as it introduces convolutional inductive bias

Interpretability

Proposed global self-attention and query tokens are interpretable
Visualization of the learned attention and Grad-CAM can be used to provide insights

E training details

GC ViT models were trained using 4 nodes with 32 NVIDIA A100 GPUs
Training batch size was 1024 for some models and 4096 for others
Training took 32 hours on average
Used timm package (Wightman, 2019)
Detection and instance segmentation models used 1 node with 8 NVIDIA A40 GPUs and took 56 hours on average
Semantic segmentation models used mmsegmentation (Contributors, 2020) and took 34 hours on average

F complexity analysis

GC ViT has a computational complexity similar to Swin Transformer
GC ViT captures long-range information and has better accuracy for classification and downstream tasks

G imagenet classification benchmarks

Table S.4 provides a benchmark for models trained on ImageNet-1K dataset
No additional data was used

Link to paper#

Abstract#

Paper Content#

Introduction#

Input features#

Fused mbconv#

Extracted global features reshape repeat#

Stage-wise global tokens#

Global query generator#

Global self-attention#

Related work#

Experiments#

Classification#

Ade20k semantic segmentation results#

Downsampler design#

Down-sampler architecture#

Gc vit#

Interpretability#

E training details#

F complexity analysis#

G imagenet classification benchmarks#

Link to paper

Abstract

Paper Content

Introduction

Input features

Fused mbconv

Extracted global features reshape repeat

Stage-wise global tokens

Global query generator

Global self-attention

Related work

Experiments

Classification

Ade20k semantic segmentation results

Downsampler design

Down-sampler architecture

Gc vit

Interpretability

E training details

F complexity analysis

G imagenet classification benchmarks