Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed a novel architecture to enhance parameter and compute utilization for computer vision tasks
- Model uses global and local self-attention modules to model long and short-range spatial interactions
- Addresses lack of inductive bias and improves modeling of inter-channel dependencies
- Achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks
Paper Content
Introduction
- Transformers have achieved SOTA performance in NLP benchmarks.
- Self-attention mechanism allows for capturing contextual representations.
- Vision Transformer (ViT) proposed to utilize image patches as tokens.
- ViT-based models have achieved SOTA or competitive performance in various computer vision tasks.
- Self-attention mechanism in ViT allows for learning more uniform short and long-range information.
- Monolithic architecture of ViT and quadratic computational complexity of self-attention limit application to high resolution images.
- Efforts have attempted to address balance between short-and long-range spatial dependencies.
- Limited receptive field of local windows challenges self-attention to capture long-range information.
- Global Context (GC) ViT network proposed to address limitations.
- Hierarchical ViT architecture consisting of local and global self-attention modules.
- Global query tokens shared across all global self-attention modules.
- Novel downsampling block with a parameter-efficient fused-MBConv layer.
- GC ViT achieves SOTA benchmarks of 83.4%, 83.9%, 84.4% and 84.6% Top-1 accuracy.
- GC ViT outperforms ConvNeXt and Swin Transformer models.
- GC ViT achieves SOTA results on MS COCO and ADE20K datasets.
Input features
Fused mbconv
- Max pooling is a technique used in computer science
- Max pooling involves dividing an image into 2x2 sections and taking the maximum value from each section
Extracted global features reshape repeat
Stage-wise global tokens
- Spatial matching with local tokens
- Tokenized features used for input-to-stage dimension matching
Global query generator
- A layer in the generator consists of a Fused-MBConv block and a max pooling layer.
- Global query tokens are computed once at every stage of the model and shared across all global attention blocks.
- Global attention layers learn local key and value features which will be used for interaction with global query tokens.
- Global self-attention query, key and value features are computed using a linear layer.
Global self-attention
- Algorithm 1 presents a PyTorch-like pseudocode for computing global self-attention in GC ViT.
- Interaction with rich contextual information embedded in the global query tokens provides an effective way of enlarging the receptive field and attending to various regions in the input feature maps.
Related work
- ViT (Dosovitskiy et al., 2020) is an alternative to CNNs
- ViT has an enlarged receptive field due to its self-attention layers
Experiments
- Trained and tested model on ImageNet-1K dataset
- Used AdamW optimizer for 300 epochs
- Used MS COCO dataset for object detection and instance segmentation
- Used ADE20K dataset for semantic segmentation
- Compared against Tiny, Small and Base model variants
Classification
- Presents ImageNet-1K classification benchmarks
- Focuses on the performance of deep learning models
Ade20k semantic segmentation results
- Models using pretrained GC ViT backbones outperform counterpart models
- Removing window shifting causes significant performance degradation
- Changing distribution of parameters improves performance
- Adding CNN-based stem of GC ViT provides additional improvements
- Using proposed downsampler further improves performance
- Leveraging proposed global self-attention improves performance
Downsampler design
- Studied effectiveness of downsampler blocks in Table 5
- Simplest alternative is convolutional and maxpooling layers, but reduces ImageNet Top-1 accuracy by -0.7
- Patch merging is another variant introduced in Swin Transformers (Liu et al., 2021)
Down-sampler architecture
- Top-1 Conv Conv (s=1) achieved 82.7% accuracy
- Swin Linear achieved 82.9% accuracy
Gc vit
- Modified Fused-MBConv (s=2) achieves 83.4% Top-1 accuracy on ImageNet
- Accuracy is reduced by -0.5 when using a different down-sampler
- Proposed down-sampler consists of a modified Fused-MBConv block and strided convolution
- SE operation boosts cross channel interaction while keeping number of parameters and FLOPs low
- Proposed down-sampler is essential to achieve high accuracy as it introduces convolutional inductive bias
Interpretability
- Proposed global self-attention and query tokens are interpretable
- Visualization of the learned attention and Grad-CAM can be used to provide insights
E training details
- GC ViT models were trained using 4 nodes with 32 NVIDIA A100 GPUs
- Training batch size was 1024 for some models and 4096 for others
- Training took 32 hours on average
- Used timm package (Wightman, 2019)
- Detection and instance segmentation models used 1 node with 8 NVIDIA A40 GPUs and took 56 hours on average
- Semantic segmentation models used mmsegmentation (Contributors, 2020) and took 34 hours on average
F complexity analysis
- GC ViT has a computational complexity similar to Swin Transformer
- GC ViT captures long-range information and has better accuracy for classification and downstream tasks
G imagenet classification benchmarks
- Table S.4 provides a benchmark for models trained on ImageNet-1K dataset
- No additional data was used