Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent progress in vision Transformers has been successful in various tasks.
  • A convolution-based framework can be used to implement the key ingredients of vision Transformers.
  • A new operation called Recursive Gated Convolution ($\textit{g}^\textit{n}$Conv) is highly flexible and customizable.
  • HorNet is a new family of generic vision backbones based on $\textit{g}^\textit{n}$Conv.
  • HorNet outperforms Swin Transformers and ConvNeXt.
  • $\textit{g}^\textit{n}$Conv can be applied to task-specific decoders and improve dense prediction performance.
  • $\textit{g}^\textit{n}$Conv is a new basic module for visual modeling that combines the merits of both vision Transformers and CNNs.

Paper Content

Introduction

  • Convolutional neural networks (CNNs) have been used for deep learning and computer vision since the introduction of AlexNet
  • CNNs have useful properties that make them suitable for a wide range of vision applications
  • CNNs are efficient on high-performance GPUs and edge devices
  • Vision Transformers have challenged the dominance of CNNs
  • Vision Transformers have shown leading performance on various vision tasks
  • Vision Transformers are more powerful than CNNs
  • Efforts have been made to improve CNN architectures by learning from vision Transformers
  • The success of self-attention and other dynamic networks suggests that explicit and high-order spatial interactions are beneficial
  • The key ingredient behind the success of vision Transformers is the new way of spatial modeling with input-adaptive, long-range and high-order spatial interactions
  • A convolution-based framework is proposed to efficiently implement the key ingredients of vision Transformers
  • Experiments are conducted to verify the effectiveness of the proposed models
  • Transformer architecture was originally designed for natural language processing tasks
  • Dosovitskiy et al. showed that vision models constructed with Transformer blocks and a patch embedding layer can achieve competitive performance to CNNs
  • State-of-the-art vision Transformers usually utilize a CNN-like hierarchical architecture and local self-attention
  • Combining vision Transformers and CNNs to develop hybrid architectures is a new direction
  • Gated convolution (gConv) is used to achieve efficient 1-order spatial interactions
  • G n Conv is designed to further enhance the model capacity by introducing higher-order interactions
  • G n Conv achieves n-order spatial interactions with a similar computational cost to a convolutional layer
  • 7x7 convolution and Global Filter (GF) are used to capture long-term dependencies
  • Vision Transformers have higher-order spatial interactions in each basic block
  • G n Conv is an extension of the self-attention in terms of the order of the spatial mixing weight

Model architectures

  • HorNet is a drop-in replacement of spatial mixing layer in vision Transformers and modern CNNs.
  • HorNet has two series of model variants.
  • HorFPN replaces standard convolution with g n Conv to improve spatial interactions.
  • HorFPN has two implementations.

Experiments

  • Conducted extensive experiments to verify effectiveness of method
  • Presented main results on ImageNet and compared with various architectures
  • Tested models on downstream dense prediction tasks on ADE20K and COCO
  • Provided ablation studies and analyzed effectiveness of g n Conv on wide range of models

Imagenet classification

  • Conducted image classification experiments on ImageNet dataset
  • Used standard ImageNet-1K dataset and ImageNet-22K dataset
  • Trained models for 300 epochs and 90 epochs respectively
  • Compared models with state-of-the-art vision Transformers and CNNs
  • Models achieved competitive performance and surpassed Swin Transformers
  • Models generalized well to larger image resolution, model sizes and training data

Dense prediction tasks

  • Evaluated HorNet for semantic segmentation on ADE20K dataset
  • HorNet 7x7 and HorNet GF models outperform Swin and ConvNeXt models
  • HorNet-L 7x7 and HorNet-L GF outperform ConvNeXt-XL with fewer FLOPs
  • Evaluated HorNet for object detection on COCO dataset
  • HorNet models achieve better performance than Swin/ConvNeXt counterparts
  • HorFPN reduces FLOPs and achieves better validation mIoU for semantic segmentation
  • HorFPN outperforms standard FPN in box AP and mask AP on different backbones
  • HorFPN GF is consistently better than HorFPN 7x7

Analysis

  • Ablation study shows that g n Conv with n = 1 (g {1,1,1,1} Conv) improves over the baseline model
  • Increasing order (g {2,3,4,5} Conv) further improves accuracy
  • g n Conv better captures high-order spatial interactions than self-attention and depth-wise convolution
  • g n Conv improves isotropic architectures by a large margin
  • g n Conv improves 3x3 depth-wise convolution and 3x3 pooling by large margins
  • HorNet has better accuracy-complexity trade-offs than vision Transformers and modern CNNs
  • g n Conv has adaptive weights to input samples and spatial locations
  • HorNet is slower than ConvNeXt with similar FLOPs on GPU

Conclusion

  • We presented Recursive Gated Convolution (g n Conv) for efficient, extendable, and translation-equivariant high-order spatial interactions
  • g n Conv can be used as a drop-in replacement for spatial mixing layers in vision Transformers and convolution-based models
  • We constructed a new family of generic vision backbones called HorNet
  • Experiments demonstrate the effectiveness of g n Conv and HorNet on commonly used visual recognition benchmarks
  • We used LayerScale techniques to make our models more stable during training
  • We compared our models with ConvNeXt and Swin Transformers on ImageNet-1K
  • We found that re-scaling the output of gated convolution with ฮฑ = 3 leads to the best performance
  • We found that no activation function achieves the best performance for gated convolutions
  • We compared HorFPN with standard FPN on different backbones and found HorFPN consistently outperforms standard FPN
  • We compared our models with recent state-of-the-art frameworks on object detection and semantic segmentation and found our models outperform them