Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent progress in vision Transformers has been successful in various tasks.
- A convolution-based framework can be used to implement the key ingredients of vision Transformers.
- A new operation called Recursive Gated Convolution ($\textit{g}^\textit{n}$Conv) is highly flexible and customizable.
- HorNet is a new family of generic vision backbones based on $\textit{g}^\textit{n}$Conv.
- HorNet outperforms Swin Transformers and ConvNeXt.
- $\textit{g}^\textit{n}$Conv can be applied to task-specific decoders and improve dense prediction performance.
- $\textit{g}^\textit{n}$Conv is a new basic module for visual modeling that combines the merits of both vision Transformers and CNNs.
Paper Content
Introduction
- Convolutional neural networks (CNNs) have been used for deep learning and computer vision since the introduction of AlexNet
- CNNs have useful properties that make them suitable for a wide range of vision applications
- CNNs are efficient on high-performance GPUs and edge devices
- Vision Transformers have challenged the dominance of CNNs
- Vision Transformers have shown leading performance on various vision tasks
- Vision Transformers are more powerful than CNNs
- Efforts have been made to improve CNN architectures by learning from vision Transformers
- The success of self-attention and other dynamic networks suggests that explicit and high-order spatial interactions are beneficial
- The key ingredient behind the success of vision Transformers is the new way of spatial modeling with input-adaptive, long-range and high-order spatial interactions
- A convolution-based framework is proposed to efficiently implement the key ingredients of vision Transformers
- Experiments are conducted to verify the effectiveness of the proposed models
Related work
- Transformer architecture was originally designed for natural language processing tasks
- Dosovitskiy et al. showed that vision models constructed with Transformer blocks and a patch embedding layer can achieve competitive performance to CNNs
- State-of-the-art vision Transformers usually utilize a CNN-like hierarchical architecture and local self-attention
- Combining vision Transformers and CNNs to develop hybrid architectures is a new direction
- Gated convolution (gConv) is used to achieve efficient 1-order spatial interactions
- G n Conv is designed to further enhance the model capacity by introducing higher-order interactions
- G n Conv achieves n-order spatial interactions with a similar computational cost to a convolutional layer
- 7x7 convolution and Global Filter (GF) are used to capture long-term dependencies
- Vision Transformers have higher-order spatial interactions in each basic block
- G n Conv is an extension of the self-attention in terms of the order of the spatial mixing weight
Model architectures
- HorNet is a drop-in replacement of spatial mixing layer in vision Transformers and modern CNNs.
- HorNet has two series of model variants.
- HorFPN replaces standard convolution with g n Conv to improve spatial interactions.
- HorFPN has two implementations.
Experiments
- Conducted extensive experiments to verify effectiveness of method
- Presented main results on ImageNet and compared with various architectures
- Tested models on downstream dense prediction tasks on ADE20K and COCO
- Provided ablation studies and analyzed effectiveness of g n Conv on wide range of models
Imagenet classification
- Conducted image classification experiments on ImageNet dataset
- Used standard ImageNet-1K dataset and ImageNet-22K dataset
- Trained models for 300 epochs and 90 epochs respectively
- Compared models with state-of-the-art vision Transformers and CNNs
- Models achieved competitive performance and surpassed Swin Transformers
- Models generalized well to larger image resolution, model sizes and training data
Dense prediction tasks
- Evaluated HorNet for semantic segmentation on ADE20K dataset
- HorNet 7x7 and HorNet GF models outperform Swin and ConvNeXt models
- HorNet-L 7x7 and HorNet-L GF outperform ConvNeXt-XL with fewer FLOPs
- Evaluated HorNet for object detection on COCO dataset
- HorNet models achieve better performance than Swin/ConvNeXt counterparts
- HorFPN reduces FLOPs and achieves better validation mIoU for semantic segmentation
- HorFPN outperforms standard FPN in box AP and mask AP on different backbones
- HorFPN GF is consistently better than HorFPN 7x7
Analysis
- Ablation study shows that g n Conv with n = 1 (g {1,1,1,1} Conv) improves over the baseline model
- Increasing order (g {2,3,4,5} Conv) further improves accuracy
- g n Conv better captures high-order spatial interactions than self-attention and depth-wise convolution
- g n Conv improves isotropic architectures by a large margin
- g n Conv improves 3x3 depth-wise convolution and 3x3 pooling by large margins
- HorNet has better accuracy-complexity trade-offs than vision Transformers and modern CNNs
- g n Conv has adaptive weights to input samples and spatial locations
- HorNet is slower than ConvNeXt with similar FLOPs on GPU
Conclusion
- We presented Recursive Gated Convolution (g n Conv) for efficient, extendable, and translation-equivariant high-order spatial interactions
- g n Conv can be used as a drop-in replacement for spatial mixing layers in vision Transformers and convolution-based models
- We constructed a new family of generic vision backbones called HorNet
- Experiments demonstrate the effectiveness of g n Conv and HorNet on commonly used visual recognition benchmarks
- We used LayerScale techniques to make our models more stable during training
- We compared our models with ConvNeXt and Swin Transformers on ImageNet-1K
- We found that re-scaling the output of gated convolution with ฮฑ = 3 leads to the best performance
- We found that no activation function achieves the best performance for gated convolutions
- We compared HorFPN with standard FPN on different backbones and found HorFPN consistently outperforms standard FPN
- We compared our models with recent state-of-the-art frameworks on object detection and semantic segmentation and found our models outperform them