Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Many works have focused on reducing the number of FLOPs to design fast neural networks.
- Reducing FLOPs does not necessarily lead to a similar level of reduction in latency.
- Low FLOPS is mainly due to frequent memory access of operators, especially the depthwise convolution.
- Proposed a novel partial convolution (PConv) to extract spatial features more efficiently.
- Proposed FasterNet, a new family of neural networks, which is faster and more accurate than existing networks.
Paper Content
Introduction
- Neural networks have been developed for computer vision tasks
- Researchers and practitioners prefer to design cost-effective fast neural networks
- Networks are designed to reduce computational complexity, measured in FLOPs
- DWConv and GConv are used to extract spatial features
- MicroNet further decomposes and sparsifies the network to reduce FLOPs
- ViTs and MLPs are also being made smaller and faster
- FLOPS is a measure of effective computational speed
- Many existing neural networks suffer from low FLOPS
- Discrepancy between FLOPs and latency is noticed
- PConv is proposed as a competitive alternative to reduce computational redundancy and memory access
- FasterNet is introduced as a new family of networks with low latency and high throughput
- PConv and FasterNet are validated with extensive experiments
Related work
- CNNs are the mainstream architecture in computer vision
- Numerous studies have been done to increase efficiency
- Group convolution and depthwise separable convolution are popular
- Growing interest in ViT and variants
- Studies have attempted to improve ViT in terms of training and model design
- Attention-based mechanisms generally run slower than convolutional counterparts
- Focus on analyzing convolution operations, particularly DWConv
Design of pconv and fasternet
- DWConv has an issue with frequent memory access
- PConv is an alternative operator to resolve the issue
- FasterNet is introduced and its details are explained
Preliminary
- DWConv is a variant of Conv and is used in many neural networks.
- DWConv has low FLOPs compared to regular Conv.
- DWConv is typically followed by a pointwise convolution.
- DWConv requires a higher channel number to compensate for accuracy drop.
- This results in higher memory access and can slow down computation.
Partial convolution as a basic operator
- Feature maps share high similarities among different channels
- Proposed a simple PConv to reduce computational redundancy and memory access
- PConv has fewer FLOPs and memory access than a regular Conv
- Remaining channels are kept untouched instead of removed
Pconv followed by pwconv
- PConv and PW-Conv have an effective receptive field that looks like a T-shaped Conv.
- The center position of the T-shaped Conv is more important than its surrounding neighbors.
- Decomposing the T-shaped Conv into a PConv and a PWConv saves FLOPs.
Fasternet as a general backbone
- Proposed FasterNet, a new family of neural networks
- Architecture has four hierarchical stages
- Each stage has a stack of FasterNet blocks
- Blocks in last two stages consume less memory access and have higher FLOPS
- Each FasterNet block has a PConv layer followed by two PWConv layers
- Normalization and activation layers used after each middle PWConv
- Batch normalization and GELU/ReLU used
- Global average pooling, Conv 1x1, and fully-connected layer used for feature transformation and classification
- Four variants of FasterNet provided (Tiny, Small, Medium, Large)
Experimental results
- Examined computational speed and effectiveness of PConv and PWConv
- Evaluated performance of FasterNet for classification, detection, and segmentation tasks
- Conducted ablation study
- Benchmarked latency and throughput using GPU, CPU, and ARM processors
Pconv is fast with high flops
- PConv is fast and exploits on-device computational capacity
- PConv has 1/16 FLOPs of a regular Conv and achieves higher FLOPS than other convolutional variants
- Regular Conv has highest FLOPS but unaffordable latency/throughput
- GConv and DWConv have significant reduction in FLOPs but decrease in FLOPS and increase latency
Pconv is effective together with pwconv
- PConv followed by PWConv is effective in approximating a regular Conv to transform feature maps
- 4 datasets created by feeding ImageNet-1k val split images into a pre-trained ResNet50 and extracting feature maps before and after the first Conv 3 ร 3 in each of the 4 stages
- Simple network consisting of PConv followed by PWConv trained on feature map datasets with mean squared error loss
- PConv + PWConv achieved lowest test loss, meaning they better approximate a regular Conv in feature transformation
- PConv shows potential to be new go-to choice in designing fast and effective neural networks
Fasternet on imagenet-1k classification
- Conducted experiments on ImageNet-1k classification dataset
- Trained models for 300 epochs using AdamW optimizer
- Used regularization and augmentation techniques
- Used 192x192 resolution for first 280 epochs and 224x224 for remaining 20 epochs
- Reported top-1 accuracy on validation set with center crop at 224x224 resolution
- Achieved state-of-the-art in balancing accuracy and latency/throughput
- FasterNet runs faster than various CNN, ViT and MLP models on a wide range of devices
- Achieved 83.5% top-1 accuracy, comparable to Swin-B and ConvNeXt-B
- FasterNet is much simpler than many other models in terms of architectural design
Fasternet on downstream tasks
- Experiments conducted on COCO dataset for object detection and instance segmentation
- ImageNet pre-trained FasterNet used as backbone
- Mask R-CNN detector used
- FasterNet outperforms ResNet and ResNext in terms of latency and average precision
- FasterNet-S saves 36% compute time and yields higher box and mask AP compared to ResNet50
- FasterNet is competitive against ViT variants
Ablation study
- Partial ratio r affects accuracy, throughput, and latency
- Batch-Norm is used instead of LayerNorm for faster inference
- GELU is more efficient than ReLU for FasterNet-T0/T1, but not for FasterNet-T2/S/M/L
Conclusion
- Investigated issue of low FLOPS in established neural networks
- Identified bottleneck operator DWConv and cause of slowdown - frequent memory access
- Proposed PConv operator to overcome issue and achieve faster neural networks
- Introduced FasterNet built upon PConv to achieve state-of-the-art speed and accuracy trade-off
- Provided details on experimental settings, comparison plots, architectural configurations, PConv implementations, comparisons with related work, limitations, and future work
- Used ImageNet-1k pre-trained weights to initialize FasterNet backbone for object detection and instance segmentation
- Compared PConv to GConv and FasterNet to ConvNeXt
- Explored other paradigms for efficient inference
- Provided two implementations of the forward pass for PConv
- Demonstrated that PConv and FasterNet are fast and effective
- Visualized feature maps in an intermediate layer of a pre-trained ResNet50
- Showed histogram of salient position distribution for regular Conv 3x3 filters in a pre-trained ResNet18
- Compared FasterNet with state-of-the-art networks
- Explained benefit of BN and how it can be merged into adjacent Conv layers for faster inference
- Compared models with similar top-1 accuracy on ImageNet-1k benchmark
- Evaluated FasterNet on COCO object detection and instance segmentation benchmarks
- Conducted ablation on partial ratio, normalization, and activation of FasterNet