Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Visual recognition has improved rapidly in the early 2020s
- ConvNets (ConvNeXt) have demonstrated strong performance
- Self-supervised learning techniques (MAE) can potentially benefit ConvNets
- A new model family (ConvNeXt V2) has been proposed to improve performance on various recognition benchmarks
- Pre-trained ConvNeXt V2 models of various sizes are available
Paper Content
Introduction
- Visual recognition has ushered in a new era of large-scale visual representation learning
- Three main factors influence performance of a visual representation learning system: neural network architecture, training method, and data used for training
- Convolutional neural network architectures (ConvNets) have had a significant impact on computer vision research
- Transformer architecture has gained popularity due to its strong scaling behavior
- ConvNeXt architecture has modernized traditional ConvNets
- Focus of visual representation learning has shifted from supervised learning with labels to self-supervised pre-training with pretext objectives
- Masked autoencoders (MAE) have recently brought success in masked language modeling to the vision domain
- Combining design elements of architectures and self-supervised learning frameworks can present challenges
- Proposed to co-design the network architecture and the masked autoencoder under the same framework
- Introduced ConvNeXt V2 which demonstrates improved performance when used in conjunction with masked autoencoders
- ConvNeXt V2 models can be used in a variety of compute regimes and includes models of varying complexity
Related work
- ConvNets were first introduced in the 1980s and have been improved over the years
- Supervised training on the ImageNet dataset has been used to discover innovations
- Self-supervised pre-text tasks such as rotation prediction and colorization have been used
- ConvNeXt has excelled in scenarios requiring lower complexity
- Masked autoencoders are a self-supervised learning strategy for visual recognition
Fully convolutional masked autoencoder
- Approach is conceptually simple and runs in a fully convolutional manner
- Model predicts missing parts of raw input visuals given the remaining context
- Masking ratio of 0.6 used
- Minimal data augmentation used
- ConvNeXt model used as encoder
- Sparse convolution used to prevent model from learning shortcuts
- Lightweight ConvNeXt block used as decoder
- Mean squared error used to compute reconstruction target
- Fully Convolutional Masked AutoEncoder (FCMAE) created
- Evaluated effectiveness of FCMAE using ConvNeXt-Base model
- Pre-trained and fine-tuned using ImageNet-1K dataset
- Compared self-supervised approach to supervised learning
Global response normalization
- Introduction of Global Response Normalization (GRN) technique to make FCMAE pretraining more effective in conjunction with the ConvNeXt architecture
- Qualitative and quantitative feature analyses reveal “feature collapse” phenomenon
- Feature cosine distance analysis shows FCMAE pretrained ConvNeXt model exhibits severe feature collapse behavior
- GRN unit consists of three steps: global feature aggregation, feature normalization, and feature calibration
- GRN incorporated into ConvNeXt block, creating various models with varying efficiency and capacity
- GRN effectively mitigates feature collapse issue and improves fine-tuning performance
- Comparison of GRN with other normalization layers and feature gating methods
- GRN important for both pre-training and fine-tuning
Imagenet experiments
- FCMAE pre-training framework and ConvNeXt V2 architecture co-designed to make masked-based self-supervised pre-training successful
- FCMAE framework and model architecture improvement (GRN layer) studied empirically
- Model scaling behavior demonstrated with improved performance over supervised baseline
- FCMAE outperforms Swin transformer pre-trained with Sim-MIM
- ImageNet-22K intermediate fine-tuning sets new state-of-the-art accuracy
- Results using publicly available data only (ImageNet-1K and ImageNet-22K)
Transfer learning experiments
- Evaluated impact of co-design (ConvNeXt V1 + supervised vs. ConvNeXt V2 + FC-MAE)
- Compared approach with Swin transformer models pre-trained with SimMIM
- Gradual improvement as proposals applied (V1 to V2)
- Final proposal (ConvNeXt V2 pre-trained on FCMAE) outperforms Swin transformer counterparts across all model sizes
Conclusion
- Introduce a new ConvNet model family called ConvNeXt V2
- Designed to be more suitable for self-supervised learning
- Improve performance of pure ConvNets across various downstream tasks
- Use fully convolutional masked autoencoder pre-training
- Use AdamW base learning rate 8e-4, weight decay 0.05, optimizer momentum β1, β2=0.9, 0.999
- Layer-wise lr decay [3,16] 0.9, batch size 1024
- Learning rate schedule cosine decay, warmup epochs 40, training epochs 300
- Augmentation RandAug (9, 0.5) [19], label smoothing [62] 0.1
- Mixup [80] 0.8, cutmix [79] 1.0, drop path [40] 0.2, head init [52] 0.001, ema 0.9999
- Input resolution of 512×512, multi-scale test using resolutions [0.75,0.875,1.0,1.125,1.25] of 512×2048
- Model weights initialized after supervised fine-tuning on ImageNet-1K
- Consistent and significant improvement across all models
- Class selectivity index analysis on FC-MAE pre-trained weights for ConvNeXt V1 and V2
- Feature normalization not preceded by global aggregation
- Masking ratio of 0.5 to 0.7 produces best results
- Compare performance of contrastive learning and masked image modeling
- FCMAE leads to better representation quality than MoCo V3
- Global response normalization (GRN) layer added after dimension-expansion MLP layer
- GRN does effective and efficient feature re-weighting without parameter overhead
- GRN should be used in both pre-training and fine-tuning stages
- ConvNeXt V2 Huge model sets new state-of-the-art accuracy of 88.9%