Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Visual recognition has improved rapidly in the early 2020s
  • ConvNets (ConvNeXt) have demonstrated strong performance
  • Self-supervised learning techniques (MAE) can potentially benefit ConvNets
  • A new model family (ConvNeXt V2) has been proposed to improve performance on various recognition benchmarks
  • Pre-trained ConvNeXt V2 models of various sizes are available

Paper Content

Introduction

  • Visual recognition has ushered in a new era of large-scale visual representation learning
  • Three main factors influence performance of a visual representation learning system: neural network architecture, training method, and data used for training
  • Convolutional neural network architectures (ConvNets) have had a significant impact on computer vision research
  • Transformer architecture has gained popularity due to its strong scaling behavior
  • ConvNeXt architecture has modernized traditional ConvNets
  • Focus of visual representation learning has shifted from supervised learning with labels to self-supervised pre-training with pretext objectives
  • Masked autoencoders (MAE) have recently brought success in masked language modeling to the vision domain
  • Combining design elements of architectures and self-supervised learning frameworks can present challenges
  • Proposed to co-design the network architecture and the masked autoencoder under the same framework
  • Introduced ConvNeXt V2 which demonstrates improved performance when used in conjunction with masked autoencoders
  • ConvNeXt V2 models can be used in a variety of compute regimes and includes models of varying complexity
  • ConvNets were first introduced in the 1980s and have been improved over the years
  • Supervised training on the ImageNet dataset has been used to discover innovations
  • Self-supervised pre-text tasks such as rotation prediction and colorization have been used
  • ConvNeXt has excelled in scenarios requiring lower complexity
  • Masked autoencoders are a self-supervised learning strategy for visual recognition

Fully convolutional masked autoencoder

  • Approach is conceptually simple and runs in a fully convolutional manner
  • Model predicts missing parts of raw input visuals given the remaining context
  • Masking ratio of 0.6 used
  • Minimal data augmentation used
  • ConvNeXt model used as encoder
  • Sparse convolution used to prevent model from learning shortcuts
  • Lightweight ConvNeXt block used as decoder
  • Mean squared error used to compute reconstruction target
  • Fully Convolutional Masked AutoEncoder (FCMAE) created
  • Evaluated effectiveness of FCMAE using ConvNeXt-Base model
  • Pre-trained and fine-tuned using ImageNet-1K dataset
  • Compared self-supervised approach to supervised learning

Global response normalization

  • Introduction of Global Response Normalization (GRN) technique to make FCMAE pretraining more effective in conjunction with the ConvNeXt architecture
  • Qualitative and quantitative feature analyses reveal “feature collapse” phenomenon
  • Feature cosine distance analysis shows FCMAE pretrained ConvNeXt model exhibits severe feature collapse behavior
  • GRN unit consists of three steps: global feature aggregation, feature normalization, and feature calibration
  • GRN incorporated into ConvNeXt block, creating various models with varying efficiency and capacity
  • GRN effectively mitigates feature collapse issue and improves fine-tuning performance
  • Comparison of GRN with other normalization layers and feature gating methods
  • GRN important for both pre-training and fine-tuning

Imagenet experiments

  • FCMAE pre-training framework and ConvNeXt V2 architecture co-designed to make masked-based self-supervised pre-training successful
  • FCMAE framework and model architecture improvement (GRN layer) studied empirically
  • Model scaling behavior demonstrated with improved performance over supervised baseline
  • FCMAE outperforms Swin transformer pre-trained with Sim-MIM
  • ImageNet-22K intermediate fine-tuning sets new state-of-the-art accuracy
  • Results using publicly available data only (ImageNet-1K and ImageNet-22K)

Transfer learning experiments

  • Evaluated impact of co-design (ConvNeXt V1 + supervised vs. ConvNeXt V2 + FC-MAE)
  • Compared approach with Swin transformer models pre-trained with SimMIM
  • Gradual improvement as proposals applied (V1 to V2)
  • Final proposal (ConvNeXt V2 pre-trained on FCMAE) outperforms Swin transformer counterparts across all model sizes

Conclusion

  • Introduce a new ConvNet model family called ConvNeXt V2
  • Designed to be more suitable for self-supervised learning
  • Improve performance of pure ConvNets across various downstream tasks
  • Use fully convolutional masked autoencoder pre-training
  • Use AdamW base learning rate 8e-4, weight decay 0.05, optimizer momentum β1, β2=0.9, 0.999
  • Layer-wise lr decay [3,16] 0.9, batch size 1024
  • Learning rate schedule cosine decay, warmup epochs 40, training epochs 300
  • Augmentation RandAug (9, 0.5) [19], label smoothing [62] 0.1
  • Mixup [80] 0.8, cutmix [79] 1.0, drop path [40] 0.2, head init [52] 0.001, ema 0.9999
  • Input resolution of 512×512, multi-scale test using resolutions [0.75,0.875,1.0,1.125,1.25] of 512×2048
  • Model weights initialized after supervised fine-tuning on ImageNet-1K
  • Consistent and significant improvement across all models
  • Class selectivity index analysis on FC-MAE pre-trained weights for ConvNeXt V1 and V2
  • Feature normalization not preceded by global aggregation
  • Masking ratio of 0.5 to 0.7 produces best results
  • Compare performance of contrastive learning and masked image modeling
  • FCMAE leads to better representation quality than MoCo V3
  • Global response normalization (GRN) layer added after dimension-expansion MLP layer
  • GRN does effective and efficient feature re-weighting without parameter overhead
  • GRN should be used in both pre-training and fine-tuning stages
  • ConvNeXt V2 Huge model sets new state-of-the-art accuracy of 88.9%