Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Visual recognition has improved rapidly in the early 2020s
ConvNets (ConvNeXt) have demonstrated strong performance
Self-supervised learning techniques (MAE) can potentially benefit ConvNets
A new model family (ConvNeXt V2) has been proposed to improve performance on various recognition benchmarks
Pre-trained ConvNeXt V2 models of various sizes are available

Visual recognition has ushered in a new era of large-scale visual representation learning
Three main factors influence performance of a visual representation learning system: neural network architecture, training method, and data used for training
Convolutional neural network architectures (ConvNets) have had a significant impact on computer vision research
Transformer architecture has gained popularity due to its strong scaling behavior
ConvNeXt architecture has modernized traditional ConvNets
Focus of visual representation learning has shifted from supervised learning with labels to self-supervised pre-training with pretext objectives
Masked autoencoders (MAE) have recently brought success in masked language modeling to the vision domain
Combining design elements of architectures and self-supervised learning frameworks can present challenges
Proposed to co-design the network architecture and the masked autoencoder under the same framework
Introduced ConvNeXt V2 which demonstrates improved performance when used in conjunction with masked autoencoders
ConvNeXt V2 models can be used in a variety of compute regimes and includes models of varying complexity

ConvNets were first introduced in the 1980s and have been improved over the years
Supervised training on the ImageNet dataset has been used to discover innovations
Self-supervised pre-text tasks such as rotation prediction and colorization have been used
ConvNeXt has excelled in scenarios requiring lower complexity
Masked autoencoders are a self-supervised learning strategy for visual recognition

Introduction of Global Response Normalization (GRN) technique to make FCMAE pretraining more effective in conjunction with the ConvNeXt architecture
Qualitative and quantitative feature analyses reveal “feature collapse” phenomenon
Feature cosine distance analysis shows FCMAE pretrained ConvNeXt model exhibits severe feature collapse behavior
GRN unit consists of three steps: global feature aggregation, feature normalization, and feature calibration
GRN incorporated into ConvNeXt block, creating various models with varying efficiency and capacity
GRN effectively mitigates feature collapse issue and improves fine-tuning performance
Comparison of GRN with other normalization layers and feature gating methods
GRN important for both pre-training and fine-tuning

FCMAE pre-training framework and ConvNeXt V2 architecture co-designed to make masked-based self-supervised pre-training successful
FCMAE framework and model architecture improvement (GRN layer) studied empirically
Model scaling behavior demonstrated with improved performance over supervised baseline
FCMAE outperforms Swin transformer pre-trained with Sim-MIM
ImageNet-22K intermediate fine-tuning sets new state-of-the-art accuracy
Results using publicly available data only (ImageNet-1K and ImageNet-22K)

Evaluated impact of co-design (ConvNeXt V1 + supervised vs. ConvNeXt V2 + FC-MAE)
Compared approach with Swin transformer models pre-trained with SimMIM
Gradual improvement as proposals applied (V1 to V2)
Final proposal (ConvNeXt V2 pre-trained on FCMAE) outperforms Swin transformer counterparts across all model sizes

Introduce a new ConvNet model family called ConvNeXt V2
Designed to be more suitable for self-supervised learning
Improve performance of pure ConvNets across various downstream tasks
Use fully convolutional masked autoencoder pre-training
Use AdamW base learning rate 8e-4, weight decay 0.05, optimizer momentum β1, β2=0.9, 0.999
Layer-wise lr decay [3,16] 0.9, batch size 1024
Learning rate schedule cosine decay, warmup epochs 40, training epochs 300
Augmentation RandAug (9, 0.5) [19], label smoothing [62] 0.1
Mixup [80] 0.8, cutmix [79] 1.0, drop path [40] 0.2, head init [52] 0.001, ema 0.9999
Input resolution of 512×512, multi-scale test using resolutions [0.75,0.875,1.0,1.125,1.25] of 512×2048
Model weights initialized after supervised fine-tuning on ImageNet-1K
Consistent and significant improvement across all models
Class selectivity index analysis on FC-MAE pre-trained weights for ConvNeXt V1 and V2
Feature normalization not preceded by global aggregation
Masking ratio of 0.5 to 0.7 produces best results
Compare performance of contrastive learning and masked image modeling
FCMAE leads to better representation quality than MoCo V3
Global response normalization (GRN) layer added after dimension-expansion MLP layer
GRN does effective and efficient feature re-weighting without parameter overhead
GRN should be used in both pre-training and fine-tuning stages
ConvNeXt V2 Huge model sets new state-of-the-art accuracy of 88.9%