Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- MetaFormer plays a significant role in achieving competitive performance
- MetaFormer ensures solid lower bound of performance
- MetaFormer works well with arbitrary token mixers
- MetaFormer effortlessly offers state-of-the-art results
- ConvFormer outperforms ConvNeXt
- CAFormer sets new record on ImageNet-1K
- StarReLU reduces 71% FLOPs of activation compared with GELU
Paper Content
Recap the concept of metaformer
- MetaFormer is a general architecture abstracted from Transformer.
- Input is embedded as a sequence of features.
- MetaFormer blocks are repeated and contain normalizations, token mixers, activation functions, and learnable parameters.
- MetaFormer is instantiated into specific models by specifying token mixers.
Techniques to improve metaformer
- Introduces a new activation StarReLU
- Introduces two modifications to improve MetaFormer
Starrelu
- Vanilla Transformer uses ReLU as activation, costing 1 FLOP per unit.
- GPT uses GELU as activation, costing 14 FLOPs per unit.
- Squared ReLU is simpler than GELU but has worse performance.
- StarReLU is proposed as an alternative, costing 4 FLOPs per unit and achieving better performance.
Other modifications
- Scaling branch output can be done by multiplying layer output by a learnable vector
- ResScale performs best according to experiments
- Bias is disabled in MetaFormer blocks
Identityformer and randformer
- Identity mapping does not mix tokens, but is still treated as a token mixer.
- Global random mixing adds extra frozen parameters and computation cost.
- IdentityFormer uses identity mapping in all four stages.
- RandFormer uses identity mapping in the first two stages and global random mixing in the last two stages.
- PoolFormerV2 applies the same techniques as IdentityFormer/RandFormer to PoolFormer.
Convformer and caformer
- Utilizes basic token mixers to probe lower bound of performance and model universality
- Specify token mixer as commonly-used operators to probe model potential
- Choose depthwise separable convolution as token mixer
- Adopt 4-stage framework and specify token mixer as convolutions in first two stages and attention in last two stages
Setup
- ImageNet-1K is used to benchmark baseline models
- ImageNet-1K contains 1.3M images of 1K classes
- Pre-training is done on ImageNet-21K, which contains 14M images of 21841 classes
- Experiments are run on TPUs
- Training and fine-tuning on ImageNet-1K follows hyper-parameters of DeiT
- Data augmentation and regularization techniques are used
- AdamW optimizer with batch size of 4096 is used, except for CAFormer which uses LAMB optimizer
- Models are fine-tuned at 384 2 resolution for 30 epochs with EMA
Results of models with basic token mixers
- IdentityFormer performs well on ImageNet-1K, especially for small model sizes
- Identity mapping does not conduct any token mixing, which explains why IdentityFormer does not perform as well as PoolFormerV2
- PoolFormerV2 outperforms RandFormer and IdentityFormer, likely due to the local inductive bias of pooling
Results of models with commonly-used token mixers
- We build ConvFormer and CAFormer with token mixers of separable convolutions and vanilla self-attention
- ConvFormer and CAFormer achieve remarkable performance on ImageNet-1K
- CAFormer sets a new record on ImageNet-1K with top-1 accuracy of 85.5%
- When pre-trained on ImageNet-21K, ConvFormer-B36 and CAFormer-B36 have 2.2% and 1.9% accuracy improvement
- StarReLU reduces 71% activation FLOPs compared with GELU
- ResScale and disabling biases of each block are employed by default
- Transformers have become popular for various tasks
- Many research endeavors have been focused on improving attention-based token mixers
- MLP-Mixer and FNet show competitive results when replacing attention in Transformer with spatial MLP and Fourier transform
- MetaFormer is abstracted from Transformer and proposed hypothesis that it is MetaFormer that really plays a critical role
- PoolFormer surpasses well-tuned ResNet/ViT/MLP-like baselines
- We explore capacity of MetaFormer with basic or “old-fashioned” token mixers
- IdentityFormer, RandFormer, ConvFormer, and CAFormer are built
- StarReLU achieves better performance and reduces FLOPs
- RSB-ResNet and MetaFormer models are compared on ImageNet-1K
- Ablation study for ConvFormer-S18/CAFormer-S18 on ImageNet-1K is conducted