Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • MetaFormer plays a significant role in achieving competitive performance
  • MetaFormer ensures solid lower bound of performance
  • MetaFormer works well with arbitrary token mixers
  • MetaFormer effortlessly offers state-of-the-art results
  • ConvFormer outperforms ConvNeXt
  • CAFormer sets new record on ImageNet-1K
  • StarReLU reduces 71% FLOPs of activation compared with GELU

Paper Content

Recap the concept of metaformer

  • MetaFormer is a general architecture abstracted from Transformer.
  • Input is embedded as a sequence of features.
  • MetaFormer blocks are repeated and contain normalizations, token mixers, activation functions, and learnable parameters.
  • MetaFormer is instantiated into specific models by specifying token mixers.

Techniques to improve metaformer

  • Introduces a new activation StarReLU
  • Introduces two modifications to improve MetaFormer

Starrelu

  • Vanilla Transformer uses ReLU as activation, costing 1 FLOP per unit.
  • GPT uses GELU as activation, costing 14 FLOPs per unit.
  • Squared ReLU is simpler than GELU but has worse performance.
  • StarReLU is proposed as an alternative, costing 4 FLOPs per unit and achieving better performance.

Other modifications

  • Scaling branch output can be done by multiplying layer output by a learnable vector
  • ResScale performs best according to experiments
  • Bias is disabled in MetaFormer blocks

Identityformer and randformer

  • Identity mapping does not mix tokens, but is still treated as a token mixer.
  • Global random mixing adds extra frozen parameters and computation cost.
  • IdentityFormer uses identity mapping in all four stages.
  • RandFormer uses identity mapping in the first two stages and global random mixing in the last two stages.
  • PoolFormerV2 applies the same techniques as IdentityFormer/RandFormer to PoolFormer.

Convformer and caformer

  • Utilizes basic token mixers to probe lower bound of performance and model universality
  • Specify token mixer as commonly-used operators to probe model potential
  • Choose depthwise separable convolution as token mixer
  • Adopt 4-stage framework and specify token mixer as convolutions in first two stages and attention in last two stages

Setup

  • ImageNet-1K is used to benchmark baseline models
  • ImageNet-1K contains 1.3M images of 1K classes
  • Pre-training is done on ImageNet-21K, which contains 14M images of 21841 classes
  • Experiments are run on TPUs
  • Training and fine-tuning on ImageNet-1K follows hyper-parameters of DeiT
  • Data augmentation and regularization techniques are used
  • AdamW optimizer with batch size of 4096 is used, except for CAFormer which uses LAMB optimizer
  • Models are fine-tuned at 384 2 resolution for 30 epochs with EMA

Results of models with basic token mixers

  • IdentityFormer performs well on ImageNet-1K, especially for small model sizes
  • Identity mapping does not conduct any token mixing, which explains why IdentityFormer does not perform as well as PoolFormerV2
  • PoolFormerV2 outperforms RandFormer and IdentityFormer, likely due to the local inductive bias of pooling

Results of models with commonly-used token mixers

  • We build ConvFormer and CAFormer with token mixers of separable convolutions and vanilla self-attention
  • ConvFormer and CAFormer achieve remarkable performance on ImageNet-1K
  • CAFormer sets a new record on ImageNet-1K with top-1 accuracy of 85.5%
  • When pre-trained on ImageNet-21K, ConvFormer-B36 and CAFormer-B36 have 2.2% and 1.9% accuracy improvement
  • StarReLU reduces 71% activation FLOPs compared with GELU
  • ResScale and disabling biases of each block are employed by default
  • Transformers have become popular for various tasks
  • Many research endeavors have been focused on improving attention-based token mixers
  • MLP-Mixer and FNet show competitive results when replacing attention in Transformer with spatial MLP and Fourier transform
  • MetaFormer is abstracted from Transformer and proposed hypothesis that it is MetaFormer that really plays a critical role
  • PoolFormer surpasses well-tuned ResNet/ViT/MLP-like baselines
  • We explore capacity of MetaFormer with basic or “old-fashioned” token mixers
  • IdentityFormer, RandFormer, ConvFormer, and CAFormer are built
  • StarReLU achieves better performance and reduces FLOPs
  • RSB-ResNet and MetaFormer models are compared on ImageNet-1K
  • Ablation study for ConvFormer-S18/CAFormer-S18 on ImageNet-1K is conducted