Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Convolution operation cannot handle irregular, random-masked input images
  • Single-scale nature of BERT pre-training is inconsistent with convnet’s hierarchical structure
  • Sparse convolution used to encode unmasked pixels as sparse voxels of 3D point clouds
  • Hierarchical decoder developed to reconstruct images from multi-scale encoded features
  • Sparse masKed modeling (SparK) can be used directly on any convolutional model without backbone modifications
  • Surpasses state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins
  • Improves object detection and instance segmentation up to +3.5%
  • Favorable scaling behavior observed on larger models

Paper Content

Introduction

  • Popularized by BERT and GPT, the pretrain-finetune paradigm is effective in NLP
  • Masked image modeling has extended the success of BERT to vision transformers
  • Increasing the mask ratio to a high level is credited with success
  • Visual self-supervised learning has shifted from contrastive learning to BERT-style masked modeling
  • Extending the success of BERT to convnets is a wonderful, but unrealized vision
  • Difficulty is rooted in the gap between language and vision in terms of data processing
  • Removing the information of masked “words” is difficult for convnets
  • Single-scale algorithm cannot learn multi-scale (hierarchical) features
  • SparK is proposed to adapt convnets to irregular masked input without a distribution shift
  • SparK accurately eliminates the information of masked parts
  • SparK embraces the advantage of convnet’s hierarchy
  • SparK outperforms state-of-the-art contrastive learning and transformer-based masked modeling
  • SparK provides a leap in convnet’s performance across downstream tasks

Hierarchical visual processing systems

  • Hierarchical structure is a gold standard for visual representation systems.
  • Feature descriptors extract multi-scale visual representations.
  • Descriptors allow the system to cope with varying object sizes.
  • Descriptors are widely used in visual tasks.
  • Hierarchical modules allow information aggregation at different granularities.

Recent progress on visual self-supervised learning

  • Contrastive learning is a form of self-supervised learning
  • Efforts have been made to overcome the issue of mode collapse
  • Advanced methods have been developed
  • Masked image modeling is inspired by success of masked language modeling in NLP
  • Transformer with a heavier patchifier is used for masked modeling
  • Vision transformers have been successfully verified
  • Contrastive learning is still state-of-the-art for convnets

Sparse convolution for visual representation

  • Convolution is used in 2D computer vision
  • Sliding window is used on regular grids
  • Convolution quickly becomes unaffordable for 3D point clouds due to the cubic increasing number of grids
  • Sparse convolution is used for 3D visual tasks
  • Minkowski Engine is a common sparse convolution framework
  • Sparse convolution is used for faster 2D visual understanding
  • Sparse convolution is used to facilitate the adaptation of convnet to BERT masked modeling

Approach

  • SparK framework pre-trains a convolutional network encoder
  • SparK uses a sparse masking strategy, a hierarchical encoder-decoder architecture, and an optimization target
  • UNet-style architecture is used to decode multi-scale sparse feature maps
  • Regression loss is optimized on masked patches
  • After pre-training, only the encoder is used for downstream tasks

Sparsely gathering unmasked patches

  • Patch-wise masking strategy is used in masked image modeling
  • Image is divided into non-overlapping square patches and masked independently
  • Transformer-based masked modeling can easily eliminate information by removing or replacing masked patches
  • Convnets cannot do this, so new approaches are needed
  • Proposed solution is to gather unmasked patches into a sparse image and use sparse convolutions
  • Benefits of this strategy include no information leakage, efficiency, and consistent masking effect

Hierarchical encoding and decoding

  • Encoder generates feature maps with different resolutions
  • Feature maps are referred to as S1, S2, S3, and S4

Raw mask

  • Decoder follows design of UNet
  • Decoder contains three successive blocks with upsampling layers
  • Necessary to fill in empty positions on sparse feature maps (called “densifying”)
  • Projection layer applied in case encoder and decoder have different network widths
  • Four different mask embeddings and projection layers required
  • Final output of decoder is D1

Optimization target and transferring to downstream

  • Need head module with two upsampling layers to reconstruct image from D1
  • Use per-patch normalized pixels as targets with L2-loss
  • Calculate errors only on masked positions

Empirical results

Implementation details

  • SparK can use any convolutional network as the encoder
  • Mask embeddings are implemented as random-initialized learnable feature vectors
  • Decoding uses a lightweight UNet decoder
  • Pre-training uses 1.28 million unlabeled images from ImageNet-1K
  • Pre-training uses minimal augmentation and a LAMB optimizer
  • Fine-tuning uses official implementations of ResNet, MoCoV2, and ConvNeXt

Imagenet evaluation

  • Performance comparison between SparK and self-supervised transformers
  • SparK has the advantage of encoding efficiency compared to contrastive learning
  • Evaluated representation quality on object detection and instance segmentation on COCO
  • SparK is the best performer and the only one that pre-trains a convnet
  • SparK yields superior results compared to SimMIM
  • SparK exhibits highest improvements over supervised baselines
  • SparK performs significantly better than convolutional contrastive learning methods

Visualization

  • Model is able to make plausible predictions on masked regions
  • Model can reconstruct round shapes and capture visual signals with medium or high frequencies

Raw input

Masked input

  • Prediction is a process of forecasting future events.
  • Input is the data used to make a prediction.

Conclusion

  • NLP community has seen rise of masked modeling on transformers
  • Problems arise when trying to apply this to convnets
  • Look at differences between language and image processing
  • Treat unmasked patches as sparse voxels and use sparse convolution to encode them
  • Employ hierarchical decoder to make full use of convnet’s hierarchy
  • SparK makes masked modeling well suited for any convnet
  • BERT-style pre-training on convnets has been initially shown
  • Masking strategies have pixel intensity histograms plotted before and after masking
  • MAE has no distribution shift thanks to transformer’s ability to process variable-length input
  • Sparse masked modeling with hierarchy
  • Use sparse convolution to address “mask pattern vanishing” issue
  • Reconstruction examples by a pre-trained ConvNeXt-Base
  • ResNet-style model typically contains 4 stages with convolutional blocks and downsampling module
  • Compare convolutional models with SparK pre-training to transformer-based pre-training methods
  • Convnets may have more potential than expected
  • SparK compared to state-of-the-art contrastive learning algorithms
  • SparK learns highly transferable features through BERT-style generative pre-training
  • Scale up SparK with model size and training resolution
  • Ablation study on importance of each component in SparK