Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Convolution operation cannot handle irregular, random-masked input images
Single-scale nature of BERT pre-training is inconsistent with convnet’s hierarchical structure
Sparse convolution used to encode unmasked pixels as sparse voxels of 3D point clouds
Hierarchical decoder developed to reconstruct images from multi-scale encoded features
Sparse masKed modeling (SparK) can be used directly on any convolutional model without backbone modifications
Surpasses state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins
Improves object detection and instance segmentation up to +3.5%
Favorable scaling behavior observed on larger models

Paper Content

Introduction

Popularized by BERT and GPT, the pretrain-finetune paradigm is effective in NLP
Masked image modeling has extended the success of BERT to vision transformers
Increasing the mask ratio to a high level is credited with success
Visual self-supervised learning has shifted from contrastive learning to BERT-style masked modeling
Extending the success of BERT to convnets is a wonderful, but unrealized vision
Difficulty is rooted in the gap between language and vision in terms of data processing
Removing the information of masked “words” is difficult for convnets
Single-scale algorithm cannot learn multi-scale (hierarchical) features
SparK is proposed to adapt convnets to irregular masked input without a distribution shift
SparK accurately eliminates the information of masked parts
SparK embraces the advantage of convnet’s hierarchy
SparK outperforms state-of-the-art contrastive learning and transformer-based masked modeling
SparK provides a leap in convnet’s performance across downstream tasks

Hierarchical visual processing systems

Hierarchical structure is a gold standard for visual representation systems.
Feature descriptors extract multi-scale visual representations.
Descriptors allow the system to cope with varying object sizes.
Descriptors are widely used in visual tasks.
Hierarchical modules allow information aggregation at different granularities.

Recent progress on visual self-supervised learning

Contrastive learning is a form of self-supervised learning
Efforts have been made to overcome the issue of mode collapse
Advanced methods have been developed
Masked image modeling is inspired by success of masked language modeling in NLP
Transformer with a heavier patchifier is used for masked modeling
Vision transformers have been successfully verified
Contrastive learning is still state-of-the-art for convnets

Sparse convolution for visual representation

Convolution is used in 2D computer vision
Sliding window is used on regular grids
Convolution quickly becomes unaffordable for 3D point clouds due to the cubic increasing number of grids
Sparse convolution is used for 3D visual tasks
Minkowski Engine is a common sparse convolution framework
Sparse convolution is used for faster 2D visual understanding
Sparse convolution is used to facilitate the adaptation of convnet to BERT masked modeling

Approach

SparK framework pre-trains a convolutional network encoder
SparK uses a sparse masking strategy, a hierarchical encoder-decoder architecture, and an optimization target
UNet-style architecture is used to decode multi-scale sparse feature maps
Regression loss is optimized on masked patches
After pre-training, only the encoder is used for downstream tasks

Sparsely gathering unmasked patches

Patch-wise masking strategy is used in masked image modeling
Image is divided into non-overlapping square patches and masked independently
Transformer-based masked modeling can easily eliminate information by removing or replacing masked patches
Convnets cannot do this, so new approaches are needed
Proposed solution is to gather unmasked patches into a sparse image and use sparse convolutions
Benefits of this strategy include no information leakage, efficiency, and consistent masking effect

Hierarchical encoding and decoding

Encoder generates feature maps with different resolutions
Feature maps are referred to as S1, S2, S3, and S4

Raw mask

Decoder follows design of UNet
Decoder contains three successive blocks with upsampling layers
Necessary to fill in empty positions on sparse feature maps (called “densifying”)
Projection layer applied in case encoder and decoder have different network widths
Four different mask embeddings and projection layers required
Final output of decoder is D1

Optimization target and transferring to downstream

Need head module with two upsampling layers to reconstruct image from D1
Use per-patch normalized pixels as targets with L2-loss
Calculate errors only on masked positions

Empirical results

Implementation details

SparK can use any convolutional network as the encoder
Mask embeddings are implemented as random-initialized learnable feature vectors
Decoding uses a lightweight UNet decoder
Pre-training uses 1.28 million unlabeled images from ImageNet-1K
Pre-training uses minimal augmentation and a LAMB optimizer
Fine-tuning uses official implementations of ResNet, MoCoV2, and ConvNeXt

Imagenet evaluation

Performance comparison between SparK and self-supervised transformers
SparK has the advantage of encoding efficiency compared to contrastive learning
Evaluated representation quality on object detection and instance segmentation on COCO
SparK is the best performer and the only one that pre-trains a convnet
SparK yields superior results compared to SimMIM
SparK exhibits highest improvements over supervised baselines
SparK performs significantly better than convolutional contrastive learning methods

Visualization

Model is able to make plausible predictions on masked regions
Model can reconstruct round shapes and capture visual signals with medium or high frequencies

Raw input

Masked input

Prediction is a process of forecasting future events.
Input is the data used to make a prediction.

Conclusion

NLP community has seen rise of masked modeling on transformers
Problems arise when trying to apply this to convnets
Look at differences between language and image processing
Treat unmasked patches as sparse voxels and use sparse convolution to encode them
Employ hierarchical decoder to make full use of convnet’s hierarchy
SparK makes masked modeling well suited for any convnet
BERT-style pre-training on convnets has been initially shown
Masking strategies have pixel intensity histograms plotted before and after masking
MAE has no distribution shift thanks to transformer’s ability to process variable-length input
Sparse masked modeling with hierarchy
Use sparse convolution to address “mask pattern vanishing” issue
Reconstruction examples by a pre-trained ConvNeXt-Base
ResNet-style model typically contains 4 stages with convolutional blocks and downsampling module
Compare convolutional models with SparK pre-training to transformer-based pre-training methods
Convnets may have more potential than expected
SparK compared to state-of-the-art contrastive learning algorithms
SparK learns highly transferable features through BERT-style generative pre-training
Scale up SparK with model size and training resolution
Ablation study on importance of each component in SparK

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Hierarchical visual processing systems#

Recent progress on visual self-supervised learning#

Sparse convolution for visual representation#

Approach#

Sparsely gathering unmasked patches#

Hierarchical encoding and decoding#

Raw mask#

Optimization target and transferring to downstream#

Empirical results#

Implementation details#

Imagenet evaluation#

Visualization#

Raw input#

Masked input#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Hierarchical visual processing systems

Recent progress on visual self-supervised learning

Sparse convolution for visual representation

Approach

Sparsely gathering unmasked patches

Hierarchical encoding and decoding

Raw mask

Optimization target and transferring to downstream

Empirical results

Implementation details

Imagenet evaluation

Visualization

Raw input

Masked input

Conclusion