Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Convolution operation cannot handle irregular, random-masked input images
- Single-scale nature of BERT pre-training is inconsistent with convnet’s hierarchical structure
- Sparse convolution used to encode unmasked pixels as sparse voxels of 3D point clouds
- Hierarchical decoder developed to reconstruct images from multi-scale encoded features
- Sparse masKed modeling (SparK) can be used directly on any convolutional model without backbone modifications
- Surpasses state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins
- Improves object detection and instance segmentation up to +3.5%
- Favorable scaling behavior observed on larger models
Paper Content
Introduction
- Popularized by BERT and GPT, the pretrain-finetune paradigm is effective in NLP
- Masked image modeling has extended the success of BERT to vision transformers
- Increasing the mask ratio to a high level is credited with success
- Visual self-supervised learning has shifted from contrastive learning to BERT-style masked modeling
- Extending the success of BERT to convnets is a wonderful, but unrealized vision
- Difficulty is rooted in the gap between language and vision in terms of data processing
- Removing the information of masked “words” is difficult for convnets
- Single-scale algorithm cannot learn multi-scale (hierarchical) features
- SparK is proposed to adapt convnets to irregular masked input without a distribution shift
- SparK accurately eliminates the information of masked parts
- SparK embraces the advantage of convnet’s hierarchy
- SparK outperforms state-of-the-art contrastive learning and transformer-based masked modeling
- SparK provides a leap in convnet’s performance across downstream tasks
Related work
Hierarchical visual processing systems
- Hierarchical structure is a gold standard for visual representation systems.
- Feature descriptors extract multi-scale visual representations.
- Descriptors allow the system to cope with varying object sizes.
- Descriptors are widely used in visual tasks.
- Hierarchical modules allow information aggregation at different granularities.
Recent progress on visual self-supervised learning
- Contrastive learning is a form of self-supervised learning
- Efforts have been made to overcome the issue of mode collapse
- Advanced methods have been developed
- Masked image modeling is inspired by success of masked language modeling in NLP
- Transformer with a heavier patchifier is used for masked modeling
- Vision transformers have been successfully verified
- Contrastive learning is still state-of-the-art for convnets
Sparse convolution for visual representation
- Convolution is used in 2D computer vision
- Sliding window is used on regular grids
- Convolution quickly becomes unaffordable for 3D point clouds due to the cubic increasing number of grids
- Sparse convolution is used for 3D visual tasks
- Minkowski Engine is a common sparse convolution framework
- Sparse convolution is used for faster 2D visual understanding
- Sparse convolution is used to facilitate the adaptation of convnet to BERT masked modeling
Approach
- SparK framework pre-trains a convolutional network encoder
- SparK uses a sparse masking strategy, a hierarchical encoder-decoder architecture, and an optimization target
- UNet-style architecture is used to decode multi-scale sparse feature maps
- Regression loss is optimized on masked patches
- After pre-training, only the encoder is used for downstream tasks
Sparsely gathering unmasked patches
- Patch-wise masking strategy is used in masked image modeling
- Image is divided into non-overlapping square patches and masked independently
- Transformer-based masked modeling can easily eliminate information by removing or replacing masked patches
- Convnets cannot do this, so new approaches are needed
- Proposed solution is to gather unmasked patches into a sparse image and use sparse convolutions
- Benefits of this strategy include no information leakage, efficiency, and consistent masking effect
Hierarchical encoding and decoding
- Encoder generates feature maps with different resolutions
- Feature maps are referred to as S1, S2, S3, and S4
Raw mask
- Decoder follows design of UNet
- Decoder contains three successive blocks with upsampling layers
- Necessary to fill in empty positions on sparse feature maps (called “densifying”)
- Projection layer applied in case encoder and decoder have different network widths
- Four different mask embeddings and projection layers required
- Final output of decoder is D1
Optimization target and transferring to downstream
- Need head module with two upsampling layers to reconstruct image from D1
- Use per-patch normalized pixels as targets with L2-loss
- Calculate errors only on masked positions
Empirical results
Implementation details
- SparK can use any convolutional network as the encoder
- Mask embeddings are implemented as random-initialized learnable feature vectors
- Decoding uses a lightweight UNet decoder
- Pre-training uses 1.28 million unlabeled images from ImageNet-1K
- Pre-training uses minimal augmentation and a LAMB optimizer
- Fine-tuning uses official implementations of ResNet, MoCoV2, and ConvNeXt
Imagenet evaluation
- Performance comparison between SparK and self-supervised transformers
- SparK has the advantage of encoding efficiency compared to contrastive learning
- Evaluated representation quality on object detection and instance segmentation on COCO
- SparK is the best performer and the only one that pre-trains a convnet
- SparK yields superior results compared to SimMIM
- SparK exhibits highest improvements over supervised baselines
- SparK performs significantly better than convolutional contrastive learning methods
Visualization
- Model is able to make plausible predictions on masked regions
- Model can reconstruct round shapes and capture visual signals with medium or high frequencies
Raw input
Masked input
- Prediction is a process of forecasting future events.
- Input is the data used to make a prediction.
Conclusion
- NLP community has seen rise of masked modeling on transformers
- Problems arise when trying to apply this to convnets
- Look at differences between language and image processing
- Treat unmasked patches as sparse voxels and use sparse convolution to encode them
- Employ hierarchical decoder to make full use of convnet’s hierarchy
- SparK makes masked modeling well suited for any convnet
- BERT-style pre-training on convnets has been initially shown
- Masking strategies have pixel intensity histograms plotted before and after masking
- MAE has no distribution shift thanks to transformer’s ability to process variable-length input
- Sparse masked modeling with hierarchy
- Use sparse convolution to address “mask pattern vanishing” issue
- Reconstruction examples by a pre-trained ConvNeXt-Base
- ResNet-style model typically contains 4 stages with convolutional blocks and downsampling module
- Compare convolutional models with SparK pre-training to transformer-based pre-training methods
- Convnets may have more potential than expected
- SparK compared to state-of-the-art contrastive learning algorithms
- SparK learns highly transferable features through BERT-style generative pre-training
- Scale up SparK with model size and training resolution
- Ablation study on importance of each component in SparK