Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Self supervision and natural language supervision are two ways to train general purpose image encoders.
- M3AE and SLIP have suggested that these approaches can be combined.
- Results from these approaches use small pre-training datasets (<50M samples).
- Investigating whether a similar approach can be effective with larger datasets.
- Combination of two state of the art approaches (MAE and CLIP) provides benefit when trained on 11.3M image-text pairs.
- Little to no benefit over CLIP when trained on 1.4B images.
Paper Content
Introduction
- Large scale pretraining is a powerful tool in computer vision
- Labels are often unreliable for datasets of this size
- Two general classes of methods to train general purpose image encoders have emerged: self-supervised and natural language supervised
- Recent work has combined both forms of supervision
- No clear improvement in state of the art
- Study of performance of self-supervised and language-supervised methods in two regimes
- Improvement in low sample size regime, but not in high sample size regime
Related work
- Combines natural language supervision and self-supervision in a multi-task approach to visual encoding
- Natural language supervision for visual encoding uses contrastive pairwise alignment signal applied to large batch and dataset sizes
- Hard negative pairs produced by random sampling, avoiding need for memory bank or momentum distillation
- Image captioning used as pre-training task
- Self-supervised approaches for visual encoding target consistency constraints between multiple views of the same scene or attempt visual reconstruction on corrupted or conditioning representations of images
- Consistency constraints derived through data-augmentation applied to a single real image
- Denoising autoencoder (DAE) popular as a means of self-supervision
- Masked patch prediction problem applied to joint image-text data space
- Multi-task methods for pre-training visual encoders highly active area of research
- Masked language modeling (MLM) and image-text-matching (ITM) used
- Masking used to speed up training by dropping the masked tokens during the forward and backward pass
Contrastive language-image pre-training
- CLIP and ALIGN showed that contrastive image-language supervision can produce an image encoder that works well for downstream tasks.
- A pairwise InfoNCE loss is applied to a large global batch of paired image-text embeddings.
- Three pooling strategies are investigated: MAP, GAP, and MAX.
Masked autoencoders
- MAE is a self-supervised image-encoder pre-training technique
- ViT encoder-decoder architecture is used
- Input to encoder consists of visible, unmasked image patches
- High patch masking ratio is important for good performance
- Decoder consumes output of encoder and learned ‘masked-patch’ embedding
- Output of decoder is measured against ground truth using mean-squared-error loss
- BART applies similar strategy to text data with lower masking ratio
- M3AE extends MAE and BART to incorporate inputs from both text and image modalities
Mae-clip
- MAE-CLIP attempts to incorporate aspects of MAE/M3AE into CLIP
- Model architecture consists of three components: image encoder, text encoder, and cross-modality decoder
- Image encoder divides input image into patches, applies linear projection and adds 2-D position encoding
- Text encoder is based on BERT, tokenizes and embeds text, adds 1-D trainable position encoding
- Decoder receives encoded image and text representations from encoders for masked and un-masked elements
- Final loss is a weighted sum of losses from each task
Experiments
- MAE, M3AE, CLIP and MAE-CLIP were studied to see how self-supervision affects representation quality of natural language supervised models
- MAE-CLIP combines mask based self-supervision with contrastive language-image learning
Experimental setup
- Combined 2.2B example webcrawled image-text pair dataset with other datasets to create 1.4B examples
- Increased learning rate warm-up to 1,000 steps
- Computed local contrastive loss for first 10,000 steps
- Trained for 480,000 steps (6 full passes through dataset)
Image classification
- MAE-CLIP typically shows worse performance than CLIP on ImageNet.
- Hypothesized that this may be due to model capacity being lost to predicting irrelevant patches.
- GAP provides consistent improvement over MAX on VTAB.
- M3AE is a strong performer on VTAB.
- M3AE does better on structured tasks, MAE-CLIP in between.
Vqa
- Finetuning strategy used with results in Table 12b
- No consistent difference between CLIP and MAE-CLIP
- Possible exception of CLEVR where M3AE is best performing
- GAP is marginally stronger than MAX
Pooling analysis
- MAE-CLIP performs better than CLIP
- Standard multi-head attention pooling (MAP) is compared to global average pooling (GAP) and max pooling (MAX)
- GAP and MAX pooling perform better than MAP pooling across ImageNet, VTAB and VQA tasks
- Qualitative analysis of perceptual grouping in visual representation is studied in Section 6.1
Experiments at scale
- Self-supervision combined with natural language supervision can improve the quality of visual representations.
- Choice of image-encoder pooling strategy can have a big impact on results.
- Investigating whether these conclusions hold when training on a larger dataset.
- Training M3AE, GAP and MAX variants of CLIP and MAE-CLIP.
- Not training MAE at scale due to resource limitations and lower performance than models with natural language supervision.
Analysis and discussion
- Self-supervision fails at scale when combined with natural language supervision.
- Two hypotheses presented to explore in future work.
Visual grounding
- Visual grounding measures how well a representation can localize objects in an image
- Representations with better visual grounding will excel at general purpose tasks
- Pooling operator has a larger effect on performance than MAE
- MAE and pooling operator have an effect on visual grounding of CLIP image and text encoders
- Max pooling produces better results than other pooling operators
Dataset diversity
- Self-supervision and natural language supervision excel for different parts of the dataset diversity-size spectrum
- Strong self-supervised visual-encoder baselines excel on ImageNet
- Specialized self-supervised methods excel on less diverse datasets such as Cifar-10
- MAEs have shown near state-of-the-art performance on Audioset
- Self-supervised methods achieve 10% worse performance than natural language supervised methods at similar data scale
- Natural language supervised models show competitive performance on massive and diverse datasets
A supplementary overview
- Used two internal datasets and several public datasets to build a “large-scale” pre-training dataset
- High Quality Image Text Pairs dataset consists of 134M diverse and high quality images with descriptive captions and titles
- English Web Image Text dataset consists of 2.2B images with one or more related pieces of text
- Public datasets include Conceptual 12M, CC3M, and LAION-400M
- Final training dataset of 1.4B image-text pairs after global image-bytelevel de-duplication
B.2 vtab evaluation data
- Trained linear classifier on predicted visual features of VTAB datasets
- Diabetic Retinopathy dataset not included due to licensing concerns
- Sun397 not included due to missing image at time of preparing datasets for VTAB benchmark
C training configuration
- Used AdamW optimizer and linear learning rate warmup over 200 steps
- Multiplied generative image loss by 0.05 and generative text loss by 0.1 when computing global contrastive loss
- Used fixed 2D position encoding for image encoder
- Used pre-layer-norm and initialization scheme from [70]
- 10,000 local contrastive loss steps and 1,000 warmup steps for cosine learning rate scheduler for web-crawled dataset
D vqa finetuning
- Evaluated on three VQA benchmark datasets
- Treated as a classification problem
- Image and text embeddings are concatenated and positional encoding is added
- BOS token is used as an output token and projected to the possible classes
- For ADE20K, several prompts are associated with each of the 150 categories
F mae-clip masking strategy
- Random masking and similarity masking perform similarly on MAE-CLIP MAX
- Random masking slightly improves classification tasks
- Similarity masking slightly improves VQA and semantic segmentation tasks
G m3ae results
- Updated version of tables from Section 5 in main paper
- M3AE baseline had not fully converged at time of submission
- M3AE performs on par with MAE-CLIP GAP on VTAB tasks
- M3AE performs worse on VTAB natural tasks, better on VTAB structured tasks
- MAE-CLIP improves classification performance of CLIP in small scale regime
- MAE-CLIP performs better than self-supervised or language supervised methods
- Self-supervision does not complement natural language supervision in large scale regime
- Finetuning pipeline uses 133 label names from https://github.com/cocodataset/panopticapi