Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Self supervision and natural language supervision are two ways to train general purpose image encoders.
M3AE and SLIP have suggested that these approaches can be combined.
Results from these approaches use small pre-training datasets (<50M samples).
Investigating whether a similar approach can be effective with larger datasets.
Combination of two state of the art approaches (MAE and CLIP) provides benefit when trained on 11.3M image-text pairs.
Little to no benefit over CLIP when trained on 1.4B images.

Paper Content

Introduction

Large scale pretraining is a powerful tool in computer vision
Labels are often unreliable for datasets of this size
Two general classes of methods to train general purpose image encoders have emerged: self-supervised and natural language supervised
Recent work has combined both forms of supervision
No clear improvement in state of the art
Study of performance of self-supervised and language-supervised methods in two regimes
Improvement in low sample size regime, but not in high sample size regime

Combines natural language supervision and self-supervision in a multi-task approach to visual encoding
Natural language supervision for visual encoding uses contrastive pairwise alignment signal applied to large batch and dataset sizes
Hard negative pairs produced by random sampling, avoiding need for memory bank or momentum distillation
Image captioning used as pre-training task
Self-supervised approaches for visual encoding target consistency constraints between multiple views of the same scene or attempt visual reconstruction on corrupted or conditioning representations of images
Consistency constraints derived through data-augmentation applied to a single real image
Denoising autoencoder (DAE) popular as a means of self-supervision
Masked patch prediction problem applied to joint image-text data space
Multi-task methods for pre-training visual encoders highly active area of research
Masked language modeling (MLM) and image-text-matching (ITM) used
Masking used to speed up training by dropping the masked tokens during the forward and backward pass

Contrastive language-image pre-training

CLIP and ALIGN showed that contrastive image-language supervision can produce an image encoder that works well for downstream tasks.
A pairwise InfoNCE loss is applied to a large global batch of paired image-text embeddings.
Three pooling strategies are investigated: MAP, GAP, and MAX.

Masked autoencoders

MAE is a self-supervised image-encoder pre-training technique
ViT encoder-decoder architecture is used
Input to encoder consists of visible, unmasked image patches
High patch masking ratio is important for good performance
Decoder consumes output of encoder and learned ‘masked-patch’ embedding
Output of decoder is measured against ground truth using mean-squared-error loss
BART applies similar strategy to text data with lower masking ratio
M3AE extends MAE and BART to incorporate inputs from both text and image modalities

Mae-clip

MAE-CLIP attempts to incorporate aspects of MAE/M3AE into CLIP
Model architecture consists of three components: image encoder, text encoder, and cross-modality decoder
Image encoder divides input image into patches, applies linear projection and adds 2-D position encoding
Text encoder is based on BERT, tokenizes and embeds text, adds 1-D trainable position encoding
Decoder receives encoded image and text representations from encoders for masked and un-masked elements
Final loss is a weighted sum of losses from each task

Experiments

MAE, M3AE, CLIP and MAE-CLIP were studied to see how self-supervision affects representation quality of natural language supervised models
MAE-CLIP combines mask based self-supervision with contrastive language-image learning

Experimental setup

Combined 2.2B example webcrawled image-text pair dataset with other datasets to create 1.4B examples
Increased learning rate warm-up to 1,000 steps
Computed local contrastive loss for first 10,000 steps
Trained for 480,000 steps (6 full passes through dataset)

Image classification

MAE-CLIP typically shows worse performance than CLIP on ImageNet.
Hypothesized that this may be due to model capacity being lost to predicting irrelevant patches.
GAP provides consistent improvement over MAX on VTAB.
M3AE is a strong performer on VTAB.
M3AE does better on structured tasks, MAE-CLIP in between.

Vqa

Finetuning strategy used with results in Table 12b
No consistent difference between CLIP and MAE-CLIP
Possible exception of CLEVR where M3AE is best performing
GAP is marginally stronger than MAX

Pooling analysis

MAE-CLIP performs better than CLIP
Standard multi-head attention pooling (MAP) is compared to global average pooling (GAP) and max pooling (MAX)
GAP and MAX pooling perform better than MAP pooling across ImageNet, VTAB and VQA tasks
Qualitative analysis of perceptual grouping in visual representation is studied in Section 6.1

Experiments at scale

Self-supervision combined with natural language supervision can improve the quality of visual representations.
Choice of image-encoder pooling strategy can have a big impact on results.
Investigating whether these conclusions hold when training on a larger dataset.
Training M3AE, GAP and MAX variants of CLIP and MAE-CLIP.
Not training MAE at scale due to resource limitations and lower performance than models with natural language supervision.

Analysis and discussion

Self-supervision fails at scale when combined with natural language supervision.
Two hypotheses presented to explore in future work.

Visual grounding

Visual grounding measures how well a representation can localize objects in an image
Representations with better visual grounding will excel at general purpose tasks
Pooling operator has a larger effect on performance than MAE
MAE and pooling operator have an effect on visual grounding of CLIP image and text encoders
Max pooling produces better results than other pooling operators

Dataset diversity

Self-supervision and natural language supervision excel for different parts of the dataset diversity-size spectrum
Strong self-supervised visual-encoder baselines excel on ImageNet
Specialized self-supervised methods excel on less diverse datasets such as Cifar-10
MAEs have shown near state-of-the-art performance on Audioset
Self-supervised methods achieve 10% worse performance than natural language supervised methods at similar data scale
Natural language supervised models show competitive performance on massive and diverse datasets

A supplementary overview

Used two internal datasets and several public datasets to build a “large-scale” pre-training dataset
High Quality Image Text Pairs dataset consists of 134M diverse and high quality images with descriptive captions and titles
English Web Image Text dataset consists of 2.2B images with one or more related pieces of text
Public datasets include Conceptual 12M, CC3M, and LAION-400M
Final training dataset of 1.4B image-text pairs after global image-bytelevel de-duplication

B.2 vtab evaluation data

Trained linear classifier on predicted visual features of VTAB datasets
Diabetic Retinopathy dataset not included due to licensing concerns
Sun397 not included due to missing image at time of preparing datasets for VTAB benchmark

C training configuration

Used AdamW optimizer and linear learning rate warmup over 200 steps
Multiplied generative image loss by 0.05 and generative text loss by 0.1 when computing global contrastive loss
Used fixed 2D position encoding for image encoder
Used pre-layer-norm and initialization scheme from [70]
10,000 local contrastive loss steps and 1,000 warmup steps for cosine learning rate scheduler for web-crawled dataset

D vqa finetuning

Evaluated on three VQA benchmark datasets
Treated as a classification problem
Image and text embeddings are concatenated and positional encoding is added
BOS token is used as an output token and projected to the possible classes
For ADE20K, several prompts are associated with each of the 150 categories

F mae-clip masking strategy

Random masking and similarity masking perform similarly on MAE-CLIP MAX
Random masking slightly improves classification tasks
Similarity masking slightly improves VQA and semantic segmentation tasks

G m3ae results

Updated version of tables from Section 5 in main paper
M3AE baseline had not fully converged at time of submission
M3AE performs on par with MAE-CLIP GAP on VTAB tasks
M3AE performs worse on VTAB natural tasks, better on VTAB structured tasks
MAE-CLIP improves classification performance of CLIP in small scale regime
MAE-CLIP performs better than self-supervised or language supervised methods
Self-supervision does not complement natural language supervision in large scale regime
Finetuning pipeline uses 133 label names from https://github.com/cocodataset/panopticapi

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Contrastive language-image pre-training#

Masked autoencoders#

Mae-clip#

Experiments#

Experimental setup#

Image classification#

Vqa#

Pooling analysis#

Experiments at scale#

Analysis and discussion#

Visual grounding#

Dataset diversity#

A supplementary overview#

B.2 vtab evaluation data#

C training configuration#

D vqa finetuning#

F mae-clip masking strategy#

G m3ae results#

Link to paper

Abstract

Paper Content

Introduction

Related work

Contrastive language-image pre-training

Masked autoencoders

Mae-clip

Experiments

Experimental setup

Image classification

Vqa

Pooling analysis

Experiments at scale

Analysis and discussion

Visual grounding

Dataset diversity

A supplementary overview

B.2 vtab evaluation data

C training configuration

D vqa finetuning

F mae-clip masking strategy

G m3ae results