Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Self supervision and natural language supervision are two ways to train general purpose image encoders.
  • M3AE and SLIP have suggested that these approaches can be combined.
  • Results from these approaches use small pre-training datasets (<50M samples).
  • Investigating whether a similar approach can be effective with larger datasets.
  • Combination of two state of the art approaches (MAE and CLIP) provides benefit when trained on 11.3M image-text pairs.
  • Little to no benefit over CLIP when trained on 1.4B images.

Paper Content

Introduction

  • Large scale pretraining is a powerful tool in computer vision
  • Labels are often unreliable for datasets of this size
  • Two general classes of methods to train general purpose image encoders have emerged: self-supervised and natural language supervised
  • Recent work has combined both forms of supervision
  • No clear improvement in state of the art
  • Study of performance of self-supervised and language-supervised methods in two regimes
  • Improvement in low sample size regime, but not in high sample size regime
  • Combines natural language supervision and self-supervision in a multi-task approach to visual encoding
  • Natural language supervision for visual encoding uses contrastive pairwise alignment signal applied to large batch and dataset sizes
  • Hard negative pairs produced by random sampling, avoiding need for memory bank or momentum distillation
  • Image captioning used as pre-training task
  • Self-supervised approaches for visual encoding target consistency constraints between multiple views of the same scene or attempt visual reconstruction on corrupted or conditioning representations of images
  • Consistency constraints derived through data-augmentation applied to a single real image
  • Denoising autoencoder (DAE) popular as a means of self-supervision
  • Masked patch prediction problem applied to joint image-text data space
  • Multi-task methods for pre-training visual encoders highly active area of research
  • Masked language modeling (MLM) and image-text-matching (ITM) used
  • Masking used to speed up training by dropping the masked tokens during the forward and backward pass

Contrastive language-image pre-training

  • CLIP and ALIGN showed that contrastive image-language supervision can produce an image encoder that works well for downstream tasks.
  • A pairwise InfoNCE loss is applied to a large global batch of paired image-text embeddings.
  • Three pooling strategies are investigated: MAP, GAP, and MAX.

Masked autoencoders

  • MAE is a self-supervised image-encoder pre-training technique
  • ViT encoder-decoder architecture is used
  • Input to encoder consists of visible, unmasked image patches
  • High patch masking ratio is important for good performance
  • Decoder consumes output of encoder and learned ‘masked-patch’ embedding
  • Output of decoder is measured against ground truth using mean-squared-error loss
  • BART applies similar strategy to text data with lower masking ratio
  • M3AE extends MAE and BART to incorporate inputs from both text and image modalities

Mae-clip

  • MAE-CLIP attempts to incorporate aspects of MAE/M3AE into CLIP
  • Model architecture consists of three components: image encoder, text encoder, and cross-modality decoder
  • Image encoder divides input image into patches, applies linear projection and adds 2-D position encoding
  • Text encoder is based on BERT, tokenizes and embeds text, adds 1-D trainable position encoding
  • Decoder receives encoded image and text representations from encoders for masked and un-masked elements
  • Final loss is a weighted sum of losses from each task

Experiments

  • MAE, M3AE, CLIP and MAE-CLIP were studied to see how self-supervision affects representation quality of natural language supervised models
  • MAE-CLIP combines mask based self-supervision with contrastive language-image learning

Experimental setup

  • Combined 2.2B example webcrawled image-text pair dataset with other datasets to create 1.4B examples
  • Increased learning rate warm-up to 1,000 steps
  • Computed local contrastive loss for first 10,000 steps
  • Trained for 480,000 steps (6 full passes through dataset)

Image classification

  • MAE-CLIP typically shows worse performance than CLIP on ImageNet.
  • Hypothesized that this may be due to model capacity being lost to predicting irrelevant patches.
  • GAP provides consistent improvement over MAX on VTAB.
  • M3AE is a strong performer on VTAB.
  • M3AE does better on structured tasks, MAE-CLIP in between.

Vqa

  • Finetuning strategy used with results in Table 12b
  • No consistent difference between CLIP and MAE-CLIP
  • Possible exception of CLEVR where M3AE is best performing
  • GAP is marginally stronger than MAX

Pooling analysis

  • MAE-CLIP performs better than CLIP
  • Standard multi-head attention pooling (MAP) is compared to global average pooling (GAP) and max pooling (MAX)
  • GAP and MAX pooling perform better than MAP pooling across ImageNet, VTAB and VQA tasks
  • Qualitative analysis of perceptual grouping in visual representation is studied in Section 6.1

Experiments at scale

  • Self-supervision combined with natural language supervision can improve the quality of visual representations.
  • Choice of image-encoder pooling strategy can have a big impact on results.
  • Investigating whether these conclusions hold when training on a larger dataset.
  • Training M3AE, GAP and MAX variants of CLIP and MAE-CLIP.
  • Not training MAE at scale due to resource limitations and lower performance than models with natural language supervision.

Analysis and discussion

  • Self-supervision fails at scale when combined with natural language supervision.
  • Two hypotheses presented to explore in future work.

Visual grounding

  • Visual grounding measures how well a representation can localize objects in an image
  • Representations with better visual grounding will excel at general purpose tasks
  • Pooling operator has a larger effect on performance than MAE
  • MAE and pooling operator have an effect on visual grounding of CLIP image and text encoders
  • Max pooling produces better results than other pooling operators

Dataset diversity

  • Self-supervision and natural language supervision excel for different parts of the dataset diversity-size spectrum
  • Strong self-supervised visual-encoder baselines excel on ImageNet
  • Specialized self-supervised methods excel on less diverse datasets such as Cifar-10
  • MAEs have shown near state-of-the-art performance on Audioset
  • Self-supervised methods achieve 10% worse performance than natural language supervised methods at similar data scale
  • Natural language supervised models show competitive performance on massive and diverse datasets

A supplementary overview

  • Used two internal datasets and several public datasets to build a “large-scale” pre-training dataset
  • High Quality Image Text Pairs dataset consists of 134M diverse and high quality images with descriptive captions and titles
  • English Web Image Text dataset consists of 2.2B images with one or more related pieces of text
  • Public datasets include Conceptual 12M, CC3M, and LAION-400M
  • Final training dataset of 1.4B image-text pairs after global image-bytelevel de-duplication

B.2 vtab evaluation data

  • Trained linear classifier on predicted visual features of VTAB datasets
  • Diabetic Retinopathy dataset not included due to licensing concerns
  • Sun397 not included due to missing image at time of preparing datasets for VTAB benchmark

C training configuration

  • Used AdamW optimizer and linear learning rate warmup over 200 steps
  • Multiplied generative image loss by 0.05 and generative text loss by 0.1 when computing global contrastive loss
  • Used fixed 2D position encoding for image encoder
  • Used pre-layer-norm and initialization scheme from [70]
  • 10,000 local contrastive loss steps and 1,000 warmup steps for cosine learning rate scheduler for web-crawled dataset

D vqa finetuning

  • Evaluated on three VQA benchmark datasets
  • Treated as a classification problem
  • Image and text embeddings are concatenated and positional encoding is added
  • BOS token is used as an output token and projected to the possible classes
  • For ADE20K, several prompts are associated with each of the 150 categories

F mae-clip masking strategy

  • Random masking and similarity masking perform similarly on MAE-CLIP MAX
  • Random masking slightly improves classification tasks
  • Similarity masking slightly improves VQA and semantic segmentation tasks

G m3ae results

  • Updated version of tables from Section 5 in main paper
  • M3AE baseline had not fully converged at time of submission
  • M3AE performs on par with MAE-CLIP GAP on VTAB tasks
  • M3AE performs worse on VTAB natural tasks, better on VTAB structured tasks
  • MAE-CLIP improves classification performance of CLIP in small scale regime
  • MAE-CLIP performs better than self-supervised or language supervised methods
  • Self-supervision does not complement natural language supervision in large scale regime
  • Finetuning pipeline uses 133 label names from https://github.com/cocodataset/panopticapi