Scaling Language-Image Pre-training via Masking

Important disclaimer: the following content is AI-generated, please make sure to fact check the presented information by reading the full paper.

December 1, 2022 · 449 words · Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He

Scaling Language-Image Pre-training via Masking

Table of Contents

Link to paper
Abstract
Paper Content

Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

FLIP is a method for training CLIP that is more efficient.
FLIP masks and removes image patches during training.
FLIP improves accuracy and speed compared to the no-masking baseline.
FLIP outperforms CLIP counterparts on downstream tasks.
FLIP encourages research on scaling vision-language learning.

Paper Content

Introduction

Language-supervised visual pre-training is a powerful methodology for learning representations
Pre-trained models have strong zeroshot transferability
Pre-trained encoders can improve multimodal and unimodal visual tasks
Natural language provides richer forms of supervision
Large-scale training is essential for language-supervised models
FLIP is a method for efficient CLIP training
FLIP trains faster and is more accurate than its CLIP counterpart
FLIP reduces computation by 2-4x and allows larger batches
FLIP outperforms its CLIP counterparts on downstream tasks
FLIP can improve accuracy with data scaling at no extra training cost

Denoising Autoencoders proposed as unsupervised representation learning method
BERT is an application of Denoising Autoencoders
Masked Autoencoder (MAE) reduces training time and memory
MAE applied to videos, point clouds, graphs, audio, visual control, vision-language, and other modalities
Our work related to MAE and its vision-language extensions
Focus on scaling aspect enabled by sparse computation
CLIP performs contrastive learning on pairs of image and text samples
Randomly mask out image patches with high masking ratio
Language-supervised learning popularized by CLIP and related works
Generative learning methods explored, optionally combined with contrastive losses

Method

Masking reduces computation in CLIP training
Tradeoff between encoding density and number of samples compared
Image masking randomly masks out a portion of patches
Text masking optionally masks out a portion of tokens
Contrastive loss used to train encoders
No reconstruction loss used
Unmasking tuning strategy to close distribution gap

Implementation

Implementation follows CLIP and OpenCLIP with modifications
Image encoder follows ViT paper, no extra LayerNorm
Global average pooling at end of image encoder
Text encoder is non-autoregressive Transformer
WordPiece tokenizer, sequences padded/cut to length of 32
Outputs of image/text encoder projected to same-dimensional embedding space
Cosine similarities of embeddings used as input to InfoNCE loss
Prompt engineering used for zero-shot transfer
Implementation based on JAX with t5x library
Training run on TPU v3 infrastructures

Experiments

FLIP design ablated
Image encoder is ViT-L/16
Text encoder has smaller size
Trained on LAION-400M
Evaluated zeroshot accuracy on ImageNet-1K
Image masking ratios studied
Batch size studied
Text masking studied
Inference unmasking studied
Unmasked tuning studied
Reconstruction studied
Accuracy vs. time trade-off studied
CLIP baselines compared
ImageNet zero-shot transfer compared
ImageNet linear probing compared
ImageNet fine-tuning compared
Zero-shot classification on more datasets compared
Zero-shot retrieval compared
Zero-shot robustness evaluation compared
Image captioning compared
Visual question answering compared