Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • FLIP is a method for training CLIP that is more efficient.
  • FLIP masks and removes image patches during training.
  • FLIP improves accuracy and speed compared to the no-masking baseline.
  • FLIP outperforms CLIP counterparts on downstream tasks.
  • FLIP encourages research on scaling vision-language learning.

Paper Content

Introduction

  • Language-supervised visual pre-training is a powerful methodology for learning representations
  • Pre-trained models have strong zeroshot transferability
  • Pre-trained encoders can improve multimodal and unimodal visual tasks
  • Natural language provides richer forms of supervision
  • Large-scale training is essential for language-supervised models
  • FLIP is a method for efficient CLIP training
  • FLIP trains faster and is more accurate than its CLIP counterpart
  • FLIP reduces computation by 2-4x and allows larger batches
  • FLIP outperforms its CLIP counterparts on downstream tasks
  • FLIP can improve accuracy with data scaling at no extra training cost
  • Denoising Autoencoders proposed as unsupervised representation learning method
  • BERT is an application of Denoising Autoencoders
  • Masked Autoencoder (MAE) reduces training time and memory
  • MAE applied to videos, point clouds, graphs, audio, visual control, vision-language, and other modalities
  • Our work related to MAE and its vision-language extensions
  • Focus on scaling aspect enabled by sparse computation
  • CLIP performs contrastive learning on pairs of image and text samples
  • Randomly mask out image patches with high masking ratio
  • Language-supervised learning popularized by CLIP and related works
  • Generative learning methods explored, optionally combined with contrastive losses

Method

  • Masking reduces computation in CLIP training
  • Tradeoff between encoding density and number of samples compared
  • Image masking randomly masks out a portion of patches
  • Text masking optionally masks out a portion of tokens
  • Contrastive loss used to train encoders
  • No reconstruction loss used
  • Unmasking tuning strategy to close distribution gap

Implementation

  • Implementation follows CLIP and OpenCLIP with modifications
  • Image encoder follows ViT paper, no extra LayerNorm
  • Global average pooling at end of image encoder
  • Text encoder is non-autoregressive Transformer
  • WordPiece tokenizer, sequences padded/cut to length of 32
  • Outputs of image/text encoder projected to same-dimensional embedding space
  • Cosine similarities of embeddings used as input to InfoNCE loss
  • Prompt engineering used for zero-shot transfer
  • Implementation based on JAX with t5x library
  • Training run on TPU v3 infrastructures

Experiments

  • FLIP design ablated
  • Image encoder is ViT-L/16
  • Text encoder has smaller size
  • Trained on LAION-400M
  • Evaluated zeroshot accuracy on ImageNet-1K
  • Image masking ratios studied
  • Batch size studied
  • Text masking studied
  • Inference unmasking studied
  • Unmasked tuning studied
  • Reconstruction studied
  • Accuracy vs. time trade-off studied
  • CLIP baselines compared
  • ImageNet zero-shot transfer compared
  • ImageNet linear probing compared
  • ImageNet fine-tuning compared
  • Zero-shot classification on more datasets compared
  • Zero-shot retrieval compared
  • Zero-shot robustness evaluation compared
  • Image captioning compared
  • Visual question answering compared