Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • This paper presents a method called contrastive-tuning which uses contrastive training to align image and text models.
  • The best results are achieved when the image models are locked and the text models are unlocked. This is called “Locked-image Tuning” (LiT).
  • LiT gives the model the ability to transfer to new vision tasks, such as image classification or retrieval.
  • LiT works with multiple pre-training methods and across diverse architectures.
  • With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves high accuracy on the ImageNet and ObjectNet test sets.

Paper Content

Introduction

  • Transfer learning is a successful paradigm in computer vision
  • Zero-shot learning is an alternative approach to develop models without task-specific data
  • Web-sourced paired image-text data can be used to pre-train models for zero-shot transfer
  • Contrastive learning framework uses an image model and text model to minimize a contrastive loss
  • Locked-image Tuning (LiT) teaches a text model to read out representations from a pre-trained image model
  • Transfer learning is a two-step process of pre-training and fine-tuning
  • Scaling up model and dataset sizes leads to improved transfer effectiveness and robustness
  • Large pre-trained models are effective in low-data (few-shot) regime
  • Zero-shot transfer is an alternative paradigm that avoids fine-tuning
  • Alignment between image and text embedding spaces is used for zero-shot transfer
  • Contrastive learning simplifies the learning task and encourages focus on high-level information

Methods

Contrastive pre-training

  • Images and text descriptions are used to train visual models.
  • Models can be used for tasks such as classification and image/text retrieval.
  • Contrastive pre-training is an effective approach for training models from image-text data.
  • Contrastive loss encourages corresponding image-text pairs to have similar embeddings and non-corresponding pairs to have distinct embeddings.

Contrastive-tuning

  • Contrastive pre-training can be used to learn two tasks at the same time: image embedding and text embedding alignment.
  • Common approach to learning image embeddings is to use a large and relatively clean dataset of (semi)manually labeled images.
  • Image-text data does not have the limitation of predefined categories, but may be of lower quality.
  • Contrastive-tuning combines advantages of both sources of data by initializing contrastive pre-training with a pre-trained image model.
  • Flexible enough to integrate any models that can produce meaningful representations.

Design choices and locked-image tuning

  • Pre-trained image and text models can be randomly initialized or from a pre-trained model.
  • Representations from the two towers may have different sizes, so a linear projection is used to map them to a common dimensionality.
  • A two-character notation is used to discuss the potential design choices.
  • The Lu setting (locked pre-trained image model, and unlocked (trainable) randomly initialized text model) works particularly well.

Image-text datasets

  • CC12M is a dataset of 12 million image-text pairs
  • YFCC100m is a dataset of 15 million images with rich metadata
  • 4 billion image-text pairs were collected following the same process as ALIGN
  • Near-duplicate images were removed from the dataset

Experiments

  • Compared LiT to state-of-the-art image-text models in two scenarios: public datasets and private data
  • Evaluated contrastive tuning design choices with various training settings and datasets for 0-shot ImageNet classification and MSCOCO image and text retrieval

Comparison to the previous state-of-the-art

  • LiT results presented on dataset
  • ViT-g/14 model used, 32k batch size, tuned for 18 billion image-text pairs
  • LiT compared to previous state-of-the-art methods
  • LiT significantly outperforms previous methods on ImageNet zero-shot classification
  • LiT sets new state-of-the-art 82.5% accuracy on ObjectNet test set
  • LiT achieves 81.7% top-1 accuracy on 0-shot ImageNet transfer with only 300M image-text pairs seen
  • LiT achieves 75.7% zero-shot transfer on ImageNet with only public data sources

Evaluation of design choices

  • Small-scale investigation of various combinations of image and text towers
  • Pre-trained weights and locked or unlocked or randomly initialized and unlocked
  • Training on YFCC100m-CLIP dataset, varying total number of steps
  • Locking image tower works best and pre-trained image tower helps
  • Pre-trained text tower only marginally improves performance
  • Locking text tower does not work well
  • Locking pre-trained image tower helps even with large dataset
  • Unexpectedly, locking image tower provides benefits even with large dataset
  • Table 2 shows results of contrastive tuning on 4 billion images
  • Locked (L) better than unlocked (U)
  • Figure 4 gives hints as to why
  • Pre-trained models better suited for LiT
  • Results highlight importance of generally pre-trained model and varied set of evaluation tasks
  • Appendix A explores other architectures

Which text model to use?

  • Related work has focused on image model, text model is underexplored
  • Four transformer-based text models considered
  • BERT model performs best on YFCC100M-CLIP dataset
  • BERT model less stable to train
  • No improvement when using ViT text encoder with BERT’s WordPiece tokenizer
  • Increasing text tower capacity improves performance
  • De-duplication of examples does not influence results strongly

Technical advantages of locked image models

  • Training is sped up and memory use is reduced
  • Precomputing image model’s embeddings reduces computation time and memory requirements

Preliminary multilingual experiments

  • Common practice is to filter imagetext datasets to English language data only
  • Removing this restriction has potential to benefit larger part of world’s population
  • No translations required, rely on pre-trained, locked image model to bridge language barrier
  • Experiments show promise of LiT for multilingual image-text models

Discussion

  • Explores only classification and retrieval as zero-shot transfer tasks
  • Evaluating zero-shot transfer to a broader set of tasks left as future work
  • Lu setup can save computational cost within a fixed budget
  • Double-edged sword: technique makes it simpler to create malicious, offensive, or obscene text tower pendants to existing image models

Conclusion

  • We present a method called contrastive-tuning that allows transferring any pre-trained vision model in a zero-shot fashion
  • The proposed LiT setup halves the gap between the from-scratch contrastive learning setup and the per-task supervised fine-tuning setup
  • LiT works for different model families, but ViT models are more amenable to learning image-text mappings
  • Increasing the model capacity of the pre-trained image tower improves zero-shot ImageNet accuracy
  • We use the AdaFactor optimizer with a learning rate of 0.001, batch size of 16384, and no weight decay
  • We use pre-trained ViT models from [55] and pre-trained transformer models from [69]
  • We use BERT-base and BERT-large from [17] for most experiments
  • We use Adam optimizer with a learning rate of 0.001 and weight decay of 0.0001
  • We pre-process images by Inception-style cropping to a size of 224 pixels
  • We train for 20 epochs (200 million seen image-text pairs)
  • We explore three ways of learning from all text signals and make use of the full 100 M images
  • We obtain the best results with LiT using all text signals jointly on the YFCC CLIP subset
  • Global contrastive loss increases the effective batch size for contrastive learning and leads to better performance
  • Pre-computation eliminates loading the image model to memory during training, allowing larger batch sizes
  • Training the image tower with a smaller learning rate and/or delaying training of the image tower results in better retrieval metrics