Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

This paper presents a method called contrastive-tuning which uses contrastive training to align image and text models.
The best results are achieved when the image models are locked and the text models are unlocked. This is called “Locked-image Tuning” (LiT).
LiT gives the model the ability to transfer to new vision tasks, such as image classification or retrieval.
LiT works with multiple pre-training methods and across diverse architectures.
With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves high accuracy on the ImageNet and ObjectNet test sets.

Paper Content

Introduction

Transfer learning is a successful paradigm in computer vision
Zero-shot learning is an alternative approach to develop models without task-specific data
Web-sourced paired image-text data can be used to pre-train models for zero-shot transfer
Contrastive learning framework uses an image model and text model to minimize a contrastive loss
Locked-image Tuning (LiT) teaches a text model to read out representations from a pre-trained image model

Transfer learning is a two-step process of pre-training and fine-tuning
Scaling up model and dataset sizes leads to improved transfer effectiveness and robustness
Large pre-trained models are effective in low-data (few-shot) regime
Zero-shot transfer is an alternative paradigm that avoids fine-tuning
Alignment between image and text embedding spaces is used for zero-shot transfer
Contrastive learning simplifies the learning task and encourages focus on high-level information

Methods

Contrastive pre-training

Images and text descriptions are used to train visual models.
Models can be used for tasks such as classification and image/text retrieval.
Contrastive pre-training is an effective approach for training models from image-text data.
Contrastive loss encourages corresponding image-text pairs to have similar embeddings and non-corresponding pairs to have distinct embeddings.

Contrastive-tuning

Contrastive pre-training can be used to learn two tasks at the same time: image embedding and text embedding alignment.
Common approach to learning image embeddings is to use a large and relatively clean dataset of (semi)manually labeled images.
Image-text data does not have the limitation of predefined categories, but may be of lower quality.
Contrastive-tuning combines advantages of both sources of data by initializing contrastive pre-training with a pre-trained image model.
Flexible enough to integrate any models that can produce meaningful representations.

Design choices and locked-image tuning

Pre-trained image and text models can be randomly initialized or from a pre-trained model.
Representations from the two towers may have different sizes, so a linear projection is used to map them to a common dimensionality.
A two-character notation is used to discuss the potential design choices.
The Lu setting (locked pre-trained image model, and unlocked (trainable) randomly initialized text model) works particularly well.

Image-text datasets

CC12M is a dataset of 12 million image-text pairs
YFCC100m is a dataset of 15 million images with rich metadata
4 billion image-text pairs were collected following the same process as ALIGN
Near-duplicate images were removed from the dataset

Experiments

Compared LiT to state-of-the-art image-text models in two scenarios: public datasets and private data
Evaluated contrastive tuning design choices with various training settings and datasets for 0-shot ImageNet classification and MSCOCO image and text retrieval

Comparison to the previous state-of-the-art

LiT results presented on dataset
ViT-g/14 model used, 32k batch size, tuned for 18 billion image-text pairs
LiT compared to previous state-of-the-art methods
LiT significantly outperforms previous methods on ImageNet zero-shot classification
LiT sets new state-of-the-art 82.5% accuracy on ObjectNet test set
LiT achieves 81.7% top-1 accuracy on 0-shot ImageNet transfer with only 300M image-text pairs seen
LiT achieves 75.7% zero-shot transfer on ImageNet with only public data sources

Evaluation of design choices

Small-scale investigation of various combinations of image and text towers
Pre-trained weights and locked or unlocked or randomly initialized and unlocked
Training on YFCC100m-CLIP dataset, varying total number of steps
Locking image tower works best and pre-trained image tower helps
Pre-trained text tower only marginally improves performance
Locking text tower does not work well
Locking pre-trained image tower helps even with large dataset
Unexpectedly, locking image tower provides benefits even with large dataset
Table 2 shows results of contrastive tuning on 4 billion images
Locked (L) better than unlocked (U)
Figure 4 gives hints as to why
Pre-trained models better suited for LiT
Results highlight importance of generally pre-trained model and varied set of evaluation tasks
Appendix A explores other architectures

Which text model to use?

Related work has focused on image model, text model is underexplored
Four transformer-based text models considered
BERT model performs best on YFCC100M-CLIP dataset
BERT model less stable to train
No improvement when using ViT text encoder with BERT’s WordPiece tokenizer
Increasing text tower capacity improves performance
De-duplication of examples does not influence results strongly

Technical advantages of locked image models

Training is sped up and memory use is reduced
Precomputing image model’s embeddings reduces computation time and memory requirements

Preliminary multilingual experiments

Common practice is to filter imagetext datasets to English language data only
Removing this restriction has potential to benefit larger part of world’s population
No translations required, rely on pre-trained, locked image model to bridge language barrier
Experiments show promise of LiT for multilingual image-text models

Discussion

Explores only classification and retrieval as zero-shot transfer tasks
Evaluating zero-shot transfer to a broader set of tasks left as future work
Lu setup can save computational cost within a fixed budget
Double-edged sword: technique makes it simpler to create malicious, offensive, or obscene text tower pendants to existing image models

Conclusion

We present a method called contrastive-tuning that allows transferring any pre-trained vision model in a zero-shot fashion
The proposed LiT setup halves the gap between the from-scratch contrastive learning setup and the per-task supervised fine-tuning setup
LiT works for different model families, but ViT models are more amenable to learning image-text mappings
Increasing the model capacity of the pre-trained image tower improves zero-shot ImageNet accuracy
We use the AdaFactor optimizer with a learning rate of 0.001, batch size of 16384, and no weight decay
We use pre-trained ViT models from [55] and pre-trained transformer models from [69]
We use BERT-base and BERT-large from [17] for most experiments
We use Adam optimizer with a learning rate of 0.001 and weight decay of 0.0001
We pre-process images by Inception-style cropping to a size of 224 pixels
We train for 20 epochs (200 million seen image-text pairs)
We explore three ways of learning from all text signals and make use of the full 100 M images
We obtain the best results with LiT using all text signals jointly on the YFCC CLIP subset
Global contrastive loss increases the effective batch size for contrastive learning and leads to better performance
Pre-computation eliminates loading the image model to memory during training, allowing larger batch sizes
Training the image tower with a smaller learning rate and/or delaying training of the image tower results in better retrieval metrics

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Methods#

Contrastive pre-training#

Contrastive-tuning#

Design choices and locked-image tuning#

Image-text datasets#

Experiments#

Comparison to the previous state-of-the-art#

Evaluation of design choices#

Which text model to use?#

Technical advantages of locked image models#

Preliminary multilingual experiments#

Discussion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Methods

Contrastive pre-training

Contrastive-tuning

Design choices and locked-image tuning

Image-text datasets

Experiments

Comparison to the previous state-of-the-art

Evaluation of design choices

Which text model to use?

Technical advantages of locked image models

Preliminary multilingual experiments

Discussion

Conclusion