Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- This paper presents a method called contrastive-tuning which uses contrastive training to align image and text models.
- The best results are achieved when the image models are locked and the text models are unlocked. This is called “Locked-image Tuning” (LiT).
- LiT gives the model the ability to transfer to new vision tasks, such as image classification or retrieval.
- LiT works with multiple pre-training methods and across diverse architectures.
- With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves high accuracy on the ImageNet and ObjectNet test sets.
Paper Content
Introduction
- Transfer learning is a successful paradigm in computer vision
- Zero-shot learning is an alternative approach to develop models without task-specific data
- Web-sourced paired image-text data can be used to pre-train models for zero-shot transfer
- Contrastive learning framework uses an image model and text model to minimize a contrastive loss
- Locked-image Tuning (LiT) teaches a text model to read out representations from a pre-trained image model
Related work
- Transfer learning is a two-step process of pre-training and fine-tuning
- Scaling up model and dataset sizes leads to improved transfer effectiveness and robustness
- Large pre-trained models are effective in low-data (few-shot) regime
- Zero-shot transfer is an alternative paradigm that avoids fine-tuning
- Alignment between image and text embedding spaces is used for zero-shot transfer
- Contrastive learning simplifies the learning task and encourages focus on high-level information
Methods
Contrastive pre-training
- Images and text descriptions are used to train visual models.
- Models can be used for tasks such as classification and image/text retrieval.
- Contrastive pre-training is an effective approach for training models from image-text data.
- Contrastive loss encourages corresponding image-text pairs to have similar embeddings and non-corresponding pairs to have distinct embeddings.
Contrastive-tuning
- Contrastive pre-training can be used to learn two tasks at the same time: image embedding and text embedding alignment.
- Common approach to learning image embeddings is to use a large and relatively clean dataset of (semi)manually labeled images.
- Image-text data does not have the limitation of predefined categories, but may be of lower quality.
- Contrastive-tuning combines advantages of both sources of data by initializing contrastive pre-training with a pre-trained image model.
- Flexible enough to integrate any models that can produce meaningful representations.
Design choices and locked-image tuning
- Pre-trained image and text models can be randomly initialized or from a pre-trained model.
- Representations from the two towers may have different sizes, so a linear projection is used to map them to a common dimensionality.
- A two-character notation is used to discuss the potential design choices.
- The Lu setting (locked pre-trained image model, and unlocked (trainable) randomly initialized text model) works particularly well.
Image-text datasets
- CC12M is a dataset of 12 million image-text pairs
- YFCC100m is a dataset of 15 million images with rich metadata
- 4 billion image-text pairs were collected following the same process as ALIGN
- Near-duplicate images were removed from the dataset
Experiments
- Compared LiT to state-of-the-art image-text models in two scenarios: public datasets and private data
- Evaluated contrastive tuning design choices with various training settings and datasets for 0-shot ImageNet classification and MSCOCO image and text retrieval
Comparison to the previous state-of-the-art
- LiT results presented on dataset
- ViT-g/14 model used, 32k batch size, tuned for 18 billion image-text pairs
- LiT compared to previous state-of-the-art methods
- LiT significantly outperforms previous methods on ImageNet zero-shot classification
- LiT sets new state-of-the-art 82.5% accuracy on ObjectNet test set
- LiT achieves 81.7% top-1 accuracy on 0-shot ImageNet transfer with only 300M image-text pairs seen
- LiT achieves 75.7% zero-shot transfer on ImageNet with only public data sources
Evaluation of design choices
- Small-scale investigation of various combinations of image and text towers
- Pre-trained weights and locked or unlocked or randomly initialized and unlocked
- Training on YFCC100m-CLIP dataset, varying total number of steps
- Locking image tower works best and pre-trained image tower helps
- Pre-trained text tower only marginally improves performance
- Locking text tower does not work well
- Locking pre-trained image tower helps even with large dataset
- Unexpectedly, locking image tower provides benefits even with large dataset
- Table 2 shows results of contrastive tuning on 4 billion images
- Locked (L) better than unlocked (U)
- Figure 4 gives hints as to why
- Pre-trained models better suited for LiT
- Results highlight importance of generally pre-trained model and varied set of evaluation tasks
- Appendix A explores other architectures
Which text model to use?
- Related work has focused on image model, text model is underexplored
- Four transformer-based text models considered
- BERT model performs best on YFCC100M-CLIP dataset
- BERT model less stable to train
- No improvement when using ViT text encoder with BERT’s WordPiece tokenizer
- Increasing text tower capacity improves performance
- De-duplication of examples does not influence results strongly
Technical advantages of locked image models
- Training is sped up and memory use is reduced
- Precomputing image model’s embeddings reduces computation time and memory requirements
Preliminary multilingual experiments
- Common practice is to filter imagetext datasets to English language data only
- Removing this restriction has potential to benefit larger part of world’s population
- No translations required, rely on pre-trained, locked image model to bridge language barrier
- Experiments show promise of LiT for multilingual image-text models
Discussion
- Explores only classification and retrieval as zero-shot transfer tasks
- Evaluating zero-shot transfer to a broader set of tasks left as future work
- Lu setup can save computational cost within a fixed budget
- Double-edged sword: technique makes it simpler to create malicious, offensive, or obscene text tower pendants to existing image models
Conclusion
- We present a method called contrastive-tuning that allows transferring any pre-trained vision model in a zero-shot fashion
- The proposed LiT setup halves the gap between the from-scratch contrastive learning setup and the per-task supervised fine-tuning setup
- LiT works for different model families, but ViT models are more amenable to learning image-text mappings
- Increasing the model capacity of the pre-trained image tower improves zero-shot ImageNet accuracy
- We use the AdaFactor optimizer with a learning rate of 0.001, batch size of 16384, and no weight decay
- We use pre-trained ViT models from [55] and pre-trained transformer models from [69]
- We use BERT-base and BERT-large from [17] for most experiments
- We use Adam optimizer with a learning rate of 0.001 and weight decay of 0.0001
- We pre-process images by Inception-style cropping to a size of 224 pixels
- We train for 20 epochs (200 million seen image-text pairs)
- We explore three ways of learning from all text signals and make use of the full 100 M images
- We obtain the best results with LiT using all text signals jointly on the YFCC CLIP subset
- Global contrastive loss increases the effective batch size for contrastive learning and leads to better performance
- Pre-computation eliminates loading the image model to memory during training, allowing larger batch sizes
- Training the image tower with a smaller learning rate and/or delaying training of the image tower results in better retrieval metrics