Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Image token removal is an efficient augmentation strategy for reducing the cost of computing image features.
- Removing a large portion of image tokens may discard the semantic content associated with a given text description.
- Proposed an attentive token removal approach for CLIP training which retains tokens with a high semantic correlation to the text description.
- Experiments show that the proposed approach performs better than the previous method of random token removal for CLIP training.
- Compared to other CLIP improvements, our method is more effective and more efficient.
Paper Content
Introduction
- CLIP and ALIGN models demonstrate zeroshot image classification and multi-modal retrieval capabilities
- CLIP and ALIGN require large data and training cost
- This paper aims to improve the efficiency of CLIP training
- Image token removal drops a large portion of image tokens and reduces computation
- Token removal process can harm CLIP performance
- Proposed attentive token removal strategy selects tokens to remove according to correlation scores
- Averaged attention weights of all layers used to compute correlation scores
- A-CLIP framework is efficient and effective
- A-CLIP-eff is even more efficient and accurate
Related work
- Goal of computer vision is to interpret visual signals using language
- Recent works suggest a new way to better connect visual signals with linguistic semantics
- Training data is more scalable with billions of image-alt-text pairs
- CLIP model is a mainstream visual learning method
- Masking tokens for efficient computation
- Dynamic ViTs learn to remove tokens for efficient image classification
- Masked autoencoder randomly masks 75% tokens
- Our method improves both effectiveness and efficiency of CLIP pre-training
- FLIP uses random masking for CLIP training
- Our method uses attentive masking which removes semantically meaningless tokens
- Our method introduces multiple masked image views
- Masking for data augmentation
- Combining CLIP with other representation learning methods
Method
A brief review of clip
- CLIP is a visual representation learning approach that uses image-to-text pairs.
- CLIP applies an InfoNCE-like loss to classify pairs as positive or negative.
- CLIP uses a Vision Transformer and a language Transformer.
- CLIP applies a vision-language contrastive loss to project the [CLS] and [EOS] features into an embedding space.
Masking for efficient clip training
- Random masking method is effective for masked image modeling
- MIM is mainly for pre-training, CLIP is for zero-shot classification and retrieval
- Removing highly semantic visuals affects CLIP training more than MIM
- Random masking with 50% mask rate reduces zero-shot accuracy on ImageNet-1K
- Attentive masking method resolves issues of random masking
Attentive mask
- Goal of attentive masking is to keep tokens relevant to language description
- Representation of [CLS] token after CLIP training corresponds to semantics in associated alt-text
- Attention weights to other image tokens act as good indicator of relevance
- Score of token at location P is computed using query and key embedding
- Three strategies for selecting image tokens to mask: low, high, mixed
- Low strategy performs best and is set as default
- EMA network used to generate attention scores
- EMA computation done at reduced image resolution to save cost
- Shared EMA score map used for multiple masked views
Overall framework
- A-CLIP is an attentive mask CLIP method
- A-CLIP takes multiple masked views for better results
- A-CLIP uses two auxiliary self-supervised tasks
- EMA inference is used for zero-shot classification and multi-modal retrieval
- A-CLIP-eff is a more efficient variant of A-CLIP
Experiments
Implementation details
- Trained model on 15M subset of YFCC100M
- Randomly sample caption for each image
- Data augmentation includes color jitter, grayscale, solarize and blur
- ViT-B/16 architecture used for visual encoders
- Text encoder uses 12 layers with 8-head multi-head attention and 512 embedding dimension
- AdamW optimizer with 5e-4 learning rate and 0.5 weight decay
- EMA model momentum starts from 0.996 and increases to 1
- A-CLIP-eff uses halved resolution image as input
- CLIP and SLIP use publicly available model and checkpoints
- MaskCLIP reproduced with 75% of tokens masked out and 0.999 to 0.9999 EMA momentum
- 4 nodes with 8 NVIDIA Tesla V100 GPUs used for training
- Speed test of different frameworks using single node of 8 NVIDIA A100 GPUs
- Zero-shot retrieval on COCO and Flickr30k
- Zero-shot transfer capacity evaluated on ImageNet and Cal-tech101
Main results
- CLIP with random masking reduces zero-shot accuracy by -2.6%
- Attentive mask strategy recaptures performance and surpasses original CLIP by +1.9%
- Low selection retains top 50% of most relevant tokens, achieving best results with +3.7% performance improvement
- High selection masks off most relevant 50% of tokens, resulting in drastic performance collapse
- Mix selection takes top 25% most relevant tokens and rest are randomly selected, not matching performance
- EMA-eff uses half-resolution images to reduce computational overhead
- A-CLIP provides efficient paradigm for combining SSL with CLIP
- A-CLIP achieves +1.6%, +2.5%, and +3.9% gains over SLIP on 25-dataset suite with 25, 50, and 100 epochs training
- A-CLIP-eff achieves +5.1%, +9.4/5.9, and +7.2/3.8 gains over plain CLIP on ImageNet-1K, Flickr30K, and MS COCO
Ablation study and analysis
- Hypothesize that attentive mask input can alleviate over-fitting
- Adding stronger data augmentation can improve performance
- SSL(online) is more dependent on stronger augmentation
- SSL(online-EMA) can bring stable gains without color+blur
- A-CLIP can be extended for multiple views
- Using EMA for evaluation leads to performance gain
- Changing mask size to 32 improves performance
- Using all layers works better for attentive token selection
Conclusion
- Introduces A-CLIP framework for more efficient and effective CLIP training
- A-CLIP uses attentive masks for the image branch
- A-CLIP is flexible and can incorporate multiple masked views and auxiliary self-supervised tasks
- A-CLIP performs better than other CLIP improvements such as SLIP and MaskCLIP
- A-CLIP-eff is more efficient than the original CLIP method
- A-CLIP significantly outperforms other methods in terms of average accuracy and number of winning tracks
- A-CLIP benefits more from longer training than the original CLIP method
- A-CLIP uses an EMA update vision encoder to generate the attentive mask