Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Image token removal is an efficient augmentation strategy for reducing the cost of computing image features.
Removing a large portion of image tokens may discard the semantic content associated with a given text description.
Proposed an attentive token removal approach for CLIP training which retains tokens with a high semantic correlation to the text description.
Experiments show that the proposed approach performs better than the previous method of random token removal for CLIP training.
Compared to other CLIP improvements, our method is more effective and more efficient.

Paper Content

Introduction

CLIP and ALIGN models demonstrate zeroshot image classification and multi-modal retrieval capabilities
CLIP and ALIGN require large data and training cost
This paper aims to improve the efficiency of CLIP training
Image token removal drops a large portion of image tokens and reduces computation
Token removal process can harm CLIP performance
Proposed attentive token removal strategy selects tokens to remove according to correlation scores
Averaged attention weights of all layers used to compute correlation scores
A-CLIP framework is efficient and effective
A-CLIP-eff is even more efficient and accurate

Goal of computer vision is to interpret visual signals using language
Recent works suggest a new way to better connect visual signals with linguistic semantics
Training data is more scalable with billions of image-alt-text pairs
CLIP model is a mainstream visual learning method
Masking tokens for efficient computation
Dynamic ViTs learn to remove tokens for efficient image classification
Masked autoencoder randomly masks 75% tokens
Our method improves both effectiveness and efficiency of CLIP pre-training
FLIP uses random masking for CLIP training
Our method uses attentive masking which removes semantically meaningless tokens
Our method introduces multiple masked image views
Masking for data augmentation
Combining CLIP with other representation learning methods

Method

A brief review of clip

CLIP is a visual representation learning approach that uses image-to-text pairs.
CLIP applies an InfoNCE-like loss to classify pairs as positive or negative.
CLIP uses a Vision Transformer and a language Transformer.
CLIP applies a vision-language contrastive loss to project the [CLS] and [EOS] features into an embedding space.

Masking for efficient clip training

Random masking method is effective for masked image modeling
MIM is mainly for pre-training, CLIP is for zero-shot classification and retrieval
Removing highly semantic visuals affects CLIP training more than MIM
Random masking with 50% mask rate reduces zero-shot accuracy on ImageNet-1K
Attentive masking method resolves issues of random masking

Attentive mask

Goal of attentive masking is to keep tokens relevant to language description
Representation of [CLS] token after CLIP training corresponds to semantics in associated alt-text
Attention weights to other image tokens act as good indicator of relevance
Score of token at location P is computed using query and key embedding
Three strategies for selecting image tokens to mask: low, high, mixed
Low strategy performs best and is set as default
EMA network used to generate attention scores
EMA computation done at reduced image resolution to save cost
Shared EMA score map used for multiple masked views

Overall framework

A-CLIP is an attentive mask CLIP method
A-CLIP takes multiple masked views for better results
A-CLIP uses two auxiliary self-supervised tasks
EMA inference is used for zero-shot classification and multi-modal retrieval
A-CLIP-eff is a more efficient variant of A-CLIP

Experiments

Implementation details

Trained model on 15M subset of YFCC100M
Randomly sample caption for each image
Data augmentation includes color jitter, grayscale, solarize and blur
ViT-B/16 architecture used for visual encoders
Text encoder uses 12 layers with 8-head multi-head attention and 512 embedding dimension
AdamW optimizer with 5e-4 learning rate and 0.5 weight decay
EMA model momentum starts from 0.996 and increases to 1
A-CLIP-eff uses halved resolution image as input
CLIP and SLIP use publicly available model and checkpoints
MaskCLIP reproduced with 75% of tokens masked out and 0.999 to 0.9999 EMA momentum
4 nodes with 8 NVIDIA Tesla V100 GPUs used for training
Speed test of different frameworks using single node of 8 NVIDIA A100 GPUs
Zero-shot retrieval on COCO and Flickr30k
Zero-shot transfer capacity evaluated on ImageNet and Cal-tech101

Main results

CLIP with random masking reduces zero-shot accuracy by -2.6%
Attentive mask strategy recaptures performance and surpasses original CLIP by +1.9%
Low selection retains top 50% of most relevant tokens, achieving best results with +3.7% performance improvement
High selection masks off most relevant 50% of tokens, resulting in drastic performance collapse
Mix selection takes top 25% most relevant tokens and rest are randomly selected, not matching performance
EMA-eff uses half-resolution images to reduce computational overhead
A-CLIP provides efficient paradigm for combining SSL with CLIP
A-CLIP achieves +1.6%, +2.5%, and +3.9% gains over SLIP on 25-dataset suite with 25, 50, and 100 epochs training
A-CLIP-eff achieves +5.1%, +9.4/5.9, and +7.2/3.8 gains over plain CLIP on ImageNet-1K, Flickr30K, and MS COCO

Ablation study and analysis

Hypothesize that attentive mask input can alleviate over-fitting
Adding stronger data augmentation can improve performance
SSL(online) is more dependent on stronger augmentation
SSL(online-EMA) can bring stable gains without color+blur
A-CLIP can be extended for multiple views
Using EMA for evaluation leads to performance gain
Changing mask size to 32 improves performance
Using all layers works better for attentive token selection

Conclusion

Introduces A-CLIP framework for more efficient and effective CLIP training
A-CLIP uses attentive masks for the image branch
A-CLIP is flexible and can incorporate multiple masked views and auxiliary self-supervised tasks
A-CLIP performs better than other CLIP improvements such as SLIP and MaskCLIP
A-CLIP-eff is more efficient than the original CLIP method
A-CLIP significantly outperforms other methods in terms of average accuracy and number of winning tracks
A-CLIP benefits more from longer training than the original CLIP method
A-CLIP uses an EMA update vision encoder to generate the attentive mask

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

A brief review of clip#

Masking for efficient clip training#

Attentive mask#

Overall framework#

Experiments#

Implementation details#

Main results#

Ablation study and analysis#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

A brief review of clip

Masking for efficient clip training

Attentive mask

Overall framework

Experiments

Implementation details

Main results

Ablation study and analysis

Conclusion