Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Image token removal is an efficient augmentation strategy for reducing the cost of computing image features.
  • Removing a large portion of image tokens may discard the semantic content associated with a given text description.
  • Proposed an attentive token removal approach for CLIP training which retains tokens with a high semantic correlation to the text description.
  • Experiments show that the proposed approach performs better than the previous method of random token removal for CLIP training.
  • Compared to other CLIP improvements, our method is more effective and more efficient.

Paper Content

Introduction

  • CLIP and ALIGN models demonstrate zeroshot image classification and multi-modal retrieval capabilities
  • CLIP and ALIGN require large data and training cost
  • This paper aims to improve the efficiency of CLIP training
  • Image token removal drops a large portion of image tokens and reduces computation
  • Token removal process can harm CLIP performance
  • Proposed attentive token removal strategy selects tokens to remove according to correlation scores
  • Averaged attention weights of all layers used to compute correlation scores
  • A-CLIP framework is efficient and effective
  • A-CLIP-eff is even more efficient and accurate
  • Goal of computer vision is to interpret visual signals using language
  • Recent works suggest a new way to better connect visual signals with linguistic semantics
  • Training data is more scalable with billions of image-alt-text pairs
  • CLIP model is a mainstream visual learning method
  • Masking tokens for efficient computation
  • Dynamic ViTs learn to remove tokens for efficient image classification
  • Masked autoencoder randomly masks 75% tokens
  • Our method improves both effectiveness and efficiency of CLIP pre-training
  • FLIP uses random masking for CLIP training
  • Our method uses attentive masking which removes semantically meaningless tokens
  • Our method introduces multiple masked image views
  • Masking for data augmentation
  • Combining CLIP with other representation learning methods

Method

A brief review of clip

  • CLIP is a visual representation learning approach that uses image-to-text pairs.
  • CLIP applies an InfoNCE-like loss to classify pairs as positive or negative.
  • CLIP uses a Vision Transformer and a language Transformer.
  • CLIP applies a vision-language contrastive loss to project the [CLS] and [EOS] features into an embedding space.

Masking for efficient clip training

  • Random masking method is effective for masked image modeling
  • MIM is mainly for pre-training, CLIP is for zero-shot classification and retrieval
  • Removing highly semantic visuals affects CLIP training more than MIM
  • Random masking with 50% mask rate reduces zero-shot accuracy on ImageNet-1K
  • Attentive masking method resolves issues of random masking

Attentive mask

  • Goal of attentive masking is to keep tokens relevant to language description
  • Representation of [CLS] token after CLIP training corresponds to semantics in associated alt-text
  • Attention weights to other image tokens act as good indicator of relevance
  • Score of token at location P is computed using query and key embedding
  • Three strategies for selecting image tokens to mask: low, high, mixed
  • Low strategy performs best and is set as default
  • EMA network used to generate attention scores
  • EMA computation done at reduced image resolution to save cost
  • Shared EMA score map used for multiple masked views

Overall framework

  • A-CLIP is an attentive mask CLIP method
  • A-CLIP takes multiple masked views for better results
  • A-CLIP uses two auxiliary self-supervised tasks
  • EMA inference is used for zero-shot classification and multi-modal retrieval
  • A-CLIP-eff is a more efficient variant of A-CLIP

Experiments

Implementation details

  • Trained model on 15M subset of YFCC100M
  • Randomly sample caption for each image
  • Data augmentation includes color jitter, grayscale, solarize and blur
  • ViT-B/16 architecture used for visual encoders
  • Text encoder uses 12 layers with 8-head multi-head attention and 512 embedding dimension
  • AdamW optimizer with 5e-4 learning rate and 0.5 weight decay
  • EMA model momentum starts from 0.996 and increases to 1
  • A-CLIP-eff uses halved resolution image as input
  • CLIP and SLIP use publicly available model and checkpoints
  • MaskCLIP reproduced with 75% of tokens masked out and 0.999 to 0.9999 EMA momentum
  • 4 nodes with 8 NVIDIA Tesla V100 GPUs used for training
  • Speed test of different frameworks using single node of 8 NVIDIA A100 GPUs
  • Zero-shot retrieval on COCO and Flickr30k
  • Zero-shot transfer capacity evaluated on ImageNet and Cal-tech101

Main results

  • CLIP with random masking reduces zero-shot accuracy by -2.6%
  • Attentive mask strategy recaptures performance and surpasses original CLIP by +1.9%
  • Low selection retains top 50% of most relevant tokens, achieving best results with +3.7% performance improvement
  • High selection masks off most relevant 50% of tokens, resulting in drastic performance collapse
  • Mix selection takes top 25% most relevant tokens and rest are randomly selected, not matching performance
  • EMA-eff uses half-resolution images to reduce computational overhead
  • A-CLIP provides efficient paradigm for combining SSL with CLIP
  • A-CLIP achieves +1.6%, +2.5%, and +3.9% gains over SLIP on 25-dataset suite with 25, 50, and 100 epochs training
  • A-CLIP-eff achieves +5.1%, +9.4/5.9, and +7.2/3.8 gains over plain CLIP on ImageNet-1K, Flickr30K, and MS COCO

Ablation study and analysis

  • Hypothesize that attentive mask input can alleviate over-fitting
  • Adding stronger data augmentation can improve performance
  • SSL(online) is more dependent on stronger augmentation
  • SSL(online-EMA) can bring stable gains without color+blur
  • A-CLIP can be extended for multiple views
  • Using EMA for evaluation leads to performance gain
  • Changing mask size to 32 improves performance
  • Using all layers works better for attentive token selection

Conclusion

  • Introduces A-CLIP framework for more efficient and effective CLIP training
  • A-CLIP uses attentive masks for the image branch
  • A-CLIP is flexible and can incorporate multiple masked views and auxiliary self-supervised tasks
  • A-CLIP performs better than other CLIP improvements such as SLIP and MaskCLIP
  • A-CLIP-eff is more efficient than the original CLIP method
  • A-CLIP significantly outperforms other methods in terms of average accuracy and number of winning tracks
  • A-CLIP benefits more from longer training than the original CLIP method
  • A-CLIP uses an EMA update vision encoder to generate the attentive mask