Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples.
  • Recently, CLIP-based methods have shown promising few-shot performance.
  • A Cascade of Foundation models (CaFo) is proposed to incorporate diverse prior knowledge of various pre-training paradigms for better few-shot learning.
  • CaFo works by ‘Prompt, Generate, then Cache’ to blend the predictions from CLIP and DINO.

Paper Content

Introduction

  • Convolutional neural networks and transformers have been successful on a range of vision tasks.
  • Few-shot learning is a research hotspot for data-deficient and resource-finite scenarios.
  • Previous works have been proposed to enhance model’s generalization capability.
  • CLIP pre-trained by language-image pairs has good zeroshot transfer ability.
  • CoOp, CLIP-Adapter and Tip-Adapter have been developed for few-shot classification.
  • Large-scale pre-training has strong representation ability which benefits few-shot learning.
  • CaFo proposed to integrate pre-trained knowledge from multiple paradigms.
  • CaFo uses GPT-3 to produce prompts, DALL-E to generate images, and a cache model to adaptively ensemble predictions.
  • Experiments show CaFo achieves state-of-the-art without extra annotated data.
  • Pre-training of vision models is based on ImageNet and fine-tuning on downstream tasks
  • Self-supervised pre-training uses large-scale unlabeled datasets
  • Contrast learning learns representations by contrasting positive and negative pairs
  • Language-supervised visual pre-training is a novel paradigm closer to natural visual understanding
  • Few-shot learning relies on transferability of trained neural networks
  • Existing methods adapt CLIP to various vision tasks

Cascade of foundation models

  • Four types of pretraining paradigms in CaFo
  • Cascade them by ‘Prompt, Generate, then Cache’

Different pre-training paradigms

  • Contrastive Vision-Language Pre-training maps vision and language into the same embedding space
  • Contrastive Vision Pre-training focuses on discrimination between different images
  • Generative Language Pre-training uses GPT-3 to produce CLIP’s prompts
  • Generative Vision-Language Pre-training uses DALL-E to generate synthetic images
  • Cache model uses two kinds of keys to adaptively fuse knowledge from CLIP and DINO

Prompt, generate, then cache

  • Introduction of CaFo with a pipeline of ‘Prompt, Generate, then Cache’

Adaptive inference

  • Extract two visual features from a test image
  • Acquire three predicted classification logits
  • Formulate logits using CLIP’s textual encoder, GPT-3’s created prompts, and query-key affinity matrix
  • Regard language-contrastive prediction as baseline
  • Calculate weights of two logits based on distribution similarity with baseline
  • Normalize scales of three classification logits
  • Calculate distribution similarities as ensemble weights
  • Obtain final ensemble logits using softmax function

Experiments

Settings

  • 11 publicly available datasets used for few-shot experiments
  • CaFo integrates knowledge from pre-trained CLIP, DINO, DALL-E, and GPT-3
  • ResNet-50 used as visual encoder for CLIP
  • Different domain-specific textual templates used for DALL-E
  • Five simple templates used for GPT-3
  • Hyperparameters tuned using official validation sets

Performance

  • Comparison of CaFo with other CLIP-based adaptation methods on ImageNet
  • Evaluation of CaFo’s robustness to distribution shift
  • Larger K does not lead to better few-shot performance
  • Ablation of different ensemble methods of CLIP and DINO’s predictions during inference on ImageNet

Visualization

  • DALL-E can generate synthetic images from ImageNet, OxfordPets and Caltech101.
  • GPT-3 can produce more semantic texts than CLIP’s handcrafted templates.

Conclusion

  • Propose CaFo, a cascade of foundation models
  • Incorporate GPT-3 for prompting CLIP
  • Adopt DALL-E to expand few-shot training data
  • Adaptively fuse vision-contrastive DINO with CLIP
  • Achieve state-of-the-art performance for few-shot learning on 11 datasets
  • Integrate more existing pre-trained knowledge in future
  • Unified four types of pre-training
  • Prompt, Generate, then Cache pipeline
  • Calculate distribution similarities between different classification logits
  • Visualization of GPT-3’s Prompts for CLIP
  • t-SNE Visualization
  • Efficiency Comparison on ImageNet
  • Quantative Performance Comparison on ImageNet
  • Distribution Ablation Study of Cascaded Models
  • Ablation Study of Adaptive Inference
  • Ablation Study of Generated Number via DALL-E
  • Ablation Study of CLIP’s Visual Encoders
  • Performance Comparison on 10 Datasets
  • Ablation Study of Zero-shot CaFo via DALL-E on Different Datasets