Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples.
- Recently, CLIP-based methods have shown promising few-shot performance.
- A Cascade of Foundation models (CaFo) is proposed to incorporate diverse prior knowledge of various pre-training paradigms for better few-shot learning.
- CaFo works by ‘Prompt, Generate, then Cache’ to blend the predictions from CLIP and DINO.
Paper Content
Introduction
- Convolutional neural networks and transformers have been successful on a range of vision tasks.
- Few-shot learning is a research hotspot for data-deficient and resource-finite scenarios.
- Previous works have been proposed to enhance model’s generalization capability.
- CLIP pre-trained by language-image pairs has good zeroshot transfer ability.
- CoOp, CLIP-Adapter and Tip-Adapter have been developed for few-shot classification.
- Large-scale pre-training has strong representation ability which benefits few-shot learning.
- CaFo proposed to integrate pre-trained knowledge from multiple paradigms.
- CaFo uses GPT-3 to produce prompts, DALL-E to generate images, and a cache model to adaptively ensemble predictions.
- Experiments show CaFo achieves state-of-the-art without extra annotated data.
Related work
- Pre-training of vision models is based on ImageNet and fine-tuning on downstream tasks
- Self-supervised pre-training uses large-scale unlabeled datasets
- Contrast learning learns representations by contrasting positive and negative pairs
- Language-supervised visual pre-training is a novel paradigm closer to natural visual understanding
- Few-shot learning relies on transferability of trained neural networks
- Existing methods adapt CLIP to various vision tasks
Cascade of foundation models
- Four types of pretraining paradigms in CaFo
- Cascade them by ‘Prompt, Generate, then Cache’
Different pre-training paradigms
- Contrastive Vision-Language Pre-training maps vision and language into the same embedding space
- Contrastive Vision Pre-training focuses on discrimination between different images
- Generative Language Pre-training uses GPT-3 to produce CLIP’s prompts
- Generative Vision-Language Pre-training uses DALL-E to generate synthetic images
- Cache model uses two kinds of keys to adaptively fuse knowledge from CLIP and DINO
Prompt, generate, then cache
- Introduction of CaFo with a pipeline of ‘Prompt, Generate, then Cache’
Adaptive inference
- Extract two visual features from a test image
- Acquire three predicted classification logits
- Formulate logits using CLIP’s textual encoder, GPT-3’s created prompts, and query-key affinity matrix
- Regard language-contrastive prediction as baseline
- Calculate weights of two logits based on distribution similarity with baseline
- Normalize scales of three classification logits
- Calculate distribution similarities as ensemble weights
- Obtain final ensemble logits using softmax function
Experiments
Settings
- 11 publicly available datasets used for few-shot experiments
- CaFo integrates knowledge from pre-trained CLIP, DINO, DALL-E, and GPT-3
- ResNet-50 used as visual encoder for CLIP
- Different domain-specific textual templates used for DALL-E
- Five simple templates used for GPT-3
- Hyperparameters tuned using official validation sets
Performance
- Comparison of CaFo with other CLIP-based adaptation methods on ImageNet
- Evaluation of CaFo’s robustness to distribution shift
- Larger K does not lead to better few-shot performance
- Ablation of different ensemble methods of CLIP and DINO’s predictions during inference on ImageNet
Visualization
- DALL-E can generate synthetic images from ImageNet, OxfordPets and Caltech101.
- GPT-3 can produce more semantic texts than CLIP’s handcrafted templates.
Conclusion
- Propose CaFo, a cascade of foundation models
- Incorporate GPT-3 for prompting CLIP
- Adopt DALL-E to expand few-shot training data
- Adaptively fuse vision-contrastive DINO with CLIP
- Achieve state-of-the-art performance for few-shot learning on 11 datasets
- Integrate more existing pre-trained knowledge in future
- Unified four types of pre-training
- Prompt, Generate, then Cache pipeline
- Calculate distribution similarities between different classification logits
- Visualization of GPT-3’s Prompts for CLIP
- t-SNE Visualization
- Efficiency Comparison on ImageNet
- Quantative Performance Comparison on ImageNet
- Distribution Ablation Study of Cascaded Models
- Ablation Study of Adaptive Inference
- Ablation Study of Generated Number via DALL-E
- Ablation Study of CLIP’s Visual Encoders
- Performance Comparison on 10 Datasets
- Ablation Study of Zero-shot CaFo via DALL-E on Different Datasets