Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples.
Recently, CLIP-based methods have shown promising few-shot performance.
A Cascade of Foundation models (CaFo) is proposed to incorporate diverse prior knowledge of various pre-training paradigms for better few-shot learning.
CaFo works by ‘Prompt, Generate, then Cache’ to blend the predictions from CLIP and DINO.

Paper Content

Introduction

Convolutional neural networks and transformers have been successful on a range of vision tasks.
Few-shot learning is a research hotspot for data-deficient and resource-finite scenarios.
Previous works have been proposed to enhance model’s generalization capability.
CLIP pre-trained by language-image pairs has good zeroshot transfer ability.
CoOp, CLIP-Adapter and Tip-Adapter have been developed for few-shot classification.
Large-scale pre-training has strong representation ability which benefits few-shot learning.
CaFo proposed to integrate pre-trained knowledge from multiple paradigms.
CaFo uses GPT-3 to produce prompts, DALL-E to generate images, and a cache model to adaptively ensemble predictions.
Experiments show CaFo achieves state-of-the-art without extra annotated data.

Pre-training of vision models is based on ImageNet and fine-tuning on downstream tasks
Self-supervised pre-training uses large-scale unlabeled datasets
Contrast learning learns representations by contrasting positive and negative pairs
Language-supervised visual pre-training is a novel paradigm closer to natural visual understanding
Few-shot learning relies on transferability of trained neural networks
Existing methods adapt CLIP to various vision tasks

Cascade of foundation models

Four types of pretraining paradigms in CaFo
Cascade them by ‘Prompt, Generate, then Cache’

Different pre-training paradigms

Contrastive Vision-Language Pre-training maps vision and language into the same embedding space
Contrastive Vision Pre-training focuses on discrimination between different images
Generative Language Pre-training uses GPT-3 to produce CLIP’s prompts
Generative Vision-Language Pre-training uses DALL-E to generate synthetic images
Cache model uses two kinds of keys to adaptively fuse knowledge from CLIP and DINO

Prompt, generate, then cache

Introduction of CaFo with a pipeline of ‘Prompt, Generate, then Cache’

Adaptive inference

Extract two visual features from a test image
Acquire three predicted classification logits
Formulate logits using CLIP’s textual encoder, GPT-3’s created prompts, and query-key affinity matrix
Regard language-contrastive prediction as baseline
Calculate weights of two logits based on distribution similarity with baseline
Normalize scales of three classification logits
Calculate distribution similarities as ensemble weights
Obtain final ensemble logits using softmax function

Experiments

Settings

11 publicly available datasets used for few-shot experiments
CaFo integrates knowledge from pre-trained CLIP, DINO, DALL-E, and GPT-3
ResNet-50 used as visual encoder for CLIP
Different domain-specific textual templates used for DALL-E
Five simple templates used for GPT-3
Hyperparameters tuned using official validation sets

Performance

Comparison of CaFo with other CLIP-based adaptation methods on ImageNet
Evaluation of CaFo’s robustness to distribution shift
Larger K does not lead to better few-shot performance
Ablation of different ensemble methods of CLIP and DINO’s predictions during inference on ImageNet

Visualization

DALL-E can generate synthetic images from ImageNet, OxfordPets and Caltech101.
GPT-3 can produce more semantic texts than CLIP’s handcrafted templates.

Conclusion

Propose CaFo, a cascade of foundation models
Incorporate GPT-3 for prompting CLIP
Adopt DALL-E to expand few-shot training data
Adaptively fuse vision-contrastive DINO with CLIP
Achieve state-of-the-art performance for few-shot learning on 11 datasets
Integrate more existing pre-trained knowledge in future
Unified four types of pre-training
Prompt, Generate, then Cache pipeline
Calculate distribution similarities between different classification logits
Visualization of GPT-3’s Prompts for CLIP
t-SNE Visualization
Efficiency Comparison on ImageNet
Quantative Performance Comparison on ImageNet
Distribution Ablation Study of Cascaded Models
Ablation Study of Adaptive Inference
Ablation Study of Generated Number via DALL-E
Ablation Study of CLIP’s Visual Encoders
Performance Comparison on 10 Datasets
Ablation Study of Zero-shot CaFo via DALL-E on Different Datasets

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Cascade of foundation models#

Different pre-training paradigms#

Prompt, generate, then cache#

Adaptive inference#

Experiments#

Settings#

Performance#

Visualization#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Cascade of foundation models

Different pre-training paradigms

Prompt, generate, then cache

Adaptive inference

Experiments

Settings

Performance

Visualization

Conclusion