Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Measures extra out-of-distribution robustness beyond what can be predicted from in-distribution performance
Existing evaluations typically use a single test set to evaluate in-distribution accuracy
Proposes a new evaluation metric to compare effective robustness of models trained on different data distributions
Controls for accuracy on multiple in-distribution test sets that cover the training distributions for all evaluated models

Paper Content

Introduction

Robustness against distribution shifts is important for machine learning models.
Taori et al. (2020) proposed the notion of effective robustness to control for in-distribution (ID) accuracy when evaluating out-of-distribution (OOD) accuracy.
Current definition of effective robustness requires a fixed ID test set.
Models from Contrastive Language-Image Pre-training have recently exhibited unprecedented effective robustness gains.
We propose to use multiple ID test sets to provide a more precise estimate of a model’s effective robustness.

Background of effective robustness

OOD accuracy is often correlated with ID accuracy
A linear trend between transformed ID accuracy and OOD accuracy holds across many datasets and models
Most models with higher OOD accuracies are naturally resulted from better ID performance
Single-ID effective robustness subtracts the predicted OOD accuracy from the actual OOD accuracy

Limitation with the single id test set

Existing robustness evaluation uses a single ID test set for all models
Comparing models trained on different datasets is necessary to know if pre-training techniques yield effective robustness gains
Comparing zero-shot CLIP models and standard ImageNet models as an example
Two different ID test sets used: ImageNet and YFCC-15M
Strong linear trend between scaled ID accuracy and OOD accuracy for ImageNet and YFCC models
Mismatch between training data and ID test data leads to imprecise conclusions on effective robustness

Multi-id effective robustness

Proposed a new way for effective robustness evaluation using multiple ID test sets
Use two ID test sets in proposed method
Baseline function predicts OOD accuracy based on two ID accuracies
Baseline function is a linear function under logit scale
Multi-ID effective robustness defined using two ID accuracies

Experiments

Models

Need to fit a baseline function, need models with diverse accuracies
Follow Taori et al. (2020) to train models with various proportions of data by subsampling
Combine examples from two datasets with different sampling ratios
Train standard classifiers on CIFAR-10 and ImageNet
Train CLIP models on YFCC-15M and LAION-15M
Discard models with ImageNet accuracy below 5%

Test sets

Focus on image classification
Use labeled image classification datasets for evaluating ID accuracy
Automatically generate classification labels for datasets without original labels
Use 3 CIFAR-like OOD test sets with natural distribution shifts
Use 4 ImageNet-like OOD test sets
Use class subsampling and mapping
Single-ID evaluation suggests advantage of ImageNet models
Multi-ID evaluation resolves confounder and shows no advantage of ImageNet models

Evaluation on cifar-like ood test sets

Visualized multi-ID effective robustness on CIFAR-10.2
Models have similar effective robustness and can be predicted using a simple plane
Single-ID evaluation yields contradictory conclusions on effective robustness
Multi-ID evaluation provides a more holistic view
Multi-ID evaluation improves fitting quality and better predicts OOD accuracies from ID accuracies

Evaluation on imagenet-like ood test sets

Models evaluated have similar effective robustness
Improvement of fitting quality is significant for models involving LAION
YFCC and LAION models have positive effective robustness values
Multi-ID evaluation suggests all models have similar effective robustness

Evaluation on additional models

Evaluated additional models not used in fitting baseline functions
Downloaded models pre-trained by existing works
Used MAE to measure fitting quality
Multi-ID evaluation reduces MAE and more accurately predicts OOD accuracy
Effective robustness values of models become closer to 0
SLIP models and Wise-FT models achieve higher average effective robustness

Earlier works have observed linear correlations between ID and OOD performance
Taori et al. proposed to evaluate effective robustness by controlling for ID accuracy
Miller et al. validated accuracy-on-the-line with a broader scope
ID accuracy and OOD accuracy can sometimes inversely correlate
Baek et al. proposed agreement-on-the-line which does not require labeled data
We propose accuracy-on-the-plane using multiple ID test sets
CLIP-like models with language image pre-training have been studied and shown to achieve exceptional effective robustness
Fang et al. and Nguyen et al. suggested pre-training data could determine effective robustness gains
Zero-shot CLIP models do not have effective robustness gains
Kumar et al., Andreassen et al., and Wortsman et al. studied the robustness of fine-tuned CLIP models
Devillers et al. and Santurkar et al. studied the transfer performance of CLIP models

Conclusion

Proposed a new and more precise effective robustness evaluation for models with different training data
OOD accuracy can be better predicted from multiple ID accuracies
Effective robustness of zero-shot CLIP models trained on language-image data is similar to that of standard ImageNet models
Limitations: focus on models which do not significantly alter the training distribution, “accuracy-on-the-line” does not hold for some models
Future work: compare models on more than two datasets, efficiently generalizing
Visualization of multi-ID effective robustness on various test sets
Fitting quality of single-ID and multi-ID effective robustness evaluated by R2 and mean absolute error
Single-ID and multi-ID effective robustness on CIFAR-like OOD test sets
Fitting quality and effective robustness for downloaded and fine-tuned models involving YFCC and LAION

Link to paper#

Abstract#

Paper Content#

Introduction#

Background of effective robustness#

Limitation with the single id test set#

Multi-id effective robustness#

Experiments#

Models#

Test sets#

Evaluation on cifar-like ood test sets#

Evaluation on imagenet-like ood test sets#

Evaluation on additional models#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background of effective robustness

Limitation with the single id test set

Multi-id effective robustness

Experiments

Models

Test sets

Evaluation on cifar-like ood test sets

Evaluation on imagenet-like ood test sets

Evaluation on additional models

Related work

Conclusion