Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Measures extra out-of-distribution robustness beyond what can be predicted from in-distribution performance
  • Existing evaluations typically use a single test set to evaluate in-distribution accuracy
  • Proposes a new evaluation metric to compare effective robustness of models trained on different data distributions
  • Controls for accuracy on multiple in-distribution test sets that cover the training distributions for all evaluated models

Paper Content

Introduction

  • Robustness against distribution shifts is important for machine learning models.
  • Taori et al. (2020) proposed the notion of effective robustness to control for in-distribution (ID) accuracy when evaluating out-of-distribution (OOD) accuracy.
  • Current definition of effective robustness requires a fixed ID test set.
  • Models from Contrastive Language-Image Pre-training have recently exhibited unprecedented effective robustness gains.
  • We propose to use multiple ID test sets to provide a more precise estimate of a model’s effective robustness.

Background of effective robustness

  • OOD accuracy is often correlated with ID accuracy
  • A linear trend between transformed ID accuracy and OOD accuracy holds across many datasets and models
  • Most models with higher OOD accuracies are naturally resulted from better ID performance
  • Single-ID effective robustness subtracts the predicted OOD accuracy from the actual OOD accuracy

Limitation with the single id test set

  • Existing robustness evaluation uses a single ID test set for all models
  • Comparing models trained on different datasets is necessary to know if pre-training techniques yield effective robustness gains
  • Comparing zero-shot CLIP models and standard ImageNet models as an example
  • Two different ID test sets used: ImageNet and YFCC-15M
  • Strong linear trend between scaled ID accuracy and OOD accuracy for ImageNet and YFCC models
  • Mismatch between training data and ID test data leads to imprecise conclusions on effective robustness

Multi-id effective robustness

  • Proposed a new way for effective robustness evaluation using multiple ID test sets
  • Use two ID test sets in proposed method
  • Baseline function predicts OOD accuracy based on two ID accuracies
  • Baseline function is a linear function under logit scale
  • Multi-ID effective robustness defined using two ID accuracies

Experiments

Models

  • Need to fit a baseline function, need models with diverse accuracies
  • Follow Taori et al. (2020) to train models with various proportions of data by subsampling
  • Combine examples from two datasets with different sampling ratios
  • Train standard classifiers on CIFAR-10 and ImageNet
  • Train CLIP models on YFCC-15M and LAION-15M
  • Discard models with ImageNet accuracy below 5%

Test sets

  • Focus on image classification
  • Use labeled image classification datasets for evaluating ID accuracy
  • Automatically generate classification labels for datasets without original labels
  • Use 3 CIFAR-like OOD test sets with natural distribution shifts
  • Use 4 ImageNet-like OOD test sets
  • Use class subsampling and mapping
  • Single-ID evaluation suggests advantage of ImageNet models
  • Multi-ID evaluation resolves confounder and shows no advantage of ImageNet models

Evaluation on cifar-like ood test sets

  • Visualized multi-ID effective robustness on CIFAR-10.2
  • Models have similar effective robustness and can be predicted using a simple plane
  • Single-ID evaluation yields contradictory conclusions on effective robustness
  • Multi-ID evaluation provides a more holistic view
  • Multi-ID evaluation improves fitting quality and better predicts OOD accuracies from ID accuracies

Evaluation on imagenet-like ood test sets

  • Models evaluated have similar effective robustness
  • Improvement of fitting quality is significant for models involving LAION
  • YFCC and LAION models have positive effective robustness values
  • Multi-ID evaluation suggests all models have similar effective robustness

Evaluation on additional models

  • Evaluated additional models not used in fitting baseline functions
  • Downloaded models pre-trained by existing works
  • Used MAE to measure fitting quality
  • Multi-ID evaluation reduces MAE and more accurately predicts OOD accuracy
  • Effective robustness values of models become closer to 0
  • SLIP models and Wise-FT models achieve higher average effective robustness
  • Earlier works have observed linear correlations between ID and OOD performance
  • Taori et al. proposed to evaluate effective robustness by controlling for ID accuracy
  • Miller et al. validated accuracy-on-the-line with a broader scope
  • ID accuracy and OOD accuracy can sometimes inversely correlate
  • Baek et al. proposed agreement-on-the-line which does not require labeled data
  • We propose accuracy-on-the-plane using multiple ID test sets
  • CLIP-like models with language image pre-training have been studied and shown to achieve exceptional effective robustness
  • Fang et al. and Nguyen et al. suggested pre-training data could determine effective robustness gains
  • Zero-shot CLIP models do not have effective robustness gains
  • Kumar et al., Andreassen et al., and Wortsman et al. studied the robustness of fine-tuned CLIP models
  • Devillers et al. and Santurkar et al. studied the transfer performance of CLIP models

Conclusion

  • Proposed a new and more precise effective robustness evaluation for models with different training data
  • OOD accuracy can be better predicted from multiple ID accuracies
  • Effective robustness of zero-shot CLIP models trained on language-image data is similar to that of standard ImageNet models
  • Limitations: focus on models which do not significantly alter the training distribution, “accuracy-on-the-line” does not hold for some models
  • Future work: compare models on more than two datasets, efficiently generalizing
  • Visualization of multi-ID effective robustness on various test sets
  • Fitting quality of single-ID and multi-ID effective robustness evaluated by R2 and mean absolute error
  • Single-ID and multi-ID effective robustness on CIFAR-like OOD test sets
  • Fitting quality and effective robustness for downloaded and fine-tuned models involving YFCC and LAION