Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Measures extra out-of-distribution robustness beyond what can be predicted from in-distribution performance
- Existing evaluations typically use a single test set to evaluate in-distribution accuracy
- Proposes a new evaluation metric to compare effective robustness of models trained on different data distributions
- Controls for accuracy on multiple in-distribution test sets that cover the training distributions for all evaluated models
Paper Content
Introduction
- Robustness against distribution shifts is important for machine learning models.
- Taori et al. (2020) proposed the notion of effective robustness to control for in-distribution (ID) accuracy when evaluating out-of-distribution (OOD) accuracy.
- Current definition of effective robustness requires a fixed ID test set.
- Models from Contrastive Language-Image Pre-training have recently exhibited unprecedented effective robustness gains.
- We propose to use multiple ID test sets to provide a more precise estimate of a model’s effective robustness.
Background of effective robustness
- OOD accuracy is often correlated with ID accuracy
- A linear trend between transformed ID accuracy and OOD accuracy holds across many datasets and models
- Most models with higher OOD accuracies are naturally resulted from better ID performance
- Single-ID effective robustness subtracts the predicted OOD accuracy from the actual OOD accuracy
Limitation with the single id test set
- Existing robustness evaluation uses a single ID test set for all models
- Comparing models trained on different datasets is necessary to know if pre-training techniques yield effective robustness gains
- Comparing zero-shot CLIP models and standard ImageNet models as an example
- Two different ID test sets used: ImageNet and YFCC-15M
- Strong linear trend between scaled ID accuracy and OOD accuracy for ImageNet and YFCC models
- Mismatch between training data and ID test data leads to imprecise conclusions on effective robustness
Multi-id effective robustness
- Proposed a new way for effective robustness evaluation using multiple ID test sets
- Use two ID test sets in proposed method
- Baseline function predicts OOD accuracy based on two ID accuracies
- Baseline function is a linear function under logit scale
- Multi-ID effective robustness defined using two ID accuracies
Experiments
Models
- Need to fit a baseline function, need models with diverse accuracies
- Follow Taori et al. (2020) to train models with various proportions of data by subsampling
- Combine examples from two datasets with different sampling ratios
- Train standard classifiers on CIFAR-10 and ImageNet
- Train CLIP models on YFCC-15M and LAION-15M
- Discard models with ImageNet accuracy below 5%
Test sets
- Focus on image classification
- Use labeled image classification datasets for evaluating ID accuracy
- Automatically generate classification labels for datasets without original labels
- Use 3 CIFAR-like OOD test sets with natural distribution shifts
- Use 4 ImageNet-like OOD test sets
- Use class subsampling and mapping
- Single-ID evaluation suggests advantage of ImageNet models
- Multi-ID evaluation resolves confounder and shows no advantage of ImageNet models
Evaluation on cifar-like ood test sets
- Visualized multi-ID effective robustness on CIFAR-10.2
- Models have similar effective robustness and can be predicted using a simple plane
- Single-ID evaluation yields contradictory conclusions on effective robustness
- Multi-ID evaluation provides a more holistic view
- Multi-ID evaluation improves fitting quality and better predicts OOD accuracies from ID accuracies
Evaluation on imagenet-like ood test sets
- Models evaluated have similar effective robustness
- Improvement of fitting quality is significant for models involving LAION
- YFCC and LAION models have positive effective robustness values
- Multi-ID evaluation suggests all models have similar effective robustness
Evaluation on additional models
- Evaluated additional models not used in fitting baseline functions
- Downloaded models pre-trained by existing works
- Used MAE to measure fitting quality
- Multi-ID evaluation reduces MAE and more accurately predicts OOD accuracy
- Effective robustness values of models become closer to 0
- SLIP models and Wise-FT models achieve higher average effective robustness
Related work
- Earlier works have observed linear correlations between ID and OOD performance
- Taori et al. proposed to evaluate effective robustness by controlling for ID accuracy
- Miller et al. validated accuracy-on-the-line with a broader scope
- ID accuracy and OOD accuracy can sometimes inversely correlate
- Baek et al. proposed agreement-on-the-line which does not require labeled data
- We propose accuracy-on-the-plane using multiple ID test sets
- CLIP-like models with language image pre-training have been studied and shown to achieve exceptional effective robustness
- Fang et al. and Nguyen et al. suggested pre-training data could determine effective robustness gains
- Zero-shot CLIP models do not have effective robustness gains
- Kumar et al., Andreassen et al., and Wortsman et al. studied the robustness of fine-tuned CLIP models
- Devillers et al. and Santurkar et al. studied the transfer performance of CLIP models
Conclusion
- Proposed a new and more precise effective robustness evaluation for models with different training data
- OOD accuracy can be better predicted from multiple ID accuracies
- Effective robustness of zero-shot CLIP models trained on language-image data is similar to that of standard ImageNet models
- Limitations: focus on models which do not significantly alter the training distribution, “accuracy-on-the-line” does not hold for some models
- Future work: compare models on more than two datasets, efficiently generalizing
- Visualization of multi-ID effective robustness on various test sets
- Fitting quality of single-ID and multi-ID effective robustness evaluated by R2 and mean absolute error
- Single-ID and multi-ID effective robustness on CIFAR-like OOD test sets
- Fitting quality and effective robustness for downloaded and fine-tuned models involving YFCC and LAION