Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Ensembling independent deep neural networks (DNNs) can improve top-line metrics and outperform larger single models.
  • Ensembling can improve subgroup performances, such as worst-k and minority group performance.
  • Gains in performance from ensembling for the minority group continue for longer than for the majority group.
  • Ensembling can be a powerful tool for alleviating disparate impact from DNN classifiers.
  • Varying sources of stochasticity can result in different fairness outcomes.

Paper Content

Introduction

  • Deep Neural Networks (DNNs) are powerful function approximators
  • Model ensembling is a popular recipe to boost performance
  • Understanding performance on subgroups is important for fairness
  • Ensembling improves aggregate performance and efficiency
  • Ensembling disproportionately benefits bottom-k and minority groups
  • Ensembling self-corrects bias from aggregation

Preliminaries

  • DNNs are mappings with trainable weights
  • Training dataset consists of data points
  • Weights are optimized by minimizing an objective function
  • Ensemble of m classification models
  • Each model is trained with an empirical risk minimization objective
  • Consider impact of ensembling on balanced and imbalanced subgroups
  • Error rates for under-represented groups can be higher
  • Models can reflect social biases

Experimental set-up

  • Evaluated methodology on CIFAR100 and TinyImagenet datasets
  • Used Resnet9/18/34/50, VGG16 and MLP-Mixer architectures
  • Trained 20 models independently
  • Calculated class accuracy on base model and found best/worst 10 performing classes
  • Constructed minority group by modifying CIFAR10 training set
  • Observed accuracy gain for different DNN architectures as number of models in ensemble grows
  • Established that ensembles of DNNs with same architecture and hyperparameters benefit minority/bottom-k group
  • Performed ablation study to observe fairness in deep ensembles

Ensembling provides disproportionate gains to bottom-k and minority classes

  • Ensembling disproportionately benefits bottom-k performance
  • Maximum gain of 55% for bottom-k compared to 5% for top-k
  • Maximum gain of 12.92% absolute accuracy gain for bottom-k over all architectures
  • Minority group benefits more than majority from ensembling
  • Bottom-k gains plateau far slower than top-k

Fair ensemble: improved robustness

  • Ensembling improves uncertainty estimates and can be beneficial for OOD.
  • Ensembles improve fairness by increasing performance on bottom group more than top group.
  • Effect is more prominent for higher severity of corruption.

Difference in churn between models explains ensemble fairness

  • Deep ensembling has a disparate impact on minority groups compared to majority groups
  • Churn is a metric used to measure model disagreement
  • Model ensembling is beneficial when models disagree
  • Churn is higher for bottom-k group than top-k group
  • Ensembling models is useful for improving accuracy on bottom-k samples

Characterizing stochasticity in deep neural networks training

  • Results show gains in homogeneous ensembles
  • Stochasticity is necessary to produce meaningful ensembles

Controlling for the sources of stochasticity in ensembles

  • Change Model Initialization
  • Change Batch Ordering
  • Change Model Initialization and Batch Ordering
  • Change Data Augmentation
  • Change Model Initialization, Batch Ordering and Data Augmentation

Can different sources of stochasticity improve deep ensemble fairness?

  • Standard ensembles improve fairness over individual models
  • Different sources of stochasticity have an impact on individual training episodes of a DNN
  • Batch-ordering minimizes the gap between top and bottom-k class accuracy
  • It is possible to skew the ensemble to favor the bottom group, improving fairness
  • Deep ensembling of neural networks improves top-line metrics
  • Prior work amplifies differences between models in the ensemble
  • Focus on simple ensembles with shared design choices
  • Understanding implications of ensembling on fairness objectives
  • Stochasticity in uniform ensembles impacts subgroup performance

Conclusion and future work

  • Ensembling can improve fairness outcomes in sensitive domains.
  • Certain distributions of stochasticity favor top-k and bottom-k.
  • Future work should optimize to amplify sources of stochasticity.
  • Ensembles improve minority group more than majority group.
  • Controlling sources of randomness can further improve fairness.
  • Top-10 and Bottom-10 class names for CIFAR100 and TinyImagenet.