Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Federated learning (FL) enables collaborative training of deep models on the network edge without centrally aggregating raw data.
  • Most of the previous work has focused on a difference in the distribution of labels or client shifts, but this paper addresses a different problem of FL, e.g., different scanners/sensors in medical imaging.
  • The paper proposes an effective method that uses local batch normalization to alleviate the feature shift before averaging models.
  • The resulting scheme, called FedBN, outperforms both classical FedAvg, as well as the state-of-the-art for non-iid data (FedProx).

Paper Content

Introduction

  • Federated learning (FL) is a way of learning from distributed data.
  • A challenge in FL is the statistical heterogeneity of the data.
  • Standard federated methods suffer from performance degradation or divergence when deployed over non-iid samples.
  • Feature shift is a type of non-iid data that has not been explored in the literature.
  • Batch Normalization (BN) has been proposed as a tool to mitigate domain shifts in domain adaptation tasks.
  • This paper proposes to apply BN for feature shift FL.
  • A toy example is used to illustrate how BN may help harmonizing local feature distributions.
  • The paper proposes a novel federated learning method, called FedBN, for addressing non-iid training data.
  • FedBN has zero parameters to tune and requires minimal additional computational resources.
  • FedBN demonstrates significant practical improvements compared to classical FedAvg and the state-of-the-art for non-iid data (FedProx).
  • FedAvg often suffers when data is heterogeneous
  • Empirical work mainly focuses on label distribution skew
  • FedProx tackles heterogeneity by allowing partial information aggregation
  • Zhao et al. assumes a subset of data is globally shared
  • FedMA proposed an aggregation strategy for non-iid data
  • FedRobust assumes data follows an affine distribution shift
  • SiloBN keeps some untrainable BN parameters
  • FedBN keeps all BN parameters strictly local
  • Reddi et al. and Zhang et al. focus on improving the optimization mechanism
  • BN makes optimization landscape smoother
  • BN provides regularization and discourages single direction reliance
  • BN used for domain adaptation
  • Role of BN in federated learning unexplored

Preliminary

  • Non-IID data is a novel category of client’s data distribution
  • Feature shift covers two categories: covariate shift and concept shift
  • FedAvg is a popular federated learning strategy where clients collaboratively send updates of locally trained models to a global server
  • FedAvg has shown successes in classical Federated Learning tasks, but suffers from slow convergence and low accuracy in most non-IID contents

Federated averaging with local batch normalization

Proposed method -fedbn

  • Proposed learning strategy: FedBN
  • Similar to FedAvg, performs local updates and averages local models
  • Excludes BN layers from averaging step
  • Results in significant empirical improvements in non-iid settings
  • Improves convergence rate under feature shift

Problem setup

Convergence analysis

  • We study the trajectory of two networks, FedAvg and FedBN, through the neural tangent kernel.
  • Recent machine learning theory studies have shown that the convergence rate of finite-width over-parameterized networks is controlled by the least eigenvalue of the induced kernel.
  • We consider the case that the number of local updates is 1.
  • The neural tangent kernel can be decomposed into a magnitude component and direction component.
  • The exponential factor of convergence for FedAvg and FedBN is controlled by the smallest eigenvalue of G(t) and G*(t) respectively.
  • FedBN has a faster convergence rate than FedAvg.

Proof sketch

  • Show that λ min (G ∞ ) is less than or equal to λ min (G * ∞ )
  • Compare equation 4 and 5

Experiments

  • Using local BN parameters is beneficial when data is heterogeneous across clients
  • FedBN is a novel local parameter sharing strategy that is more robust and faster than alternative methods
  • FedBN is shown to have better model performance on benchmark and real-world datasets

Benchmark experiments

  • Extensive empirical analysis conducted on benchmark digits classification task
  • Five datasets used: SVHN Netzer et al. (2011), USPS Hull (1994), SynthDigits Ganin & Lempitsky (2015), MNIST-M Ganin & Lempitsky (2015) and MNIST LeCun et al. (1998)
  • 7438 training samples in each dataset
  • Classification model is a convolutional neural network
  • Cross-entropy loss and SGD optimizer with a learning rate of 10 −2
  • Default setting for local update epochs is E = 1
  • Default setting for the amount of data at each client is 10% of the dataset original size
  • Investigation of convergence rate and analysis of local updating epochs

Analysis of local dataset size:

  • FedBN behaviour is observed over different data capacities at each client.
  • Testing accuracy drops when each client is attributed 20% of its original data amount.
  • FedBN can benefit from collaborative training on distributed data, especially when each client holds a small amount of data.

Effects of statistical heterogeneity:

  • FedBN is compared to FedAvg in a federated setting with varying levels of heterogeneity
  • 10 subsets of data are used, with equal number of samples and same label distribution
  • FedBN achieves higher testing accuracy than FedAvg over all levels of heterogeneity
  • FedBN consistently outperforms state-of-the-art and baseline methods

Experiments on real-world datasets

  • Proposed algorithm can be beneficial in real-world feature-shift noniid
  • Validated effectiveness of FedBN on three real-world datasets
  • Datasets include Office-Caltech10, four medical institutions from ABIDE I, and DomainNet
  • Classification models use AlexNet architecture with BN added after each convolution and fully-connected layer
  • ABIDE I instances represented as 5995-dimensional vector through brain connectome computation
  • Classifier is a three-layer fully connected neural network with two BN layers after the first two fully connected layers

Results and analysis:

  • FedBN significantly outperforms state-of-the-art method on Office-Caltech10
  • FedBN achieves supreme accuracy on DomainNet
  • Alternative FL methods achieve comparable results with SingleSet except Quickdraw
  • FedBN outperforms alternative FL methods by 10%
  • Alternative FL strategies are ineffective in feature shift non-iid datasets
  • FedBN excels by a non-negligible margin on three clients in ABIDE I

Conclusion and discussion

  • Proposes a novel federated learning aggregation method called FedBN
  • Mitigates feature shifts in non-IID data
  • Provides convergence guarantees for FedBN
  • Experiments demonstrate FedBN can improve convergence and model performance
  • Can be combined with different optimization algorithms, communication schemes, and aggregation techniques
  • Can be integrated into existing tool-kits/systems

B.3 proof of corollary 4.6

  • FedAvg and FedBN have different convergence rates when E = 1
  • Comparing the exponential factors in the convergence rates reduces to comparing two matrices

C fedbn algorithm

  • Proposed algorithm: FedBN
  • Initialize model parameters: w (l) 0,k , local update pace: E, and total optimization round T
  • Model Architecture: 6-layer Convolutional Neural Network (CNN)
  • Datasets: Office-31, Caltech-256, Amazon, DSLR, Webcam
  • Training Details: Cross-entropy loss, SGD optimizer, learning rate 10-2, batch size 32, training epochs 300, µ 10-2
  • Data sample number: Office-Caltech10 62, DomainNet 105

D.4 abide dataset and training details

  • Described real-world medical datasets
  • Described preprocessing and training details

Dataset:

  • Study was conducted using rs-fMRI data from ABIDE dataset
  • Data was downloaded from four largest sites (UM, NYU, USM, UCLA)
  • Subjects lacking filename were skipped, resulting in 88, 167, 52, 63 subjects for each site
  • Sliding windows were used to truncate raw time sequences of fMRI
  • Performance of FedBN and alternative methods was compared in Figure 5

Nyu um1 usm ucla1