Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Federated learning (FL) enables collaborative training of deep models on the network edge without centrally aggregating raw data.
- Most of the previous work has focused on a difference in the distribution of labels or client shifts, but this paper addresses a different problem of FL, e.g., different scanners/sensors in medical imaging.
- The paper proposes an effective method that uses local batch normalization to alleviate the feature shift before averaging models.
- The resulting scheme, called FedBN, outperforms both classical FedAvg, as well as the state-of-the-art for non-iid data (FedProx).
Paper Content
Introduction
- Federated learning (FL) is a way of learning from distributed data.
- A challenge in FL is the statistical heterogeneity of the data.
- Standard federated methods suffer from performance degradation or divergence when deployed over non-iid samples.
- Feature shift is a type of non-iid data that has not been explored in the literature.
- Batch Normalization (BN) has been proposed as a tool to mitigate domain shifts in domain adaptation tasks.
- This paper proposes to apply BN for feature shift FL.
- A toy example is used to illustrate how BN may help harmonizing local feature distributions.
- The paper proposes a novel federated learning method, called FedBN, for addressing non-iid training data.
- FedBN has zero parameters to tune and requires minimal additional computational resources.
- FedBN demonstrates significant practical improvements compared to classical FedAvg and the state-of-the-art for non-iid data (FedProx).
Related work
- FedAvg often suffers when data is heterogeneous
- Empirical work mainly focuses on label distribution skew
- FedProx tackles heterogeneity by allowing partial information aggregation
- Zhao et al. assumes a subset of data is globally shared
- FedMA proposed an aggregation strategy for non-iid data
- FedRobust assumes data follows an affine distribution shift
- SiloBN keeps some untrainable BN parameters
- FedBN keeps all BN parameters strictly local
- Reddi et al. and Zhang et al. focus on improving the optimization mechanism
- BN makes optimization landscape smoother
- BN provides regularization and discourages single direction reliance
- BN used for domain adaptation
- Role of BN in federated learning unexplored
Preliminary
- Non-IID data is a novel category of client’s data distribution
- Feature shift covers two categories: covariate shift and concept shift
- FedAvg is a popular federated learning strategy where clients collaboratively send updates of locally trained models to a global server
- FedAvg has shown successes in classical Federated Learning tasks, but suffers from slow convergence and low accuracy in most non-IID contents
Federated averaging with local batch normalization
Proposed method -fedbn
- Proposed learning strategy: FedBN
- Similar to FedAvg, performs local updates and averages local models
- Excludes BN layers from averaging step
- Results in significant empirical improvements in non-iid settings
- Improves convergence rate under feature shift
Problem setup
Convergence analysis
- We study the trajectory of two networks, FedAvg and FedBN, through the neural tangent kernel.
- Recent machine learning theory studies have shown that the convergence rate of finite-width over-parameterized networks is controlled by the least eigenvalue of the induced kernel.
- We consider the case that the number of local updates is 1.
- The neural tangent kernel can be decomposed into a magnitude component and direction component.
- The exponential factor of convergence for FedAvg and FedBN is controlled by the smallest eigenvalue of G(t) and G*(t) respectively.
- FedBN has a faster convergence rate than FedAvg.
Proof sketch
- Show that λ min (G ∞ ) is less than or equal to λ min (G * ∞ )
- Compare equation 4 and 5
Experiments
- Using local BN parameters is beneficial when data is heterogeneous across clients
- FedBN is a novel local parameter sharing strategy that is more robust and faster than alternative methods
- FedBN is shown to have better model performance on benchmark and real-world datasets
Benchmark experiments
- Extensive empirical analysis conducted on benchmark digits classification task
- Five datasets used: SVHN Netzer et al. (2011), USPS Hull (1994), SynthDigits Ganin & Lempitsky (2015), MNIST-M Ganin & Lempitsky (2015) and MNIST LeCun et al. (1998)
- 7438 training samples in each dataset
- Classification model is a convolutional neural network
- Cross-entropy loss and SGD optimizer with a learning rate of 10 −2
- Default setting for local update epochs is E = 1
- Default setting for the amount of data at each client is 10% of the dataset original size
- Investigation of convergence rate and analysis of local updating epochs
Analysis of local dataset size:
- FedBN behaviour is observed over different data capacities at each client.
- Testing accuracy drops when each client is attributed 20% of its original data amount.
- FedBN can benefit from collaborative training on distributed data, especially when each client holds a small amount of data.
Effects of statistical heterogeneity:
- FedBN is compared to FedAvg in a federated setting with varying levels of heterogeneity
- 10 subsets of data are used, with equal number of samples and same label distribution
- FedBN achieves higher testing accuracy than FedAvg over all levels of heterogeneity
- FedBN consistently outperforms state-of-the-art and baseline methods
Experiments on real-world datasets
- Proposed algorithm can be beneficial in real-world feature-shift noniid
- Validated effectiveness of FedBN on three real-world datasets
- Datasets include Office-Caltech10, four medical institutions from ABIDE I, and DomainNet
- Classification models use AlexNet architecture with BN added after each convolution and fully-connected layer
- ABIDE I instances represented as 5995-dimensional vector through brain connectome computation
- Classifier is a three-layer fully connected neural network with two BN layers after the first two fully connected layers
Results and analysis:
- FedBN significantly outperforms state-of-the-art method on Office-Caltech10
- FedBN achieves supreme accuracy on DomainNet
- Alternative FL methods achieve comparable results with SingleSet except Quickdraw
- FedBN outperforms alternative FL methods by 10%
- Alternative FL strategies are ineffective in feature shift non-iid datasets
- FedBN excels by a non-negligible margin on three clients in ABIDE I
Conclusion and discussion
- Proposes a novel federated learning aggregation method called FedBN
- Mitigates feature shifts in non-IID data
- Provides convergence guarantees for FedBN
- Experiments demonstrate FedBN can improve convergence and model performance
- Can be combined with different optimization algorithms, communication schemes, and aggregation techniques
- Can be integrated into existing tool-kits/systems
B.3 proof of corollary 4.6
- FedAvg and FedBN have different convergence rates when E = 1
- Comparing the exponential factors in the convergence rates reduces to comparing two matrices
C fedbn algorithm
- Proposed algorithm: FedBN
- Initialize model parameters: w (l) 0,k , local update pace: E, and total optimization round T
- Model Architecture: 6-layer Convolutional Neural Network (CNN)
- Datasets: Office-31, Caltech-256, Amazon, DSLR, Webcam
- Training Details: Cross-entropy loss, SGD optimizer, learning rate 10-2, batch size 32, training epochs 300, µ 10-2
- Data sample number: Office-Caltech10 62, DomainNet 105
D.4 abide dataset and training details
- Described real-world medical datasets
- Described preprocessing and training details
Dataset:
- Study was conducted using rs-fMRI data from ABIDE dataset
- Data was downloaded from four largest sites (UM, NYU, USM, UCLA)
- Subjects lacking filename were skipped, resulting in 88, 167, 52, 63 subjects for each site
- Sliding windows were used to truncate raw time sequences of fMRI
- Performance of FedBN and alternative methods was compared in Figure 5