Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Feature normalization transforms are essential for deep neural networks.
- Tuning the parameters of these transforms can achieve high accuracy.
- This work investigates the expressive power of tuning normalization layers of frozen networks.
- Tuning normalization layers of random ReLU networks can reconstruct any target network that is $O(\sqrt{\text{width}})$ times smaller.
- This holds even for randomly sparsified networks, under sufficient overparameterization.
Paper Content
Introduction
- Machine learning techniques often use pretrained networks and only train a small part of them
- This is useful for tasks like transfer learning, multitask learning, and few-shot learning
- Training only a subset of the parameters of a large-scale model can lead to better accuracy than training from scratch
- One particular way of model fine-tuning is to train only the Batch Normalization or Layer Normalization parameters
- Frankle et al. discovered that training these normalization parameters in isolation leads to predictive models with accuracy far above random guessing
- Increasing the width/depth of these random networks allowed normalization layers training to reach significant accuracy
- This indicates that training only the normalization layers of a network is expressive enough to allow non-trivial accuracy
- This paper investigates the expressive power of tuning only the normalization layers of a neural network
- Training the normalization layer parameters is equivalent to scaling the input of each neuron and adding a bias term
- Theoretically, any given neural network can be perfectly reconstructed by only tuning the normalization layers of a wider or deeper random network
- If the random network has sparse weight matrices, the total number of parameters only needs to be a factor of O( √ d) larger than the target network
Related work
- Feature Normalization techniques are used to improve generalization performance and accelerate training in deep neural networks.
- Batch Normalization was the first normalization technique introduced.
- Batch Normalization reduces the internal covariate shift, which is the change in network parameters of the layers preceding to any given layer.
- Batch Normalization leads to different neural activation patterns for different inputs.
- Santurkar et al. [2018] showed that adding noise with non-zero mean after Batch Normalization still leads to fast convergence.
- Kohler et al. [2019] showed that Batch Normalization leads to exponentially fast convergence.
Preliminaries
- Notation used to denote matrices, vectors, and products
- Randomly initialized neural networks with ReLU activations
- Equivalence/Realization of two neural networks
- Domain of bounded real matrices
- Definitions of Khatri-Rao and Hadamard products
- Batch Normalization introduced by Ioffe and Szegedy
- Two major variants of Batch Normalization
- Second variant leads to better accuracy
Main results
- Study the expressive power of normalization layers of a frozen or randomly initialized neural network
- Mean and variance are considered to be constants
- There exists a choice of parameters of the normalization layers of another randomly initialized neural network such that it is equivalent to the target network
- A randomly initialized ReLU network with a factor of d overparameterization can reconstruct the target ReLU network
- If the weight matrices of the target neural network are factorized and have ranks r, then the randomly initialized network only needs a width overparameterization of the order r
- Increasing depth, while keeping width fixed, increases the expressive power of normalization layer parameters
- Any fully connected neural network can be realized by a deeper, randomly initialized neural network with skip connections
- Sparsifying the random matrices of the network results in a total number of Õ(d2√d) non-zero parameters
Our techniques
- Derivation of results relies on invertibility of Khatri-Rao product
- Establish full-rankness and non-degeneracy of matrix multiplications
- Exploit randomness of weight matrices
Reconstruction with overparameterization
- Theorem 1 is proven by building a layer-by-layer reconstruction of g
- For each layer of g, two layers of f are constructed
- Parameters of f are used to activate ReLUs and cancel out extra bias
- Γ 2i is set to an identity matrix for convenience
- The first layer of g and the first two layers of f are functionally equivalent
- Parameters of β 1 are set large enough to activate all ReLUs of the first layer of f
- A solution for Γ 1 is constructed using Lemma 2
- Layer-wise equivalence between f and g is proven
Width/depth tradeoff
- Architecture consists of layers with width dk
- Input x is partitioned into blocks of size k
- Blocks are passed to different layers through skip connections
- Blocks are projected through matrices A i to correct dimension
- Final linear layer is added to match dimensions with target network
- Random matrix A i projects input to match dimensions of target network
- Parameters β i are set to activate ReLUs
- System of d k linear systems is feasible
- Elements of Γ d k are non-zero with probability one
Reconstruction of sparse networks
- The proof of this result has two components
- Show that if matrices have i.i.d. entries from a continuous distribution, the sparsity pattern dictates whether the matrix is non-singular or not
- Reduce the problem to the problem of non-singularity of random Boolean matrices with i.i.d. elements
- Introduce the notion of the Boolean determinant for a Boolean matrix
- Prove that the Boolean determinant of M 1 M 2 is non-zero with a small probability
- Replace each column of M 1 M 2 with a new vector of equal length, which has Bern(q) i.i.d. elements
- Find a value of the probability p that ensures small enough probability of non-invertibility of the Khatri-Rao product
- Show that if p ≥ √ 2qd then the requirement is satisfied
- Use existing results on the singularity of Boolean matrices with i.i.d. entries
- Use Theorem 4 to get bounds on probability of non-singularity of the i.i.d. Boolean entries matrix
- Take a union bound over all the layers to ensure that the Khatri-Rao product of the sparse matrices is invertible
- p = Θ( log d/d) suffices for the theorem to hold
Discussion and future work
- Focuses on expressive power of normalization parameters
- Results may not apply to training process through gradient based optimization algorithms
- Could explore regime in which normalization layer is only moderately overparameterized
- Could explore fine-tuning capacity of normalization layers
- Could extend results to other architectures like CNNs and Transformers
A linear algebra
- Fact 1: If two matrices have no intersection between their nullspace and column space, then their rank is the minimum of their dimensions.
- Remark 4: Necessary and sufficient condition for full rank multiplication of two matrices.
- Definition 5: Kronecker product is a block matrix.
- Theorem 5: Kronecker product has rank equal to product of individual ranks.
- Definition 6: Khatri-Rao product is a matrix with column-wise Kronecker products.
- Theorem 6: Khatri-Rao and Kronecker products are connected, Khatri-Rao and Hadamard products are connected.
- Theorem 7: Linear system has no solution if rank of matrix and rank of matrix with b column are equal.
- Lemma 5: If polynomial is not zero, set of zero values is measure zero.
- Lemma 6: Khatri-Rao product of two random matrices is full rank with probability 1.
- Lemma 7: Two layer ReLU network with BatchNorm can be functionally equivalent to one layer network with same architecture and width with probability 1.
- Proposition 1: Linear network with BatchNorm can be expressed as one layer ReLU network with BatchNorm.
- Lemma 8: B inverse is a combination of determinants.