Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Feature normalization transforms are essential for deep neural networks.
Tuning the parameters of these transforms can achieve high accuracy.
This work investigates the expressive power of tuning normalization layers of frozen networks.
Tuning normalization layers of random ReLU networks can reconstruct any target network that is $O(\sqrt{\text{width}})$ times smaller.
This holds even for randomly sparsified networks, under sufficient overparameterization.

Paper Content

Introduction

Machine learning techniques often use pretrained networks and only train a small part of them
This is useful for tasks like transfer learning, multitask learning, and few-shot learning
Training only a subset of the parameters of a large-scale model can lead to better accuracy than training from scratch
One particular way of model fine-tuning is to train only the Batch Normalization or Layer Normalization parameters
Frankle et al. discovered that training these normalization parameters in isolation leads to predictive models with accuracy far above random guessing
Increasing the width/depth of these random networks allowed normalization layers training to reach significant accuracy
This indicates that training only the normalization layers of a network is expressive enough to allow non-trivial accuracy
This paper investigates the expressive power of tuning only the normalization layers of a neural network
Training the normalization layer parameters is equivalent to scaling the input of each neuron and adding a bias term
Theoretically, any given neural network can be perfectly reconstructed by only tuning the normalization layers of a wider or deeper random network
If the random network has sparse weight matrices, the total number of parameters only needs to be a factor of O( √ d) larger than the target network

Feature Normalization techniques are used to improve generalization performance and accelerate training in deep neural networks.
Batch Normalization was the first normalization technique introduced.
Batch Normalization reduces the internal covariate shift, which is the change in network parameters of the layers preceding to any given layer.
Batch Normalization leads to different neural activation patterns for different inputs.
Santurkar et al. [2018] showed that adding noise with non-zero mean after Batch Normalization still leads to fast convergence.
Kohler et al. [2019] showed that Batch Normalization leads to exponentially fast convergence.

Preliminaries

Notation used to denote matrices, vectors, and products
Randomly initialized neural networks with ReLU activations
Equivalence/Realization of two neural networks
Domain of bounded real matrices
Definitions of Khatri-Rao and Hadamard products
Batch Normalization introduced by Ioffe and Szegedy
Two major variants of Batch Normalization
Second variant leads to better accuracy

Main results

Study the expressive power of normalization layers of a frozen or randomly initialized neural network
Mean and variance are considered to be constants
There exists a choice of parameters of the normalization layers of another randomly initialized neural network such that it is equivalent to the target network
A randomly initialized ReLU network with a factor of d overparameterization can reconstruct the target ReLU network
If the weight matrices of the target neural network are factorized and have ranks r, then the randomly initialized network only needs a width overparameterization of the order r
Increasing depth, while keeping width fixed, increases the expressive power of normalization layer parameters
Any fully connected neural network can be realized by a deeper, randomly initialized neural network with skip connections
Sparsifying the random matrices of the network results in a total number of Õ(d2√d) non-zero parameters

Our techniques

Derivation of results relies on invertibility of Khatri-Rao product
Establish full-rankness and non-degeneracy of matrix multiplications
Exploit randomness of weight matrices

Reconstruction with overparameterization

Theorem 1 is proven by building a layer-by-layer reconstruction of g
For each layer of g, two layers of f are constructed
Parameters of f are used to activate ReLUs and cancel out extra bias
Γ 2i is set to an identity matrix for convenience
The first layer of g and the first two layers of f are functionally equivalent
Parameters of β 1 are set large enough to activate all ReLUs of the first layer of f
A solution for Γ 1 is constructed using Lemma 2
Layer-wise equivalence between f and g is proven

Width/depth tradeoff

Architecture consists of layers with width dk
Input x is partitioned into blocks of size k
Blocks are passed to different layers through skip connections
Blocks are projected through matrices A i to correct dimension
Final linear layer is added to match dimensions with target network
Random matrix A i projects input to match dimensions of target network
Parameters β i are set to activate ReLUs
System of d k linear systems is feasible
Elements of Γ d k are non-zero with probability one

Reconstruction of sparse networks

The proof of this result has two components
Show that if matrices have i.i.d. entries from a continuous distribution, the sparsity pattern dictates whether the matrix is non-singular or not
Reduce the problem to the problem of non-singularity of random Boolean matrices with i.i.d. elements
Introduce the notion of the Boolean determinant for a Boolean matrix
Prove that the Boolean determinant of M 1 M 2 is non-zero with a small probability
Replace each column of M 1 M 2 with a new vector of equal length, which has Bern(q) i.i.d. elements
Find a value of the probability p that ensures small enough probability of non-invertibility of the Khatri-Rao product
Show that if p ≥ √ 2qd then the requirement is satisfied
Use existing results on the singularity of Boolean matrices with i.i.d. entries
Use Theorem 4 to get bounds on probability of non-singularity of the i.i.d. Boolean entries matrix
Take a union bound over all the layers to ensure that the Khatri-Rao product of the sparse matrices is invertible
p = Θ( log d/d) suffices for the theorem to hold

Discussion and future work

Focuses on expressive power of normalization parameters
Results may not apply to training process through gradient based optimization algorithms
Could explore regime in which normalization layer is only moderately overparameterized
Could explore fine-tuning capacity of normalization layers
Could extend results to other architectures like CNNs and Transformers

A linear algebra

Fact 1: If two matrices have no intersection between their nullspace and column space, then their rank is the minimum of their dimensions.
Remark 4: Necessary and sufficient condition for full rank multiplication of two matrices.
Definition 5: Kronecker product is a block matrix.
Theorem 5: Kronecker product has rank equal to product of individual ranks.
Definition 6: Khatri-Rao product is a matrix with column-wise Kronecker products.
Theorem 6: Khatri-Rao and Kronecker products are connected, Khatri-Rao and Hadamard products are connected.
Theorem 7: Linear system has no solution if rank of matrix and rank of matrix with b column are equal.
Lemma 5: If polynomial is not zero, set of zero values is measure zero.
Lemma 6: Khatri-Rao product of two random matrices is full rank with probability 1.
Lemma 7: Two layer ReLU network with BatchNorm can be functionally equivalent to one layer network with same architecture and width with probability 1.
Proposition 1: Linear network with BatchNorm can be expressed as one layer ReLU network with BatchNorm.
Lemma 8: B inverse is a combination of determinants.

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Preliminaries#

Main results#

Our techniques#

Reconstruction with overparameterization#

Width/depth tradeoff#

Reconstruction of sparse networks#

Discussion and future work#

A linear algebra#

Link to paper

Abstract

Paper Content

Introduction

Related work

Preliminaries

Main results

Our techniques

Reconstruction with overparameterization

Width/depth tradeoff

Reconstruction of sparse networks

Discussion and future work

A linear algebra