Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Feature normalization transforms are essential for deep neural networks.
  • Tuning the parameters of these transforms can achieve high accuracy.
  • This work investigates the expressive power of tuning normalization layers of frozen networks.
  • Tuning normalization layers of random ReLU networks can reconstruct any target network that is $O(\sqrt{\text{width}})$ times smaller.
  • This holds even for randomly sparsified networks, under sufficient overparameterization.

Paper Content

Introduction

  • Machine learning techniques often use pretrained networks and only train a small part of them
  • This is useful for tasks like transfer learning, multitask learning, and few-shot learning
  • Training only a subset of the parameters of a large-scale model can lead to better accuracy than training from scratch
  • One particular way of model fine-tuning is to train only the Batch Normalization or Layer Normalization parameters
  • Frankle et al. discovered that training these normalization parameters in isolation leads to predictive models with accuracy far above random guessing
  • Increasing the width/depth of these random networks allowed normalization layers training to reach significant accuracy
  • This indicates that training only the normalization layers of a network is expressive enough to allow non-trivial accuracy
  • This paper investigates the expressive power of tuning only the normalization layers of a neural network
  • Training the normalization layer parameters is equivalent to scaling the input of each neuron and adding a bias term
  • Theoretically, any given neural network can be perfectly reconstructed by only tuning the normalization layers of a wider or deeper random network
  • If the random network has sparse weight matrices, the total number of parameters only needs to be a factor of O( √ d) larger than the target network
  • Feature Normalization techniques are used to improve generalization performance and accelerate training in deep neural networks.
  • Batch Normalization was the first normalization technique introduced.
  • Batch Normalization reduces the internal covariate shift, which is the change in network parameters of the layers preceding to any given layer.
  • Batch Normalization leads to different neural activation patterns for different inputs.
  • Santurkar et al. [2018] showed that adding noise with non-zero mean after Batch Normalization still leads to fast convergence.
  • Kohler et al. [2019] showed that Batch Normalization leads to exponentially fast convergence.

Preliminaries

  • Notation used to denote matrices, vectors, and products
  • Randomly initialized neural networks with ReLU activations
  • Equivalence/Realization of two neural networks
  • Domain of bounded real matrices
  • Definitions of Khatri-Rao and Hadamard products
  • Batch Normalization introduced by Ioffe and Szegedy
  • Two major variants of Batch Normalization
  • Second variant leads to better accuracy

Main results

  • Study the expressive power of normalization layers of a frozen or randomly initialized neural network
  • Mean and variance are considered to be constants
  • There exists a choice of parameters of the normalization layers of another randomly initialized neural network such that it is equivalent to the target network
  • A randomly initialized ReLU network with a factor of d overparameterization can reconstruct the target ReLU network
  • If the weight matrices of the target neural network are factorized and have ranks r, then the randomly initialized network only needs a width overparameterization of the order r
  • Increasing depth, while keeping width fixed, increases the expressive power of normalization layer parameters
  • Any fully connected neural network can be realized by a deeper, randomly initialized neural network with skip connections
  • Sparsifying the random matrices of the network results in a total number of Õ(d2√d) non-zero parameters

Our techniques

  • Derivation of results relies on invertibility of Khatri-Rao product
  • Establish full-rankness and non-degeneracy of matrix multiplications
  • Exploit randomness of weight matrices

Reconstruction with overparameterization

  • Theorem 1 is proven by building a layer-by-layer reconstruction of g
  • For each layer of g, two layers of f are constructed
  • Parameters of f are used to activate ReLUs and cancel out extra bias
  • Γ 2i is set to an identity matrix for convenience
  • The first layer of g and the first two layers of f are functionally equivalent
  • Parameters of β 1 are set large enough to activate all ReLUs of the first layer of f
  • A solution for Γ 1 is constructed using Lemma 2
  • Layer-wise equivalence between f and g is proven

Width/depth tradeoff

  • Architecture consists of layers with width dk
  • Input x is partitioned into blocks of size k
  • Blocks are passed to different layers through skip connections
  • Blocks are projected through matrices A i to correct dimension
  • Final linear layer is added to match dimensions with target network
  • Random matrix A i projects input to match dimensions of target network
  • Parameters β i are set to activate ReLUs
  • System of d k linear systems is feasible
  • Elements of Γ d k are non-zero with probability one

Reconstruction of sparse networks

  • The proof of this result has two components
  • Show that if matrices have i.i.d. entries from a continuous distribution, the sparsity pattern dictates whether the matrix is non-singular or not
  • Reduce the problem to the problem of non-singularity of random Boolean matrices with i.i.d. elements
  • Introduce the notion of the Boolean determinant for a Boolean matrix
  • Prove that the Boolean determinant of M 1 M 2 is non-zero with a small probability
  • Replace each column of M 1 M 2 with a new vector of equal length, which has Bern(q) i.i.d. elements
  • Find a value of the probability p that ensures small enough probability of non-invertibility of the Khatri-Rao product
  • Show that if p ≥ √ 2qd then the requirement is satisfied
  • Use existing results on the singularity of Boolean matrices with i.i.d. entries
  • Use Theorem 4 to get bounds on probability of non-singularity of the i.i.d. Boolean entries matrix
  • Take a union bound over all the layers to ensure that the Khatri-Rao product of the sparse matrices is invertible
  • p = Θ( log d/d) suffices for the theorem to hold

Discussion and future work

  • Focuses on expressive power of normalization parameters
  • Results may not apply to training process through gradient based optimization algorithms
  • Could explore regime in which normalization layer is only moderately overparameterized
  • Could explore fine-tuning capacity of normalization layers
  • Could extend results to other architectures like CNNs and Transformers

A linear algebra

  • Fact 1: If two matrices have no intersection between their nullspace and column space, then their rank is the minimum of their dimensions.
  • Remark 4: Necessary and sufficient condition for full rank multiplication of two matrices.
  • Definition 5: Kronecker product is a block matrix.
  • Theorem 5: Kronecker product has rank equal to product of individual ranks.
  • Definition 6: Khatri-Rao product is a matrix with column-wise Kronecker products.
  • Theorem 6: Khatri-Rao and Kronecker products are connected, Khatri-Rao and Hadamard products are connected.
  • Theorem 7: Linear system has no solution if rank of matrix and rank of matrix with b column are equal.
  • Lemma 5: If polynomial is not zero, set of zero values is measure zero.
  • Lemma 6: Khatri-Rao product of two random matrices is full rank with probability 1.
  • Lemma 7: Two layer ReLU network with BatchNorm can be functionally equivalent to one layer network with same architecture and width with probability 1.
  • Proposition 1: Linear network with BatchNorm can be expressed as one layer ReLU network with BatchNorm.
  • Lemma 8: B inverse is a combination of determinants.