Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Taking the width and depth of a deep neural network to infinity with a specific scaling results in the same covariance structure.
- This explains why the standard infinite-width-then-depth approach works even for networks with depth and width of the same order.
- Pre-activations have Gaussian distributions, which has applications in Bayesian deep learning.
Paper Content
Introduction
- Deep neural networks have achieved success in tasks such as image classification and natural language processing
- The behavior of these networks in the limit of large depth and large width is not fully understood
- Research has been done on two main limits: large-width and large-depth
- Large-width limit is relatively well understood, but large-depth and interaction between the two have not been studied as much
- Question: do large-width and large-depth limits commute?
- Results show that, at initialization, for a certain kind of network, the limits do commute
- Proof technique takes depth limit first, then width limit
- Results provide new insights into behavior of deep neural networks
Related work
- Research has been done on randomly initialized neural networks with an infinite number of parameters
- Research has focused on networks with infinite width and fixed depth, and recently on networks with large depth
Infinite-width limit
- Study of infinite-width limit of neural networks has yielded various theoretical and algorithmic innovations
- Initialization methods and selection of activation functions have practical benefits
- Weak limit of neural networks converges to a distribution modeled by a Gaussian process
- Neural Tangent Kernel converges to a deterministic kernel in the infinite-width limit
Infinite-depth limit
- Three approaches to studying the infinite-depth limit of neural networks with random initialization
- Edge of Chaos initialization scheme derived from infinite-width-then-depth limit
- Impact of activation function studied in infinite-width-then-depth limit
- Neural Tangent Kernel studied in joint infinite-width-and-depth limit
- Output of ResNets exhibits log-normal behavior in joint limit
- Covariance kernel of MLP weakly converges to SDE in joint limit
- Pre-activations of ResNets converge to diffusion process in infinite-depth limit of finite-width neural networks
Setup and definitions
- Convergence in distribution (weak convergence): pre-activations converge weakly to a Gaussian distribution
- Wasserstein metric used to quantify weak convergence rate
- Convergence in L2 (strong convergence): neural covariance converges to a deterministic limit
- Wasserstein distance used to measure weak convergence rate
Warmup: depth and width generally do not commute
- Width and depth of a network are denoted by n and L, respectively
- Input dimension is denoted by d
- Network has no bias
- Pre-activations and post-activations are vectors
- In the limit n, L → ∞, pre-activations exhibit Gaussian behavior
- Gaussian behavior facilitates exact Bayesian inference
- When width and depth both approach infinity, the limiting behavior can vary depending on the relative rates
- Empirical evidence supports the existence of this difference in the limiting behavior of the distribution
Neural covariance/correlation
- Researchers have studied the covariance of the pre-activation vectors Y tL (a) and Y tL (b) for two different inputs a, b ∈ R d.
- In the limit of “n → ∞, then L → ∞”, the network outputs Y L (a) and Y L (b) become perfectly correlated (correlation=1).
- This can lead to unstable behavior of the gradients and make the model untrainable.
- Several techniques have been proposed to address this issue.
- In the joint limit n, L → ∞ with n/L fixed, the correlation still converges to 1.
- Taking the width to infinity first, then depth, the result is a deterministic covariance structure.
Main results: width and depth commute in resnets
- ResNet architecture of width n and depth L uses ReLU activation function
- Weights are randomly initialized with iid Gaussian variables
- Networks have no bias
- Distribution of pre-activations Y tL is a zero-mean Gaussian distribution in the limit of min(n, L) → ∞
- Wasserstein distance between distribution of neuron Y 1 tL and zero-mean Gaussian random variable is provided
- Pre-activations become similar to a Gaussian distribution as min(n, L) → ∞
- Residual network can be seen as a Gaussian process in this limit with a well-specified kernel function
Neural covariance
- MLPs and ResNets have different limiting behaviors when width and depth are taken to infinity.
- ResNets have a deterministic kernel in the limit min(n, L) → ∞.
- Theorem 2 unifies previous approaches to understanding the covariance structure in large width and depth ResNets.
- Theorem 3 expresses the result of Theorem 2 in terms of correlation.
Experiments and practical implications
- Validated theoretical results with simulations
- Used large width and depth residual neural networks
Gaussian behavior and independence of neurons
- Neurons converge weakly to a Gaussian distribution
- Histograms of first neuron in last layer fit Gaussian distribution better as width and depth increase
- Kolmogorov-Smirnov normality test shows decreasing statistic and p-value as width and depth increase
- Distribution of neurons in last layer of MLP is heavy-tailed
- Theorem 1 predicts independence of neurons, which is validated by pair-wise joint distributions matching an isotropic 2-dimensional Gaussian distribution
Convergence of the neural covariance
- Theorem 2 predicts that the covariance qt converges in L 2 norm to q t when min(n, L) → ∞.
- Fig. 5 compares the empirical covariance qt with the theoretical prediction q t for different values of (n, L).
- The L 2 error is smaller with depth L = 5000 than with depth L = 5.
- The theoretical prediction q t is approximated with a PDE solver for t ∈ [0, 1] with a discretization step ∆t =1e-4.
Role of width and depth
- Width is more important than depth in convergence to limiting values
- Theorem 1 and Theorem 2 provide bounds on the convergence rate
- A better bound of the form C1 √ n + C2 √ L can be obtained
- For large widths, finite-depth variance is close to infinite-depth variance
Conclusion and limitations
- Residual neural networks (resnets) have been shown to have a large-depth and large-width limit that commute at initialization.
- A novel proof technique and concentration of measure result for a McKean-Vlasov process was used to prove this.
- Prior works analyzing deep and wide neural networks take the width limit first then depth.
- Different behaviors may occur depending on the learning rate chosen as a function of width and depth.
- Further analysis would require more mathematical machinery than presented here.
A a more comprehensive literature review
- Initialization schemes: Edge of Chaos initialization scheme improves performance
- Gaussian process behaviour: Randomly initialized neural network behaves like a Gaussian process
- Neural Tangent Kernel (NTK): Extensively studied, optimization and generalization properties studied
- Tensor programs: Generalization and unification of existing works
- Others: Used for network pruning, regularization, feature learning, and ensembling methods
- Infinite-width-then-infinite-depth limit: Edge of Chaos initialization, impact of activation function, Neural Tangent Kernel
- Joint infinite-width-and-depth limit: Log-normal behavior, Stochastic Differential Equation
- Infinite-depth limit of finite-width neural networks: Diffusion process, activation function, Neural Tangent Kernel
- Mathematical framework: Stochastic Differential Equations, Itô diffusion process, Itô’s lemma, Euler discretization scheme
- Width-uniform convergence: Result for infinite-depth