Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Machine learning methods use iterative optimization algorithms for training.
- Choices of optimizer, learning rate, batch size, etc. must be made for deep neural networks.
- Koopman operator theory can be used to identify when choices lead to equivalent or non-equivalent optimization trajectories.
- Analysis of feedforward, fully connected neural networks provides a general characterization of when choices lead to equivalent or non-equivalent evolution of network parameters.
Paper Content
Introduction
Koopman operator theory
Iterative algorithms as dynamical systems
- Iterative algorithms can be viewed as discrete-time dynamical systems
- Algorithm T acts as a dynamical map with the number of iterations acting as “time”
- Algorithms must converge and converge sufficiently in a finite number of iterations
Koopman mode analysis
- Dynamical systems theory is difficult to leverage when algorithms are nonlinear.
- Koopman operator theory is a data-driven dynamical systems theory that is related to invariant geometrical objects.
- Koopman operator is an infinite dimensional linear operator that describes the time evolution of observables.
- Koopman operator can be decomposed into eigenfunctions and eigenvalues.
- Any observable can be used to identify eigenvalues.
- Koopman operator can be approximated with a finite number of eigenvalues.
- Dynamic mode decomposition (DMD) is the most popular method for numerically computing the Koopman mode decomposition.
Koopman spectra and conjugacy
- Koopman operator can be used to compare ML optimization algorithms
- Two dynamical systems are conjugate if there is a smooth invertible mapping between them
- If two dynamical systems are conjugate, they have the same principal eigenvalues
- If two dynamical systems are semi-conjugate, the set of principal eigenvalues of one is a subset of the other
- An example of identifying conjugacies between algorithms is presented in the Appendix
Online mirror and gradient descent
- Gradient descent in ML is challenging due to non-convex loss landscapes.
- Recent discovery of a mapping between online gradient descent and online mirror descent offers promise of alleviating challenges.
- Mapping proved to be exact in continuous-time dynamics and approximate in discrete-time.
Equivalence of online mirror and gradient descent
- OMD is used to find a minimum of a function f on a convex set K
- OGD is used to find a minimum of a function f on a non-convex set K
- Algorithms 1 and 2 provide pseudocode implementations of OMD and OGD
- If ∇ f (u t ) = ∇f [q(u t )] and K = q −1 (K), the outputs of the two algorithms are equivalent
Identifying a conjugacy between online mirror and gradient descent
- Tested log barrier regularization with exponential reparameterization
- Ran OMD and OGD on two functions for 500 time steps
- 100 independent initial conditions sampled from a grid
- Koopman mode decomposition computed from trajectories
- OMD and OGD successfully mapped to one another for all time
- Koopman spectra closely aligned, properly identifying conjugate optimization methods
Neural network training
- Training DNNs requires making decisions about optimizer, activation functions, batch size, learning rate, architecture, etc.
- General guidelines and numerical results can help make decisions, but are qualitative.
- A method to identify when decisions lead to equivalent or distinct training does not exist.
- Koopman mode decomposition is used to study when equivalent optimization of feedforward, fully connected DNNs arises.
- Observables are chosen as the last layer of weights.
- Experiments are done on MNIST and weights are initialized uniformly.
- Results show that for some hyper-parameters, there are choices that lead to conjugate or non-conjugate FCN training.
Batch size and learning rate
- Changing the batch size impacts the training of a FCN.
- When batch sizes are sufficiently large, there should be a conjugacy.
- Training loss of FCNs with b = 64 and b = 128 are similar.
- Koopman operator theoretic framework can identify expected conjugacy of FCN training.
Number of hidden units per layer
- Increasing the number of hidden units in each layer of a DNN leads to a decrease in loss.
- It is not known if this decrease in loss is due to a change in training dynamics or an increase in capacity.
- When the number of units is small, the Koopman spectra has a larger Wasserstein distance from the spectra found when the number of units is large.
- This suggests that decreasing the size of the network layers beyond a certain point leads to non-conjugate training.
- The decrease in loss arises from an increased capacity that occurs with the addition of more hidden units.
Data set
- Training a FCN on different data sets affects its training
- MNIST achieves lower asymptotic loss than FMNIST and KMNIST
- Training on handwritten characters vs. synthetic man-made objects may lead to non-conjugate optimization
Activation functions
- Choice of activation function affects training performance
- ReLU and tanh lead to fast convergence of training loss
- Sigmoid activation function leads to worse performance
- Wasserstein distance between Koopman spectra is smallest with ReLU and tanh, but difference is minor
- Choice of activation function has historically been important for dynamic behavior of training DNNs
Discussion
- Data is being generated and ML methods are being developed to analyze it
- A way to classify, analyze, and understand the nature and interrelations of ML methods is needed
- Koopman operator theory can be used to identify conjugacies in a data-driven manner
- Koopman operator theory can be used to investigate proprietary ML methods
- Koopman operator theory can be used to optimize, design, and analyze algorithms from a dynamical systems perspective
- Koopman operator theory can be used to identify conjugacies in ML settings such as reinforcement learning and curriculum learning
- Koopman operator theory respects permutation symmetry present in DNNs
- Koopman operator theory can be used to transform one ML method to another
- An example of using Koopman operator theory to identify conjugate and non-conjugate applications of optimization algorithms is given
- The bisection and regula falsi methods can be used to find roots with linear time convergence
- The Koopman eigenvalue associated with the evolution of b k is 1/2 for the bisection method
- The Koopman eigenvalue associated with the evolution of b k is 1/2 for the regula falsi method when δ = 2
- Koopman mode analysis provides additional information beyond the binary identification of equivalence