Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Machine learning methods use iterative optimization algorithms for training.
Choices of optimizer, learning rate, batch size, etc. must be made for deep neural networks.
Koopman operator theory can be used to identify when choices lead to equivalent or non-equivalent optimization trajectories.
Analysis of feedforward, fully connected neural networks provides a general characterization of when choices lead to equivalent or non-equivalent evolution of network parameters.

Paper Content

Introduction

Koopman operator theory

Iterative algorithms as dynamical systems

Iterative algorithms can be viewed as discrete-time dynamical systems
Algorithm T acts as a dynamical map with the number of iterations acting as “time”
Algorithms must converge and converge sufficiently in a finite number of iterations

Koopman mode analysis

Dynamical systems theory is difficult to leverage when algorithms are nonlinear.
Koopman operator theory is a data-driven dynamical systems theory that is related to invariant geometrical objects.
Koopman operator is an infinite dimensional linear operator that describes the time evolution of observables.
Koopman operator can be decomposed into eigenfunctions and eigenvalues.
Any observable can be used to identify eigenvalues.
Koopman operator can be approximated with a finite number of eigenvalues.
Dynamic mode decomposition (DMD) is the most popular method for numerically computing the Koopman mode decomposition.

Koopman spectra and conjugacy

Koopman operator can be used to compare ML optimization algorithms
Two dynamical systems are conjugate if there is a smooth invertible mapping between them
If two dynamical systems are conjugate, they have the same principal eigenvalues
If two dynamical systems are semi-conjugate, the set of principal eigenvalues of one is a subset of the other
An example of identifying conjugacies between algorithms is presented in the Appendix

Online mirror and gradient descent

Gradient descent in ML is challenging due to non-convex loss landscapes.
Recent discovery of a mapping between online gradient descent and online mirror descent offers promise of alleviating challenges.
Mapping proved to be exact in continuous-time dynamics and approximate in discrete-time.

Equivalence of online mirror and gradient descent

OMD is used to find a minimum of a function f on a convex set K
OGD is used to find a minimum of a function f on a non-convex set K
Algorithms 1 and 2 provide pseudocode implementations of OMD and OGD
If ∇ f (u t ) = ∇f [q(u t )] and K = q −1 (K), the outputs of the two algorithms are equivalent

Identifying a conjugacy between online mirror and gradient descent

Tested log barrier regularization with exponential reparameterization
Ran OMD and OGD on two functions for 500 time steps
100 independent initial conditions sampled from a grid
Koopman mode decomposition computed from trajectories
OMD and OGD successfully mapped to one another for all time
Koopman spectra closely aligned, properly identifying conjugate optimization methods

Neural network training

Training DNNs requires making decisions about optimizer, activation functions, batch size, learning rate, architecture, etc.
General guidelines and numerical results can help make decisions, but are qualitative.
A method to identify when decisions lead to equivalent or distinct training does not exist.
Koopman mode decomposition is used to study when equivalent optimization of feedforward, fully connected DNNs arises.
Observables are chosen as the last layer of weights.
Experiments are done on MNIST and weights are initialized uniformly.
Results show that for some hyper-parameters, there are choices that lead to conjugate or non-conjugate FCN training.

Batch size and learning rate

Changing the batch size impacts the training of a FCN.
When batch sizes are sufficiently large, there should be a conjugacy.
Training loss of FCNs with b = 64 and b = 128 are similar.
Koopman operator theoretic framework can identify expected conjugacy of FCN training.

Number of hidden units per layer

Increasing the number of hidden units in each layer of a DNN leads to a decrease in loss.
It is not known if this decrease in loss is due to a change in training dynamics or an increase in capacity.
When the number of units is small, the Koopman spectra has a larger Wasserstein distance from the spectra found when the number of units is large.
This suggests that decreasing the size of the network layers beyond a certain point leads to non-conjugate training.
The decrease in loss arises from an increased capacity that occurs with the addition of more hidden units.

Data set

Training a FCN on different data sets affects its training
MNIST achieves lower asymptotic loss than FMNIST and KMNIST
Training on handwritten characters vs. synthetic man-made objects may lead to non-conjugate optimization

Activation functions

Choice of activation function affects training performance
ReLU and tanh lead to fast convergence of training loss
Sigmoid activation function leads to worse performance
Wasserstein distance between Koopman spectra is smallest with ReLU and tanh, but difference is minor
Choice of activation function has historically been important for dynamic behavior of training DNNs

Discussion

Data is being generated and ML methods are being developed to analyze it
A way to classify, analyze, and understand the nature and interrelations of ML methods is needed
Koopman operator theory can be used to identify conjugacies in a data-driven manner
Koopman operator theory can be used to investigate proprietary ML methods
Koopman operator theory can be used to optimize, design, and analyze algorithms from a dynamical systems perspective
Koopman operator theory can be used to identify conjugacies in ML settings such as reinforcement learning and curriculum learning
Koopman operator theory respects permutation symmetry present in DNNs
Koopman operator theory can be used to transform one ML method to another
An example of using Koopman operator theory to identify conjugate and non-conjugate applications of optimization algorithms is given
The bisection and regula falsi methods can be used to find roots with linear time convergence
The Koopman eigenvalue associated with the evolution of b k is 1/2 for the bisection method
The Koopman eigenvalue associated with the evolution of b k is 1/2 for the regula falsi method when δ = 2
Koopman mode analysis provides additional information beyond the binary identification of equivalence

Link to paper#

Abstract#

Paper Content#

Introduction#

Koopman operator theory#

Iterative algorithms as dynamical systems#

Koopman mode analysis#

Koopman spectra and conjugacy#

Online mirror and gradient descent#

Equivalence of online mirror and gradient descent#

Identifying a conjugacy between online mirror and gradient descent#

Neural network training#

Batch size and learning rate#

Number of hidden units per layer#

Data set#

Activation functions#

Discussion#

Link to paper

Abstract

Paper Content

Introduction

Koopman operator theory

Iterative algorithms as dynamical systems

Koopman mode analysis

Koopman spectra and conjugacy

Online mirror and gradient descent

Equivalence of online mirror and gradient descent

Identifying a conjugacy between online mirror and gradient descent

Neural network training

Batch size and learning rate

Number of hidden units per layer

Data set

Activation functions

Discussion