Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Machine learning methods use iterative optimization algorithms for training.
  • Choices of optimizer, learning rate, batch size, etc. must be made for deep neural networks.
  • Koopman operator theory can be used to identify when choices lead to equivalent or non-equivalent optimization trajectories.
  • Analysis of feedforward, fully connected neural networks provides a general characterization of when choices lead to equivalent or non-equivalent evolution of network parameters.

Paper Content

Introduction

Koopman operator theory

Iterative algorithms as dynamical systems

  • Iterative algorithms can be viewed as discrete-time dynamical systems
  • Algorithm T acts as a dynamical map with the number of iterations acting as “time”
  • Algorithms must converge and converge sufficiently in a finite number of iterations

Koopman mode analysis

  • Dynamical systems theory is difficult to leverage when algorithms are nonlinear.
  • Koopman operator theory is a data-driven dynamical systems theory that is related to invariant geometrical objects.
  • Koopman operator is an infinite dimensional linear operator that describes the time evolution of observables.
  • Koopman operator can be decomposed into eigenfunctions and eigenvalues.
  • Any observable can be used to identify eigenvalues.
  • Koopman operator can be approximated with a finite number of eigenvalues.
  • Dynamic mode decomposition (DMD) is the most popular method for numerically computing the Koopman mode decomposition.

Koopman spectra and conjugacy

  • Koopman operator can be used to compare ML optimization algorithms
  • Two dynamical systems are conjugate if there is a smooth invertible mapping between them
  • If two dynamical systems are conjugate, they have the same principal eigenvalues
  • If two dynamical systems are semi-conjugate, the set of principal eigenvalues of one is a subset of the other
  • An example of identifying conjugacies between algorithms is presented in the Appendix

Online mirror and gradient descent

  • Gradient descent in ML is challenging due to non-convex loss landscapes.
  • Recent discovery of a mapping between online gradient descent and online mirror descent offers promise of alleviating challenges.
  • Mapping proved to be exact in continuous-time dynamics and approximate in discrete-time.

Equivalence of online mirror and gradient descent

  • OMD is used to find a minimum of a function f on a convex set K
  • OGD is used to find a minimum of a function f on a non-convex set K
  • Algorithms 1 and 2 provide pseudocode implementations of OMD and OGD
  • If ∇ f (u t ) = ∇f [q(u t )] and K = q −1 (K), the outputs of the two algorithms are equivalent

Identifying a conjugacy between online mirror and gradient descent

  • Tested log barrier regularization with exponential reparameterization
  • Ran OMD and OGD on two functions for 500 time steps
  • 100 independent initial conditions sampled from a grid
  • Koopman mode decomposition computed from trajectories
  • OMD and OGD successfully mapped to one another for all time
  • Koopman spectra closely aligned, properly identifying conjugate optimization methods

Neural network training

  • Training DNNs requires making decisions about optimizer, activation functions, batch size, learning rate, architecture, etc.
  • General guidelines and numerical results can help make decisions, but are qualitative.
  • A method to identify when decisions lead to equivalent or distinct training does not exist.
  • Koopman mode decomposition is used to study when equivalent optimization of feedforward, fully connected DNNs arises.
  • Observables are chosen as the last layer of weights.
  • Experiments are done on MNIST and weights are initialized uniformly.
  • Results show that for some hyper-parameters, there are choices that lead to conjugate or non-conjugate FCN training.

Batch size and learning rate

  • Changing the batch size impacts the training of a FCN.
  • When batch sizes are sufficiently large, there should be a conjugacy.
  • Training loss of FCNs with b = 64 and b = 128 are similar.
  • Koopman operator theoretic framework can identify expected conjugacy of FCN training.

Number of hidden units per layer

  • Increasing the number of hidden units in each layer of a DNN leads to a decrease in loss.
  • It is not known if this decrease in loss is due to a change in training dynamics or an increase in capacity.
  • When the number of units is small, the Koopman spectra has a larger Wasserstein distance from the spectra found when the number of units is large.
  • This suggests that decreasing the size of the network layers beyond a certain point leads to non-conjugate training.
  • The decrease in loss arises from an increased capacity that occurs with the addition of more hidden units.

Data set

  • Training a FCN on different data sets affects its training
  • MNIST achieves lower asymptotic loss than FMNIST and KMNIST
  • Training on handwritten characters vs. synthetic man-made objects may lead to non-conjugate optimization

Activation functions

  • Choice of activation function affects training performance
  • ReLU and tanh lead to fast convergence of training loss
  • Sigmoid activation function leads to worse performance
  • Wasserstein distance between Koopman spectra is smallest with ReLU and tanh, but difference is minor
  • Choice of activation function has historically been important for dynamic behavior of training DNNs

Discussion

  • Data is being generated and ML methods are being developed to analyze it
  • A way to classify, analyze, and understand the nature and interrelations of ML methods is needed
  • Koopman operator theory can be used to identify conjugacies in a data-driven manner
  • Koopman operator theory can be used to investigate proprietary ML methods
  • Koopman operator theory can be used to optimize, design, and analyze algorithms from a dynamical systems perspective
  • Koopman operator theory can be used to identify conjugacies in ML settings such as reinforcement learning and curriculum learning
  • Koopman operator theory respects permutation symmetry present in DNNs
  • Koopman operator theory can be used to transform one ML method to another
  • An example of using Koopman operator theory to identify conjugate and non-conjugate applications of optimization algorithms is given
  • The bisection and regula falsi methods can be used to find roots with linear time convergence
  • The Koopman eigenvalue associated with the evolution of b k is 1/2 for the bisection method
  • The Koopman eigenvalue associated with the evolution of b k is 1/2 for the regula falsi method when δ = 2
  • Koopman mode analysis provides additional information beyond the binary identification of equivalence