Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Deep Learning (DL) has problems such as feature redundancy and vanishing/exploding gradients.
  • Riemannian-based DL uses geometric optimization to update parameters on Riemannian manifolds.
  • This article surveys the application of geometric optimization in DL networks for AI tasks.
  • Toolboxes that implement optimization on manifold are discussed.
  • Performance comparison between deep geometric optimization methods is made.

Paper Content

Introduction

  • Increasing computing power has enabled deep neural networks to be successful in various tasks.
  • Deep learning models often contain many layers and parameters, which can be challenging to optimize.
  • Geometric optimization can reduce parameters and convert constrained optimization problems into unconstrained ones.
  • Geometric optimization has been applied to various deep neural networks, such as CNN, RNN and ViT.
  • This article reviews the theory and applications of geometric optimization in shallow and deep learning.
  • It also investigates representative manifold optimization toolboxes and compares performance of different geometric deep learning methods.

Geometric optimization theory

  • Optimization problems are used to find the maximum or minimum value of a cost function.
  • Conventional optimization methods can be used to solve unconstrained optimization problems.
  • Constrained optimization problems can be transformed into unconstrained problems using Lagrange multipliers or a barrier penalty function.
  • Geometric optimization methods are developed to exploit the underlying geometry of a cost function.
  • Geometric optimization methods use Riemannian optimizers to find an optimal solution.

Geometric optimization process on manifolds

  • Figure 3 shows the update process in geometric optimization
  • Each point on the manifold has a corresponding tangent space
  • The tangent space has an inner product which helps with vector metrics
  • A Riemannian gradient is a tangent vector on the tangent space
  • A geodesic is a locally shortest path between two points on the manifold
  • The geodesic defined by the negative Riemannian gradient reveals the next point in the optimization direction
  • Exponential mapping and retraction operation are used to map a point from the tangent space to the manifold

Gradient descent optimizers

  • Optimization problems can be abstracted as where θ are trainable parameters and E means the Euclidean space
  • There are a variety of standard optimizers for Equation (7)
  • Gradient descent is a basic optimization strategy
  • SGD can accelerate convergence
  • SGD-M is developed to maintain the inertia of the previous step
  • RMSProp can adaptively determine the learning rate of parameters
  • Gradient descent takes the form where λ is a hyper-parameter representing the step size
  • SGD uses random mini-batches of training data to update parameters
  • SGD-M exerts the influence of the last update on the current update
  • RMSProp considers the influence of the last update when calculating the upcoming update
  • Euclidean gradient descent can be transferred to Riemannian manifolds
  • Constraint SGD-M and constraint RMSProp are instances of generalizing gradient descent optimizers from Euclidean space to manifolds

Manifold examples

  • Different kinds of matrix manifolds have different geometry structures and advantages when applying geometric optimization to deep learning.
  • Oblique manifold is useful for dictionary learning due to its property of unit-norm columns.
  • Stiefel manifold helps optimize RNNs since matrices on the Stiefel manifold have orthogonal and uncorrelated columns.
  • Common manifold structures include Stiefel, oblique, Graßmann, product, quotient, SPD, sphere, and unitary.
  • SPD matrices are used for image and video statistical representations.
  • Stiefel manifold has an upper bound which allows it to achieve an optimal solution.
  • Oblique manifold is the set of matrices with unit-norm columns.
  • Graßmann manifold is different from Stiefel manifold, representing an entire subspace.
  • Unitary matrices are the extension of orthogonal matrices to the complex domain.
  • Lie groups are real or complex manifolds with group structure.

Applications in classical machine learning

  • Classical machine learning methods have been successful in solving AI problems.
  • Solving large categories of constrained classical machine learning problems in Euclidean space is difficult.
  • Geometric optimization can decrease the difficulty by treating constrained problems as unconstrained ones on Riemannian manifolds.

Dimension reduction

  • Dimension reduction (DR) is a process of finding a lower-dimensional representation of given data samples.
  • DR approaches can use linear or nonlinear transformations.
  • A generic algorithmic framework to find an optimal solution involves a maximization problem.
  • Solutions of the maximization problem are rotation invariant.
  • Most linear DR methods begin with solving tr(V T AV ) while nonlinear DR methods construct a graph by connecting nearby points.

Inverse problem

  • Inverse problems have a significant impact on practical applications.
  • Inverse problems involve reconstructing inputs from outputs.
  • Solutions to inverse problems can be achieved by confining the parameter matrix W to reside on a smooth Riemannian manifold.

Dictionary learning

  • Dictionary learning is used to obtain the most essential features of input data.
  • X is expanded into a linear combination of D1, …, Dn.
  • Dictionary learning aims to learn a D that makes the coefficients Φ be zero or close to zero.

Analysis operator learning

  • Analysis operator learning assumes that a few operators are enough to represent high-dimensional variables.
  • The operators are hidden and not observed.
  • The goal is to find these hidden operators to simplify the original variables.
  • The analysis operator learning is formulated as an optimization problem on the positive manifold M.

Temporal model

  • Temporal probability model composed of transition and sensor models
  • Transition model describes state evolution over time
  • Sensor model describes observation process
  • Temporal model used for filtering, prediction and smoothing
  • Transition process of states modeled with Gaussian noise
  • Observation process modeled with Gaussian noise
  • Temporal models divided into hidden Markov models and linear dynamic systems

Applications in deep learning

  • Deep learning methods are combining with geometric optimization
  • Geometric optimization techniques vary with different deep learning backbones (e.g., CNN, RNN and GNN)
  • Orthogonal manifold is widely used in geometric CNNs to reduce feature redundancy

Geometric cnn

  • Deep CNNs have achieved success in computer vision tasks
  • CNNs learn features from large-scale data using convolution, activation, and pooling structures
  • Problems such as training instability and feature redundancy can be alleviated by geometric optimization approaches
  • Kernel space maps original features to a higher dimensional space
  • Geometric regularization imposes restrictions on the parameters of the optimization function
  • Quasi-CNN architectures mimic traditional CNN architecture and establish a new architecture suitable for the manifold structure
  • SPDNet and GrNet are examples of quasi-CNN architectures
  • SPDNet uses Bilinear mapping, eigenvalue rectification, and eigenvalue logarithm layers
  • GrNet uses Full rank mapping, re-orthonormalization, inner product, and orthonormal mapping layers
  • Manifold regularization can be used to enhance the nonlinear locality constraints of CNN parameters
  • SURFMNet uses orthogonality constraints to regularize a convolution layer
  • Huang et al. incorporated a Lie group structure to parameter matrices in the deep human action recognition network
  • Chen et al. proposed a deep manifold learning framework to learn manifold information and deep representations of action videos

Geometric rnn

  • RNNs are designed to process sequential data
  • RNNs can capture spatial and temporal dependencies between the sequential input
  • RNNs can be applied in tasks such as speech recognition, text prediction, and machine translation
  • RNNs generate output predictions based on input weight matrix, recurrent weight matrix, previous hidden state, input bias, pointwise nonlinearity function, current hidden state, output weight matrix, and output bias
  • Gradient of the loss function for the hidden state can be computed
  • Exploding and vanishing gradient problem of RNN can be alleviated with orthogonal constraints
  • uRNN parameterizes the unitary hidden-to-hidden matrix by composing simple unitary matrices
  • Full-capacity uRNN is proposed to cover all N x N unitary matrices
  • ExpRNN exploits the exponential map to achieve orthogonal constraints
  • OMDSM optimizes DNN over multiple dependent Stiefel manifolds
  • Soft orthogonal constraints can be explored
  • Householder matrix can be used to reduce time complexity of parameterizing unitary matrices
  • GORU designs a forget gate to pay little attention to extraneous information

Geometric gnn

  • GNN can be used to construct a learning network based on irregular graphs
  • GNN encodes vertexes as feature vectors and models edges as a relationship matrix
  • GNN can take advantage of the graph structure and update the feature information of each vertex
  • GIL incorporates Euclidean space with hyperbolic geometry to model both low-dimensional regular data and complex hierarchical structures
  • MRDGCN integrates manifold regularization into GCN to model dynamic structure information

Geometric optimization for other deep learning methods

  • Robust Time Series Prediction uses low-rank constraint and feature selection to deal with noisy disturbances
  • Medical Reconstruction combines CNN and SToRM with conjugate gradients for fast and high quality MRI data
  • Transfer Learning uses knowledge distillation to transfer model knowledge from a well-trained model to a compact model
  • Optimal Transport uses Riemannian gradient descent and generalized doubly stochastic manifold to measure distance between two probability distributions
  • Robots use geometry-aware Bayesian optimization with Matérn kernel to incorporate domain geometry into optimization algorithm
  • Continual Learning uses low-rank orthogonal manifold to project gradient into disjoint subspace and alleviate catastrophic forgetting

Toolbox

  • Toolboxes can help build neural networks
  • Manopt, Pymanopt, McTorch, and Geomstats are classic toolboxes for manifold geometries and optimization algorithms
  • Manopt and Pymanopt are limited to shallow learning optimizations
  • McTorch extends Pytorch for deep learning optimizations
  • Geoopt is cheaper than McTorch
  • Geomstats has two core modules for geometry and learning
  • TheanoGeometry uses Theano for symbolic calculations and Riemannian geometry

Performance evaluation

  • GORU outperforms other ORNNs on the MNIST dataset
  • expRNN uses surjective exponential map to realize orthogonal parameterization
  • uRNN uses simple unitary matrices to construct the unitary hidden-to-hidden matrix
  • full-capacity uRNN overcomes bottleneck of uRNN
  • soRNN uses regularization terms to realize orthogonal parameterization
  • ORNN exploits householder matrix to enforce orthogonal constraint
  • SPDNet and GrNet achieve better classification results than state-of-the-art methods
  • Hariri et al. method achieves highest precision on BU-3DFE and Bosphorus datasets
  • SPDNet and GrNet outperform state-of-the-art methods on action recognition and face recognition tasks
  • SRMR outperforms state-of-the-art non-manifold methods on scene recognition datasets
  • Different architecture settings affect classification accuracy

Conclusions and future work

  • Reviewed progress of optimizing deep learning networks on manifolds
  • Needs further research on dataset-oriented geometric optimization
  • Needs further research on model-oriented geometric optimization
  • Needs further research on manifold-oriented geometric optimization