Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Deep Learning (DL) has problems such as feature redundancy and vanishing/exploding gradients.
- Riemannian-based DL uses geometric optimization to update parameters on Riemannian manifolds.
- This article surveys the application of geometric optimization in DL networks for AI tasks.
- Toolboxes that implement optimization on manifold are discussed.
- Performance comparison between deep geometric optimization methods is made.
Paper Content
Introduction
- Increasing computing power has enabled deep neural networks to be successful in various tasks.
- Deep learning models often contain many layers and parameters, which can be challenging to optimize.
- Geometric optimization can reduce parameters and convert constrained optimization problems into unconstrained ones.
- Geometric optimization has been applied to various deep neural networks, such as CNN, RNN and ViT.
- This article reviews the theory and applications of geometric optimization in shallow and deep learning.
- It also investigates representative manifold optimization toolboxes and compares performance of different geometric deep learning methods.
Geometric optimization theory
- Optimization problems are used to find the maximum or minimum value of a cost function.
- Conventional optimization methods can be used to solve unconstrained optimization problems.
- Constrained optimization problems can be transformed into unconstrained problems using Lagrange multipliers or a barrier penalty function.
- Geometric optimization methods are developed to exploit the underlying geometry of a cost function.
- Geometric optimization methods use Riemannian optimizers to find an optimal solution.
Geometric optimization process on manifolds
- Figure 3 shows the update process in geometric optimization
- Each point on the manifold has a corresponding tangent space
- The tangent space has an inner product which helps with vector metrics
- A Riemannian gradient is a tangent vector on the tangent space
- A geodesic is a locally shortest path between two points on the manifold
- The geodesic defined by the negative Riemannian gradient reveals the next point in the optimization direction
- Exponential mapping and retraction operation are used to map a point from the tangent space to the manifold
Gradient descent optimizers
- Optimization problems can be abstracted as where θ are trainable parameters and E means the Euclidean space
- There are a variety of standard optimizers for Equation (7)
- Gradient descent is a basic optimization strategy
- SGD can accelerate convergence
- SGD-M is developed to maintain the inertia of the previous step
- RMSProp can adaptively determine the learning rate of parameters
- Gradient descent takes the form where λ is a hyper-parameter representing the step size
- SGD uses random mini-batches of training data to update parameters
- SGD-M exerts the influence of the last update on the current update
- RMSProp considers the influence of the last update when calculating the upcoming update
- Euclidean gradient descent can be transferred to Riemannian manifolds
- Constraint SGD-M and constraint RMSProp are instances of generalizing gradient descent optimizers from Euclidean space to manifolds
Manifold examples
- Different kinds of matrix manifolds have different geometry structures and advantages when applying geometric optimization to deep learning.
- Oblique manifold is useful for dictionary learning due to its property of unit-norm columns.
- Stiefel manifold helps optimize RNNs since matrices on the Stiefel manifold have orthogonal and uncorrelated columns.
- Common manifold structures include Stiefel, oblique, Graßmann, product, quotient, SPD, sphere, and unitary.
- SPD matrices are used for image and video statistical representations.
- Stiefel manifold has an upper bound which allows it to achieve an optimal solution.
- Oblique manifold is the set of matrices with unit-norm columns.
- Graßmann manifold is different from Stiefel manifold, representing an entire subspace.
- Unitary matrices are the extension of orthogonal matrices to the complex domain.
- Lie groups are real or complex manifolds with group structure.
Applications in classical machine learning
- Classical machine learning methods have been successful in solving AI problems.
- Solving large categories of constrained classical machine learning problems in Euclidean space is difficult.
- Geometric optimization can decrease the difficulty by treating constrained problems as unconstrained ones on Riemannian manifolds.
Dimension reduction
- Dimension reduction (DR) is a process of finding a lower-dimensional representation of given data samples.
- DR approaches can use linear or nonlinear transformations.
- A generic algorithmic framework to find an optimal solution involves a maximization problem.
- Solutions of the maximization problem are rotation invariant.
- Most linear DR methods begin with solving tr(V T AV ) while nonlinear DR methods construct a graph by connecting nearby points.
Inverse problem
- Inverse problems have a significant impact on practical applications.
- Inverse problems involve reconstructing inputs from outputs.
- Solutions to inverse problems can be achieved by confining the parameter matrix W to reside on a smooth Riemannian manifold.
Dictionary learning
- Dictionary learning is used to obtain the most essential features of input data.
- X is expanded into a linear combination of D1, …, Dn.
- Dictionary learning aims to learn a D that makes the coefficients Φ be zero or close to zero.
Analysis operator learning
- Analysis operator learning assumes that a few operators are enough to represent high-dimensional variables.
- The operators are hidden and not observed.
- The goal is to find these hidden operators to simplify the original variables.
- The analysis operator learning is formulated as an optimization problem on the positive manifold M.
Temporal model
- Temporal probability model composed of transition and sensor models
- Transition model describes state evolution over time
- Sensor model describes observation process
- Temporal model used for filtering, prediction and smoothing
- Transition process of states modeled with Gaussian noise
- Observation process modeled with Gaussian noise
- Temporal models divided into hidden Markov models and linear dynamic systems
Applications in deep learning
- Deep learning methods are combining with geometric optimization
- Geometric optimization techniques vary with different deep learning backbones (e.g., CNN, RNN and GNN)
- Orthogonal manifold is widely used in geometric CNNs to reduce feature redundancy
Geometric cnn
- Deep CNNs have achieved success in computer vision tasks
- CNNs learn features from large-scale data using convolution, activation, and pooling structures
- Problems such as training instability and feature redundancy can be alleviated by geometric optimization approaches
- Kernel space maps original features to a higher dimensional space
- Geometric regularization imposes restrictions on the parameters of the optimization function
- Quasi-CNN architectures mimic traditional CNN architecture and establish a new architecture suitable for the manifold structure
- SPDNet and GrNet are examples of quasi-CNN architectures
- SPDNet uses Bilinear mapping, eigenvalue rectification, and eigenvalue logarithm layers
- GrNet uses Full rank mapping, re-orthonormalization, inner product, and orthonormal mapping layers
- Manifold regularization can be used to enhance the nonlinear locality constraints of CNN parameters
- SURFMNet uses orthogonality constraints to regularize a convolution layer
- Huang et al. incorporated a Lie group structure to parameter matrices in the deep human action recognition network
- Chen et al. proposed a deep manifold learning framework to learn manifold information and deep representations of action videos
Geometric rnn
- RNNs are designed to process sequential data
- RNNs can capture spatial and temporal dependencies between the sequential input
- RNNs can be applied in tasks such as speech recognition, text prediction, and machine translation
- RNNs generate output predictions based on input weight matrix, recurrent weight matrix, previous hidden state, input bias, pointwise nonlinearity function, current hidden state, output weight matrix, and output bias
- Gradient of the loss function for the hidden state can be computed
- Exploding and vanishing gradient problem of RNN can be alleviated with orthogonal constraints
- uRNN parameterizes the unitary hidden-to-hidden matrix by composing simple unitary matrices
- Full-capacity uRNN is proposed to cover all N x N unitary matrices
- ExpRNN exploits the exponential map to achieve orthogonal constraints
- OMDSM optimizes DNN over multiple dependent Stiefel manifolds
- Soft orthogonal constraints can be explored
- Householder matrix can be used to reduce time complexity of parameterizing unitary matrices
- GORU designs a forget gate to pay little attention to extraneous information
Geometric gnn
- GNN can be used to construct a learning network based on irregular graphs
- GNN encodes vertexes as feature vectors and models edges as a relationship matrix
- GNN can take advantage of the graph structure and update the feature information of each vertex
- GIL incorporates Euclidean space with hyperbolic geometry to model both low-dimensional regular data and complex hierarchical structures
- MRDGCN integrates manifold regularization into GCN to model dynamic structure information
Geometric optimization for other deep learning methods
- Robust Time Series Prediction uses low-rank constraint and feature selection to deal with noisy disturbances
- Medical Reconstruction combines CNN and SToRM with conjugate gradients for fast and high quality MRI data
- Transfer Learning uses knowledge distillation to transfer model knowledge from a well-trained model to a compact model
- Optimal Transport uses Riemannian gradient descent and generalized doubly stochastic manifold to measure distance between two probability distributions
- Robots use geometry-aware Bayesian optimization with Matérn kernel to incorporate domain geometry into optimization algorithm
- Continual Learning uses low-rank orthogonal manifold to project gradient into disjoint subspace and alleviate catastrophic forgetting
Toolbox
- Toolboxes can help build neural networks
- Manopt, Pymanopt, McTorch, and Geomstats are classic toolboxes for manifold geometries and optimization algorithms
- Manopt and Pymanopt are limited to shallow learning optimizations
- McTorch extends Pytorch for deep learning optimizations
- Geoopt is cheaper than McTorch
- Geomstats has two core modules for geometry and learning
- TheanoGeometry uses Theano for symbolic calculations and Riemannian geometry
Performance evaluation
- GORU outperforms other ORNNs on the MNIST dataset
- expRNN uses surjective exponential map to realize orthogonal parameterization
- uRNN uses simple unitary matrices to construct the unitary hidden-to-hidden matrix
- full-capacity uRNN overcomes bottleneck of uRNN
- soRNN uses regularization terms to realize orthogonal parameterization
- ORNN exploits householder matrix to enforce orthogonal constraint
- SPDNet and GrNet achieve better classification results than state-of-the-art methods
- Hariri et al. method achieves highest precision on BU-3DFE and Bosphorus datasets
- SPDNet and GrNet outperform state-of-the-art methods on action recognition and face recognition tasks
- SRMR outperforms state-of-the-art non-manifold methods on scene recognition datasets
- Different architecture settings affect classification accuracy
Conclusions and future work
- Reviewed progress of optimizing deep learning networks on manifolds
- Needs further research on dataset-oriented geometric optimization
- Needs further research on model-oriented geometric optimization
- Needs further research on manifold-oriented geometric optimization