Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Compiled as lecture notes for a course at the University of Southern California
Accessible to engineering graduate students with a strong background in Applied Mathematics
Introduce student to topics in deep learning
Exploit connections between deep learning algorithms and conventional techniques of computational physics
Use concepts from computational physics to develop understanding of deep learning algorithms
Novel deep learning algorithms can be used to solve challenging problems in computational physics

Paper Content

Computational physics

Computational physics solves problems in science and engineering
Involves collecting measurements of an observable
Postulating a physical law based on observations
Writing a mathematical description of the law
Solving the system using exact or approximate methods

Machine learning

ML does not require physical laws
Collect data from physical phenomena, measurements, or numerical solvers
Train algorithm to discover patterns or relations
Use ML algorithm to make predictions and validate with data

Examples of ml

Regression algorithms are used to approximate a function given a set of pairwise data
Decision trees can be used to predict the probability of a new individual owning a house
Clustering algorithms are used to find patterns in a set of data

Types of ml algorithms based leaning task

Supervised learning: Predicting labels for new data based on existing data
Unsupervised learning: Finding relations among different regions of data

Artificial intelligence, machine learning and deep learning

AI, ML and DL are related but different concepts
AI refers to a system with human-like intelligence
ML is a key component of an AI system
Self-driving cars are an example of AI
ML algorithms are trained using data
DL is a subset of ML algorithms
DL architecture is loosely motivated by how signals are transmitted by the central nervous system

Machine learning and computational physics

Combining computational physics and ML can provide an alternate route to representing mathematical laws.
Physics knowledge can help reduce the amount of data required to train ML algorithms.
Tools from computational physics can be applied to ML to better understand and design ML algorithms.

Mlp architecture

MLP is used to approximate a function f: x ∈ R d → y ∈ R D
MLP consists of a source layer, hidden layers and an output layer
Weights and biases are associated with each neuron
Activation function is applied component-wise
Output function may be used at the end of the output layer
Depth of the network is the number of computing layers
Parameters of the network are the weights and biases

Activation functions

Activation function is important in MLP
Many activation functions available with different advantages and disadvantages
Figure 2.2 provides more information

Linear activation

Linear activation function is infinitely smooth.
Range of the function is from negative infinity to positive infinity.
Network reduced to single affine transformation of input, making it a linear approximation of target function.

Rectified linear unit (relu)

Function is piecewise linear
Function is continuous
Derivative is piecewise constant with a jump at ξ = 0
Second derivative is a dirac function concentrated at ξ = 0
Range of the function is [0, ∞)

Leaky relu

ReLU activation leads to null output if affine transformation is negative.
Leaky ReLU was designed to overcome challenge of dying neurons.
Derivatives of Leaky ReLU behave same as ReLU.
Range of Leaky ReLU is (-infinity, infinity).

Logistic function

Function is smooth and monotonic.
Range of function is (0, 1).
Derivative quickly decays to zero away from 0, leading to slow convergence.

Tanh

Tanh is a symmetrical extension of the logistic function.
Tanh is smooth, monotonic and bounded between -1 and 1.

Sine

Sine function is an efficient activation function.
Sine function is infinitely smooth and bounded.

Expressivity of a network

Increasing N θ has an effect on the ReLU activation function.
A simple example is used to illustrate the effects.
The number of kinks in the output increases as the depth and width of the network increases.

Universal approximation results

MLPs can approximate continuous functions with a single hidden layer
MLPs can approximate continuous vector-valued functions with multiple hidden layers
MLPs with ReLU activations can approximate functions with two continuous derivatives
Numerical results help demystify the “black-box” nature of neural networks

Training, validation and testing of neural networks

MLPs are used to approximate a target function
Parameters of MLPs are set in a supervised learning framework
Dataset is split into 3 parts: training, validation, and testing
Training phase finds optimal values of θ for a fixed Θ
Validation phase finds optimal values of Θ
Testing phase evaluates performance of trained network on unseen data
Loss function is used to optimize network parameters

Generalizability

Training a network with a small value of Π train and Π val does not guarantee a small value of Π test.
Regularization is a technique used to avoid data overfitting.

Regularization

Neural networks are often over-parametrized
Loss function can have many local minimas
Regularization technique can be used to nudge the choice of θ
Regularization encourages selection of minima with smaller values of θ
Regularization helps avoid over fitting

Gradient descent

We wish to solve a minimization problem using gradient descent
Taylor expansion is used to approximate the problem
Gradient descent requires a step-size (learning-rate) to be tuned
GD converges when the local curvature is less than 2/η
GD prefers flat minima
Flat minima tend to generalize better

Some advanced optimization algorithms

GD can be used to solve the optimization problem involved in training neural networks.
Most optimization algorithms use a formula that includes a component-wise learning rate and a vector-valued function that approximates the gradient.
If the learning rate is too large, the updates will zig-zag their way towards the minima.
Two popular methods can help overcome some of the issues faced by GD.

Momentum methods

Momentum methods use the history of the gradient, not just the previous step.
The formula for the update includes a weighted moving average of the gradient.
This weighting is expected to reduce zig-zagging and move more smoothly towards the minima.
A commonly used value for β 1 is 0.9.

Adam

Adam optimization was introduced by Kingma and Ba
Adam optimization uses the history of the gradient and the second moment of the gradient
Adam updates are given by a formula with a learning rate η
Recommended values for hyper-parameters are β 1 = 0.9, β 2 = 0.999 and = 10 −8
Learning rate for each component is different, larger magnitude of the gradient for a component means smaller learning rate
Adam algorithm has additional correction steps to improve efficiency

Stochastic optimization

Training loss can be rewritten
Gradient of loss function is expensive to calculate
Stochastic gradient descent (SGD) is an easy way to circumvent this problem
SGD converges if Π i (θ) is convex and differentiable
SGD needs to have a decaying learning rate
SGD algorithm keeps overshooting the minima without decay
SGD algorithm moves closer to θ * with decaying learning rate
Stochastic optimization algorithms are not used due to chaotic behaviour and under-utilization of resources
Mini-batch optimization is a compromise between the two
Mini-batch stochastic optimization algorithm evaluates batch gradient and updates θ
SGD might help in selecting minima that generalize better

Calculating gradients using back-propagation

Output of layer l+1 is given by an affine transform and a non-linear transform
Loss/objective function can be evaluated using a forward pass
To update network parameters, need derivatives of loss with respect to parameters
Chain rule is used to evaluate derivatives
Computational graph is used to represent evaluation of loss and its gradient
Back propagation is used to traverse in backward direction

Regression versus classification

Two types of losses considered: mean square error (MSE) and mean absolute error (MAE)
Neural networks can be used to solve regression problems with nonlinear inputs/outputs
Examples of classification problems given
Output labels need to be one-hot encoded
Cross-entropy loss function used to measure discrepancy between two distributions
Cross-entropy loss function penalizes incorrect confident predictions
Residual networks (ResNets) introduced by He et al. in 2015

Vanishing gradients in deep networks

Training neural networks can lead to gradients becoming very small.
This can happen when the network is very deep, e.g. L ≥ 20.
This can lead to the benefit of deep networks being lost.
The product of scalars in the equation can become very small, leading to vanishing gradients.
This can lead to an increase in training and validation error.

Resnets

If all weights and biases are null, then gradients will not vanish
Computational graph for forward and back-propagation of a ResNet is shown in Figure 3.3

Connections with odes

ResNet with d = D = H has relation (3.4)
ResNet is a discretization of a non-linear system of ODEs
Training a ResNet means determining parameters θ of the network
Training a system of ODEs means determining the right hand side V (x, t)

Neural odes

Neural ODEs proposed in [4] transform a regression problem to one of finding the nonlinear, time-dependent RHS of a system of ODEs
Computational cost for Neural ODEs and ResNets is O(L)
Memory cost for Neural ODEs is independent of the number of time-steps used to solve the ODE
Memory cost for ResNets increases linearly with the number of layers
Neural ODEs can use any time-integrator, while ResNets use a forward Euler type method
Scalar advection-diffusion problem in one-dimension is described by equation (4.1)
Solution to (4.1) for f ≡ 0 can be analytically written as (4.2)
Thickness of boundary layer is given by δ ≈ Pe × l

Finite difference method

Discretize domain into grid of points
Approximate derivatives with finite difference approximations
Solve system of algebraic equations
Use Thomas tridiagonal algorithm to solve system

Spectral collocation method

Select a set of global basis functions
Use Chebyshev polynomials on the interval (0, 1)
Evaluate derivatives for the PDE
Use boundary conditions of the PDE
Solve linear system for coefficients
Use least-square variant to find coefficients
Use gradient-based methods to solve non-linear PDEs

Physics-informed neural networks (pinns)

Idea of using neural networks to solve PDEs introduced in 1990-2000s
Rediscovered in 2019 and given the term PINNs (physics-informed neural networks)
Loss function contains derivate operators arising in the PDE
Neural network used as function representation of PDE solution
Representation must be complete, smooth, easy to evaluate and easy to evaluate derivatives
Find θ such that PDE is satisfied in some suitable form
After training, solution is u*(x) = F(x; θ*)
Improve accuracy by increasing number of collocation points, changing hyper-parameter λb, or increasing size of network
Boundary conditions enforced as soft constraint via penalization term Πb(θ)
Hyper-parameter λb balances interplay between two loss terms during minimization process

Extending pinns to a more general pde

PDE is a general equation to find a solution u
Example is the three-dimensional incompressible Navier-Stokes equation
Input to the network is [s1, s2, s3, t]
Output vector is u4
Loss function is given
Regularization term is added to the loss
Exact solution to the PDE is not guaranteed
Error e = u* - u is related to the loss value

Error analysis for pinns

Error e = u * − u needs exact solution u which is not available
Linear PDEs (L and B are linear operators)
Error can be controlled if residuals of u * are small
Error can be controlled by reducing loss functions and increasing number of interior and boundary collocation points

Data assimilation using pinns

Data assimilation is a problem encountered in science and engineering where sparse measurements of a quantity are used to evaluate it everywhere on a fine grid.
Data assimilation is solved using PINNs (Physics-Informed Neural Networks) which represent the quantity using a neural network and define a loss function to train the network.

Functions and images

Function u(x) defines a grayscale image with pixel intensity values.
Color images are three-dimensional tensors with red, blue and green channels.
Fully-connected neural networks require large input dimensions and connected layers.
Unravelling an image loses spatial context.
Local operations should be the same in any region of the image.

Convolutions of functions

Convolution operator maps functions to functions
Kernel function g(x) decays as |x| → ∞
Convolution operator samples u by varying x
Kernel shifts to different locations to sample u in different windows

Example 1

Gaussian kernel is a popular choice for smoothing/blurring filter
It is isotropic and the integral over the whole domain is unity
Parameterized by σ, which filters out scales finer than σ

Example 2

Kernel produces derivative of a smooth version of u
Kernel is shown in 1D and 2D in Figure 5.3
Action of kernel looks like smoothed finite difference operation
Region to left of center of kernel is weighted by negative value
Region to right of center of kernel is weighted by positive value

Discrete convolutions

Discrete convolution in 2D can be evaluated using quadrature.
Kernel width is defined by a measure of N, and convolution is zero for pixels outside this measure.

Connection to finite differences

Convolution is related to the stencil of a finite difference scheme
A function u(x1, x2) is represented on a finite grid
Taylor series expansion is used to approximate derivatives along the 1-direction
Convolution with a specific kernel approximates the computation of the second derivative along the 1-direction

Convolution layers

Convolution layers consist of multiple kernels.
Weights of kernels are learnable parameters.
Network learns operations that are appropriate for its task.

Average and max pooling

Pooling operations reduce the size of an image and allow you to step through different scales.
Pooling operations do not have any trainable parameters.

Convolution for inputs with multiple channels

Convolution layer takes an input of size N 1 × N 2 × C and produces an output of size M 1 × M 2 × P.
Each convolution uses a kernel of width k = 2 N + 1, which has k × k × C weights for each of the C input channels.
Output of each convolution is stacked together to give the final output of the convolution layer.
Total number of trainable parameters is (2 N + 1) × (2 N + 1) × P × C.

Convolution neural network (cnn)

CNN used to solve image classification problem
Output of final pooling layer flattened to form vector
Vector passed through fully connected layers with activation function
Final fully connected layer reduces vector to C
Cross-entropy function used as loss function
Transpose convolution layers also called fractionally-strided layers
Upscaling done with reduction in number of channels

Image-to-image transformations

Image-to-image transformations are similar to function-to-function transformations.
U-Nets are a type of network used for these transformations.
U-Nets have a downward branch which downscales the image and an upward branch which upscales the image.
U-Nets also use skip connections to combine information from the downward and upward branches.

The problem with pinns

MLP takes input x and gives output y with trainable weights θ
PINN is a network that takes input x of underlying PDE and gives solution u(x) as output
PINN is trained by minimizing weighted sum of PDE and boundary residual
Changing f or g in PDE requires retraining of network

Parametrized pdes

Source term f is given as a parametric function
Network takes as input both x and α
Train network by minimizing loss function
Solution to PDE is u(x, α) = F(x, α; θ*)
Find solution for arbitrary, non-parametric f by approximating operators that map functions to functions

Operators

Operator N maps conductivity κ and boundary condition g to solution u of PDE
Operator N maps initial condition u0 to solution u at final time T
Examples 1 and 4 are linear operators, examples 2, 3 and 5 are non-linear
Interested in networks that approximate operator N
Two popular versions of networks: DeepONet and Fourier Neural Operator

Deep operator network (deeponet) architecture

Operator networks proposed by Chen and Chen
DeepONets comprise two neural networks
DeepONet approximates an operator N
Branch net B takes input from a set of functions and outputs a vector
Trunk net T takes input from a set of functions and outputs a vector
Dot product of branch and trunk net outputs is the final output of DeepONet
DeepONet can approximate any N (a)(x)
DeepONet uses pre-defined sensor points
Branch net is fully connected or convolutional
DeepONet expressivity can be improved by increasing parameters or latent vector dimension

Training deeponets

Error analysis for deeponets

Chen and Chen [5] have a theorem that states a DeepONet with a single hidden layer can approximate any nonlinear, continuous operator mapping with a certain accuracy.
The theorem has been extended to a deeper version of the network and generalized by removing the compactness assumptions on the spaces.
An estimate of the error in a DeepONet has been developed in [20] which states the error is bounded by the numerical solver used, the final training error, and the distance of any input from the training set.

Link to paper#

Abstract#

Paper Content#

Computational physics#

Machine learning#

Examples of ml#

Types of ml algorithms based leaning task#

Artificial intelligence, machine learning and deep learning#

Machine learning and computational physics#

Mlp architecture#

Activation functions#

Linear activation#

Rectified linear unit (relu)#

Leaky relu#

Logistic function#

Tanh#

Sine#

Expressivity of a network#

Universal approximation results#

Training, validation and testing of neural networks#

Generalizability#

Regularization#

Gradient descent#

Some advanced optimization algorithms#

Momentum methods#

Adam#

Stochastic optimization#

Calculating gradients using back-propagation#

Regression versus classification#

Vanishing gradients in deep networks#

Resnets#

Connections with odes#

Neural odes#

Finite difference method#

Spectral collocation method#

Physics-informed neural networks (pinns)#

Extending pinns to a more general pde#

Error analysis for pinns#

Data assimilation using pinns#

Functions and images#

Convolutions of functions#

Example 1#

Example 2#

Discrete convolutions#

Connection to finite differences#

Convolution layers#

Average and max pooling#

Convolution for inputs with multiple channels#

Convolution neural network (cnn)#

Image-to-image transformations#

The problem with pinns#

Parametrized pdes#

Operators#

Deep operator network (deeponet) architecture#

Training deeponets#

Error analysis for deeponets#

Link to paper

Abstract

Paper Content

Computational physics

Machine learning

Examples of ml

Types of ml algorithms based leaning task

Artificial intelligence, machine learning and deep learning

Machine learning and computational physics

Mlp architecture

Activation functions

Linear activation

Rectified linear unit (relu)

Leaky relu

Logistic function

Tanh

Sine

Expressivity of a network

Universal approximation results

Training, validation and testing of neural networks

Generalizability

Regularization

Gradient descent

Some advanced optimization algorithms

Momentum methods

Adam

Stochastic optimization

Calculating gradients using back-propagation

Regression versus classification

Vanishing gradients in deep networks

Resnets

Connections with odes

Neural odes

Finite difference method

Spectral collocation method

Physics-informed neural networks (pinns)

Extending pinns to a more general pde

Error analysis for pinns

Data assimilation using pinns

Functions and images

Convolutions of functions

Example 1

Example 2

Discrete convolutions

Connection to finite differences

Convolution layers

Average and max pooling

Convolution for inputs with multiple channels

Convolution neural network (cnn)

Image-to-image transformations

The problem with pinns

Parametrized pdes

Operators

Deep operator network (deeponet) architecture

Training deeponets

Error analysis for deeponets