Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Compiled as lecture notes for a course at the University of Southern California
  • Accessible to engineering graduate students with a strong background in Applied Mathematics
  • Introduce student to topics in deep learning
  • Exploit connections between deep learning algorithms and conventional techniques of computational physics
  • Use concepts from computational physics to develop understanding of deep learning algorithms
  • Novel deep learning algorithms can be used to solve challenging problems in computational physics

Paper Content

Computational physics

  • Computational physics solves problems in science and engineering
  • Involves collecting measurements of an observable
  • Postulating a physical law based on observations
  • Writing a mathematical description of the law
  • Solving the system using exact or approximate methods

Machine learning

  • ML does not require physical laws
  • Collect data from physical phenomena, measurements, or numerical solvers
  • Train algorithm to discover patterns or relations
  • Use ML algorithm to make predictions and validate with data

Examples of ml

  • Regression algorithms are used to approximate a function given a set of pairwise data
  • Decision trees can be used to predict the probability of a new individual owning a house
  • Clustering algorithms are used to find patterns in a set of data

Types of ml algorithms based leaning task

  • Supervised learning: Predicting labels for new data based on existing data
  • Unsupervised learning: Finding relations among different regions of data

Artificial intelligence, machine learning and deep learning

  • AI, ML and DL are related but different concepts
  • AI refers to a system with human-like intelligence
  • ML is a key component of an AI system
  • Self-driving cars are an example of AI
  • ML algorithms are trained using data
  • DL is a subset of ML algorithms
  • DL architecture is loosely motivated by how signals are transmitted by the central nervous system

Machine learning and computational physics

  • Combining computational physics and ML can provide an alternate route to representing mathematical laws.
  • Physics knowledge can help reduce the amount of data required to train ML algorithms.
  • Tools from computational physics can be applied to ML to better understand and design ML algorithms.

Mlp architecture

  • MLP is used to approximate a function f: x ∈ R d → y ∈ R D
  • MLP consists of a source layer, hidden layers and an output layer
  • Weights and biases are associated with each neuron
  • Activation function is applied component-wise
  • Output function may be used at the end of the output layer
  • Depth of the network is the number of computing layers
  • Parameters of the network are the weights and biases

Activation functions

  • Activation function is important in MLP
  • Many activation functions available with different advantages and disadvantages
  • Figure 2.2 provides more information

Linear activation

  • Linear activation function is infinitely smooth.
  • Range of the function is from negative infinity to positive infinity.
  • Network reduced to single affine transformation of input, making it a linear approximation of target function.

Rectified linear unit (relu)

  • Function is piecewise linear
  • Function is continuous
  • Derivative is piecewise constant with a jump at ξ = 0
  • Second derivative is a dirac function concentrated at ξ = 0
  • Range of the function is [0, ∞)

Leaky relu

  • ReLU activation leads to null output if affine transformation is negative.
  • Leaky ReLU was designed to overcome challenge of dying neurons.
  • Derivatives of Leaky ReLU behave same as ReLU.
  • Range of Leaky ReLU is (-infinity, infinity).

Logistic function

  • Function is smooth and monotonic.
  • Range of function is (0, 1).
  • Derivative quickly decays to zero away from 0, leading to slow convergence.

Tanh

  • Tanh is a symmetrical extension of the logistic function.
  • Tanh is smooth, monotonic and bounded between -1 and 1.

Sine

  • Sine function is an efficient activation function.
  • Sine function is infinitely smooth and bounded.

Expressivity of a network

  • Increasing N θ has an effect on the ReLU activation function.
  • A simple example is used to illustrate the effects.
  • The number of kinks in the output increases as the depth and width of the network increases.

Universal approximation results

  • MLPs can approximate continuous functions with a single hidden layer
  • MLPs can approximate continuous vector-valued functions with multiple hidden layers
  • MLPs with ReLU activations can approximate functions with two continuous derivatives
  • Numerical results help demystify the “black-box” nature of neural networks

Training, validation and testing of neural networks

  • MLPs are used to approximate a target function
  • Parameters of MLPs are set in a supervised learning framework
  • Dataset is split into 3 parts: training, validation, and testing
  • Training phase finds optimal values of θ for a fixed Θ
  • Validation phase finds optimal values of Θ
  • Testing phase evaluates performance of trained network on unseen data
  • Loss function is used to optimize network parameters

Generalizability

  • Training a network with a small value of Π train and Π val does not guarantee a small value of Π test.
  • Regularization is a technique used to avoid data overfitting.

Regularization

  • Neural networks are often over-parametrized
  • Loss function can have many local minimas
  • Regularization technique can be used to nudge the choice of θ
  • Regularization encourages selection of minima with smaller values of θ
  • Regularization helps avoid over fitting

Gradient descent

  • We wish to solve a minimization problem using gradient descent
  • Taylor expansion is used to approximate the problem
  • Gradient descent requires a step-size (learning-rate) to be tuned
  • GD converges when the local curvature is less than 2/η
  • GD prefers flat minima
  • Flat minima tend to generalize better

Some advanced optimization algorithms

  • GD can be used to solve the optimization problem involved in training neural networks.
  • Most optimization algorithms use a formula that includes a component-wise learning rate and a vector-valued function that approximates the gradient.
  • If the learning rate is too large, the updates will zig-zag their way towards the minima.
  • Two popular methods can help overcome some of the issues faced by GD.

Momentum methods

  • Momentum methods use the history of the gradient, not just the previous step.
  • The formula for the update includes a weighted moving average of the gradient.
  • This weighting is expected to reduce zig-zagging and move more smoothly towards the minima.
  • A commonly used value for β 1 is 0.9.

Adam

  • Adam optimization was introduced by Kingma and Ba
  • Adam optimization uses the history of the gradient and the second moment of the gradient
  • Adam updates are given by a formula with a learning rate η
  • Recommended values for hyper-parameters are β 1 = 0.9, β 2 = 0.999 and = 10 −8
  • Learning rate for each component is different, larger magnitude of the gradient for a component means smaller learning rate
  • Adam algorithm has additional correction steps to improve efficiency

Stochastic optimization

  • Training loss can be rewritten
  • Gradient of loss function is expensive to calculate
  • Stochastic gradient descent (SGD) is an easy way to circumvent this problem
  • SGD converges if Π i (θ) is convex and differentiable
  • SGD needs to have a decaying learning rate
  • SGD algorithm keeps overshooting the minima without decay
  • SGD algorithm moves closer to θ * with decaying learning rate
  • Stochastic optimization algorithms are not used due to chaotic behaviour and under-utilization of resources
  • Mini-batch optimization is a compromise between the two
  • Mini-batch stochastic optimization algorithm evaluates batch gradient and updates θ
  • SGD might help in selecting minima that generalize better

Calculating gradients using back-propagation

  • Output of layer l+1 is given by an affine transform and a non-linear transform
  • Loss/objective function can be evaluated using a forward pass
  • To update network parameters, need derivatives of loss with respect to parameters
  • Chain rule is used to evaluate derivatives
  • Computational graph is used to represent evaluation of loss and its gradient
  • Back propagation is used to traverse in backward direction

Regression versus classification

  • Two types of losses considered: mean square error (MSE) and mean absolute error (MAE)
  • Neural networks can be used to solve regression problems with nonlinear inputs/outputs
  • Examples of classification problems given
  • Output labels need to be one-hot encoded
  • Cross-entropy loss function used to measure discrepancy between two distributions
  • Cross-entropy loss function penalizes incorrect confident predictions
  • Residual networks (ResNets) introduced by He et al. in 2015

Vanishing gradients in deep networks

  • Training neural networks can lead to gradients becoming very small.
  • This can happen when the network is very deep, e.g. L ≥ 20.
  • This can lead to the benefit of deep networks being lost.
  • The product of scalars in the equation can become very small, leading to vanishing gradients.
  • This can lead to an increase in training and validation error.

Resnets

  • If all weights and biases are null, then gradients will not vanish
  • Computational graph for forward and back-propagation of a ResNet is shown in Figure 3.3

Connections with odes

  • ResNet with d = D = H has relation (3.4)
  • ResNet is a discretization of a non-linear system of ODEs
  • Training a ResNet means determining parameters θ of the network
  • Training a system of ODEs means determining the right hand side V (x, t)

Neural odes

  • Neural ODEs proposed in [4] transform a regression problem to one of finding the nonlinear, time-dependent RHS of a system of ODEs
  • Computational cost for Neural ODEs and ResNets is O(L)
  • Memory cost for Neural ODEs is independent of the number of time-steps used to solve the ODE
  • Memory cost for ResNets increases linearly with the number of layers
  • Neural ODEs can use any time-integrator, while ResNets use a forward Euler type method
  • Scalar advection-diffusion problem in one-dimension is described by equation (4.1)
  • Solution to (4.1) for f ≡ 0 can be analytically written as (4.2)
  • Thickness of boundary layer is given by δ ≈ Pe × l

Finite difference method

  • Discretize domain into grid of points
  • Approximate derivatives with finite difference approximations
  • Solve system of algebraic equations
  • Use Thomas tridiagonal algorithm to solve system

Spectral collocation method

  • Select a set of global basis functions
  • Use Chebyshev polynomials on the interval (0, 1)
  • Evaluate derivatives for the PDE
  • Use boundary conditions of the PDE
  • Solve linear system for coefficients
  • Use least-square variant to find coefficients
  • Use gradient-based methods to solve non-linear PDEs

Physics-informed neural networks (pinns)

  • Idea of using neural networks to solve PDEs introduced in 1990-2000s
  • Rediscovered in 2019 and given the term PINNs (physics-informed neural networks)
  • Loss function contains derivate operators arising in the PDE
  • Neural network used as function representation of PDE solution
  • Representation must be complete, smooth, easy to evaluate and easy to evaluate derivatives
  • Find θ such that PDE is satisfied in some suitable form
  • After training, solution is u*(x) = F(x; θ*)
  • Improve accuracy by increasing number of collocation points, changing hyper-parameter λb, or increasing size of network
  • Boundary conditions enforced as soft constraint via penalization term Πb(θ)
  • Hyper-parameter λb balances interplay between two loss terms during minimization process

Extending pinns to a more general pde

  • PDE is a general equation to find a solution u
  • Example is the three-dimensional incompressible Navier-Stokes equation
  • Input to the network is [s1, s2, s3, t]
  • Output vector is u4
  • Loss function is given
  • Regularization term is added to the loss
  • Exact solution to the PDE is not guaranteed
  • Error e = u* - u is related to the loss value

Error analysis for pinns

  • Error e = u * − u needs exact solution u which is not available
  • Linear PDEs (L and B are linear operators)
  • Error can be controlled if residuals of u * are small
  • Error can be controlled by reducing loss functions and increasing number of interior and boundary collocation points

Data assimilation using pinns

  • Data assimilation is a problem encountered in science and engineering where sparse measurements of a quantity are used to evaluate it everywhere on a fine grid.
  • Data assimilation is solved using PINNs (Physics-Informed Neural Networks) which represent the quantity using a neural network and define a loss function to train the network.

Functions and images

  • Function u(x) defines a grayscale image with pixel intensity values.
  • Color images are three-dimensional tensors with red, blue and green channels.
  • Fully-connected neural networks require large input dimensions and connected layers.
  • Unravelling an image loses spatial context.
  • Local operations should be the same in any region of the image.

Convolutions of functions

  • Convolution operator maps functions to functions
  • Kernel function g(x) decays as |x| → ∞
  • Convolution operator samples u by varying x
  • Kernel shifts to different locations to sample u in different windows

Example 1

  • Gaussian kernel is a popular choice for smoothing/blurring filter
  • It is isotropic and the integral over the whole domain is unity
  • Parameterized by σ, which filters out scales finer than σ

Example 2

  • Kernel produces derivative of a smooth version of u
  • Kernel is shown in 1D and 2D in Figure 5.3
  • Action of kernel looks like smoothed finite difference operation
  • Region to left of center of kernel is weighted by negative value
  • Region to right of center of kernel is weighted by positive value

Discrete convolutions

  • Discrete convolution in 2D can be evaluated using quadrature.
  • Kernel width is defined by a measure of N, and convolution is zero for pixels outside this measure.

Connection to finite differences

  • Convolution is related to the stencil of a finite difference scheme
  • A function u(x1, x2) is represented on a finite grid
  • Taylor series expansion is used to approximate derivatives along the 1-direction
  • Convolution with a specific kernel approximates the computation of the second derivative along the 1-direction

Convolution layers

  • Convolution layers consist of multiple kernels.
  • Weights of kernels are learnable parameters.
  • Network learns operations that are appropriate for its task.

Average and max pooling

  • Pooling operations reduce the size of an image and allow you to step through different scales.
  • Pooling operations do not have any trainable parameters.

Convolution for inputs with multiple channels

  • Convolution layer takes an input of size N 1 × N 2 × C and produces an output of size M 1 × M 2 × P.
  • Each convolution uses a kernel of width k = 2 N + 1, which has k × k × C weights for each of the C input channels.
  • Output of each convolution is stacked together to give the final output of the convolution layer.
  • Total number of trainable parameters is (2 N + 1) × (2 N + 1) × P × C.

Convolution neural network (cnn)

  • CNN used to solve image classification problem
  • Output of final pooling layer flattened to form vector
  • Vector passed through fully connected layers with activation function
  • Final fully connected layer reduces vector to C
  • Cross-entropy function used as loss function
  • Transpose convolution layers also called fractionally-strided layers
  • Upscaling done with reduction in number of channels

Image-to-image transformations

  • Image-to-image transformations are similar to function-to-function transformations.
  • U-Nets are a type of network used for these transformations.
  • U-Nets have a downward branch which downscales the image and an upward branch which upscales the image.
  • U-Nets also use skip connections to combine information from the downward and upward branches.

The problem with pinns

  • MLP takes input x and gives output y with trainable weights θ
  • PINN is a network that takes input x of underlying PDE and gives solution u(x) as output
  • PINN is trained by minimizing weighted sum of PDE and boundary residual
  • Changing f or g in PDE requires retraining of network

Parametrized pdes

  • Source term f is given as a parametric function
  • Network takes as input both x and α
  • Train network by minimizing loss function
  • Solution to PDE is u(x, α) = F(x, α; θ*)
  • Find solution for arbitrary, non-parametric f by approximating operators that map functions to functions

Operators

  • Operator N maps conductivity κ and boundary condition g to solution u of PDE
  • Operator N maps initial condition u0 to solution u at final time T
  • Examples 1 and 4 are linear operators, examples 2, 3 and 5 are non-linear
  • Interested in networks that approximate operator N
  • Two popular versions of networks: DeepONet and Fourier Neural Operator

Deep operator network (deeponet) architecture

  • Operator networks proposed by Chen and Chen
  • DeepONets comprise two neural networks
  • DeepONet approximates an operator N
  • Branch net B takes input from a set of functions and outputs a vector
  • Trunk net T takes input from a set of functions and outputs a vector
  • Dot product of branch and trunk net outputs is the final output of DeepONet
  • DeepONet can approximate any N (a)(x)
  • DeepONet uses pre-defined sensor points
  • Branch net is fully connected or convolutional
  • DeepONet expressivity can be improved by increasing parameters or latent vector dimension

Training deeponets

Error analysis for deeponets

  • Chen and Chen [5] have a theorem that states a DeepONet with a single hidden layer can approximate any nonlinear, continuous operator mapping with a certain accuracy.
  • The theorem has been extended to a deeper version of the network and generalized by removing the compactness assumptions on the spaces.
  • An estimate of the error in a DeepONet has been developed in [20] which states the error is bounded by the numerical solver used, the final training error, and the distance of any input from the training set.