Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Compiled as lecture notes for a course at the University of Southern California
- Accessible to engineering graduate students with a strong background in Applied Mathematics
- Introduce student to topics in deep learning
- Exploit connections between deep learning algorithms and conventional techniques of computational physics
- Use concepts from computational physics to develop understanding of deep learning algorithms
- Novel deep learning algorithms can be used to solve challenging problems in computational physics
Paper Content
Computational physics
- Computational physics solves problems in science and engineering
- Involves collecting measurements of an observable
- Postulating a physical law based on observations
- Writing a mathematical description of the law
- Solving the system using exact or approximate methods
Machine learning
- ML does not require physical laws
- Collect data from physical phenomena, measurements, or numerical solvers
- Train algorithm to discover patterns or relations
- Use ML algorithm to make predictions and validate with data
Examples of ml
- Regression algorithms are used to approximate a function given a set of pairwise data
- Decision trees can be used to predict the probability of a new individual owning a house
- Clustering algorithms are used to find patterns in a set of data
Types of ml algorithms based leaning task
- Supervised learning: Predicting labels for new data based on existing data
- Unsupervised learning: Finding relations among different regions of data
Artificial intelligence, machine learning and deep learning
- AI, ML and DL are related but different concepts
- AI refers to a system with human-like intelligence
- ML is a key component of an AI system
- Self-driving cars are an example of AI
- ML algorithms are trained using data
- DL is a subset of ML algorithms
- DL architecture is loosely motivated by how signals are transmitted by the central nervous system
Machine learning and computational physics
- Combining computational physics and ML can provide an alternate route to representing mathematical laws.
- Physics knowledge can help reduce the amount of data required to train ML algorithms.
- Tools from computational physics can be applied to ML to better understand and design ML algorithms.
Mlp architecture
- MLP is used to approximate a function f: x ∈ R d → y ∈ R D
- MLP consists of a source layer, hidden layers and an output layer
- Weights and biases are associated with each neuron
- Activation function is applied component-wise
- Output function may be used at the end of the output layer
- Depth of the network is the number of computing layers
- Parameters of the network are the weights and biases
Activation functions
- Activation function is important in MLP
- Many activation functions available with different advantages and disadvantages
- Figure 2.2 provides more information
Linear activation
- Linear activation function is infinitely smooth.
- Range of the function is from negative infinity to positive infinity.
- Network reduced to single affine transformation of input, making it a linear approximation of target function.
Rectified linear unit (relu)
- Function is piecewise linear
- Function is continuous
- Derivative is piecewise constant with a jump at ξ = 0
- Second derivative is a dirac function concentrated at ξ = 0
- Range of the function is [0, ∞)
Leaky relu
- ReLU activation leads to null output if affine transformation is negative.
- Leaky ReLU was designed to overcome challenge of dying neurons.
- Derivatives of Leaky ReLU behave same as ReLU.
- Range of Leaky ReLU is (-infinity, infinity).
Logistic function
- Function is smooth and monotonic.
- Range of function is (0, 1).
- Derivative quickly decays to zero away from 0, leading to slow convergence.
Tanh
- Tanh is a symmetrical extension of the logistic function.
- Tanh is smooth, monotonic and bounded between -1 and 1.
Sine
- Sine function is an efficient activation function.
- Sine function is infinitely smooth and bounded.
Expressivity of a network
- Increasing N θ has an effect on the ReLU activation function.
- A simple example is used to illustrate the effects.
- The number of kinks in the output increases as the depth and width of the network increases.
Universal approximation results
- MLPs can approximate continuous functions with a single hidden layer
- MLPs can approximate continuous vector-valued functions with multiple hidden layers
- MLPs with ReLU activations can approximate functions with two continuous derivatives
- Numerical results help demystify the “black-box” nature of neural networks
Training, validation and testing of neural networks
- MLPs are used to approximate a target function
- Parameters of MLPs are set in a supervised learning framework
- Dataset is split into 3 parts: training, validation, and testing
- Training phase finds optimal values of θ for a fixed Θ
- Validation phase finds optimal values of Θ
- Testing phase evaluates performance of trained network on unseen data
- Loss function is used to optimize network parameters
Generalizability
- Training a network with a small value of Π train and Π val does not guarantee a small value of Π test.
- Regularization is a technique used to avoid data overfitting.
Regularization
- Neural networks are often over-parametrized
- Loss function can have many local minimas
- Regularization technique can be used to nudge the choice of θ
- Regularization encourages selection of minima with smaller values of θ
- Regularization helps avoid over fitting
Gradient descent
- We wish to solve a minimization problem using gradient descent
- Taylor expansion is used to approximate the problem
- Gradient descent requires a step-size (learning-rate) to be tuned
- GD converges when the local curvature is less than 2/η
- GD prefers flat minima
- Flat minima tend to generalize better
Some advanced optimization algorithms
- GD can be used to solve the optimization problem involved in training neural networks.
- Most optimization algorithms use a formula that includes a component-wise learning rate and a vector-valued function that approximates the gradient.
- If the learning rate is too large, the updates will zig-zag their way towards the minima.
- Two popular methods can help overcome some of the issues faced by GD.
Momentum methods
- Momentum methods use the history of the gradient, not just the previous step.
- The formula for the update includes a weighted moving average of the gradient.
- This weighting is expected to reduce zig-zagging and move more smoothly towards the minima.
- A commonly used value for β 1 is 0.9.
Adam
- Adam optimization was introduced by Kingma and Ba
- Adam optimization uses the history of the gradient and the second moment of the gradient
- Adam updates are given by a formula with a learning rate η
- Recommended values for hyper-parameters are β 1 = 0.9, β 2 = 0.999 and = 10 −8
- Learning rate for each component is different, larger magnitude of the gradient for a component means smaller learning rate
- Adam algorithm has additional correction steps to improve efficiency
Stochastic optimization
- Training loss can be rewritten
- Gradient of loss function is expensive to calculate
- Stochastic gradient descent (SGD) is an easy way to circumvent this problem
- SGD converges if Π i (θ) is convex and differentiable
- SGD needs to have a decaying learning rate
- SGD algorithm keeps overshooting the minima without decay
- SGD algorithm moves closer to θ * with decaying learning rate
- Stochastic optimization algorithms are not used due to chaotic behaviour and under-utilization of resources
- Mini-batch optimization is a compromise between the two
- Mini-batch stochastic optimization algorithm evaluates batch gradient and updates θ
- SGD might help in selecting minima that generalize better
Calculating gradients using back-propagation
- Output of layer l+1 is given by an affine transform and a non-linear transform
- Loss/objective function can be evaluated using a forward pass
- To update network parameters, need derivatives of loss with respect to parameters
- Chain rule is used to evaluate derivatives
- Computational graph is used to represent evaluation of loss and its gradient
- Back propagation is used to traverse in backward direction
Regression versus classification
- Two types of losses considered: mean square error (MSE) and mean absolute error (MAE)
- Neural networks can be used to solve regression problems with nonlinear inputs/outputs
- Examples of classification problems given
- Output labels need to be one-hot encoded
- Cross-entropy loss function used to measure discrepancy between two distributions
- Cross-entropy loss function penalizes incorrect confident predictions
- Residual networks (ResNets) introduced by He et al. in 2015
Vanishing gradients in deep networks
- Training neural networks can lead to gradients becoming very small.
- This can happen when the network is very deep, e.g. L ≥ 20.
- This can lead to the benefit of deep networks being lost.
- The product of scalars in the equation can become very small, leading to vanishing gradients.
- This can lead to an increase in training and validation error.
Resnets
- If all weights and biases are null, then gradients will not vanish
- Computational graph for forward and back-propagation of a ResNet is shown in Figure 3.3
Connections with odes
- ResNet with d = D = H has relation (3.4)
- ResNet is a discretization of a non-linear system of ODEs
- Training a ResNet means determining parameters θ of the network
- Training a system of ODEs means determining the right hand side V (x, t)
Neural odes
- Neural ODEs proposed in [4] transform a regression problem to one of finding the nonlinear, time-dependent RHS of a system of ODEs
- Computational cost for Neural ODEs and ResNets is O(L)
- Memory cost for Neural ODEs is independent of the number of time-steps used to solve the ODE
- Memory cost for ResNets increases linearly with the number of layers
- Neural ODEs can use any time-integrator, while ResNets use a forward Euler type method
- Scalar advection-diffusion problem in one-dimension is described by equation (4.1)
- Solution to (4.1) for f ≡ 0 can be analytically written as (4.2)
- Thickness of boundary layer is given by δ ≈ Pe × l
Finite difference method
- Discretize domain into grid of points
- Approximate derivatives with finite difference approximations
- Solve system of algebraic equations
- Use Thomas tridiagonal algorithm to solve system
Spectral collocation method
- Select a set of global basis functions
- Use Chebyshev polynomials on the interval (0, 1)
- Evaluate derivatives for the PDE
- Use boundary conditions of the PDE
- Solve linear system for coefficients
- Use least-square variant to find coefficients
- Use gradient-based methods to solve non-linear PDEs
Physics-informed neural networks (pinns)
- Idea of using neural networks to solve PDEs introduced in 1990-2000s
- Rediscovered in 2019 and given the term PINNs (physics-informed neural networks)
- Loss function contains derivate operators arising in the PDE
- Neural network used as function representation of PDE solution
- Representation must be complete, smooth, easy to evaluate and easy to evaluate derivatives
- Find θ such that PDE is satisfied in some suitable form
- After training, solution is u*(x) = F(x; θ*)
- Improve accuracy by increasing number of collocation points, changing hyper-parameter λb, or increasing size of network
- Boundary conditions enforced as soft constraint via penalization term Πb(θ)
- Hyper-parameter λb balances interplay between two loss terms during minimization process
Extending pinns to a more general pde
- PDE is a general equation to find a solution u
- Example is the three-dimensional incompressible Navier-Stokes equation
- Input to the network is [s1, s2, s3, t]
- Output vector is u4
- Loss function is given
- Regularization term is added to the loss
- Exact solution to the PDE is not guaranteed
- Error e = u* - u is related to the loss value
Error analysis for pinns
- Error e = u * − u needs exact solution u which is not available
- Linear PDEs (L and B are linear operators)
- Error can be controlled if residuals of u * are small
- Error can be controlled by reducing loss functions and increasing number of interior and boundary collocation points
Data assimilation using pinns
- Data assimilation is a problem encountered in science and engineering where sparse measurements of a quantity are used to evaluate it everywhere on a fine grid.
- Data assimilation is solved using PINNs (Physics-Informed Neural Networks) which represent the quantity using a neural network and define a loss function to train the network.
Functions and images
- Function u(x) defines a grayscale image with pixel intensity values.
- Color images are three-dimensional tensors with red, blue and green channels.
- Fully-connected neural networks require large input dimensions and connected layers.
- Unravelling an image loses spatial context.
- Local operations should be the same in any region of the image.
Convolutions of functions
- Convolution operator maps functions to functions
- Kernel function g(x) decays as |x| → ∞
- Convolution operator samples u by varying x
- Kernel shifts to different locations to sample u in different windows
Example 1
- Gaussian kernel is a popular choice for smoothing/blurring filter
- It is isotropic and the integral over the whole domain is unity
- Parameterized by σ, which filters out scales finer than σ
Example 2
- Kernel produces derivative of a smooth version of u
- Kernel is shown in 1D and 2D in Figure 5.3
- Action of kernel looks like smoothed finite difference operation
- Region to left of center of kernel is weighted by negative value
- Region to right of center of kernel is weighted by positive value
Discrete convolutions
- Discrete convolution in 2D can be evaluated using quadrature.
- Kernel width is defined by a measure of N, and convolution is zero for pixels outside this measure.
Connection to finite differences
- Convolution is related to the stencil of a finite difference scheme
- A function u(x1, x2) is represented on a finite grid
- Taylor series expansion is used to approximate derivatives along the 1-direction
- Convolution with a specific kernel approximates the computation of the second derivative along the 1-direction
Convolution layers
- Convolution layers consist of multiple kernels.
- Weights of kernels are learnable parameters.
- Network learns operations that are appropriate for its task.
Average and max pooling
- Pooling operations reduce the size of an image and allow you to step through different scales.
- Pooling operations do not have any trainable parameters.
Convolution for inputs with multiple channels
- Convolution layer takes an input of size N 1 × N 2 × C and produces an output of size M 1 × M 2 × P.
- Each convolution uses a kernel of width k = 2 N + 1, which has k × k × C weights for each of the C input channels.
- Output of each convolution is stacked together to give the final output of the convolution layer.
- Total number of trainable parameters is (2 N + 1) × (2 N + 1) × P × C.
Convolution neural network (cnn)
- CNN used to solve image classification problem
- Output of final pooling layer flattened to form vector
- Vector passed through fully connected layers with activation function
- Final fully connected layer reduces vector to C
- Cross-entropy function used as loss function
- Transpose convolution layers also called fractionally-strided layers
- Upscaling done with reduction in number of channels
Image-to-image transformations
- Image-to-image transformations are similar to function-to-function transformations.
- U-Nets are a type of network used for these transformations.
- U-Nets have a downward branch which downscales the image and an upward branch which upscales the image.
- U-Nets also use skip connections to combine information from the downward and upward branches.
The problem with pinns
- MLP takes input x and gives output y with trainable weights θ
- PINN is a network that takes input x of underlying PDE and gives solution u(x) as output
- PINN is trained by minimizing weighted sum of PDE and boundary residual
- Changing f or g in PDE requires retraining of network
Parametrized pdes
- Source term f is given as a parametric function
- Network takes as input both x and α
- Train network by minimizing loss function
- Solution to PDE is u(x, α) = F(x, α; θ*)
- Find solution for arbitrary, non-parametric f by approximating operators that map functions to functions
Operators
- Operator N maps conductivity κ and boundary condition g to solution u of PDE
- Operator N maps initial condition u0 to solution u at final time T
- Examples 1 and 4 are linear operators, examples 2, 3 and 5 are non-linear
- Interested in networks that approximate operator N
- Two popular versions of networks: DeepONet and Fourier Neural Operator
Deep operator network (deeponet) architecture
- Operator networks proposed by Chen and Chen
- DeepONets comprise two neural networks
- DeepONet approximates an operator N
- Branch net B takes input from a set of functions and outputs a vector
- Trunk net T takes input from a set of functions and outputs a vector
- Dot product of branch and trunk net outputs is the final output of DeepONet
- DeepONet can approximate any N (a)(x)
- DeepONet uses pre-defined sensor points
- Branch net is fully connected or convolutional
- DeepONet expressivity can be improved by increasing parameters or latent vector dimension
Training deeponets
Error analysis for deeponets
- Chen and Chen [5] have a theorem that states a DeepONet with a single hidden layer can approximate any nonlinear, continuous operator mapping with a certain accuracy.
- The theorem has been extended to a deeper version of the network and generalized by removing the compactness assumptions on the spaces.
- An estimate of the error in a DeepONet has been developed in [20] which states the error is bounded by the numerical solver used, the final training error, and the distance of any input from the training set.