Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Learning algorithms used in neuroscience and neuromorphic chips use Contrastive Learning (CL)
CL traditionally implemented with rigid, temporally non-local, and periodic learning dynamics
Recent work explores how CL might be implemented by biological or neurmorphic systems
CL can be made temporally local and still function even if many dynamical requirements are relaxed
Theorems and numerical experiments provide theoretical foundations for CL methods for biological and neuromorphic neural networks

Paper Content

Introduction

CL is a family of algorithms for learning representations
CL can be divided into equilibrium-based and nonequilibrium-based methods
CL has been proposed as a normative model for biological learning
Investigating alternative training dynamics for CL
CL learns representations by leveraging statistical differences between positive and negative samples
CL algorithms have a periodic, biphasic time course of learning
Investigating temporal non-locality, periodicity, learning rate modulation, and deterministic phase length
Proposing an importance sampling-inspired approach to estimating the gradient
Not always optimal to spend equal amount of time learning from positive and negative samples
Proving that equilibrium-based CL can still occur with no learning rate modulation and with noise in phase length
Relaxing requirements on learning dynamics of CL

Background

AI algorithms are being used as normative models of learning in biological neural networks
AI algorithms often exhibit properties that are computationally incompatible with neural hardware
Bio-plausible learning is relevant to both neuroscience and AI
Backpropagation is an example of an AI algorithm that is not bio-plausible
Neuromorphic computing is a potential solution to AI energy efficiency problems
CL algorithms do not require backpropagation for neural network-based learning
CL algorithms exhibit properties that have not been observed in the brain
CL algorithms require a specific pattern of learning rate scheduling and periodic stimulus/sample presentation
The Boltzmann machine is an example of a classic equilibrium-based CL method
Equilibrium-based CL must run internal dynamics on the space of the given model’s hidden unit activations
Non-equilibrium-based CL has been used for unsupervised generative modelling
Temporally local learning has been explored in equilibrium-based and non-equilibrium-based CL methods

Theoretical results

CL can be generalized to temporally local learning dynamics
Learning rate and phase length can be studied in equilibrium-based CL systems
SGD is used to optimize a two-term objective function
Separate passes through the neural network are needed to estimate each term of the gradient
Classic CL algorithms have this form of gradient

Gradient sampling to avoid non-locality

We can resolve the temporal non-locality of Equation 2 by considering the situation where each term g + and g − are used separately to do parameter updates.
We propose a probabilistic phase selection process, where the next phase type (positive or negative) to be performed during learning is chosen by a Bernouilli random variable B.
This gradient estimator remains unbiased for any b ∈ (0, 1).
The variance of the gradient estimator given by Equation 3 is a convex function of b.
Networks can still learn effectively with this higher-variance estimator.

Bias of always-on-learning and variable phase length in equilibrium-based cl

Investigated less restrictive learning dynamics for equilibrium-based CL
Dropped subscripts + or - on network states
Learning occurs via gradient estimate ĝ
Equation 3 applied to standard equilibrium-based CL paradigm
Learning rate fixed to η(t) = η τ ∀t
Learning rate fixed to same value and phase length chosen randomly
Learning dynamics of AoL Random T results in Equation 4 being updated
Explicit identification of end of phase no longer necessary
Theorem 3.4 and 3.5 provide bounds on bias introduced by training
Theorems apply to any method satisfying assumptions that uses equilibrium-based gradient updates
Illustrate how relaxed CL conditions maintain successful learning at limited cost

Experimental findings

Proposed algorithms were tested in practice
Two different CL methods were used: ML learning in energy based models and FF algorithm
RBM and feed-forward neural network with two hidden layers were trained
Testing was done on bMNIST, BAS and MNIST datasets
SGD and ADAM were used for training

Estimators learn comparably to classic methods

ISD was compared to traditional two-term CL estimator for training a feed-forward neural network and a RBM
Loss was measured over phase counts and number of time steps in network dynamics
ISD updates parameters more often than standard CL
ISD exhibited higher variance learning curves
ISD was still able to learn quite robustly and to a degree of accuracy only slightly worse than standard two-term gradient estimates
ISD estimators tended to require a slightly lower learning rate and converged slightly slower than the two-term estimators

Robustness to changes in positive phase proportion

Investigated effect of parameter b on RBM model
Found learning is robust to changes in b
Optimal value not always for equal time spent in positive/negative phases
Better results for higher b on bMNIST dataset, lower b on BAS dataset
Robustness to parameter changes, values 0.8-0.2 not performing drastically worse than best-performing b

Variance of the gradient estimate

Investigated how b affects the variance of the ISD gradient estimator
Found that higher/lower values of b lead to less variance when g+/g- has lower/higher variance
Hypothesized that effect of b on variance of estimator may underlie b’s effect on learning
Found that trace of covariance of ĝ(θ) is a convex function of b, so value of b leading to minimal estimator variance is unique

Discussion

Implications for theoretical neuroscience and neuromorphic computing
Lack of periodicity or sharp learning rate modulation in a neural circuit is not a basis for discounting CL as a potential algorithm
Performance is robust to changes in positive phase probability
Optimal value is not always at b = 0.5
Performance on BARS dataset for RBM trained using ISD and ISD with constant learning rate
Negative phase variance changes while positive phase is fixed
Trade-off between variance and locality
Manipulating amount of time spent in positive versus negative phase for CL
Trade compute for data in sparse data regimes

Conclusion

Proposed an alternative gradient estimate for CL
Proved a theorem applying to algorithms that require equilibrium dynamics
Theorem shows that equilibriumbased algorithms with stochastic phase lengths will still exhibit asymptotically unbiased gradients
Importance Sampling -Discrete (ISD) Gradient Estimate is an unbiased estimate of the gradient
ISD gradient estimate has strictly greater variance than the standard CL gradient estimate
Variance of ISD gradient estimate is a convex function of b
Investigated a single phase of equilibrium-based learning
Assumed ĝ(θ) is differentiable and bounded, with a bounded derivative
Proved bias of learning updates is of the order of η2/τm
Applied law of total expectation and assumed T ≥ s
Analyzed inner expectation, E[G(z 0 , θ)|T ]
Error term is quadratic in η and O(s/τ )

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Theoretical results#

Gradient sampling to avoid non-locality#

Bias of always-on-learning and variable phase length in equilibrium-based cl#

Experimental findings#

Estimators learn comparably to classic methods#

Robustness to changes in positive phase proportion#

Variance of the gradient estimate#

Discussion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background

Theoretical results

Gradient sampling to avoid non-locality

Bias of always-on-learning and variable phase length in equilibrium-based cl

Experimental findings

Estimators learn comparably to classic methods

Robustness to changes in positive phase proportion

Variance of the gradient estimate

Discussion

Conclusion