Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Learning algorithms used in neuroscience and neuromorphic chips use Contrastive Learning (CL)
  • CL traditionally implemented with rigid, temporally non-local, and periodic learning dynamics
  • Recent work explores how CL might be implemented by biological or neurmorphic systems
  • CL can be made temporally local and still function even if many dynamical requirements are relaxed
  • Theorems and numerical experiments provide theoretical foundations for CL methods for biological and neuromorphic neural networks

Paper Content

Introduction

  • CL is a family of algorithms for learning representations
  • CL can be divided into equilibrium-based and nonequilibrium-based methods
  • CL has been proposed as a normative model for biological learning
  • Investigating alternative training dynamics for CL
  • CL learns representations by leveraging statistical differences between positive and negative samples
  • CL algorithms have a periodic, biphasic time course of learning
  • Investigating temporal non-locality, periodicity, learning rate modulation, and deterministic phase length
  • Proposing an importance sampling-inspired approach to estimating the gradient
  • Not always optimal to spend equal amount of time learning from positive and negative samples
  • Proving that equilibrium-based CL can still occur with no learning rate modulation and with noise in phase length
  • Relaxing requirements on learning dynamics of CL

Background

  • AI algorithms are being used as normative models of learning in biological neural networks
  • AI algorithms often exhibit properties that are computationally incompatible with neural hardware
  • Bio-plausible learning is relevant to both neuroscience and AI
  • Backpropagation is an example of an AI algorithm that is not bio-plausible
  • Neuromorphic computing is a potential solution to AI energy efficiency problems
  • CL algorithms do not require backpropagation for neural network-based learning
  • CL algorithms exhibit properties that have not been observed in the brain
  • CL algorithms require a specific pattern of learning rate scheduling and periodic stimulus/sample presentation
  • The Boltzmann machine is an example of a classic equilibrium-based CL method
  • Equilibrium-based CL must run internal dynamics on the space of the given model’s hidden unit activations
  • Non-equilibrium-based CL has been used for unsupervised generative modelling
  • Temporally local learning has been explored in equilibrium-based and non-equilibrium-based CL methods

Theoretical results

  • CL can be generalized to temporally local learning dynamics
  • Learning rate and phase length can be studied in equilibrium-based CL systems
  • SGD is used to optimize a two-term objective function
  • Separate passes through the neural network are needed to estimate each term of the gradient
  • Classic CL algorithms have this form of gradient

Gradient sampling to avoid non-locality

  • We can resolve the temporal non-locality of Equation 2 by considering the situation where each term g + and g − are used separately to do parameter updates.
  • We propose a probabilistic phase selection process, where the next phase type (positive or negative) to be performed during learning is chosen by a Bernouilli random variable B.
  • This gradient estimator remains unbiased for any b ∈ (0, 1).
  • The variance of the gradient estimator given by Equation 3 is a convex function of b.
  • Networks can still learn effectively with this higher-variance estimator.

Bias of always-on-learning and variable phase length in equilibrium-based cl

  • Investigated less restrictive learning dynamics for equilibrium-based CL
  • Dropped subscripts + or - on network states
  • Learning occurs via gradient estimate ĝ
  • Equation 3 applied to standard equilibrium-based CL paradigm
  • Learning rate fixed to η(t) = η τ ∀t
  • Learning rate fixed to same value and phase length chosen randomly
  • Learning dynamics of AoL Random T results in Equation 4 being updated
  • Explicit identification of end of phase no longer necessary
  • Theorem 3.4 and 3.5 provide bounds on bias introduced by training
  • Theorems apply to any method satisfying assumptions that uses equilibrium-based gradient updates
  • Illustrate how relaxed CL conditions maintain successful learning at limited cost

Experimental findings

  • Proposed algorithms were tested in practice
  • Two different CL methods were used: ML learning in energy based models and FF algorithm
  • RBM and feed-forward neural network with two hidden layers were trained
  • Testing was done on bMNIST, BAS and MNIST datasets
  • SGD and ADAM were used for training

Estimators learn comparably to classic methods

  • ISD was compared to traditional two-term CL estimator for training a feed-forward neural network and a RBM
  • Loss was measured over phase counts and number of time steps in network dynamics
  • ISD updates parameters more often than standard CL
  • ISD exhibited higher variance learning curves
  • ISD was still able to learn quite robustly and to a degree of accuracy only slightly worse than standard two-term gradient estimates
  • ISD estimators tended to require a slightly lower learning rate and converged slightly slower than the two-term estimators

Robustness to changes in positive phase proportion

  • Investigated effect of parameter b on RBM model
  • Found learning is robust to changes in b
  • Optimal value not always for equal time spent in positive/negative phases
  • Better results for higher b on bMNIST dataset, lower b on BAS dataset
  • Robustness to parameter changes, values 0.8-0.2 not performing drastically worse than best-performing b

Variance of the gradient estimate

  • Investigated how b affects the variance of the ISD gradient estimator
  • Found that higher/lower values of b lead to less variance when g+/g- has lower/higher variance
  • Hypothesized that effect of b on variance of estimator may underlie b’s effect on learning
  • Found that trace of covariance of ĝ(θ) is a convex function of b, so value of b leading to minimal estimator variance is unique

Discussion

  • Implications for theoretical neuroscience and neuromorphic computing
  • Lack of periodicity or sharp learning rate modulation in a neural circuit is not a basis for discounting CL as a potential algorithm
  • Performance is robust to changes in positive phase probability
  • Optimal value is not always at b = 0.5
  • Performance on BARS dataset for RBM trained using ISD and ISD with constant learning rate
  • Negative phase variance changes while positive phase is fixed
  • Trade-off between variance and locality
  • Manipulating amount of time spent in positive versus negative phase for CL
  • Trade compute for data in sparse data regimes

Conclusion

  • Proposed an alternative gradient estimate for CL
  • Proved a theorem applying to algorithms that require equilibrium dynamics
  • Theorem shows that equilibriumbased algorithms with stochastic phase lengths will still exhibit asymptotically unbiased gradients
  • Importance Sampling -Discrete (ISD) Gradient Estimate is an unbiased estimate of the gradient
  • ISD gradient estimate has strictly greater variance than the standard CL gradient estimate
  • Variance of ISD gradient estimate is a convex function of b
  • Investigated a single phase of equilibrium-based learning
  • Assumed ĝ(θ) is differentiable and bounded, with a bounded derivative
  • Proved bias of learning updates is of the order of η2/τm
  • Applied law of total expectation and assumed T ≥ s
  • Analyzed inner expectation, E[G(z 0 , θ)|T ]
  • Error term is quadratic in η and O(s/τ )