Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Learning algorithms used in neuroscience and neuromorphic chips use Contrastive Learning (CL)
- CL traditionally implemented with rigid, temporally non-local, and periodic learning dynamics
- Recent work explores how CL might be implemented by biological or neurmorphic systems
- CL can be made temporally local and still function even if many dynamical requirements are relaxed
- Theorems and numerical experiments provide theoretical foundations for CL methods for biological and neuromorphic neural networks
Paper Content
Introduction
- CL is a family of algorithms for learning representations
- CL can be divided into equilibrium-based and nonequilibrium-based methods
- CL has been proposed as a normative model for biological learning
- Investigating alternative training dynamics for CL
- CL learns representations by leveraging statistical differences between positive and negative samples
- CL algorithms have a periodic, biphasic time course of learning
- Investigating temporal non-locality, periodicity, learning rate modulation, and deterministic phase length
- Proposing an importance sampling-inspired approach to estimating the gradient
- Not always optimal to spend equal amount of time learning from positive and negative samples
- Proving that equilibrium-based CL can still occur with no learning rate modulation and with noise in phase length
- Relaxing requirements on learning dynamics of CL
Background
- AI algorithms are being used as normative models of learning in biological neural networks
- AI algorithms often exhibit properties that are computationally incompatible with neural hardware
- Bio-plausible learning is relevant to both neuroscience and AI
- Backpropagation is an example of an AI algorithm that is not bio-plausible
- Neuromorphic computing is a potential solution to AI energy efficiency problems
- CL algorithms do not require backpropagation for neural network-based learning
- CL algorithms exhibit properties that have not been observed in the brain
- CL algorithms require a specific pattern of learning rate scheduling and periodic stimulus/sample presentation
- The Boltzmann machine is an example of a classic equilibrium-based CL method
- Equilibrium-based CL must run internal dynamics on the space of the given model’s hidden unit activations
- Non-equilibrium-based CL has been used for unsupervised generative modelling
- Temporally local learning has been explored in equilibrium-based and non-equilibrium-based CL methods
Theoretical results
- CL can be generalized to temporally local learning dynamics
- Learning rate and phase length can be studied in equilibrium-based CL systems
- SGD is used to optimize a two-term objective function
- Separate passes through the neural network are needed to estimate each term of the gradient
- Classic CL algorithms have this form of gradient
Gradient sampling to avoid non-locality
- We can resolve the temporal non-locality of Equation 2 by considering the situation where each term g + and g − are used separately to do parameter updates.
- We propose a probabilistic phase selection process, where the next phase type (positive or negative) to be performed during learning is chosen by a Bernouilli random variable B.
- This gradient estimator remains unbiased for any b ∈ (0, 1).
- The variance of the gradient estimator given by Equation 3 is a convex function of b.
- Networks can still learn effectively with this higher-variance estimator.
Bias of always-on-learning and variable phase length in equilibrium-based cl
- Investigated less restrictive learning dynamics for equilibrium-based CL
- Dropped subscripts + or - on network states
- Learning occurs via gradient estimate ĝ
- Equation 3 applied to standard equilibrium-based CL paradigm
- Learning rate fixed to η(t) = η τ ∀t
- Learning rate fixed to same value and phase length chosen randomly
- Learning dynamics of AoL Random T results in Equation 4 being updated
- Explicit identification of end of phase no longer necessary
- Theorem 3.4 and 3.5 provide bounds on bias introduced by training
- Theorems apply to any method satisfying assumptions that uses equilibrium-based gradient updates
- Illustrate how relaxed CL conditions maintain successful learning at limited cost
Experimental findings
- Proposed algorithms were tested in practice
- Two different CL methods were used: ML learning in energy based models and FF algorithm
- RBM and feed-forward neural network with two hidden layers were trained
- Testing was done on bMNIST, BAS and MNIST datasets
- SGD and ADAM were used for training
Estimators learn comparably to classic methods
- ISD was compared to traditional two-term CL estimator for training a feed-forward neural network and a RBM
- Loss was measured over phase counts and number of time steps in network dynamics
- ISD updates parameters more often than standard CL
- ISD exhibited higher variance learning curves
- ISD was still able to learn quite robustly and to a degree of accuracy only slightly worse than standard two-term gradient estimates
- ISD estimators tended to require a slightly lower learning rate and converged slightly slower than the two-term estimators
Robustness to changes in positive phase proportion
- Investigated effect of parameter b on RBM model
- Found learning is robust to changes in b
- Optimal value not always for equal time spent in positive/negative phases
- Better results for higher b on bMNIST dataset, lower b on BAS dataset
- Robustness to parameter changes, values 0.8-0.2 not performing drastically worse than best-performing b
Variance of the gradient estimate
- Investigated how b affects the variance of the ISD gradient estimator
- Found that higher/lower values of b lead to less variance when g+/g- has lower/higher variance
- Hypothesized that effect of b on variance of estimator may underlie b’s effect on learning
- Found that trace of covariance of ĝ(θ) is a convex function of b, so value of b leading to minimal estimator variance is unique
Discussion
- Implications for theoretical neuroscience and neuromorphic computing
- Lack of periodicity or sharp learning rate modulation in a neural circuit is not a basis for discounting CL as a potential algorithm
- Performance is robust to changes in positive phase probability
- Optimal value is not always at b = 0.5
- Performance on BARS dataset for RBM trained using ISD and ISD with constant learning rate
- Negative phase variance changes while positive phase is fixed
- Trade-off between variance and locality
- Manipulating amount of time spent in positive versus negative phase for CL
- Trade compute for data in sparse data regimes
Conclusion
- Proposed an alternative gradient estimate for CL
- Proved a theorem applying to algorithms that require equilibrium dynamics
- Theorem shows that equilibriumbased algorithms with stochastic phase lengths will still exhibit asymptotically unbiased gradients
- Importance Sampling -Discrete (ISD) Gradient Estimate is an unbiased estimate of the gradient
- ISD gradient estimate has strictly greater variance than the standard CL gradient estimate
- Variance of ISD gradient estimate is a convex function of b
- Investigated a single phase of equilibrium-based learning
- Assumed ĝ(θ) is differentiable and bounded, with a bounded derivative
- Proved bias of learning updates is of the order of η2/τm
- Applied law of total expectation and assumed T ≥ s
- Analyzed inner expectation, E[G(z 0 , θ)|T ]
- Error term is quadratic in η and O(s/τ )