Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Aim of paper is to introduce new learning procedure for neural networks
- Procedure replaces forward and backward passes of backpropagation with two forward passes
- Each layer has its own objective function to have high goodness for positive data and low goodness for negative data
- Negative passes could be done offline, making learning simpler and allowing video to be pipelined
Paper Content
The forward-forward algorithm
- Forward-Forward algorithm is a greedy multi-layer learning procedure
- Replaces forward and backward passes of backpropagation with two forward passes
- Positive pass operates on real data and adjusts weights to increase goodness in hidden layers
- Negative pass operates on “negative data” and adjusts weights to decrease goodness in hidden layers
- Goodness function is sum of squared neural activities
- Aim is to correctly classify input vectors as positive or negative data
- Layer normalization prevents first hidden layer from being used as input to second hidden layer
- FF algorithm works in small neural networks with few million connections
The backpropagation baseline
- Experiments use MNIST dataset of handwritten digits
- 50,000 images used for training, 10,000 for validation
- 10,000 images used to compute test error rate
- Simple neural networks trained with backpropagation get 0.6% test error
- Permutation-invariant version of task does not give info about spatial layout of pixels
- Neural networks with ReLUs get 1.4% test error
- Regularizers can reduce test error to 1.1%
- Learning procedure works about as well as backpropagation on MNIST
A simple unsupervised example of ff
- Two main questions about FF need to be answered
- Hand-crafted source of negative data used as a temporary crutch
- Linear classifier used to transform input vectors into representation vectors
- Negative data created by adding together one digit image and a different digit image
- Masks created by blurring and thresholding a random bit image
- Network with four hidden layers of 2000 ReLUs gives 1.37% test error
- Local receptive fields (without weight-sharing) improves performance to 1.16% test error
A simple supervised example of ff
- Learning hidden representations without labels is useful for large models.
- Smaller models should use supervised learning with labels included in the input.
- FF can be used to learn MNIST images with 1.36% test errors.
- Augmenting the training data with jittered images can reduce test error to 0.64%.
Using ff to model top-down effects in perception
- Feed-forward neural networks are learned one layer at a time
- Backpropagation is seen as a major improvement
- Treating a static image as a video is a way to overcome the limitation of FF
- A multi-layer recurrent neural network is used
- An experiment was done with a static MNIST image
- The network was trained on MNIST for 60 epochs
- Negative data is generated by doing a single forward pass through the net
Using predictions from the spatial context as a teacher
- Objective is to have good agreement between input from layer above and below for positive data, bad agreement for negative data
- Top-down input is determined by larger region of image, bottom-up input is based on more local region
- Top-down input should learn to predict representations of bottom-up input
- Top-down input should learn to cancel out bottom-up input on positive data
- Layer normalization means plenty of information gets sent to next layer
- Learning by using contextual prediction as teaching signal for local feature extraction has been around for a long time
- CIFAR-10 has 50,000 training images, 32x32 with 3 color channels
- FF compared with backpropagation net, FF is comparable in performance
- FF test performance is slightly worse than backpropagation, gap does not increase with more hidden layers
- Backpropagation reduces training error more quickly
Sleep
- FF would be easier to implement in a brain if positive data is processed when awake and negative data is created and processed during sleep.
- FF can predict the next character in a sequence from the previous ten character window.
- Alternating between weight updates on positive and negative data only works if the learning rate is low and the momentum is high.
- It is possible to separate the positive and negative phases of learning, but this remains to be shown.
- It is interesting to see if eliminating the negative phase updates for a while mimics the effects of sleep deprivation.
Relationship to boltzmann machines
- In the early 1980s two promising learning procedures for deep neural networks were backpropagation and Boltzmann Machines
- Boltzmann Machines are networks of stochastic binary neurons with pairwise connections
- When running freely, a Boltzmann Machine updates each binary neuron by setting it to the on state with a probability equal to the logistic of the total input it receives
- The aim of Boltzmann machine learning is to make the distribution of binary vectors on the visible neurons match the data distribution
- The Kullback-Liebler divergence between the data distribution and the model distribution has a simple derivative w.r.t. any weight
- Boltzmann Machine learning is impractical and implausible as a model of cortical learning
- FF combines the contrastive learning from Boltzmann machines with a simple, local goodness function
Relationship to generative adversarial networks
- GANs use a multi-layer neural network to generate data
- GANs train a generative model by using a discriminative network
- GANs can suffer from mode collapse
- FF can be viewed as a special case of a GAN
- FF does not require backpropagation to learn the discriminative or generative model
- FF eliminates problems that arise when one model learns too fast relative to the other model
- FF eliminates mode collapse
Relationship to contrastive methods that compare representations of two different image crops
- Self-supervised contrastive methods use an objective function to favor agreement between representations of two different crops from the same image and disagreement between crops from two different images.
- These methods use many layers and backpropagation to train the layers.
- FF uses a different way to measure agreement which is easier for a real neural network.
- FF divides each layer into many small blocks and forces each block to decide between positive and negative cases.
A problem with stacked contrastive learning
- Unsupervised learning can be used to learn multiple layers of representation.
- Applying unsupervised learning to activity vectors created by a random weight matrix can lead to structure that has nothing to do with the data.
- Boltzmann Machine learning algorithm was designed to avoid this flaw by contrasting statistics caused by two different external boundary conditions.
Learning fast and slow
- Full connectivity between layers allows weight updates to not affect layer-normalized output for a particular input vector.
- Vector of increments of incoming weights for hidden neuron j is given by a formula.
- Weight update does not change orientation of activity vector.
- Weight updates in earlier layers do not affect activity vectors in later layers.
- It is possible to change all weights in one step to achieve desired goodness of S*.
Mortal computation
- General purpose digital computers were designed to faithfully follow instructions
- Separation of software from hardware is one of the foundations of Computer Science
- It is possible to compute derivatives on huge data-sets by using many copies of the same model running in parallel
- It should be possible to achieve savings in energy and cost of fabricating hardware by abandoning immortality
- Learning procedure to discover parameter values that make effective use of unknown properties of each hardware instance
- Distillation can transfer knowledge from one piece of hardware to another
- Language may provide information about speaker’s internal vector representations
- Training a large neural network on massive amounts of language is an effective way to capture world view of a culture
- Forward-Forward algorithm may be a promising candidate for efficient hardware
- Open questions include best goodness function, activation function, and use of local goodness functions
- Acknowledgements to Jeff Dean, David Fleet, and many others