Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Aim of paper is to introduce new learning procedure for neural networks
Procedure replaces forward and backward passes of backpropagation with two forward passes
Each layer has its own objective function to have high goodness for positive data and low goodness for negative data
Negative passes could be done offline, making learning simpler and allowing video to be pipelined

Paper Content

The forward-forward algorithm

Forward-Forward algorithm is a greedy multi-layer learning procedure
Replaces forward and backward passes of backpropagation with two forward passes
Positive pass operates on real data and adjusts weights to increase goodness in hidden layers
Negative pass operates on “negative data” and adjusts weights to decrease goodness in hidden layers
Goodness function is sum of squared neural activities
Aim is to correctly classify input vectors as positive or negative data
Layer normalization prevents first hidden layer from being used as input to second hidden layer
FF algorithm works in small neural networks with few million connections

The backpropagation baseline

Experiments use MNIST dataset of handwritten digits
50,000 images used for training, 10,000 for validation
10,000 images used to compute test error rate
Simple neural networks trained with backpropagation get 0.6% test error
Permutation-invariant version of task does not give info about spatial layout of pixels
Neural networks with ReLUs get 1.4% test error
Regularizers can reduce test error to 1.1%
Learning procedure works about as well as backpropagation on MNIST

A simple unsupervised example of ff

Two main questions about FF need to be answered
Hand-crafted source of negative data used as a temporary crutch
Linear classifier used to transform input vectors into representation vectors
Negative data created by adding together one digit image and a different digit image
Masks created by blurring and thresholding a random bit image
Network with four hidden layers of 2000 ReLUs gives 1.37% test error
Local receptive fields (without weight-sharing) improves performance to 1.16% test error

A simple supervised example of ff

Learning hidden representations without labels is useful for large models.
Smaller models should use supervised learning with labels included in the input.
FF can be used to learn MNIST images with 1.36% test errors.
Augmenting the training data with jittered images can reduce test error to 0.64%.

Using ff to model top-down effects in perception

Feed-forward neural networks are learned one layer at a time
Backpropagation is seen as a major improvement
Treating a static image as a video is a way to overcome the limitation of FF
A multi-layer recurrent neural network is used
An experiment was done with a static MNIST image
The network was trained on MNIST for 60 epochs
Negative data is generated by doing a single forward pass through the net

Using predictions from the spatial context as a teacher

Objective is to have good agreement between input from layer above and below for positive data, bad agreement for negative data
Top-down input is determined by larger region of image, bottom-up input is based on more local region
Top-down input should learn to predict representations of bottom-up input
Top-down input should learn to cancel out bottom-up input on positive data
Layer normalization means plenty of information gets sent to next layer
Learning by using contextual prediction as teaching signal for local feature extraction has been around for a long time
CIFAR-10 has 50,000 training images, 32x32 with 3 color channels
FF compared with backpropagation net, FF is comparable in performance
FF test performance is slightly worse than backpropagation, gap does not increase with more hidden layers
Backpropagation reduces training error more quickly

Sleep

FF would be easier to implement in a brain if positive data is processed when awake and negative data is created and processed during sleep.
FF can predict the next character in a sequence from the previous ten character window.
Alternating between weight updates on positive and negative data only works if the learning rate is low and the momentum is high.
It is possible to separate the positive and negative phases of learning, but this remains to be shown.
It is interesting to see if eliminating the negative phase updates for a while mimics the effects of sleep deprivation.

Relationship to boltzmann machines

In the early 1980s two promising learning procedures for deep neural networks were backpropagation and Boltzmann Machines
Boltzmann Machines are networks of stochastic binary neurons with pairwise connections
When running freely, a Boltzmann Machine updates each binary neuron by setting it to the on state with a probability equal to the logistic of the total input it receives
The aim of Boltzmann machine learning is to make the distribution of binary vectors on the visible neurons match the data distribution
The Kullback-Liebler divergence between the data distribution and the model distribution has a simple derivative w.r.t. any weight
Boltzmann Machine learning is impractical and implausible as a model of cortical learning
FF combines the contrastive learning from Boltzmann machines with a simple, local goodness function

Relationship to generative adversarial networks

GANs use a multi-layer neural network to generate data
GANs train a generative model by using a discriminative network
GANs can suffer from mode collapse
FF can be viewed as a special case of a GAN
FF does not require backpropagation to learn the discriminative or generative model
FF eliminates problems that arise when one model learns too fast relative to the other model
FF eliminates mode collapse

Relationship to contrastive methods that compare representations of two different image crops

Self-supervised contrastive methods use an objective function to favor agreement between representations of two different crops from the same image and disagreement between crops from two different images.
These methods use many layers and backpropagation to train the layers.
FF uses a different way to measure agreement which is easier for a real neural network.
FF divides each layer into many small blocks and forces each block to decide between positive and negative cases.

A problem with stacked contrastive learning

Unsupervised learning can be used to learn multiple layers of representation.
Applying unsupervised learning to activity vectors created by a random weight matrix can lead to structure that has nothing to do with the data.
Boltzmann Machine learning algorithm was designed to avoid this flaw by contrasting statistics caused by two different external boundary conditions.

Learning fast and slow

Full connectivity between layers allows weight updates to not affect layer-normalized output for a particular input vector.
Vector of increments of incoming weights for hidden neuron j is given by a formula.
Weight update does not change orientation of activity vector.
Weight updates in earlier layers do not affect activity vectors in later layers.
It is possible to change all weights in one step to achieve desired goodness of S*.

Mortal computation

General purpose digital computers were designed to faithfully follow instructions
Separation of software from hardware is one of the foundations of Computer Science
It is possible to compute derivatives on huge data-sets by using many copies of the same model running in parallel
It should be possible to achieve savings in energy and cost of fabricating hardware by abandoning immortality
Learning procedure to discover parameter values that make effective use of unknown properties of each hardware instance
Distillation can transfer knowledge from one piece of hardware to another
Language may provide information about speaker’s internal vector representations
Training a large neural network on massive amounts of language is an effective way to capture world view of a culture
Forward-Forward algorithm may be a promising candidate for efficient hardware
Open questions include best goodness function, activation function, and use of local goodness functions
Acknowledgements to Jeff Dean, David Fleet, and many others

Link to paper#

Abstract#

Paper Content#

The forward-forward algorithm#

The backpropagation baseline#

A simple unsupervised example of ff#

A simple supervised example of ff#

Using ff to model top-down effects in perception#

Using predictions from the spatial context as a teacher#

Sleep#

Relationship to boltzmann machines#

Relationship to generative adversarial networks#

Relationship to contrastive methods that compare representations of two different image crops#

A problem with stacked contrastive learning#

Learning fast and slow#

Mortal computation#

Link to paper

Abstract

Paper Content

The forward-forward algorithm

The backpropagation baseline

A simple unsupervised example of ff

A simple supervised example of ff

Using ff to model top-down effects in perception

Using predictions from the spatial context as a teacher

Sleep

Relationship to boltzmann machines

Relationship to generative adversarial networks

Relationship to contrastive methods that compare representations of two different image crops

A problem with stacked contrastive learning

Learning fast and slow

Mortal computation