Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Aim of paper is to introduce new learning procedure for neural networks
  • Procedure replaces forward and backward passes of backpropagation with two forward passes
  • Each layer has its own objective function to have high goodness for positive data and low goodness for negative data
  • Negative passes could be done offline, making learning simpler and allowing video to be pipelined

Paper Content

The forward-forward algorithm

  • Forward-Forward algorithm is a greedy multi-layer learning procedure
  • Replaces forward and backward passes of backpropagation with two forward passes
  • Positive pass operates on real data and adjusts weights to increase goodness in hidden layers
  • Negative pass operates on “negative data” and adjusts weights to decrease goodness in hidden layers
  • Goodness function is sum of squared neural activities
  • Aim is to correctly classify input vectors as positive or negative data
  • Layer normalization prevents first hidden layer from being used as input to second hidden layer
  • FF algorithm works in small neural networks with few million connections

The backpropagation baseline

  • Experiments use MNIST dataset of handwritten digits
  • 50,000 images used for training, 10,000 for validation
  • 10,000 images used to compute test error rate
  • Simple neural networks trained with backpropagation get 0.6% test error
  • Permutation-invariant version of task does not give info about spatial layout of pixels
  • Neural networks with ReLUs get 1.4% test error
  • Regularizers can reduce test error to 1.1%
  • Learning procedure works about as well as backpropagation on MNIST

A simple unsupervised example of ff

  • Two main questions about FF need to be answered
  • Hand-crafted source of negative data used as a temporary crutch
  • Linear classifier used to transform input vectors into representation vectors
  • Negative data created by adding together one digit image and a different digit image
  • Masks created by blurring and thresholding a random bit image
  • Network with four hidden layers of 2000 ReLUs gives 1.37% test error
  • Local receptive fields (without weight-sharing) improves performance to 1.16% test error

A simple supervised example of ff

  • Learning hidden representations without labels is useful for large models.
  • Smaller models should use supervised learning with labels included in the input.
  • FF can be used to learn MNIST images with 1.36% test errors.
  • Augmenting the training data with jittered images can reduce test error to 0.64%.

Using ff to model top-down effects in perception

  • Feed-forward neural networks are learned one layer at a time
  • Backpropagation is seen as a major improvement
  • Treating a static image as a video is a way to overcome the limitation of FF
  • A multi-layer recurrent neural network is used
  • An experiment was done with a static MNIST image
  • The network was trained on MNIST for 60 epochs
  • Negative data is generated by doing a single forward pass through the net

Using predictions from the spatial context as a teacher

  • Objective is to have good agreement between input from layer above and below for positive data, bad agreement for negative data
  • Top-down input is determined by larger region of image, bottom-up input is based on more local region
  • Top-down input should learn to predict representations of bottom-up input
  • Top-down input should learn to cancel out bottom-up input on positive data
  • Layer normalization means plenty of information gets sent to next layer
  • Learning by using contextual prediction as teaching signal for local feature extraction has been around for a long time
  • CIFAR-10 has 50,000 training images, 32x32 with 3 color channels
  • FF compared with backpropagation net, FF is comparable in performance
  • FF test performance is slightly worse than backpropagation, gap does not increase with more hidden layers
  • Backpropagation reduces training error more quickly

Sleep

  • FF would be easier to implement in a brain if positive data is processed when awake and negative data is created and processed during sleep.
  • FF can predict the next character in a sequence from the previous ten character window.
  • Alternating between weight updates on positive and negative data only works if the learning rate is low and the momentum is high.
  • It is possible to separate the positive and negative phases of learning, but this remains to be shown.
  • It is interesting to see if eliminating the negative phase updates for a while mimics the effects of sleep deprivation.

Relationship to boltzmann machines

  • In the early 1980s two promising learning procedures for deep neural networks were backpropagation and Boltzmann Machines
  • Boltzmann Machines are networks of stochastic binary neurons with pairwise connections
  • When running freely, a Boltzmann Machine updates each binary neuron by setting it to the on state with a probability equal to the logistic of the total input it receives
  • The aim of Boltzmann machine learning is to make the distribution of binary vectors on the visible neurons match the data distribution
  • The Kullback-Liebler divergence between the data distribution and the model distribution has a simple derivative w.r.t. any weight
  • Boltzmann Machine learning is impractical and implausible as a model of cortical learning
  • FF combines the contrastive learning from Boltzmann machines with a simple, local goodness function

Relationship to generative adversarial networks

  • GANs use a multi-layer neural network to generate data
  • GANs train a generative model by using a discriminative network
  • GANs can suffer from mode collapse
  • FF can be viewed as a special case of a GAN
  • FF does not require backpropagation to learn the discriminative or generative model
  • FF eliminates problems that arise when one model learns too fast relative to the other model
  • FF eliminates mode collapse

Relationship to contrastive methods that compare representations of two different image crops

  • Self-supervised contrastive methods use an objective function to favor agreement between representations of two different crops from the same image and disagreement between crops from two different images.
  • These methods use many layers and backpropagation to train the layers.
  • FF uses a different way to measure agreement which is easier for a real neural network.
  • FF divides each layer into many small blocks and forces each block to decide between positive and negative cases.

A problem with stacked contrastive learning

  • Unsupervised learning can be used to learn multiple layers of representation.
  • Applying unsupervised learning to activity vectors created by a random weight matrix can lead to structure that has nothing to do with the data.
  • Boltzmann Machine learning algorithm was designed to avoid this flaw by contrasting statistics caused by two different external boundary conditions.

Learning fast and slow

  • Full connectivity between layers allows weight updates to not affect layer-normalized output for a particular input vector.
  • Vector of increments of incoming weights for hidden neuron j is given by a formula.
  • Weight update does not change orientation of activity vector.
  • Weight updates in earlier layers do not affect activity vectors in later layers.
  • It is possible to change all weights in one step to achieve desired goodness of S*.

Mortal computation

  • General purpose digital computers were designed to faithfully follow instructions
  • Separation of software from hardware is one of the foundations of Computer Science
  • It is possible to compute derivatives on huge data-sets by using many copies of the same model running in parallel
  • It should be possible to achieve savings in energy and cost of fabricating hardware by abandoning immortality
  • Learning procedure to discover parameter values that make effective use of unknown properties of each hardware instance
  • Distillation can transfer knowledge from one piece of hardware to another
  • Language may provide information about speaker’s internal vector representations
  • Training a large neural network on massive amounts of language is an effective way to capture world view of a culture
  • Forward-Forward algorithm may be a promising candidate for efficient hardware
  • Open questions include best goodness function, activation function, and use of local goodness functions
  • Acknowledgements to Jeff Dean, David Fleet, and many others