Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Research has focused on developing efficient neural network architectures
  • Explored logic gate networks for machine learning tasks
  • Networks comprise logic gates such as “AND” and “XOR”
  • Difficult to learn logic gate networks as they are non-differentiable
  • Proposed differentiable logic gate networks to allow for effective training
  • Discretized logic gate networks achieve fast inference speeds

Paper Content

Introduction

  • Neural networks have been researched to make computations faster and more efficient.
  • Various techniques have been proposed to solve this problem.
  • This paper proposes an approach for gradient-based training of logic gate networks.
  • Logic gate networks are based on binary logic gates, such as “and” and “xor”.
  • Networks are continuously relaxed to differentiable logic gate networks to allow for gradient descent.
  • Probability distributions are learned for each neuron to decide which logic gate to use.
  • Logic gate networks are faster than fully connected ReLU neural networks by two orders of magnitude.
  • Training can be scaled up to 5 million parameters.
  • Differentiable logics and triangular norms are well-known in fuzzy logics and probabilistic metric spaces
  • Chatterjee explored a method for memorizing binary classification data sets with a network of binary lookup tables
  • Brudermueller et al. proposed a method to translate a neural network into random forests and then into networks of AND-Inverter logic gates
  • Continuous relaxations are used to make discrete structures differentiable
  • Zimmer et al. proposed differentiable logic machines for inductive logic programming
  • Chen proposed Gumbel-Max Equation Learner Networks to learn symbolic expressions from data
  • Mocanu et al. proposed training neural networks with sparse evolutionary training
  • Gaier et al. proposed learning networks of operators such as ReLU, sin, inverse, absolute, step, and tanh using evolutionary strategies
  • Zantedeschi et al. proposed to learn decision trees by quadratically relaxing the decision trees
  • Binary neural networks are conceptually different from logic gate networks
  • Sparse neural networks are neural networks where only a selected subset of connections is present

Logic gate networks

  • Logic gate networks are similar to neural networks but use binary logic gates instead of weights.
  • Logic gate networks are sparse because each neuron has only 2 inputs.
  • Logic gate networks do not need activation functions as they are intrinsically non-linear.
  • Logic gate networks allow for grading of predictions with as many levels as there are neurons per class.
  • Logic gate networks are efficient to execute.

Differentiable logic gate networks

  • Training binary logic gate networks is difficult because they are not differentiable.
  • We propose relaxing logic gate networks to differentiable logic gate networks to allow for gradient-based training.
  • Relaxation involves replacing hard binary activations with probabilistic activations.
  • Logic gates are replaced by computing the expected value of the activation given probabilities of independent inputs.
  • Choice of logic gate is represented by a categorical probability distribution.
  • Output neurons are aggregated to produce graded outputs.

Training considerations

  • Randomly initialize connections and parameters of each neuron
  • Draw elements of w from standard normal distribution
  • Use same number of neurons in each layer (except input)
  • Use 4-8 layers
  • Train with Adam optimizer at constant learning rate of 0.01
  • Discretize probability distributions by taking mode
  • Most neurons converge to one logic gate operation
  • Classification: group output into k groups, count number of 1s
  • Regression: use affine transformation to transform range of predictions

Remarks

  • Computational detail: use larger data types instead of Boolean
  • Computational speedup on CPUs and GPUs
  • Memory considerations: only need to store 4-bit information
  • Pruning model for speedup, but requires storing connections
  • Use Adam optimizer for training

Current limitations and opportunities

  • Differentiable logic gate networks have higher training cost than conventional neural networks
  • Practical computational cost can be reduced
  • Asymptotically, differentiable logic gate networks are cheaper to train
  • Edge computing and embedded machine learning require small architectures
  • Training cost is not a concern for edge computing and embedded machine learning
  • Experiments performed on MONK, Adult Census, Breast Cancer, MNIST and CIFAR-10 data sets
  • Method outperforms logistic regression and larger neural network on MONK-3

Adult and breast cancer

  • Method performs similarly to neural networks and logistic regression on Adult Census data set
  • Method achieves best performance and fastest inference speed on Breast Cancer data set

Mnist

  • Our logic network achieves better performance than the fastest method (FINN) while using less than 10% of the number of binary operations.
  • Our model is 12x faster than FINN on an NVIDIA A6000 GPU, even though it only achieves 7% utilization.
  • Our model requires substantially fewer operations than each of the baselines.

Cifar-10

  • Benchmarked method on MNIST and CIFAR-10
  • Reduced color-channel resolution of CIFAR-10 images
  • Employed binary embedding
  • No data augmentation or dropout
  • Outperforms neural networks with small margin
  • Requires less than 0.1% of memory footprint
  • In comparison to best fully-connected neural network baselines, does not achieve same performance
  • Sparse neural networks more competitive with respect to speed

Distribution of logic gates

  • Constant 0/1 operator is rarely used and not present in the last layer
  • First layer has more ‘and’, ’nand’, ‘or’, and ’nor’
  • Second and third layers have more ‘A’, ‘B’, ‘¬A’, and ‘¬B’s
  • Last layer has more ‘xor’ and ‘xnor’
  • Implications (e.g., A ⇒ B) are rarely used

Conclusion

  • Presented novel approach to train logic gate networks
  • Allows for efficient neural networks with high accuracy
  • Source code released to foster future research
  • Output bits can be aggregated into binary number to reduce memory bandwidth
  • Training times and standard deviations provided
  • Trade-off between depth and accuracy similar to regular neural networks
  • Sparsity arises from binary logic gate operators and number of pairs of neurons covered
  • Randomly selected sparse connections work well
  • T-norms and T-conorms used for real-valued logics
  • Axioms for multi-valued logics defined
  • Results on MONK, Adult, Breast Cancer, MNIST, and CIFAR-10 data sets
  • Inference times per data point for 1 CPU thread
  • GPU and CPU used for inference times