Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Research has focused on developing efficient neural network architectures
- Explored logic gate networks for machine learning tasks
- Networks comprise logic gates such as “AND” and “XOR”
- Difficult to learn logic gate networks as they are non-differentiable
- Proposed differentiable logic gate networks to allow for effective training
- Discretized logic gate networks achieve fast inference speeds
Paper Content
Introduction
- Neural networks have been researched to make computations faster and more efficient.
- Various techniques have been proposed to solve this problem.
- This paper proposes an approach for gradient-based training of logic gate networks.
- Logic gate networks are based on binary logic gates, such as “and” and “xor”.
- Networks are continuously relaxed to differentiable logic gate networks to allow for gradient descent.
- Probability distributions are learned for each neuron to decide which logic gate to use.
- Logic gate networks are faster than fully connected ReLU neural networks by two orders of magnitude.
- Training can be scaled up to 5 million parameters.
Related work
- Differentiable logics and triangular norms are well-known in fuzzy logics and probabilistic metric spaces
- Chatterjee explored a method for memorizing binary classification data sets with a network of binary lookup tables
- Brudermueller et al. proposed a method to translate a neural network into random forests and then into networks of AND-Inverter logic gates
- Continuous relaxations are used to make discrete structures differentiable
- Zimmer et al. proposed differentiable logic machines for inductive logic programming
- Chen proposed Gumbel-Max Equation Learner Networks to learn symbolic expressions from data
- Mocanu et al. proposed training neural networks with sparse evolutionary training
- Gaier et al. proposed learning networks of operators such as ReLU, sin, inverse, absolute, step, and tanh using evolutionary strategies
- Zantedeschi et al. proposed to learn decision trees by quadratically relaxing the decision trees
- Binary neural networks are conceptually different from logic gate networks
- Sparse neural networks are neural networks where only a selected subset of connections is present
Logic gate networks
- Logic gate networks are similar to neural networks but use binary logic gates instead of weights.
- Logic gate networks are sparse because each neuron has only 2 inputs.
- Logic gate networks do not need activation functions as they are intrinsically non-linear.
- Logic gate networks allow for grading of predictions with as many levels as there are neurons per class.
- Logic gate networks are efficient to execute.
Differentiable logic gate networks
- Training binary logic gate networks is difficult because they are not differentiable.
- We propose relaxing logic gate networks to differentiable logic gate networks to allow for gradient-based training.
- Relaxation involves replacing hard binary activations with probabilistic activations.
- Logic gates are replaced by computing the expected value of the activation given probabilities of independent inputs.
- Choice of logic gate is represented by a categorical probability distribution.
- Output neurons are aggregated to produce graded outputs.
Training considerations
- Randomly initialize connections and parameters of each neuron
- Draw elements of w from standard normal distribution
- Use same number of neurons in each layer (except input)
- Use 4-8 layers
- Train with Adam optimizer at constant learning rate of 0.01
- Discretize probability distributions by taking mode
- Most neurons converge to one logic gate operation
- Classification: group output into k groups, count number of 1s
- Regression: use affine transformation to transform range of predictions
Remarks
- Computational detail: use larger data types instead of Boolean
- Computational speedup on CPUs and GPUs
- Memory considerations: only need to store 4-bit information
- Pruning model for speedup, but requires storing connections
- Use Adam optimizer for training
Current limitations and opportunities
- Differentiable logic gate networks have higher training cost than conventional neural networks
- Practical computational cost can be reduced
- Asymptotically, differentiable logic gate networks are cheaper to train
- Edge computing and embedded machine learning require small architectures
- Training cost is not a concern for edge computing and embedded machine learning
- Experiments performed on MONK, Adult Census, Breast Cancer, MNIST and CIFAR-10 data sets
- Method outperforms logistic regression and larger neural network on MONK-3
Adult and breast cancer
- Method performs similarly to neural networks and logistic regression on Adult Census data set
- Method achieves best performance and fastest inference speed on Breast Cancer data set
Mnist
- Our logic network achieves better performance than the fastest method (FINN) while using less than 10% of the number of binary operations.
- Our model is 12x faster than FINN on an NVIDIA A6000 GPU, even though it only achieves 7% utilization.
- Our model requires substantially fewer operations than each of the baselines.
Cifar-10
- Benchmarked method on MNIST and CIFAR-10
- Reduced color-channel resolution of CIFAR-10 images
- Employed binary embedding
- No data augmentation or dropout
- Outperforms neural networks with small margin
- Requires less than 0.1% of memory footprint
- In comparison to best fully-connected neural network baselines, does not achieve same performance
- Sparse neural networks more competitive with respect to speed
Distribution of logic gates
- Constant 0/1 operator is rarely used and not present in the last layer
- First layer has more ‘and’, ’nand’, ‘or’, and ’nor’
- Second and third layers have more ‘A’, ‘B’, ‘¬A’, and ‘¬B’s
- Last layer has more ‘xor’ and ‘xnor’
- Implications (e.g., A ⇒ B) are rarely used
Conclusion
- Presented novel approach to train logic gate networks
- Allows for efficient neural networks with high accuracy
- Source code released to foster future research
- Output bits can be aggregated into binary number to reduce memory bandwidth
- Training times and standard deviations provided
- Trade-off between depth and accuracy similar to regular neural networks
- Sparsity arises from binary logic gate operators and number of pairs of neurons covered
- Randomly selected sparse connections work well
- T-norms and T-conorms used for real-valued logics
- Axioms for multi-valued logics defined
- Results on MONK, Adult, Breast Cancer, MNIST, and CIFAR-10 data sets
- Inference times per data point for 1 CPU thread
- GPU and CPU used for inference times