Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Neural kernels have improved performance on different data modalities.
- Neural kernels require more compute, limiting their application to smaller datasets.
- This work massively parallelizes neural kernel computation across many GPUs.
- This approach enables kernel regression on large datasets (up to 5 million examples).
- Results on protein and small molecule prediction tasks are competitive with SotA methods.
Paper Content
Introduction
- Kernel methods are often contrasted with deep learning
- Recent advances in machine learning have identified and developed correspondences between the two
- Kernel regression has been used to better understand neural networks and deep learning
- Neural networks can be viewed as a random function with a specific covariance function or kernel
- Computing the kernel for the CIFAR-10 dataset takes significantly more compute than standard kernels
- Challenges posed from the cubic scaling in time and quadratic scaling in memory of inference for kernel regression with dataset size limit understanding of infinite-width models to small datasets
Contributions
- Parallelize and scale up existing algorithms to many more machines
- Consider significantly larger datasets up to 5 million examples
- Study performance changes as more data is added
- Consider high resolution images from Tiny Imagenet dataset
- Not restricted to image data, obtain results for protein sequence and small molecule datasets
- Massively parallelize computation of neural kernels
- Use distributed, preconditioned conjugate gradients algorithm for inference
- Demonstrate scaling laws across several orders of magnitude
- Explore other data modalities, obtain competitive results with SotA methods
Related work
- Recent advances in large scale inference for Gaussian processes and kernel regression
- Wang et al. [2019] showed how to solve large-scale linear systems within GPs using conjugate gradients
- Pre-conditioner (partially-pivoted Cholesky decomposition) is important for finite precision
- Rahimi and Recht [2007] showed that stationary kernels can be approximated using a finite set of random basis functions
- Han et al. [2022] developed random-feature approximations for expressive, nonstationary kernels
- Stochastic variational inference (SVI) [Hensman et al., 2013] is a promising approach to scale GPs
- KISS-GP [Wilson andNickisch, 2015, Stanton et al., 2021] interpolates over a grid of inducing points
- EigenPro and EigenPro2 [Ma andBelkin, 2017, 2019] accelerate and scale up kernel regression
- Neural network Gaussian process (NNGP) was first noted by Neal [1994] and analytically derived by Williams [1996]
- Neural Tangent Kernel (NTK) [Jacot et al., 2018] extends to many architectures
- Neural Tangents [Novak et al., 2020] library enables practical applications of neural kernels
Approaches to large-scale kernel methods
- Large-scale kernel regression approaches discussed
- Challenges introduced by neural kernels addressed
- Upper diagonal of large kernel matrix split into smaller blocks for multi-threading of IO
- Blocks batched further and computed on several GPUs
- Kernel split by rows or columns across workers
- Classification error rate compared for four different kernel approximations
- Two main approaches to applying kernel methods to larger datasets
- Low-rank approximations, Nyström and subset of regressors, can be inverted in time O(r2n)
- Block diagonal approximation sets all kernel entries outside of some blocks along the diagonal to zero
- Bayesian committee machine partitions the dataset, fitting separate regressors to the partitions
- Iterative solvers can be used to solve the linear system to a small residual
- Conjugate gradients (CG) used
- Pre-conditioning essential
- Challenges from neural kernels discussed
- Performance comparison of different methods for 10 layer Myrtle NTK on 1.6 million training examples from the CIFAR-5m dataset
Scaling laws for neural kernels
- Neural scaling laws show how performance of neural networks improves with more data and parameters
- Performance improves as a power law, with many works seeking to estimate the exponent
- Projecting potential improvement in performance from scaling up neural networks is important
- It is unclear if power-law scaling originates from data or model choice
- Neural kernels are nonparametric and capacity scales automatically with more data
- CIFAR-5m dataset used to extend analysis over 2 orders of magnitude
- Evaluation on CIFAR-10 and 10.1 test sets for comparison against existing results
- Power-law scaling observed across 4-5 orders of magnitude
- Scaling exponent increases with complexity of kernel
- Tiny ImageNet dataset used to evaluate neural kernels
- 44.7% accuracy achieved on test set, competitive to modern finite neural network architectures without data augmentation
Data augmentation
- Data augmentation is widely used in deep learning applied to image data
- AutoAug and RandAug are more effective augmentation strategies
- Data augmentation has a long history in SVMs
- Data augmentation has been used little in kernel methods
- Horizontal flips, RandAug, and random crops are used to increase training set size
- Highest published accuracy for a kernel method is 91.2%
- Gap in accuracy between kernels and finite neural networks still remains
Sequence and graph data
- Developing neural kernels as a method for structured data modalities
- Protein function prediction benchmark motivated by protein design for targeted gene therapy
- Small molecule dataset called ogbg-molpcba from Open Graph Benchmark
- Derive kernels for deep graph neural networks and fit to training set of 350K examples
- Evaluate predictions using Spearman correlation and mean average precision
- Traditional covariance functions do not capture task-relevant similarities in high-dimensional inputs
- Connection between neural networks and neural kernels in the infinite-width limit
- Tuning hyperparameters using 1k trials of different hyperparameters on Google Vizier service
- Inference on ogbg-molpcba and all splits of the AAV dataset
- Kernel regression scales remarkably well in terms of predictive performance
- Neural kernels offer stable solutions with relatively little optimization and hyperparameter tuning
B data preprocessing
- Data preprocessing used for image classification
- Data preprocessing used for sequence prediction
- Data preprocessing used for molecular property prediction
B.1 image data regularized zca preprocessing
- Preprocess inputs with regularized ZCA whitening for convolutional kernels on image data
- Regularization parameter for CV8 is 3.0, for Myrtle and Tuned Myrtle is 0.1
- Regularized ZCA helps performance of convolutional neural kernels on image classification tasks
B.2 sequence data preprocessing
- AAV data consists of sequences of amino acids
- Sequences were one-hot encoded and padded
- Features and labels were normalized
- Global average pooling after convolution layer was tuned
- Weight and bias initialization standard deviation was tuned
- Filter size of first convolutional layer was tuned
- Number of convolutional layers before each average pooling layer was tuned
- Option to add 0, 1, or 2 dense layers was tuned
- ZCA regularization strength was tuned
- Small subset of 1,280 training examples used for tuning
- Figure 1 shows upper diagonal of large kernel matrix split into smaller blocks
- Figure 2 shows dataset size scaling for neural kernels on CIFAR-5m
- Figure 4 shows neural kernel performance on CIFAR-10 with data augmentation
- Figure 5 shows scatter plot of test MSE