Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Neural kernels have improved performance on different data modalities.
Neural kernels require more compute, limiting their application to smaller datasets.
This work massively parallelizes neural kernel computation across many GPUs.
This approach enables kernel regression on large datasets (up to 5 million examples).
Results on protein and small molecule prediction tasks are competitive with SotA methods.

Paper Content

Introduction

Kernel methods are often contrasted with deep learning
Recent advances in machine learning have identified and developed correspondences between the two
Kernel regression has been used to better understand neural networks and deep learning
Neural networks can be viewed as a random function with a specific covariance function or kernel
Computing the kernel for the CIFAR-10 dataset takes significantly more compute than standard kernels
Challenges posed from the cubic scaling in time and quadratic scaling in memory of inference for kernel regression with dataset size limit understanding of infinite-width models to small datasets

Contributions

Parallelize and scale up existing algorithms to many more machines
Consider significantly larger datasets up to 5 million examples
Study performance changes as more data is added
Consider high resolution images from Tiny Imagenet dataset
Not restricted to image data, obtain results for protein sequence and small molecule datasets
Massively parallelize computation of neural kernels
Use distributed, preconditioned conjugate gradients algorithm for inference
Demonstrate scaling laws across several orders of magnitude
Explore other data modalities, obtain competitive results with SotA methods

Recent advances in large scale inference for Gaussian processes and kernel regression
Wang et al. [2019] showed how to solve large-scale linear systems within GPs using conjugate gradients
Pre-conditioner (partially-pivoted Cholesky decomposition) is important for finite precision
Rahimi and Recht [2007] showed that stationary kernels can be approximated using a finite set of random basis functions
Han et al. [2022] developed random-feature approximations for expressive, nonstationary kernels
Stochastic variational inference (SVI) [Hensman et al., 2013] is a promising approach to scale GPs
KISS-GP [Wilson andNickisch, 2015, Stanton et al., 2021] interpolates over a grid of inducing points
EigenPro and EigenPro2 [Ma andBelkin, 2017, 2019] accelerate and scale up kernel regression
Neural network Gaussian process (NNGP) was first noted by Neal [1994] and analytically derived by Williams [1996]
Neural Tangent Kernel (NTK) [Jacot et al., 2018] extends to many architectures
Neural Tangents [Novak et al., 2020] library enables practical applications of neural kernels

Approaches to large-scale kernel methods

Large-scale kernel regression approaches discussed
Challenges introduced by neural kernels addressed
Upper diagonal of large kernel matrix split into smaller blocks for multi-threading of IO
Blocks batched further and computed on several GPUs
Kernel split by rows or columns across workers
Classification error rate compared for four different kernel approximations
Two main approaches to applying kernel methods to larger datasets
Low-rank approximations, Nyström and subset of regressors, can be inverted in time O(r2n)
Block diagonal approximation sets all kernel entries outside of some blocks along the diagonal to zero
Bayesian committee machine partitions the dataset, fitting separate regressors to the partitions
Iterative solvers can be used to solve the linear system to a small residual
Conjugate gradients (CG) used
Pre-conditioning essential
Challenges from neural kernels discussed
Performance comparison of different methods for 10 layer Myrtle NTK on 1.6 million training examples from the CIFAR-5m dataset

Scaling laws for neural kernels

Neural scaling laws show how performance of neural networks improves with more data and parameters
Performance improves as a power law, with many works seeking to estimate the exponent
Projecting potential improvement in performance from scaling up neural networks is important
It is unclear if power-law scaling originates from data or model choice
Neural kernels are nonparametric and capacity scales automatically with more data
CIFAR-5m dataset used to extend analysis over 2 orders of magnitude
Evaluation on CIFAR-10 and 10.1 test sets for comparison against existing results
Power-law scaling observed across 4-5 orders of magnitude
Scaling exponent increases with complexity of kernel
Tiny ImageNet dataset used to evaluate neural kernels
44.7% accuracy achieved on test set, competitive to modern finite neural network architectures without data augmentation

Data augmentation

Data augmentation is widely used in deep learning applied to image data
AutoAug and RandAug are more effective augmentation strategies
Data augmentation has a long history in SVMs
Data augmentation has been used little in kernel methods
Horizontal flips, RandAug, and random crops are used to increase training set size
Highest published accuracy for a kernel method is 91.2%
Gap in accuracy between kernels and finite neural networks still remains

Sequence and graph data

Developing neural kernels as a method for structured data modalities
Protein function prediction benchmark motivated by protein design for targeted gene therapy
Small molecule dataset called ogbg-molpcba from Open Graph Benchmark
Derive kernels for deep graph neural networks and fit to training set of 350K examples
Evaluate predictions using Spearman correlation and mean average precision
Traditional covariance functions do not capture task-relevant similarities in high-dimensional inputs
Connection between neural networks and neural kernels in the infinite-width limit
Tuning hyperparameters using 1k trials of different hyperparameters on Google Vizier service
Inference on ogbg-molpcba and all splits of the AAV dataset
Kernel regression scales remarkably well in terms of predictive performance
Neural kernels offer stable solutions with relatively little optimization and hyperparameter tuning

B data preprocessing

Data preprocessing used for image classification
Data preprocessing used for sequence prediction
Data preprocessing used for molecular property prediction

B.1 image data regularized zca preprocessing

Preprocess inputs with regularized ZCA whitening for convolutional kernels on image data
Regularization parameter for CV8 is 3.0, for Myrtle and Tuned Myrtle is 0.1
Regularized ZCA helps performance of convolutional neural kernels on image classification tasks

B.2 sequence data preprocessing

AAV data consists of sequences of amino acids
Sequences were one-hot encoded and padded
Features and labels were normalized
Global average pooling after convolution layer was tuned
Weight and bias initialization standard deviation was tuned
Filter size of first convolutional layer was tuned
Number of convolutional layers before each average pooling layer was tuned
Option to add 0, 1, or 2 dense layers was tuned
ZCA regularization strength was tuned
Small subset of 1,280 training examples used for tuning
Figure 1 shows upper diagonal of large kernel matrix split into smaller blocks
Figure 2 shows dataset size scaling for neural kernels on CIFAR-5m
Figure 4 shows neural kernel performance on CIFAR-10 with data augmentation
Figure 5 shows scatter plot of test MSE

Link to paper#

Abstract#

Paper Content#

Introduction#

Contributions#

Related work#

Approaches to large-scale kernel methods#

Scaling laws for neural kernels#

Data augmentation#

Sequence and graph data#

B data preprocessing#

B.1 image data regularized zca preprocessing#

B.2 sequence data preprocessing#

Link to paper

Abstract

Paper Content

Introduction

Contributions

Related work

Approaches to large-scale kernel methods

Scaling laws for neural kernels

Data augmentation

Sequence and graph data

B data preprocessing

B.1 image data regularized zca preprocessing

B.2 sequence data preprocessing