Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Neural kernels have improved performance on different data modalities.
  • Neural kernels require more compute, limiting their application to smaller datasets.
  • This work massively parallelizes neural kernel computation across many GPUs.
  • This approach enables kernel regression on large datasets (up to 5 million examples).
  • Results on protein and small molecule prediction tasks are competitive with SotA methods.

Paper Content

Introduction

  • Kernel methods are often contrasted with deep learning
  • Recent advances in machine learning have identified and developed correspondences between the two
  • Kernel regression has been used to better understand neural networks and deep learning
  • Neural networks can be viewed as a random function with a specific covariance function or kernel
  • Computing the kernel for the CIFAR-10 dataset takes significantly more compute than standard kernels
  • Challenges posed from the cubic scaling in time and quadratic scaling in memory of inference for kernel regression with dataset size limit understanding of infinite-width models to small datasets

Contributions

  • Parallelize and scale up existing algorithms to many more machines
  • Consider significantly larger datasets up to 5 million examples
  • Study performance changes as more data is added
  • Consider high resolution images from Tiny Imagenet dataset
  • Not restricted to image data, obtain results for protein sequence and small molecule datasets
  • Massively parallelize computation of neural kernels
  • Use distributed, preconditioned conjugate gradients algorithm for inference
  • Demonstrate scaling laws across several orders of magnitude
  • Explore other data modalities, obtain competitive results with SotA methods
  • Recent advances in large scale inference for Gaussian processes and kernel regression
  • Wang et al. [2019] showed how to solve large-scale linear systems within GPs using conjugate gradients
  • Pre-conditioner (partially-pivoted Cholesky decomposition) is important for finite precision
  • Rahimi and Recht [2007] showed that stationary kernels can be approximated using a finite set of random basis functions
  • Han et al. [2022] developed random-feature approximations for expressive, nonstationary kernels
  • Stochastic variational inference (SVI) [Hensman et al., 2013] is a promising approach to scale GPs
  • KISS-GP [Wilson andNickisch, 2015, Stanton et al., 2021] interpolates over a grid of inducing points
  • EigenPro and EigenPro2 [Ma andBelkin, 2017, 2019] accelerate and scale up kernel regression
  • Neural network Gaussian process (NNGP) was first noted by Neal [1994] and analytically derived by Williams [1996]
  • Neural Tangent Kernel (NTK) [Jacot et al., 2018] extends to many architectures
  • Neural Tangents [Novak et al., 2020] library enables practical applications of neural kernels

Approaches to large-scale kernel methods

  • Large-scale kernel regression approaches discussed
  • Challenges introduced by neural kernels addressed
  • Upper diagonal of large kernel matrix split into smaller blocks for multi-threading of IO
  • Blocks batched further and computed on several GPUs
  • Kernel split by rows or columns across workers
  • Classification error rate compared for four different kernel approximations
  • Two main approaches to applying kernel methods to larger datasets
  • Low-rank approximations, Nyström and subset of regressors, can be inverted in time O(r2n)
  • Block diagonal approximation sets all kernel entries outside of some blocks along the diagonal to zero
  • Bayesian committee machine partitions the dataset, fitting separate regressors to the partitions
  • Iterative solvers can be used to solve the linear system to a small residual
  • Conjugate gradients (CG) used
  • Pre-conditioning essential
  • Challenges from neural kernels discussed
  • Performance comparison of different methods for 10 layer Myrtle NTK on 1.6 million training examples from the CIFAR-5m dataset

Scaling laws for neural kernels

  • Neural scaling laws show how performance of neural networks improves with more data and parameters
  • Performance improves as a power law, with many works seeking to estimate the exponent
  • Projecting potential improvement in performance from scaling up neural networks is important
  • It is unclear if power-law scaling originates from data or model choice
  • Neural kernels are nonparametric and capacity scales automatically with more data
  • CIFAR-5m dataset used to extend analysis over 2 orders of magnitude
  • Evaluation on CIFAR-10 and 10.1 test sets for comparison against existing results
  • Power-law scaling observed across 4-5 orders of magnitude
  • Scaling exponent increases with complexity of kernel
  • Tiny ImageNet dataset used to evaluate neural kernels
  • 44.7% accuracy achieved on test set, competitive to modern finite neural network architectures without data augmentation

Data augmentation

  • Data augmentation is widely used in deep learning applied to image data
  • AutoAug and RandAug are more effective augmentation strategies
  • Data augmentation has a long history in SVMs
  • Data augmentation has been used little in kernel methods
  • Horizontal flips, RandAug, and random crops are used to increase training set size
  • Highest published accuracy for a kernel method is 91.2%
  • Gap in accuracy between kernels and finite neural networks still remains

Sequence and graph data

  • Developing neural kernels as a method for structured data modalities
  • Protein function prediction benchmark motivated by protein design for targeted gene therapy
  • Small molecule dataset called ogbg-molpcba from Open Graph Benchmark
  • Derive kernels for deep graph neural networks and fit to training set of 350K examples
  • Evaluate predictions using Spearman correlation and mean average precision
  • Traditional covariance functions do not capture task-relevant similarities in high-dimensional inputs
  • Connection between neural networks and neural kernels in the infinite-width limit
  • Tuning hyperparameters using 1k trials of different hyperparameters on Google Vizier service
  • Inference on ogbg-molpcba and all splits of the AAV dataset
  • Kernel regression scales remarkably well in terms of predictive performance
  • Neural kernels offer stable solutions with relatively little optimization and hyperparameter tuning

B data preprocessing

  • Data preprocessing used for image classification
  • Data preprocessing used for sequence prediction
  • Data preprocessing used for molecular property prediction

B.1 image data regularized zca preprocessing

  • Preprocess inputs with regularized ZCA whitening for convolutional kernels on image data
  • Regularization parameter for CV8 is 3.0, for Myrtle and Tuned Myrtle is 0.1
  • Regularized ZCA helps performance of convolutional neural kernels on image classification tasks

B.2 sequence data preprocessing

  • AAV data consists of sequences of amino acids
  • Sequences were one-hot encoded and padded
  • Features and labels were normalized
  • Global average pooling after convolution layer was tuned
  • Weight and bias initialization standard deviation was tuned
  • Filter size of first convolutional layer was tuned
  • Number of convolutional layers before each average pooling layer was tuned
  • Option to add 0, 1, or 2 dense layers was tuned
  • ZCA regularization strength was tuned
  • Small subset of 1,280 training examples used for tuning
  • Figure 1 shows upper diagonal of large kernel matrix split into smaller blocks
  • Figure 2 shows dataset size scaling for neural kernels on CIFAR-5m
  • Figure 4 shows neural kernel performance on CIFAR-10 with data augmentation
  • Figure 5 shows scatter plot of test MSE