Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Gradient descent speed is dependent on learning rate
  • A single-loop method can achieve optimal convergence rate without knowledge of distance to solution set
  • Method does not require additional multiplicative log factors
  • Experiments show method matches hand-tuned learning rates
  • Method is practical, efficient and requires no additional evaluations
  • Open-source implementation available

Paper Content

Introduction

  • Problem of unconstrained convex minimization
  • Standard approach is subgradient method
  • Step size (learning rate) affects convergence
  • Optimal step size requires knowledge of distance to solution
  • Dual Averaging with D-Adaptation algorithm achieves optimal rate of convergence
  • No need for hyper-parameter grid searches

Algorithm

  • Algorithm 1 is a modification of AdaGrad step size applied to weighted dual averaging
  • Algorithm 1 uses a lower bound dk on D
  • Algorithm 1 has two key differences from the classical bound
  • Theorem 1 states that Algorithm 1 returns a point xn such that as n → ∞, where D = x 0 − x *
  • Theorem 2 states that Algorithm 1 run for n ≥ log 2 (D/d 0 ) steps has a guarantee that is significantly better than using the subgradient method

D-adapted adagrad

  • D-Adaptation technique can be applied to coordinate-wise scaling variant of AdaGrad
  • Algorithm 2 presents this method
  • Estimates distance to solution in ∞-norm instead of Euclidean norm
  • Same adaptive convergence rate as AdaGrad up to constant factors
  • Theorem 3 for convex p-dimensional function with G ∞ in initialization of a 0

Discussion

  • D-Adaptation is a computer science algorithm used to minimize an absolute value function
  • The algorithm starts with a value of d 0 which is lower than the known D value
  • The value of d k typically does not asymptotically approach D
  • The algorithm uses a hyper-gradient quantity to estimate the magnitude of the optimal learning rate
  • The algorithm is applicable to convex Lipschitz functions and can be extended to the stochastic setting
  • Optimizing Lipschitz functions
  • Major classes of approaches reviewed

Polyak step size

  • Polyak step size can replace the requirement of knowledge of D
  • Polyak step size gives optimal rate of convergence without additional log factors
  • Estimates or approximations of f* can lead to unstable convergence
  • Restarting scheme with lower bounds on f* can converge within a log factor of the optimal rate

Exact line searches

  • Method gives optimal rate without requiring knowledge of problem parameters
  • Relaxing exact line search to approximate line search introduces additional dependencies on problem constants
  • Bisection algorithm used to replace log dependence on d max with log log dependence
  • Estimating d max allows for replacing d max with D in bound

Coin-betting

  • Coin betting approaches can be used when knowledge of G is assumed but not D.
  • Coin betting methods achieve optimal regret without knowledge of D, which is worse than the best possible regret with knowledge of D.

Reward doubling

  • Streeter and McMahan’s reward-doubling technique is similar to the approach in the paper.
  • It tracks the sum of the quantity x k g k and compares it to a pre-specified upper bound.
  • When the reward sum exceeds the upper bound, the step size is doubled and the optimizer state is reset.
  • It produces similar results to the coin betting approach.

Machine learning applications

  • Adapt D-Adaptation technique to stochastic optimization
  • Algorithm 3 and 4 are versions of D-Adaptation for SGD and Adam
  • Remove factor of 2 from D bound in Algorithms 3 and 4
  • Norms are weighted instead of unweighted
  • Correction factor of (1 − β 2 ) in the D bound
  • Optional γ k constant sequence as input to the algorithms

Experimental results

  • Compared D-Adapted variants of Adam and SGD on machine learning problems
  • Vary models and datasets to illustrate effectiveness of D-Adaptation
  • Used standard learning rate schedule with base learning rate set by D-Adaptation
  • Plotted mean of multiple seeds with error bars indicating range of 2 standard errors from mean

Convex problems

  • 5 benchmark problems from LIBSVM repository were used
  • 100 epochs of training with 10-fold decreases at 60, 80, and 95 epochs
  • No weight decay and batch-size 16
  • Hyper-parameters set to defaults
  • Learning rate for Adam chosen with grid search
  • D-Adaptation matches or exceeds performance of grid-search based learning rate

Convolutional image classification

  • Used 3 common datasets for optimization method testing: CIFAR10, CIFAR100, ImageNet 2012
  • Used 3 different architectures: Wide-Resnet, DenseNet, vanilla ResNet
  • D-Adaptation matches or exceeds baseline learning rates on each problem

Lstm recurrent neural networks

  • IWSLT14 German-to-English dataset is used to benchmark machine translation models
  • LSTM model is commonly used for this problem
  • Inverse-square-root learning rate schedule is used for baseline and D-Adaptation
  • Model achieves comparable performance to baseline without tuning learning rate

Masked language modelling

  • BERT is a popular approach to pretraining transformer models.
  • We use the 110M parameter RoBERTA variant of BERT for experiments.
  • We train on the Book-Wiki corpus.
  • D-Adaptation matches the baseline in test-set perplexity.

Auto-regressive language modelling

  • Used GPT decoder-only transformer architecture for auto-regressive language modelling
  • Trained on large Book-Wiki corpus
  • D-Adaptation is comparable to baseline with negligible perplexity difference

Object detection

  • COCO 2017 object detection task is a popular benchmark in computer vision
  • Trained Faster-RCNN model using Detectron2
  • Used pretrained ResNeXt-101-32x8d model
  • Experiments showed D-Adaptation overfitting
  • Increased decay from 0.0001 to 0.00015, improved test set accuracy

Vision transformers

  • Vision Transformers are a new approach to image classification
  • They are more advanced than ResNet models
  • Training Vision Transformers requires more resources
  • Training is typically done for 300 epochs
  • Adaptive optimizers such as Adam are used to avoid overfitting
  • PyTorch Image Models framework is used
  • Cosine learning rate schedule is used
  • D-Adaptation under-performs the baseline learning rate

Fastmri

  • FastMRI Knee Dataset is a large-scale release of raw MRI data
  • Reconstruction task is to produce a 2-dimensional, grey-scale image from raw sensor data
  • VarNet 2.0 model was trained on the dataset using code and training setup released by Meta

Recommendation systems

  • Criteo Kaggle Display Advertising dataset is a large, sparse dataset of user click-through events
  • DLRM model is a common benchmark for this problem
  • Our method closely matches the performance of the tuned baseline learning rate

Conclusion

  • We have presented a simple approach to achieving parameter free learning of convex Lipshitz functions
  • We construct successively better lower bounds on the key unknown quantity: the distance to solution x 0 − x *
  • Our approach for constructing these lower bounds may be of independent interest
  • Our method is highly practical, demonstrating excellent performance across a range of large and diverse machine learning problems
  • We consider a more general form of Algorithm 1 with arbitrary positive weights λ k
  • We bound the sum of inner products γ k λ k g k , s k over time
  • We simplify when the weighting sequence is flat
  • We prove that the initial distance to solution, D = x 0 − x * , can be lower bounded
  • We bound the gradient error term
  • We prove that the iterates of Algorithm 1 satisfy a bound
  • We prove that the norm of s n+1 is bounded
  • We prove that the average iterate xn returned by Algorithm 1 satisfies a bound
  • We prove that the 1-norm of s n+1 is bounded
  • We prove that for any x * in the set of minimizers of f, the initial distance to solution, D = x 0 − x * ∞, can be lower bounded
  • We prove that we can return the point xt and at time t = arg min k≤n d (d n+1 /d 0 ) n + 1
  • We present results from Logistic Regression, CIFAR10, CIFAR100, and ImageNet experiments