Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Gradient descent speed is dependent on learning rate
A single-loop method can achieve optimal convergence rate without knowledge of distance to solution set
Method does not require additional multiplicative log factors
Experiments show method matches hand-tuned learning rates
Method is practical, efficient and requires no additional evaluations
Open-source implementation available

Paper Content

Introduction

Problem of unconstrained convex minimization
Standard approach is subgradient method
Step size (learning rate) affects convergence
Optimal step size requires knowledge of distance to solution
Dual Averaging with D-Adaptation algorithm achieves optimal rate of convergence
No need for hyper-parameter grid searches

Algorithm

Algorithm 1 is a modification of AdaGrad step size applied to weighted dual averaging
Algorithm 1 uses a lower bound dk on D
Algorithm 1 has two key differences from the classical bound
Theorem 1 states that Algorithm 1 returns a point xn such that as n → ∞, where D = x 0 − x *
Theorem 2 states that Algorithm 1 run for n ≥ log 2 (D/d 0 ) steps has a guarantee that is significantly better than using the subgradient method

D-adapted adagrad

D-Adaptation technique can be applied to coordinate-wise scaling variant of AdaGrad
Algorithm 2 presents this method
Estimates distance to solution in ∞-norm instead of Euclidean norm
Same adaptive convergence rate as AdaGrad up to constant factors
Theorem 3 for convex p-dimensional function with G ∞ in initialization of a 0

Discussion

D-Adaptation is a computer science algorithm used to minimize an absolute value function
The algorithm starts with a value of d 0 which is lower than the known D value
The value of d k typically does not asymptotically approach D
The algorithm uses a hyper-gradient quantity to estimate the magnitude of the optimal learning rate
The algorithm is applicable to convex Lipschitz functions and can be extended to the stochastic setting

Optimizing Lipschitz functions
Major classes of approaches reviewed

Polyak step size

Polyak step size can replace the requirement of knowledge of D
Polyak step size gives optimal rate of convergence without additional log factors
Estimates or approximations of f* can lead to unstable convergence
Restarting scheme with lower bounds on f* can converge within a log factor of the optimal rate

Exact line searches

Method gives optimal rate without requiring knowledge of problem parameters
Relaxing exact line search to approximate line search introduces additional dependencies on problem constants
Bisection algorithm used to replace log dependence on d max with log log dependence
Estimating d max allows for replacing d max with D in bound

Coin-betting

Coin betting approaches can be used when knowledge of G is assumed but not D.
Coin betting methods achieve optimal regret without knowledge of D, which is worse than the best possible regret with knowledge of D.

Reward doubling

Streeter and McMahan’s reward-doubling technique is similar to the approach in the paper.
It tracks the sum of the quantity x k g k and compares it to a pre-specified upper bound.
When the reward sum exceeds the upper bound, the step size is doubled and the optimizer state is reset.
It produces similar results to the coin betting approach.

Machine learning applications

Adapt D-Adaptation technique to stochastic optimization
Algorithm 3 and 4 are versions of D-Adaptation for SGD and Adam
Remove factor of 2 from D bound in Algorithms 3 and 4
Norms are weighted instead of unweighted
Correction factor of (1 − β 2 ) in the D bound
Optional γ k constant sequence as input to the algorithms

Experimental results

Compared D-Adapted variants of Adam and SGD on machine learning problems
Vary models and datasets to illustrate effectiveness of D-Adaptation
Used standard learning rate schedule with base learning rate set by D-Adaptation
Plotted mean of multiple seeds with error bars indicating range of 2 standard errors from mean

Convex problems

5 benchmark problems from LIBSVM repository were used
100 epochs of training with 10-fold decreases at 60, 80, and 95 epochs
No weight decay and batch-size 16
Hyper-parameters set to defaults
Learning rate for Adam chosen with grid search
D-Adaptation matches or exceeds performance of grid-search based learning rate

Convolutional image classification

Used 3 common datasets for optimization method testing: CIFAR10, CIFAR100, ImageNet 2012
Used 3 different architectures: Wide-Resnet, DenseNet, vanilla ResNet
D-Adaptation matches or exceeds baseline learning rates on each problem

Lstm recurrent neural networks

IWSLT14 German-to-English dataset is used to benchmark machine translation models
LSTM model is commonly used for this problem
Inverse-square-root learning rate schedule is used for baseline and D-Adaptation
Model achieves comparable performance to baseline without tuning learning rate

Masked language modelling

BERT is a popular approach to pretraining transformer models.
We use the 110M parameter RoBERTA variant of BERT for experiments.
We train on the Book-Wiki corpus.
D-Adaptation matches the baseline in test-set perplexity.

Auto-regressive language modelling

Used GPT decoder-only transformer architecture for auto-regressive language modelling
Trained on large Book-Wiki corpus
D-Adaptation is comparable to baseline with negligible perplexity difference

Object detection

COCO 2017 object detection task is a popular benchmark in computer vision
Trained Faster-RCNN model using Detectron2
Used pretrained ResNeXt-101-32x8d model
Experiments showed D-Adaptation overfitting
Increased decay from 0.0001 to 0.00015, improved test set accuracy

Vision transformers

Vision Transformers are a new approach to image classification
They are more advanced than ResNet models
Training Vision Transformers requires more resources
Training is typically done for 300 epochs
Adaptive optimizers such as Adam are used to avoid overfitting
PyTorch Image Models framework is used
Cosine learning rate schedule is used
D-Adaptation under-performs the baseline learning rate

Fastmri

FastMRI Knee Dataset is a large-scale release of raw MRI data
Reconstruction task is to produce a 2-dimensional, grey-scale image from raw sensor data
VarNet 2.0 model was trained on the dataset using code and training setup released by Meta

Recommendation systems

Criteo Kaggle Display Advertising dataset is a large, sparse dataset of user click-through events
DLRM model is a common benchmark for this problem
Our method closely matches the performance of the tuned baseline learning rate

Conclusion

We have presented a simple approach to achieving parameter free learning of convex Lipshitz functions
We construct successively better lower bounds on the key unknown quantity: the distance to solution x 0 − x *
Our approach for constructing these lower bounds may be of independent interest
Our method is highly practical, demonstrating excellent performance across a range of large and diverse machine learning problems
We consider a more general form of Algorithm 1 with arbitrary positive weights λ k
We bound the sum of inner products γ k λ k g k , s k over time
We simplify when the weighting sequence is flat
We prove that the initial distance to solution, D = x 0 − x * , can be lower bounded
We bound the gradient error term
We prove that the iterates of Algorithm 1 satisfy a bound
We prove that the norm of s n+1 is bounded
We prove that the average iterate xn returned by Algorithm 1 satisfies a bound
We prove that the 1-norm of s n+1 is bounded
We prove that for any x * in the set of minimizers of f, the initial distance to solution, D = x 0 − x * ∞, can be lower bounded
We prove that we can return the point xt and at time t = arg min k≤n d (d n+1 /d 0 ) n + 1
We present results from Logistic Regression, CIFAR10, CIFAR100, and ImageNet experiments

Link to paper#

Abstract#

Paper Content#

Introduction#

Algorithm#

D-adapted adagrad#

Discussion#

Related work#

Polyak step size#

Exact line searches#

Coin-betting#

Reward doubling#

Machine learning applications#

Experimental results#

Convex problems#

Convolutional image classification#

Lstm recurrent neural networks#

Masked language modelling#

Auto-regressive language modelling#

Object detection#

Vision transformers#

Fastmri#

Recommendation systems#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Algorithm

D-adapted adagrad

Discussion

Related work

Polyak step size

Exact line searches

Coin-betting

Reward doubling

Machine learning applications

Experimental results

Convex problems

Convolutional image classification

Lstm recurrent neural networks

Masked language modelling

Auto-regressive language modelling

Object detection

Vision transformers

Fastmri

Recommendation systems

Conclusion