Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Gradient descent speed is dependent on learning rate
- A single-loop method can achieve optimal convergence rate without knowledge of distance to solution set
- Method does not require additional multiplicative log factors
- Experiments show method matches hand-tuned learning rates
- Method is practical, efficient and requires no additional evaluations
- Open-source implementation available
Paper Content
Introduction
- Problem of unconstrained convex minimization
- Standard approach is subgradient method
- Step size (learning rate) affects convergence
- Optimal step size requires knowledge of distance to solution
- Dual Averaging with D-Adaptation algorithm achieves optimal rate of convergence
- No need for hyper-parameter grid searches
Algorithm
- Algorithm 1 is a modification of AdaGrad step size applied to weighted dual averaging
- Algorithm 1 uses a lower bound dk on D
- Algorithm 1 has two key differences from the classical bound
- Theorem 1 states that Algorithm 1 returns a point xn such that as n → ∞, where D = x 0 − x *
- Theorem 2 states that Algorithm 1 run for n ≥ log 2 (D/d 0 ) steps has a guarantee that is significantly better than using the subgradient method
D-adapted adagrad
- D-Adaptation technique can be applied to coordinate-wise scaling variant of AdaGrad
- Algorithm 2 presents this method
- Estimates distance to solution in ∞-norm instead of Euclidean norm
- Same adaptive convergence rate as AdaGrad up to constant factors
- Theorem 3 for convex p-dimensional function with G ∞ in initialization of a 0
Discussion
- D-Adaptation is a computer science algorithm used to minimize an absolute value function
- The algorithm starts with a value of d 0 which is lower than the known D value
- The value of d k typically does not asymptotically approach D
- The algorithm uses a hyper-gradient quantity to estimate the magnitude of the optimal learning rate
- The algorithm is applicable to convex Lipschitz functions and can be extended to the stochastic setting
Related work
- Optimizing Lipschitz functions
- Major classes of approaches reviewed
Polyak step size
- Polyak step size can replace the requirement of knowledge of D
- Polyak step size gives optimal rate of convergence without additional log factors
- Estimates or approximations of f* can lead to unstable convergence
- Restarting scheme with lower bounds on f* can converge within a log factor of the optimal rate
Exact line searches
- Method gives optimal rate without requiring knowledge of problem parameters
- Relaxing exact line search to approximate line search introduces additional dependencies on problem constants
- Bisection algorithm used to replace log dependence on d max with log log dependence
- Estimating d max allows for replacing d max with D in bound
Coin-betting
- Coin betting approaches can be used when knowledge of G is assumed but not D.
- Coin betting methods achieve optimal regret without knowledge of D, which is worse than the best possible regret with knowledge of D.
Reward doubling
- Streeter and McMahan’s reward-doubling technique is similar to the approach in the paper.
- It tracks the sum of the quantity x k g k and compares it to a pre-specified upper bound.
- When the reward sum exceeds the upper bound, the step size is doubled and the optimizer state is reset.
- It produces similar results to the coin betting approach.
Machine learning applications
- Adapt D-Adaptation technique to stochastic optimization
- Algorithm 3 and 4 are versions of D-Adaptation for SGD and Adam
- Remove factor of 2 from D bound in Algorithms 3 and 4
- Norms are weighted instead of unweighted
- Correction factor of (1 − β 2 ) in the D bound
- Optional γ k constant sequence as input to the algorithms
Experimental results
- Compared D-Adapted variants of Adam and SGD on machine learning problems
- Vary models and datasets to illustrate effectiveness of D-Adaptation
- Used standard learning rate schedule with base learning rate set by D-Adaptation
- Plotted mean of multiple seeds with error bars indicating range of 2 standard errors from mean
Convex problems
- 5 benchmark problems from LIBSVM repository were used
- 100 epochs of training with 10-fold decreases at 60, 80, and 95 epochs
- No weight decay and batch-size 16
- Hyper-parameters set to defaults
- Learning rate for Adam chosen with grid search
- D-Adaptation matches or exceeds performance of grid-search based learning rate
Convolutional image classification
- Used 3 common datasets for optimization method testing: CIFAR10, CIFAR100, ImageNet 2012
- Used 3 different architectures: Wide-Resnet, DenseNet, vanilla ResNet
- D-Adaptation matches or exceeds baseline learning rates on each problem
Lstm recurrent neural networks
- IWSLT14 German-to-English dataset is used to benchmark machine translation models
- LSTM model is commonly used for this problem
- Inverse-square-root learning rate schedule is used for baseline and D-Adaptation
- Model achieves comparable performance to baseline without tuning learning rate
Masked language modelling
- BERT is a popular approach to pretraining transformer models.
- We use the 110M parameter RoBERTA variant of BERT for experiments.
- We train on the Book-Wiki corpus.
- D-Adaptation matches the baseline in test-set perplexity.
Auto-regressive language modelling
- Used GPT decoder-only transformer architecture for auto-regressive language modelling
- Trained on large Book-Wiki corpus
- D-Adaptation is comparable to baseline with negligible perplexity difference
Object detection
- COCO 2017 object detection task is a popular benchmark in computer vision
- Trained Faster-RCNN model using Detectron2
- Used pretrained ResNeXt-101-32x8d model
- Experiments showed D-Adaptation overfitting
- Increased decay from 0.0001 to 0.00015, improved test set accuracy
Vision transformers
- Vision Transformers are a new approach to image classification
- They are more advanced than ResNet models
- Training Vision Transformers requires more resources
- Training is typically done for 300 epochs
- Adaptive optimizers such as Adam are used to avoid overfitting
- PyTorch Image Models framework is used
- Cosine learning rate schedule is used
- D-Adaptation under-performs the baseline learning rate
Fastmri
- FastMRI Knee Dataset is a large-scale release of raw MRI data
- Reconstruction task is to produce a 2-dimensional, grey-scale image from raw sensor data
- VarNet 2.0 model was trained on the dataset using code and training setup released by Meta
Recommendation systems
- Criteo Kaggle Display Advertising dataset is a large, sparse dataset of user click-through events
- DLRM model is a common benchmark for this problem
- Our method closely matches the performance of the tuned baseline learning rate
Conclusion
- We have presented a simple approach to achieving parameter free learning of convex Lipshitz functions
- We construct successively better lower bounds on the key unknown quantity: the distance to solution x 0 − x *
- Our approach for constructing these lower bounds may be of independent interest
- Our method is highly practical, demonstrating excellent performance across a range of large and diverse machine learning problems
- We consider a more general form of Algorithm 1 with arbitrary positive weights λ k
- We bound the sum of inner products γ k λ k g k , s k over time
- We simplify when the weighting sequence is flat
- We prove that the initial distance to solution, D = x 0 − x * , can be lower bounded
- We bound the gradient error term
- We prove that the iterates of Algorithm 1 satisfy a bound
- We prove that the norm of s n+1 is bounded
- We prove that the average iterate xn returned by Algorithm 1 satisfies a bound
- We prove that the 1-norm of s n+1 is bounded
- We prove that for any x * in the set of minimizers of f, the initial distance to solution, D = x 0 − x * ∞, can be lower bounded
- We prove that we can return the point xt and at time t = arg min k≤n d (d n+1 /d 0 ) n + 1
- We present results from Logistic Regression, CIFAR10, CIFAR100, and ImageNet experiments