Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Bernstein’s condition is an assumption that helps machine learning algorithms run faster.
- The Gibbs algorithm has an excess risk of $O(d_{\pi}/n)$ instead of the standard $O(\sqrt{d_{\pi}/n})$.
- This paper examines the Gibbs algorithm in the context of meta-learning.
- Bernstein’s condition always holds at the meta level, regardless of its validity at the observation level.
- The additional cost to learn the Gibbs prior $\pi$ is in $O(1/T)$.
- This result improves on standard rates in three different settings.
Paper Content
Introduction
- Artificial intelligence promises to create autonomous systems that can learn and adapt like living things.
- Meta-learning is a field that has been widely studied in recent literature.
- Transfer learning is a concept that involves two tasks that share similarities.
- Multi-task learning involves multiple learning tasks and a common representation.
- Meta-learning involves two levels of abstraction to improve learning over time.
- Metric-based methods use a metric learned from the meta-training dataset.
- Model-based methods quickly update the parameters in a few learning steps.
- Optimisation-based methods involve learning the hyper-parameters of a within-task algorithm.
- MAML and its variants are some of the best known meta-strategies.
Approach and contributions
- Focus on Gibbs algorithms and variational approximations
- Interpreted in Bayesian statistics framework
- PAC-Bayes and mutual information bounds used to study excess risk
- Intractable Gibbs posteriors, so variational approximations used
- PAC-Bayes bounds can be used on variational approximations
- Meta-learning through PAC-Bayes and information bounds
- Empirical PAC-Bayes bounds for meta-learning
- Excess risk of Gibbs algorithm in O(d π,t /n) when Bernstein’s condition is satisfied
- Meta-level Gibbs algorithm achieves excess risk O(inf π∈M E t [(d π,t /n) α ] + 1/T )
- Gain from meta-learning is blatant in some favorable situations
Problem definition and notations
- Z is a space of observations, Θ is a decision space and ℓ is a bounded loss function
- Learner has to solve T tasks
- Learner receives observations from a distribution P t
- Objective is to find a parameter θ which minimizes prediction risk
- Bayesian approaches seek for ρ t in P(Θ)
- Empirical risk is a standard choice for ρ t
- Variational approximations are defined by F ⊆ P(Θ)
Assumptions on the loss and bernstein’s condition
- Assumed loss function is bounded
- PAC-Bayes bounds for unbounded losses are known
- Variance term for task t is defined
- Assumption 1 (Bernstein’s condition) is crucial
- Excess risk of Gibbs posterior is characterized by Assumption 1
- Bound on excess risk provided with and without Assumption 1
- Assumption 2 (risk is smooth enough) used in specific applications
Learning in isolation
- Learning in isolation considers each task separately.
- Theorem 1 provides a bound for any α > 0.
- Corollary 2 provides an explicit rate of convergence for Gibbs posteriors.
- Bernstein’s condition yields a specific choice of α.
Main results
- Meta-learning considers all tasks to improve learning in each task
- Objective is to learn prior to make meta-risk as small as possible
Bernstein’s condition at the meta level
- Bernstein’s condition is always satisfied at the meta level when using Gibbs posteriors.
- Boundedness assumption (3) is required.
- Lemma 4 is used in the proof.
- Expectations are taken with respect to S t ∼ P t.
Pac-bayes bound for meta-learning
- We seek a prior π that allows us to obtain a small meta-risk.
- We fix a set of possible priors M and a set of distributions G on these priors.
- Theorem 5 states that if the loss ℓ satisfies (3), then the excess risk of Π is bounded.
- Open Question 1 asks under what conditions on F can we replace Π by Π(F) in Theorem 5.
- We study the case where M statisticians propose a different prior, all of which satisfy a prior mass condition.
- Theorem 5 and Corollary 2 give a rate of convergence provided by the best prior among {π 1 , . . . , π M }, with an additional log(M )/T term.
Applications of theorem 5
- Derive explicit bounds on the excess risk of the Gibbs algorithm in the case of discrete priors
- Derive explicit bounds on the excess risk of the Gibbs algorithm in the case of Gaussian priors
- Derive explicit bounds on the excess risk of the Gibbs algorithm in the case of mixtures of Gaussian priors
Learning discrete priors
- Assume |Θ| = M < ∞
- Define A* as smallest possible subset of Θ
- Bernstein’s condition is satisfied
- Excess risk of Gibbs algorithm is 4 log(M) αn
- Set of priors M is set of probability distributions πA
- Prior on priors Λ is defined as drawing m from {1, …, M}
- Excess risk of meta predictor Π is bounded by βT
- Meta-learning rate is larger than learning in isolation in unfavorable case
- Meta-learning improves upon learning in isolation in favorable case
- Benefits of meta-learning mainly expected in T ≫ n regime
Learning gaussian priors
- Prior on priors is defined as Λ
- Assumptions 1 and 4 are assumed to hold
- Excess risk of Π is bounded
- In favorable case, convergence rate is O log T T
Learning mixtures of gaussian priors
- Generalizing the result of the previous section to priors that are mixtures of Gaussians
- Assumptions 1 and 4 hold
- Number of components in the mixture is known
- Dirichlet prior on weights of components in the mixture
- Excess risk of estimator bounded
- Convergence rate of O 1 n + 1 T
- Convergence term at meta level is O dK log T T
- Estimator takes 2 log T βT to find optimal number of mixtures
Discussion
- Meta-learning has received increasing attention in recent years
- Theoretical analysis of meta-learning goes back to Baxter (2000)
- Many other generalization bounds have been provided for different strategies and proof techniques
- PAC-Bayes theory was first proposed in the meta-learning framework in the paper of Pentina and Lampert (2014)
- Excess risk bounds have been provided in the i.i.d. task environment framework
- Denevi et al. (2019a) provides statistical guarantees for Ridge regression with a metalearned bias
- Guan et al. (2022) address fast rates with respect to the number of tasks
- Guan and Lu (2022a) and Rezazadeh (2022) provide fast rate generalization bounds based on Catoni’s PAC-Bayes inequality
Conclusion and open problems
- We provided an analysis of the excess risk in meta-learning the prior via PAC-Bayes bounds
- At the meta-level, conditions for fast rates are always satisfied if one uses exact Gibbs posteriors at the task level
- An important problem is to extend this result to variational approximations of Gibbs posteriors
- Lemma 12 (Hoeffding’s inequality) states that for any s > 0, the probability of U i being outside of an interval [a, b] is bounded
- Lemma 13 (Bernstein’s inequality) states that for any s ∈ (0, 1/C], the probability of U i being outside of an interval [a, b] is bounded
- Lemma 14 (Donsker and Varadhan’s variational inequality) states that for any measurable, bounded function h, the KL divergence between a multinomial distribution of parameters (x 1 , . . . , x T ) and a multinomial distribution of parameters 1 T , . . . , 1 T is bounded
- Theorem 1 provides a lower bound for any π ′
- Theorem 5 states that for any prior on priors Λ, the bound becomes
- Assumption 4 implies that the bound becomes
- In the T > n regime, the bound is significantly improved compared to the learning in isolation
- We assume that priors are mixtures of K Gaussians, and the bound from Theorem 5 becomes, at t = T + 1,