Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Bernstein’s condition is an assumption that helps machine learning algorithms run faster.
The Gibbs algorithm has an excess risk of $O(d_{\pi}/n)$ instead of the standard $O(\sqrt{d_{\pi}/n})$.
This paper examines the Gibbs algorithm in the context of meta-learning.
Bernstein’s condition always holds at the meta level, regardless of its validity at the observation level.
The additional cost to learn the Gibbs prior $\pi$ is in $O(1/T)$.
This result improves on standard rates in three different settings.

Paper Content

Introduction

Artificial intelligence promises to create autonomous systems that can learn and adapt like living things.
Meta-learning is a field that has been widely studied in recent literature.
Transfer learning is a concept that involves two tasks that share similarities.
Multi-task learning involves multiple learning tasks and a common representation.
Meta-learning involves two levels of abstraction to improve learning over time.
Metric-based methods use a metric learned from the meta-training dataset.
Model-based methods quickly update the parameters in a few learning steps.
Optimisation-based methods involve learning the hyper-parameters of a within-task algorithm.
MAML and its variants are some of the best known meta-strategies.

Approach and contributions

Focus on Gibbs algorithms and variational approximations
Interpreted in Bayesian statistics framework
PAC-Bayes and mutual information bounds used to study excess risk
Intractable Gibbs posteriors, so variational approximations used
PAC-Bayes bounds can be used on variational approximations
Meta-learning through PAC-Bayes and information bounds
Empirical PAC-Bayes bounds for meta-learning
Excess risk of Gibbs algorithm in O(d π,t /n) when Bernstein’s condition is satisfied
Meta-level Gibbs algorithm achieves excess risk O(inf π∈M E t [(d π,t /n) α ] + 1/T )
Gain from meta-learning is blatant in some favorable situations

Problem definition and notations

Z is a space of observations, Θ is a decision space and ℓ is a bounded loss function
Learner has to solve T tasks
Learner receives observations from a distribution P t
Objective is to find a parameter θ which minimizes prediction risk
Bayesian approaches seek for ρ t in P(Θ)
Empirical risk is a standard choice for ρ t
Variational approximations are defined by F ⊆ P(Θ)

Assumptions on the loss and bernstein’s condition

Assumed loss function is bounded
PAC-Bayes bounds for unbounded losses are known
Variance term for task t is defined
Assumption 1 (Bernstein’s condition) is crucial
Excess risk of Gibbs posterior is characterized by Assumption 1
Bound on excess risk provided with and without Assumption 1
Assumption 2 (risk is smooth enough) used in specific applications

Learning in isolation

Learning in isolation considers each task separately.
Theorem 1 provides a bound for any α > 0.
Corollary 2 provides an explicit rate of convergence for Gibbs posteriors.
Bernstein’s condition yields a specific choice of α.

Main results

Meta-learning considers all tasks to improve learning in each task
Objective is to learn prior to make meta-risk as small as possible

Bernstein’s condition at the meta level

Bernstein’s condition is always satisfied at the meta level when using Gibbs posteriors.
Boundedness assumption (3) is required.
Lemma 4 is used in the proof.
Expectations are taken with respect to S t ∼ P t.

Pac-bayes bound for meta-learning

We seek a prior π that allows us to obtain a small meta-risk.
We fix a set of possible priors M and a set of distributions G on these priors.
Theorem 5 states that if the loss ℓ satisfies (3), then the excess risk of Π is bounded.
Open Question 1 asks under what conditions on F can we replace Π by Π(F) in Theorem 5.
We study the case where M statisticians propose a different prior, all of which satisfy a prior mass condition.
Theorem 5 and Corollary 2 give a rate of convergence provided by the best prior among {π 1 , . . . , π M }, with an additional log(M )/T term.

Applications of theorem 5

Derive explicit bounds on the excess risk of the Gibbs algorithm in the case of discrete priors
Derive explicit bounds on the excess risk of the Gibbs algorithm in the case of Gaussian priors
Derive explicit bounds on the excess risk of the Gibbs algorithm in the case of mixtures of Gaussian priors

Learning discrete priors

Assume |Θ| = M < ∞
Define A* as smallest possible subset of Θ
Bernstein’s condition is satisfied
Excess risk of Gibbs algorithm is 4 log(M) αn
Set of priors M is set of probability distributions πA
Prior on priors Λ is defined as drawing m from {1, …, M}
Excess risk of meta predictor Π is bounded by βT
Meta-learning rate is larger than learning in isolation in unfavorable case
Meta-learning improves upon learning in isolation in favorable case
Benefits of meta-learning mainly expected in T ≫ n regime

Learning gaussian priors

Prior on priors is defined as Λ
Assumptions 1 and 4 are assumed to hold
Excess risk of Π is bounded
In favorable case, convergence rate is O log T T

Learning mixtures of gaussian priors

Generalizing the result of the previous section to priors that are mixtures of Gaussians
Assumptions 1 and 4 hold
Number of components in the mixture is known
Dirichlet prior on weights of components in the mixture
Excess risk of estimator bounded
Convergence rate of O 1 n + 1 T
Convergence term at meta level is O dK log T T
Estimator takes 2 log T βT to find optimal number of mixtures

Discussion

Meta-learning has received increasing attention in recent years
Theoretical analysis of meta-learning goes back to Baxter (2000)
Many other generalization bounds have been provided for different strategies and proof techniques
PAC-Bayes theory was first proposed in the meta-learning framework in the paper of Pentina and Lampert (2014)
Excess risk bounds have been provided in the i.i.d. task environment framework
Denevi et al. (2019a) provides statistical guarantees for Ridge regression with a metalearned bias
Guan et al. (2022) address fast rates with respect to the number of tasks
Guan and Lu (2022a) and Rezazadeh (2022) provide fast rate generalization bounds based on Catoni’s PAC-Bayes inequality

Conclusion and open problems

We provided an analysis of the excess risk in meta-learning the prior via PAC-Bayes bounds
At the meta-level, conditions for fast rates are always satisfied if one uses exact Gibbs posteriors at the task level
An important problem is to extend this result to variational approximations of Gibbs posteriors
Lemma 12 (Hoeffding’s inequality) states that for any s > 0, the probability of U i being outside of an interval [a, b] is bounded
Lemma 13 (Bernstein’s inequality) states that for any s ∈ (0, 1/C], the probability of U i being outside of an interval [a, b] is bounded
Lemma 14 (Donsker and Varadhan’s variational inequality) states that for any measurable, bounded function h, the KL divergence between a multinomial distribution of parameters (x 1 , . . . , x T ) and a multinomial distribution of parameters 1 T , . . . , 1 T is bounded
Theorem 1 provides a lower bound for any π ′
Theorem 5 states that for any prior on priors Λ, the bound becomes
Assumption 4 implies that the bound becomes
In the T > n regime, the bound is significantly improved compared to the learning in isolation
We assume that priors are mixtures of K Gaussians, and the bound from Theorem 5 becomes, at t = T + 1,

Link to paper#

Abstract#

Paper Content#

Introduction#

Approach and contributions#

Problem definition and notations#

Assumptions on the loss and bernstein’s condition#

Learning in isolation#

Main results#

Bernstein’s condition at the meta level#

Pac-bayes bound for meta-learning#

Applications of theorem 5#

Learning discrete priors#

Learning gaussian priors#

Learning mixtures of gaussian priors#

Discussion#

Conclusion and open problems#

Link to paper

Abstract

Paper Content

Introduction

Approach and contributions

Problem definition and notations

Assumptions on the loss and bernstein’s condition

Learning in isolation

Main results

Bernstein’s condition at the meta level

Pac-bayes bound for meta-learning

Applications of theorem 5

Learning discrete priors

Learning gaussian priors

Learning mixtures of gaussian priors

Discussion

Conclusion and open problems