Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Bernstein’s condition is an assumption that helps machine learning algorithms run faster.
  • The Gibbs algorithm has an excess risk of $O(d_{\pi}/n)$ instead of the standard $O(\sqrt{d_{\pi}/n})$.
  • This paper examines the Gibbs algorithm in the context of meta-learning.
  • Bernstein’s condition always holds at the meta level, regardless of its validity at the observation level.
  • The additional cost to learn the Gibbs prior $\pi$ is in $O(1/T)$.
  • This result improves on standard rates in three different settings.

Paper Content

Introduction

  • Artificial intelligence promises to create autonomous systems that can learn and adapt like living things.
  • Meta-learning is a field that has been widely studied in recent literature.
  • Transfer learning is a concept that involves two tasks that share similarities.
  • Multi-task learning involves multiple learning tasks and a common representation.
  • Meta-learning involves two levels of abstraction to improve learning over time.
  • Metric-based methods use a metric learned from the meta-training dataset.
  • Model-based methods quickly update the parameters in a few learning steps.
  • Optimisation-based methods involve learning the hyper-parameters of a within-task algorithm.
  • MAML and its variants are some of the best known meta-strategies.

Approach and contributions

  • Focus on Gibbs algorithms and variational approximations
  • Interpreted in Bayesian statistics framework
  • PAC-Bayes and mutual information bounds used to study excess risk
  • Intractable Gibbs posteriors, so variational approximations used
  • PAC-Bayes bounds can be used on variational approximations
  • Meta-learning through PAC-Bayes and information bounds
  • Empirical PAC-Bayes bounds for meta-learning
  • Excess risk of Gibbs algorithm in O(d π,t /n) when Bernstein’s condition is satisfied
  • Meta-level Gibbs algorithm achieves excess risk O(inf π∈M E t [(d π,t /n) α ] + 1/T )
  • Gain from meta-learning is blatant in some favorable situations

Problem definition and notations

  • Z is a space of observations, Θ is a decision space and ℓ is a bounded loss function
  • Learner has to solve T tasks
  • Learner receives observations from a distribution P t
  • Objective is to find a parameter θ which minimizes prediction risk
  • Bayesian approaches seek for ρ t in P(Θ)
  • Empirical risk is a standard choice for ρ t
  • Variational approximations are defined by F ⊆ P(Θ)

Assumptions on the loss and bernstein’s condition

  • Assumed loss function is bounded
  • PAC-Bayes bounds for unbounded losses are known
  • Variance term for task t is defined
  • Assumption 1 (Bernstein’s condition) is crucial
  • Excess risk of Gibbs posterior is characterized by Assumption 1
  • Bound on excess risk provided with and without Assumption 1
  • Assumption 2 (risk is smooth enough) used in specific applications

Learning in isolation

  • Learning in isolation considers each task separately.
  • Theorem 1 provides a bound for any α > 0.
  • Corollary 2 provides an explicit rate of convergence for Gibbs posteriors.
  • Bernstein’s condition yields a specific choice of α.

Main results

  • Meta-learning considers all tasks to improve learning in each task
  • Objective is to learn prior to make meta-risk as small as possible

Bernstein’s condition at the meta level

  • Bernstein’s condition is always satisfied at the meta level when using Gibbs posteriors.
  • Boundedness assumption (3) is required.
  • Lemma 4 is used in the proof.
  • Expectations are taken with respect to S t ∼ P t.

Pac-bayes bound for meta-learning

  • We seek a prior π that allows us to obtain a small meta-risk.
  • We fix a set of possible priors M and a set of distributions G on these priors.
  • Theorem 5 states that if the loss ℓ satisfies (3), then the excess risk of Π is bounded.
  • Open Question 1 asks under what conditions on F can we replace Π by Π(F) in Theorem 5.
  • We study the case where M statisticians propose a different prior, all of which satisfy a prior mass condition.
  • Theorem 5 and Corollary 2 give a rate of convergence provided by the best prior among {π 1 , . . . , π M }, with an additional log(M )/T term.

Applications of theorem 5

  • Derive explicit bounds on the excess risk of the Gibbs algorithm in the case of discrete priors
  • Derive explicit bounds on the excess risk of the Gibbs algorithm in the case of Gaussian priors
  • Derive explicit bounds on the excess risk of the Gibbs algorithm in the case of mixtures of Gaussian priors

Learning discrete priors

  • Assume |Θ| = M < ∞
  • Define A* as smallest possible subset of Θ
  • Bernstein’s condition is satisfied
  • Excess risk of Gibbs algorithm is 4 log(M) αn
  • Set of priors M is set of probability distributions πA
  • Prior on priors Λ is defined as drawing m from {1, …, M}
  • Excess risk of meta predictor Π is bounded by βT
  • Meta-learning rate is larger than learning in isolation in unfavorable case
  • Meta-learning improves upon learning in isolation in favorable case
  • Benefits of meta-learning mainly expected in T ≫ n regime

Learning gaussian priors

  • Prior on priors is defined as Λ
  • Assumptions 1 and 4 are assumed to hold
  • Excess risk of Π is bounded
  • In favorable case, convergence rate is O log T T

Learning mixtures of gaussian priors

  • Generalizing the result of the previous section to priors that are mixtures of Gaussians
  • Assumptions 1 and 4 hold
  • Number of components in the mixture is known
  • Dirichlet prior on weights of components in the mixture
  • Excess risk of estimator bounded
  • Convergence rate of O 1 n + 1 T
  • Convergence term at meta level is O dK log T T
  • Estimator takes 2 log T βT to find optimal number of mixtures

Discussion

  • Meta-learning has received increasing attention in recent years
  • Theoretical analysis of meta-learning goes back to Baxter (2000)
  • Many other generalization bounds have been provided for different strategies and proof techniques
  • PAC-Bayes theory was first proposed in the meta-learning framework in the paper of Pentina and Lampert (2014)
  • Excess risk bounds have been provided in the i.i.d. task environment framework
  • Denevi et al. (2019a) provides statistical guarantees for Ridge regression with a metalearned bias
  • Guan et al. (2022) address fast rates with respect to the number of tasks
  • Guan and Lu (2022a) and Rezazadeh (2022) provide fast rate generalization bounds based on Catoni’s PAC-Bayes inequality

Conclusion and open problems

  • We provided an analysis of the excess risk in meta-learning the prior via PAC-Bayes bounds
  • At the meta-level, conditions for fast rates are always satisfied if one uses exact Gibbs posteriors at the task level
  • An important problem is to extend this result to variational approximations of Gibbs posteriors
  • Lemma 12 (Hoeffding’s inequality) states that for any s > 0, the probability of U i being outside of an interval [a, b] is bounded
  • Lemma 13 (Bernstein’s inequality) states that for any s ∈ (0, 1/C], the probability of U i being outside of an interval [a, b] is bounded
  • Lemma 14 (Donsker and Varadhan’s variational inequality) states that for any measurable, bounded function h, the KL divergence between a multinomial distribution of parameters (x 1 , . . . , x T ) and a multinomial distribution of parameters 1 T , . . . , 1 T is bounded
  • Theorem 1 provides a lower bound for any π ′
  • Theorem 5 states that for any prior on priors Λ, the bound becomes
  • Assumption 4 implies that the bound becomes
  • In the T > n regime, the bound is significantly improved compared to the learning in isolation
  • We assume that priors are mixtures of K Gaussians, and the bound from Theorem 5 becomes, at t = T + 1,