Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Estimate discrete distribution in KL divergence
  • Concentration bounds for Laplace estimator
  • Deviation from mean scales as $\sqrt{k}/n$ when $n \ge k$
  • Establish matching lower bound, tight up to polylogarithmic factors

Paper Content

Introduction

  • Discrete distribution estimation is a fundamental problem in Statistics
  • Estimating an arbitrary discrete distribution in Kullback-Leibler (KL) divergence with vanishing probability of error is the goal
  • KL divergence is non-negative, unbounded and asymmetric
  • Maximum likelihood estimator (empirical estimator) is commonly used
  • Minimax rates and concentration bounds are typically studied
  • Laplace estimator is commonly used and has a rate of convergence of max
  • Concentration bounds of the form “with probability 1 − δ” are desired
  • McDiarmid’s inequality can be used for 1 distance
  • Best known bound for KL divergence is given by [4]
  • Main result is a bound on the minimax rate of the Laplace estimator
  • Lower bound on the variance of KL(p p1 ) is established
  • Direct consequence of main result is improved sample complexity for estimating a tree-structured Bayesian network

Analysis sketch

  • McDiarmid’s inequality is a standard way to provide concentration bounds for discrete distribution estimators
  • Lemma 1 (McDiarmid’s inequality) states that if changing a variable changes the absolute value of the function by at most ci, then with probability at least, the bound simplifies to
  • The goal is to obtain a good enough bound on c∞(f) to show that nc∞(f)2 decreases to 0
  • [6] observes that the 1 distance between the true distribution and the empirical distribution satisfies c∞ ≤ 2/n
  • A direct application of McDiarmid’s inequality for KL(p p1) results in a vacuous bound
  • To provide a stronger bound, the KL divergence is written as a function of k counts N1, N2, …, Nk
  • The counts Nis are not independent of each other, so the Poisson sampling process is used
  • A concentration bound for KL under Poisson sampling yields a concentration bound for KL under multinomial sampling
  • A high probability bound on c∞(KL) is provided
  • Lemma 2 states that the expectations under multinomial and Poisson sampling are similar
  • A careful coupling between binomial and Poisson random variables with the same mean is used to obtain bounds on the quantity

Analysis

  • Combining equations yields result needed for KL divergence result
  • Lemma 4 states that p i = where γ = 311 n + 160k n 3/2
  • Both i p i = 1 and i p1 i = 1
  • KL(p p ) is bounded with high probability
  • Jensen’s inequality used to bound right-hand side of equation
  • Lemma 3 and union bound used to bound each term
  • McDiarmid’s inequality applied to function with parameter δ/2
  • Theorem 3 provides lower bounds on variance of KL divergence of Laplace estimator
  • Corollary 1 shows that √ k dependence is tight
  • Argument in Theorem 3 can be extended to n k regime

Conclusion

  • KL divergence between underlying distribution and Laplace estimator can be bounded
  • Previous bound of Õ(k log(1/δ)/n) improved to Õ( √ k log 5/2 (1/δ)/n)
  • Lower bound of Ω( √ k/n) on variance and tail bound of KL loss of Laplace estimator established
  • Heuristic computation of leading constant done by asymptotic expansion of KL divergence
  • As n → ∞, p1 i → p i and Laplace and empirical estimator are similar
  • Chi-squared distribution used to approximate k i=1 (p emp i − 1/k) 2
  • Poisson and Binomial random variables used to define convenient coupling
  • Standard fact used to upper bound equation
  • Figure 1 shows sample standard deviation vs. estimated standard deviation of Laplace estimator