Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Discrete distribution estimation is a fundamental problem in Statistics
Estimating an arbitrary discrete distribution in Kullback-Leibler (KL) divergence with vanishing probability of error is the goal
KL divergence is non-negative, unbounded and asymmetric
Maximum likelihood estimator (empirical estimator) is commonly used
Minimax rates and concentration bounds are typically studied
Laplace estimator is commonly used and has a rate of convergence of max
Concentration bounds of the form “with probability 1 − δ” are desired
McDiarmid’s inequality can be used for 1 distance
Best known bound for KL divergence is given by [4]
Main result is a bound on the minimax rate of the Laplace estimator
Lower bound on the variance of KL(p p1 ) is established
Direct consequence of main result is improved sample complexity for estimating a tree-structured Bayesian network

McDiarmid’s inequality is a standard way to provide concentration bounds for discrete distribution estimators
Lemma 1 (McDiarmid’s inequality) states that if changing a variable changes the absolute value of the function by at most ci, then with probability at least, the bound simplifies to
The goal is to obtain a good enough bound on c∞(f) to show that nc∞(f)2 decreases to 0
[6] observes that the 1 distance between the true distribution and the empirical distribution satisfies c∞ ≤ 2/n
A direct application of McDiarmid’s inequality for KL(p p1) results in a vacuous bound
To provide a stronger bound, the KL divergence is written as a function of k counts N1, N2, …, Nk
The counts Nis are not independent of each other, so the Poisson sampling process is used
A concentration bound for KL under Poisson sampling yields a concentration bound for KL under multinomial sampling
A high probability bound on c∞(KL) is provided
Lemma 2 states that the expectations under multinomial and Poisson sampling are similar
A careful coupling between binomial and Poisson random variables with the same mean is used to obtain bounds on the quantity

Combining equations yields result needed for KL divergence result
Lemma 4 states that p i = where γ = 311 n + 160k n 3/2
Both i p i = 1 and i p1 i = 1
KL(p p ) is bounded with high probability
Jensen’s inequality used to bound right-hand side of equation
Lemma 3 and union bound used to bound each term
McDiarmid’s inequality applied to function with parameter δ/2
Theorem 3 provides lower bounds on variance of KL divergence of Laplace estimator
Corollary 1 shows that √ k dependence is tight
Argument in Theorem 3 can be extended to n k regime

KL divergence between underlying distribution and Laplace estimator can be bounded
Previous bound of Õ(k log(1/δ)/n) improved to Õ( √ k log 5/2 (1/δ)/n)
Lower bound of Ω( √ k/n) on variance and tail bound of KL loss of Laplace estimator established
Heuristic computation of leading constant done by asymptotic expansion of KL divergence
As n → ∞, p1 i → p i and Laplace and empirical estimator are similar
Chi-squared distribution used to approximate k i=1 (p emp i − 1/k) 2
Poisson and Binomial random variables used to define convenient coupling
Standard fact used to upper bound equation
Figure 1 shows sample standard deviation vs. estimated standard deviation of Laplace estimator