Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • This paper examines the computational complexity of learning a Hidden Markov Model (HMM).
  • It proposes an interactive access model, in which the algorithm can query for samples from the conditional distributions of the HMMs.
  • This model enables computationally efficient learning algorithms, bypassing cryptographic hardness.
  • Algorithms are presented for two settings: one with query access to exact conditional probabilities, and one with samples from the conditional distributions.
  • The performance of the algorithm depends on a new parameter, called the fidelity of the HMM.
  • The algorithms can be viewed as generalizations and robustifications of Angluin’s $L^*$ algorithm.

Paper Content

Introduction

  • HMMs are used to model temporal and sequential phenomena
  • HMMs have low description complexity, expressivity to capture long-range dependencies, and efficient inference algorithms
  • HMMs are used in many fields
  • Estimating/learning HMMs is computationally difficult
  • We focus on distribution learning in total variation distance
  • Maximum likelihood estimation is known to be statistically efficient, but not computationally efficient
  • We consider interactive access to the HMM
  • We show how L* algorithm can efficiently learn any HMM
  • We show an algorithm that is efficient for all HMMs with “high fidelity”
  • We introduce a new representation for distributions over exponentially large domains
  • We introduce a new perturbation argument for mitigating error amplification over long sequences

Preliminaries

  • Let O denote a finite observation space and O* denote observation sequences of arbitrary length
  • Consider a distribution Pr[•] over T random variables x1, …, xT with a sequential ordering
  • Pr[x1, x2, …, xT] is written in lieu of Pr[x1=x1, …, xT=xT], omitting explicit reference to the random variables
  • Pr[f|h] is written to denote the |F| x |H| matrix whose (i, j)th entry is Pr[f|h]
  • Pr[F|H] is written to denote the |F| x |H| matrix whose (i, j)th entry is Pr[f|h]
  • Hidden Markov Models provide a low-complexity parametrization for distributions over observation sequences
  • An HMM with S hidden states is specified by an initial distribution µ, an emission matrix O, and a state transition matrix T
  • Rank of a distribution is defined as the conditional probability matrix Pr[O ≤T −t |O t ] having rank at most r
  • An HMM with S hidden states has rank at most S
  • Exact conditional probability oracle is given as input observation sequences h and f of length t ≤ T and T − t respectively and returns the scalar Pr[f|h]
  • Conditional sampling oracle is given as input an observation sequence h of length t ≤ T and returns an observation sequence f of length T − t
  • Learning goal is distribution learning in total variation distance
  • Algorithm should compute an estimate Pr[•] such that with probability at least 1 − δ, the total variation distance between Pr[•] and Pr[•] is at most ε
  • Algorithm should have computational complexity that scales polynomially in r, T, O, 1/ε and log(1/δ)

Our results

  • Algorithm 1 can learn any HMM given access to an exact probability oracle in poly time
  • Algorithm 1 requires 0 < ε, δ < 1
  • Algorithm 1 returns an efficiently represented approximation of the distribution
  • Algorithm 2 is a robust version of L*
  • Algorithm 2 depends on a spectral property of a distribution called fidelity
  • Open Problem 1.6: Is there a computationally efficient algorithm for learning any low rank distribution given access to a conditional sampling oracle?

Technical overview

  • Low rank distributions are challenging to learn.
  • Notation is introduced to explain the challenges.
  • Estimating matrices is necessary for distribution learning.
  • Low rank property does not provide an efficient representation of the distribution.

Background: observable operators and hard instances

  • HMMs can be used to obtain an efficient algorithm.
  • Probability of a sequence can be written using observable operator representation.
  • Operators can be estimated when T and O have full column rank.
  • Rank deficient HMMs are hard instances.

Efficient representation

  • Rank deficient HMMs require efficient representation of the distribution
  • Any submatrix of Pr[F t |H t ] with the same rank as the entire matrix can be used to build an efficient representation
  • Exploiting a circulant structure in the matrices {Pr[F t |H t ]} t≤T can model the evolution of the coefficients
  • Sequence probabilities can be expressed by iterated application of the circulant structure
  • Estimating operators requires interactive access and a novel error propagation argument

Error propagation

  • Finding and estimating operators is difficult
  • Error amplification can arise from repeated application of learned operators
  • Estimating operators is discussed in Section 2.4 and finding the basis in Section 2.5
  • Estimator is defined in terms of estimated operators
  • Total variation distance is defined
  • Two strategies for bounding expression are discussed
  • New perturbation analysis is introduced
  • Tracks error in space of coefficients
  • Sum of scalars is small via inductive argument

Estimating operators

  • Estimate operators using conditional sampling oracle
  • Linear regression may not work due to small singular values
  • Preconditioner introduced to stabilize system
  • Preconditioner reduces size of matrix
  • Entries of matrix can be estimated using conditional samples
  • Preconditioner amplifies singular values of matrix
  • Fidelity introduced to ensure large singular values of matrix
  • Fidelity captures previously studied positive results for learning HMMs

Finding the basis

  • Challenge is to find bases {B t } t∈[T ]
  • Random sampling approach works for high fidelity distributions
  • Basis finding is the final issue to address for Theorem 1
  • Adaptation of Angluin’s L* algorithm used to find bases
  • Algorithm checks if predictions are accurate for polynomially many random sequences

Learning with conditional probabilities (theorem 1)

  • Theorem 1 states that Algorithm 1 can return an approximation of a probability distribution in poly(r, T, 1/ε, log(1/δ)) time
  • Notation is introduced to define histories and futures of length t
  • Probabilities associated to empty string are defined
  • Bases for the distribution are formally defined
  • Structural result is introduced to generate coefficients using OT matrices of size r x r
  • Equation (7) is introduced to represent the probability distribution
  • Solution A o,t is introduced for Equation (7)

Algorithm

  • Algorithm 2 requires the user to provide ε, δ, ∆* and r.
  • Algorithm 2 relies on the efficient representation provided by Proposition 3.2.
  • Algorithm 2 estimates the operators A o,t−1 using conditional samples and linear regression.

Analysis

  • Access to exact conditional probability oracle is no longer available, only samples can be obtained
  • Robust bases must be defined to control estimation errors
  • Estimation algorithm and proof provided in Appendix B.6
  • Estimation error for operators A o,t can be characterized
  • Errors in induced distributions can be bounded using structured error

Discussion

  • Interactive access to hidden Markov models can circumvent computational barriers to efficient learning.
  • All low rank distributions with a certain fidelity property can be efficiently learned.
  • Fidelity captures assumptions considered in prior work on learning of HMMs.
  • Overcomplete setting of Sharan et al. admits bases of size S with fidelity 1/poly(S).
  • Reliance on fidelity parameter is the main limitation of results.
  • Open problem is to show that ignoring small directions preserves low rank property.
  • Algorithm 1 with access to an exact probability oracle runs in poly(r, T, 1/ε, log(1/δ)) time.

B.1 finding robust basis

  • Finding a robust basis can be defined by a covariance matrix
  • The norm of the basis is upper bounded
  • The two distributions are only a small factor apart
  • Approximations of projections and coefficients need to be learned
  • Error of the approximation is small
  • Process requires many conditional samples

B.3 perturbation analysis: error in coefficients

  • Learn approximations of operators A o,t to compute probabilities
  • Let A x 1:t and A x 1:t represent product of matrices A xt,t−1 . . x 1:t
  • B x 1:t ⊂ H t+1, a subset of histories of length t+1
  • 1 norm of associated coefficients γ x 1:t and γ ⊥ x 1:t grow moderately
  • For any observation sequence x, 1 norm of coefficients can be bounded
  • Let α(x t , B x 1:t−1 ) represent matrix with column given by α(x t , b) for b ∈ B x 1:t−1
  • Let α ⊥ (x t , B x 1:t−1 ), α(x t , V ⊥ t−1 ) and α ⊥ (x t , V ⊥ t−1 )
  • Recursion from Proposition B.5 has solution
  • Algorithm 2 with access to conditional sampling oracle runs in poly time
  • Returns approximation Pr[•] satisfying TV(Pr, Pr) ≤ ε with probability at least 1 − δ
  • Let {B t } t∈[T ] be basis of distribution Pr[•]
  • Define operators A o,t under basis {B t } t∈[T ]
  • Covariance matrix associated to B t has eigenvalue decomposition
  • Let d * t be restriction of distribution d t over set dom + (t)
  • Let β(x) ∈ span(V t ) be coefficients associated to history x
  • β(x) are uniquely defined in span(V t )
  • β(x) sum to one, even though some entries could be negative
  • Existence of operators which can be used to construct coefficients

B.6 estimating covariance matrix in frobenius norm

  • Estimate objects needed for operator A o,t
  • Lemma B.12 states that with probability 1-δ, we can learn estimate s(b * , x)
  • Define s(b * , x) as a sum where b * ∈ B t and x is a history of length t
  • Define Pr[•|b] for b ∈ B t
  • Sample m = (1/2c 2 n 3 p 2 ) log(2/δ) random futures from Pr[•|x]
  • Estimate Pr[f |b * ] and d(f )
  • Define α-regular future
  • Perform test A(f, b) for each future f and basis history b
  • Estimate q(bo) and Σ Bt for all b ∈ B t , observations o and time t ∈ [T ]
  • Parity with noise and all previously known positive results can be learned by algorithm
  • Define distribution induced by parity with noise
  • Proposition C.4 shows that distribution has rank ≤ 2T and fidelity (1−2α) 2 /2
  • Define overcomplete HMMs
  • Proposition C.7 shows that distribution has rank S and fidelity (poly(S)) −1

D general algorithm for finding approximate basis

  • Definition D.1 defines an approximate basis for a probability vector
  • Theorem 3 presents a main result on how to build an approximate basis for a regular low rank distribution
  • The regularity assumption on the distribution can be removed using ideas from Appendix B.6

D.1 learning coefficients

  • Check if there exists a β(x) such that a certain condition is met
  • Define a 2 approximation error
  • Use relative probabilities for regular distributions to build a guess for approximate basis
  • Use poly(T, 1/ε, 1/α, log(1/δ)) many conditional samples to get estimates
  • Use a 2-smooth function and a standard uniform convergence argument
  • Use Hoeffding’s inequality
  • Use an Elliptical Potential Lemma
  • Choose C, H and n
  • Find a counterexample
  • Show that the overall error of the basis is small

E helper propositions

  • Proposition E.1 (Hoeffding’s inequality) states that for independent random variables with a lower and upper bound, the sum of these random variables can be bounded by a certain value.
  • Davis-Kahan theorem is used in the work.
  • Algorithm 2 with access to a conditional sampling oracle runs in poly(r, T, O, 1/∆ * , 1/ε, log(1/δ)) time and returns an efficiently represented approximation Pr[•] satisfying TV(Pr, Pr) ≤ ε with probability at least 1 − δ.
  • Lemma 4.2 states that with probability 1 - δ, {S t } t∈[T ] form ∆-robust bases for Pr[•].
  • Lemma 4.4 states that we can learn approximations A o,t for all observations o ∈ O and t ∈ [T ] in poly(r, |O|, T , 1/ε, 1/∆, log(1/δ)) time such that with probability 1 − δ, for any unit vector v.
  • Lemma 4.5 states that the functions Pr[•] and Pr[•] are close in TV distance: TV(Pr, Pr) ≤ 2|O|T ε.
  • Definition B.14 states that Test A(f, b) passes if the empirical estimate Pr[f τ |bf 1:τ −1 ] > 2α for all τ ∈ [t] and fails otherwise.
  • Proposition B.17 states that Pr[F b |b] ≤ O(|O|T α).
  • Definition B.15 states that a future f is α-irregular for history b ∈ B t if there exists some τ ∈ [t] (τ can depend on b) such that Pr[f τ |bf 1:τ −1 ] < α.
  • Proposition B.18 states that Pr[F b |b] ≤ |O|T α.
  • Definition B.19 states that Pr[f |b] is set to 0 if test A(f, b) fails and is set to the estimate from Proposition B.16 if test A(f, b) passes.
  • Proposition B.20 states that Pr[f |b] - Pr[f |b] ≤ γ Pr[f |b].
  • Proposition D.4 states that there exists h ≤ H such that min β∈R h ,||β|| 2 ≤C L B h ,b h+1 (β) ≤ ε.
  • Proposition D.5 states that there exists h ≤ H such that min β∈R h ,||β|| 2 ≤C L B h ,b h+1 (β) ≤ ε.