Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

This paper examines the computational complexity of learning a Hidden Markov Model (HMM).
It proposes an interactive access model, in which the algorithm can query for samples from the conditional distributions of the HMMs.
This model enables computationally efficient learning algorithms, bypassing cryptographic hardness.
Algorithms are presented for two settings: one with query access to exact conditional probabilities, and one with samples from the conditional distributions.
The performance of the algorithm depends on a new parameter, called the fidelity of the HMM.
The algorithms can be viewed as generalizations and robustifications of Angluin’s $L^*$ algorithm.

Paper Content

Introduction

HMMs are used to model temporal and sequential phenomena
HMMs have low description complexity, expressivity to capture long-range dependencies, and efficient inference algorithms
HMMs are used in many fields
Estimating/learning HMMs is computationally difficult
We focus on distribution learning in total variation distance
Maximum likelihood estimation is known to be statistically efficient, but not computationally efficient
We consider interactive access to the HMM
We show how L* algorithm can efficiently learn any HMM
We show an algorithm that is efficient for all HMMs with “high fidelity”
We introduce a new representation for distributions over exponentially large domains
We introduce a new perturbation argument for mitigating error amplification over long sequences

Preliminaries

Let O denote a finite observation space and O* denote observation sequences of arbitrary length
Consider a distribution Pr[•] over T random variables x1, …, xT with a sequential ordering
Pr[x1, x2, …, xT] is written in lieu of Pr[x1=x1, …, xT=xT], omitting explicit reference to the random variables
Pr[f|h] is written to denote the |F| x |H| matrix whose (i, j)th entry is Pr[f|h]
Pr[F|H] is written to denote the |F| x |H| matrix whose (i, j)th entry is Pr[f|h]
Hidden Markov Models provide a low-complexity parametrization for distributions over observation sequences
An HMM with S hidden states is specified by an initial distribution µ, an emission matrix O, and a state transition matrix T
Rank of a distribution is defined as the conditional probability matrix Pr[O ≤T −t |O t ] having rank at most r
An HMM with S hidden states has rank at most S
Exact conditional probability oracle is given as input observation sequences h and f of length t ≤ T and T − t respectively and returns the scalar Pr[f|h]
Conditional sampling oracle is given as input an observation sequence h of length t ≤ T and returns an observation sequence f of length T − t
Learning goal is distribution learning in total variation distance
Algorithm should compute an estimate Pr[•] such that with probability at least 1 − δ, the total variation distance between Pr[•] and Pr[•] is at most ε
Algorithm should have computational complexity that scales polynomially in r, T, O, 1/ε and log(1/δ)

Our results

Algorithm 1 can learn any HMM given access to an exact probability oracle in poly time
Algorithm 1 requires 0 < ε, δ < 1
Algorithm 1 returns an efficiently represented approximation of the distribution
Algorithm 2 is a robust version of L*
Algorithm 2 depends on a spectral property of a distribution called fidelity
Open Problem 1.6: Is there a computationally efficient algorithm for learning any low rank distribution given access to a conditional sampling oracle?

Technical overview

Low rank distributions are challenging to learn.
Notation is introduced to explain the challenges.
Estimating matrices is necessary for distribution learning.
Low rank property does not provide an efficient representation of the distribution.

Background: observable operators and hard instances

HMMs can be used to obtain an efficient algorithm.
Probability of a sequence can be written using observable operator representation.
Operators can be estimated when T and O have full column rank.
Rank deficient HMMs are hard instances.

Efficient representation

Rank deficient HMMs require efficient representation of the distribution
Any submatrix of Pr[F t |H t ] with the same rank as the entire matrix can be used to build an efficient representation
Exploiting a circulant structure in the matrices {Pr[F t |H t ]} t≤T can model the evolution of the coefficients
Sequence probabilities can be expressed by iterated application of the circulant structure
Estimating operators requires interactive access and a novel error propagation argument

Error propagation

Finding and estimating operators is difficult
Error amplification can arise from repeated application of learned operators
Estimating operators is discussed in Section 2.4 and finding the basis in Section 2.5
Estimator is defined in terms of estimated operators
Total variation distance is defined
Two strategies for bounding expression are discussed
New perturbation analysis is introduced
Tracks error in space of coefficients
Sum of scalars is small via inductive argument

Estimating operators

Estimate operators using conditional sampling oracle
Linear regression may not work due to small singular values
Preconditioner introduced to stabilize system
Preconditioner reduces size of matrix
Entries of matrix can be estimated using conditional samples
Preconditioner amplifies singular values of matrix
Fidelity introduced to ensure large singular values of matrix
Fidelity captures previously studied positive results for learning HMMs

Finding the basis

Challenge is to find bases {B t } t∈[T ]
Random sampling approach works for high fidelity distributions
Basis finding is the final issue to address for Theorem 1
Adaptation of Angluin’s L* algorithm used to find bases
Algorithm checks if predictions are accurate for polynomially many random sequences

Learning with conditional probabilities (theorem 1)

Theorem 1 states that Algorithm 1 can return an approximation of a probability distribution in poly(r, T, 1/ε, log(1/δ)) time
Notation is introduced to define histories and futures of length t
Probabilities associated to empty string are defined
Bases for the distribution are formally defined
Structural result is introduced to generate coefficients using OT matrices of size r x r
Equation (7) is introduced to represent the probability distribution
Solution A o,t is introduced for Equation (7)

Algorithm

Algorithm 2 requires the user to provide ε, δ, ∆* and r.
Algorithm 2 relies on the efficient representation provided by Proposition 3.2.
Algorithm 2 estimates the operators A o,t−1 using conditional samples and linear regression.

Analysis

Access to exact conditional probability oracle is no longer available, only samples can be obtained
Robust bases must be defined to control estimation errors
Estimation algorithm and proof provided in Appendix B.6
Estimation error for operators A o,t can be characterized
Errors in induced distributions can be bounded using structured error

Discussion

Interactive access to hidden Markov models can circumvent computational barriers to efficient learning.
All low rank distributions with a certain fidelity property can be efficiently learned.
Fidelity captures assumptions considered in prior work on learning of HMMs.
Overcomplete setting of Sharan et al. admits bases of size S with fidelity 1/poly(S).
Reliance on fidelity parameter is the main limitation of results.
Open problem is to show that ignoring small directions preserves low rank property.
Algorithm 1 with access to an exact probability oracle runs in poly(r, T, 1/ε, log(1/δ)) time.

B.1 finding robust basis

Finding a robust basis can be defined by a covariance matrix
The norm of the basis is upper bounded
The two distributions are only a small factor apart
Approximations of projections and coefficients need to be learned
Error of the approximation is small
Process requires many conditional samples

B.3 perturbation analysis: error in coefficients

Learn approximations of operators A o,t to compute probabilities
Let A x 1:t and A x 1:t represent product of matrices A xt,t−1 . . x 1:t
B x 1:t ⊂ H t+1, a subset of histories of length t+1
1 norm of associated coefficients γ x 1:t and γ ⊥ x 1:t grow moderately
For any observation sequence x, 1 norm of coefficients can be bounded
Let α(x t , B x 1:t−1 ) represent matrix with column given by α(x t , b) for b ∈ B x 1:t−1
Let α ⊥ (x t , B x 1:t−1 ), α(x t , V ⊥ t−1 ) and α ⊥ (x t , V ⊥ t−1 )
Recursion from Proposition B.5 has solution
Algorithm 2 with access to conditional sampling oracle runs in poly time
Returns approximation Pr[•] satisfying TV(Pr, Pr) ≤ ε with probability at least 1 − δ
Let {B t } t∈[T ] be basis of distribution Pr[•]
Define operators A o,t under basis {B t } t∈[T ]
Covariance matrix associated to B t has eigenvalue decomposition
Let d * t be restriction of distribution d t over set dom + (t)
Let β(x) ∈ span(V t ) be coefficients associated to history x
β(x) are uniquely defined in span(V t )
β(x) sum to one, even though some entries could be negative
Existence of operators which can be used to construct coefficients

B.6 estimating covariance matrix in frobenius norm

Estimate objects needed for operator A o,t
Lemma B.12 states that with probability 1-δ, we can learn estimate s(b * , x)
Define s(b * , x) as a sum where b * ∈ B t and x is a history of length t
Define Pr[•|b] for b ∈ B t
Sample m = (1/2c 2 n 3 p 2 ) log(2/δ) random futures from Pr[•|x]
Estimate Pr[f |b * ] and d(f )
Define α-regular future
Perform test A(f, b) for each future f and basis history b
Estimate q(bo) and Σ Bt for all b ∈ B t , observations o and time t ∈ [T ]
Parity with noise and all previously known positive results can be learned by algorithm
Define distribution induced by parity with noise
Proposition C.4 shows that distribution has rank ≤ 2T and fidelity (1−2α) 2 /2
Define overcomplete HMMs
Proposition C.7 shows that distribution has rank S and fidelity (poly(S)) −1

D general algorithm for finding approximate basis

Definition D.1 defines an approximate basis for a probability vector
Theorem 3 presents a main result on how to build an approximate basis for a regular low rank distribution
The regularity assumption on the distribution can be removed using ideas from Appendix B.6

D.1 learning coefficients

Check if there exists a β(x) such that a certain condition is met
Define a 2 approximation error
Use relative probabilities for regular distributions to build a guess for approximate basis
Use poly(T, 1/ε, 1/α, log(1/δ)) many conditional samples to get estimates
Use a 2-smooth function and a standard uniform convergence argument
Use Hoeffding’s inequality
Use an Elliptical Potential Lemma
Choose C, H and n
Find a counterexample
Show that the overall error of the basis is small

E helper propositions

Proposition E.1 (Hoeffding’s inequality) states that for independent random variables with a lower and upper bound, the sum of these random variables can be bounded by a certain value.
Davis-Kahan theorem is used in the work.
Algorithm 2 with access to a conditional sampling oracle runs in poly(r, T, O, 1/∆ * , 1/ε, log(1/δ)) time and returns an efficiently represented approximation Pr[•] satisfying TV(Pr, Pr) ≤ ε with probability at least 1 − δ.
Lemma 4.2 states that with probability 1 - δ, {S t } t∈[T ] form ∆-robust bases for Pr[•].
Lemma 4.4 states that we can learn approximations A o,t for all observations o ∈ O and t ∈ [T ] in poly(r, |O|, T , 1/ε, 1/∆, log(1/δ)) time such that with probability 1 − δ, for any unit vector v.
Lemma 4.5 states that the functions Pr[•] and Pr[•] are close in TV distance: TV(Pr, Pr) ≤ 2|O|T ε.
Definition B.14 states that Test A(f, b) passes if the empirical estimate Pr[f τ |bf 1:τ −1 ] > 2α for all τ ∈ [t] and fails otherwise.
Proposition B.17 states that Pr[F b |b] ≤ O(|O|T α).
Definition B.15 states that a future f is α-irregular for history b ∈ B t if there exists some τ ∈ [t] (τ can depend on b) such that Pr[f τ |bf 1:τ −1 ] < α.
Proposition B.18 states that Pr[F b |b] ≤ |O|T α.
Definition B.19 states that Pr[f |b] is set to 0 if test A(f, b) fails and is set to the estimate from Proposition B.16 if test A(f, b) passes.
Proposition B.20 states that Pr[f |b] - Pr[f |b] ≤ γ Pr[f |b].
Proposition D.4 states that there exists h ≤ H such that min β∈R h ,||β|| 2 ≤C L B h ,b h+1 (β) ≤ ε.
Proposition D.5 states that there exists h ≤ H such that min β∈R h ,||β|| 2 ≤C L B h ,b h+1 (β) ≤ ε.

Link to paper#

Abstract#

Paper Content#

Introduction#

Preliminaries#

Our results#

Technical overview#

Background: observable operators and hard instances#

Efficient representation#

Error propagation#

Estimating operators#

Finding the basis#

Learning with conditional probabilities (theorem 1)#

Algorithm#

Analysis#

Discussion#

B.1 finding robust basis#

B.3 perturbation analysis: error in coefficients#

B.6 estimating covariance matrix in frobenius norm#

D general algorithm for finding approximate basis#

D.1 learning coefficients#

E helper propositions#

Link to paper

Abstract

Paper Content

Introduction

Preliminaries

Our results

Technical overview

Background: observable operators and hard instances

Efficient representation

Error propagation

Estimating operators

Finding the basis

Learning with conditional probabilities (theorem 1)

Algorithm

Analysis

Discussion

B.1 finding robust basis

B.3 perturbation analysis: error in coefficients

B.6 estimating covariance matrix in frobenius norm

D general algorithm for finding approximate basis

D.1 learning coefficients

E helper propositions