Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Language modeling is a task in natural language processing.
- Probability mass can “leak” onto infinite sequences in some cases.
- This paper offers a measure-theoretic treatment of language modeling.
- Popular language model families are tight and will not leak.
Paper Content
Introduction
- Language modeling is a core task in natural language processing
- It involves estimating a distribution over the set of strings over a given alphabet
- It has been used to estimate statistical properties of language and is essential for computational linguistics research
- It is also central to a wide range of natural language processing applications
- Language models are typically described as a distribution over the countably infinite set of all (finite) strings
- Some classes of autoregressive language models have parameter settings in which the generative process terminates with probability < 1
- Transformer-based language models are always tight and recurrent neural language models are always tight when they employ a bounded activation function
Motivating examples
- An alphabet is a finite set of symbols, including an end-of-sequence symbol
- A string is a finite sequence of symbols from the alphabet
- A language model is a distribution over all strings
- An autoregressive sequence model is a conditional probability distribution
- An ASM can match the conditional probabilities of a known language model
- The probability of all strings is ≈ 0.702
- An ASM can be tight if the probability of EOS decays more slowly
- An infinite coin toss model assigns probability (1/2)∞ to each infinite sequence
- Care must be taken to characterize tightness in NLP literature
The language model measure
- An ASM can lose probability mass to the set of infinite sequences.
- The set of infinite sequences is uncountable.
Measure-theoretic background
- Measure-theoretic probability assigns probability to subsets of an outcome space.
- It is not possible to assign probabilities to all outcome spaces in a way that satisfies a set of reasonable desiderata.
- Probability is only assigned to certain “nice” subsets of the outcome space, referred to as measurable subsets or events.
- A probability measure is a function that assigns a probability to measurable subsets.
- A measurable space guarantees that countable operations on sets are always valid.
Language models as a measure
- A sequence model is a probability space over the uncountable set Σ * ∪ Σ ∞.
- A language model is a probability space over Σ *.
- The set Σ ∞ represents the event where the language model is nonterminating.
Pre-measure
- It is often impossible to assign a probability measure to every single subset of Ω.
- There is no probability measure over ({H, T} ∞ , P({H, T} ∞ )) such that P({ω}) = 0 for all ω ∈ {H, T} ∞.
- Carathéodory’s extension theorem provides a way to construct a probability space for sequences.
- Definition 3.7: A probability pre-measure over (Ω, A) is a function.
- Cylinder sets of rank k are the set of infinite strings that share their k-prefix with some string x ∈ H ⊆ Σ k.
- Lemma 3.10: P 0 is a pre-measure over C.
Extension of pre-measure
- Carathéodory’s Extension Theorem states that given an algebra A and a probability pre-measure P0, there exists a probability space (Ω, F, P) such that A ⊂ F and P| A = P 0.
- The σ-algebra F is minimal and unique, and the probability measure P is unique.
A string-valued random variable
- Constructed a probability space over Σ* ∪ Σ∞
- Used a random variable to construct a σ-algebra over Σ* ∪ Σ∞
- Defined a random variable X
- Constructed a probability measure P
- Showed that EOS probability is the conditional probability of generating a string with a prefix x ∈ Σ*
Characterizing tightness
- A goal of the paper is to provide an exact characterization of tightness in autoregressive sequence models.
- The event A k is the event that an EOS symbol appears at position k in the string.
- The probability of generating an infinite string is expressed using Eq. (16).
- A sequence model is tight if P (X ∈ Σ ∞ ) = 0.
A lower bound result
- Tightness of a language model can be determined using Lemma 4.2.
- Probability of EOS is lower bounded by a function that depends on length, not content.
The borel-cantelli lemmata
- Proposition 4.3 admits a converse statement
- Borel-Cantelli lemmata relate probability measure of sets to a series
- Lemma 4.5 (Borel-Cantelli II) states that if events are independent, then the probability of them occurring infinitely often is 0
- Corollary 4.6 states that if a sequence of events has a probability between 0 and 1, then the series of probabilities will diverge
- Theorem 4.7 states that an autoregressive sequence model is tight if and only if the EOS probability is 1 for some t or the summands of the EOS probability are well-defined
- Proposition 4.3 is the main tool for determining tightness
- Examples 2.4 and 2.5 show that Proposition 4.3 can be used to determine tightness
Analysis of common language models
- Discuss tightness of autoregressive sequence models
- Put foundations from previous sections into practice
Stochastic finite-state language models
- N-gram language models are a type of stochastic finite-state language models
- Tightness of language models can be characterized in the more general setting
- SFSSMs have a symbol-specific transition matrix, initial state probabilities, and termination probabilities
- Accessible and co-accessible states are important for SFSSMs
- Trimming an SFSSM removes non-useful states
- Matrix inversion formula can be used to calculate total weight of accepting paths in a weighted graph
Transformer language models
- Transformer language models are proven to be tight.
- A basic fact in topology is used to prove tightness of various neural architectures, including the Transformer.
- A function is defined to address the variable-length nature of modern deep NLP models.
- Lemma 5.7 states that a compact set exists such that for any finite number of Transformer layers, the function is contained in the set.
- Theorem 5.8 states that a sequence model as defined by any (finite-layer) transformer is tight.
Recurrent neural language models
- RNNs have hidden states defined by a recurrence and conditional probabilities defined by an equation
- Theorem 5.6 and the same strategy of proof as Theorem 5.8 can prove the tightness of RNN language models with bounded activations
- Unbounded activations can lead to non-tightness when the probability of EOS decays too fast
- Theorem 4.7 determines how fast the decay can be without losing tightness
- Proposition 5.9 generalizes Theorem 4.7 and Lemma 3.2 of Welleck et al. (2020)
- If the norm of the activations eventually grows sub-logarithmically, the RNN is still tight
Conclusion
- Presents a measure-theoretic treatment of language modeling
- Goal is to determine when sampling from an autoregressive sequence model is guaranteed to terminate
- Defines components of language modeling in measure-theoretic terminology
- Computes termination probabilities of a sequence model
- Understands portion of probability mass allocated to infinite-length strings
- Formalizes a definition of sequence modeling where probability of producing an infinite-length sequence is non-zero
- Studies Transformer language model and its ability to place probability mass on infinite-length strings
- Characterizes tightness of language models
- Studies limitations of common neural network architectures
- Casts language models as stochastic processes
- Calculates P(Σ*)
- Invokes Caratheory’s Extension Theorem
- Constructs a non-measurable set in the cylinder σ-algebra
- Constructs σ(C) by performing construction for every countable cardinal α