Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language modeling is a task in natural language processing.
Probability mass can “leak” onto infinite sequences in some cases.
This paper offers a measure-theoretic treatment of language modeling.
Popular language model families are tight and will not leak.

Paper Content

Introduction

Language modeling is a core task in natural language processing
It involves estimating a distribution over the set of strings over a given alphabet
It has been used to estimate statistical properties of language and is essential for computational linguistics research
It is also central to a wide range of natural language processing applications
Language models are typically described as a distribution over the countably infinite set of all (finite) strings
Some classes of autoregressive language models have parameter settings in which the generative process terminates with probability < 1
Transformer-based language models are always tight and recurrent neural language models are always tight when they employ a bounded activation function

Motivating examples

An alphabet is a finite set of symbols, including an end-of-sequence symbol
A string is a finite sequence of symbols from the alphabet
A language model is a distribution over all strings
An autoregressive sequence model is a conditional probability distribution
An ASM can match the conditional probabilities of a known language model
The probability of all strings is ≈ 0.702
An ASM can be tight if the probability of EOS decays more slowly
An infinite coin toss model assigns probability (1/2)∞ to each infinite sequence
Care must be taken to characterize tightness in NLP literature

The language model measure

An ASM can lose probability mass to the set of infinite sequences.
The set of infinite sequences is uncountable.

Measure-theoretic background

Measure-theoretic probability assigns probability to subsets of an outcome space.
It is not possible to assign probabilities to all outcome spaces in a way that satisfies a set of reasonable desiderata.
Probability is only assigned to certain “nice” subsets of the outcome space, referred to as measurable subsets or events.
A probability measure is a function that assigns a probability to measurable subsets.
A measurable space guarantees that countable operations on sets are always valid.

Language models as a measure

A sequence model is a probability space over the uncountable set Σ * ∪ Σ ∞.
A language model is a probability space over Σ *.
The set Σ ∞ represents the event where the language model is nonterminating.

Pre-measure

It is often impossible to assign a probability measure to every single subset of Ω.
There is no probability measure over ({H, T} ∞ , P({H, T} ∞ )) such that P({ω}) = 0 for all ω ∈ {H, T} ∞.
Carathéodory’s extension theorem provides a way to construct a probability space for sequences.
Definition 3.7: A probability pre-measure over (Ω, A) is a function.
Cylinder sets of rank k are the set of infinite strings that share their k-prefix with some string x ∈ H ⊆ Σ k.
Lemma 3.10: P 0 is a pre-measure over C.

Extension of pre-measure

Carathéodory’s Extension Theorem states that given an algebra A and a probability pre-measure P0, there exists a probability space (Ω, F, P) such that A ⊂ F and P| A = P 0.
The σ-algebra F is minimal and unique, and the probability measure P is unique.

A string-valued random variable

Constructed a probability space over Σ* ∪ Σ∞
Used a random variable to construct a σ-algebra over Σ* ∪ Σ∞
Defined a random variable X
Constructed a probability measure P
Showed that EOS probability is the conditional probability of generating a string with a prefix x ∈ Σ*

Characterizing tightness

A goal of the paper is to provide an exact characterization of tightness in autoregressive sequence models.
The event A k is the event that an EOS symbol appears at position k in the string.
The probability of generating an infinite string is expressed using Eq. (16).
A sequence model is tight if P (X ∈ Σ ∞ ) = 0.

A lower bound result

Tightness of a language model can be determined using Lemma 4.2.
Probability of EOS is lower bounded by a function that depends on length, not content.

The borel-cantelli lemmata

Proposition 4.3 admits a converse statement
Borel-Cantelli lemmata relate probability measure of sets to a series
Lemma 4.5 (Borel-Cantelli II) states that if events are independent, then the probability of them occurring infinitely often is 0
Corollary 4.6 states that if a sequence of events has a probability between 0 and 1, then the series of probabilities will diverge
Theorem 4.7 states that an autoregressive sequence model is tight if and only if the EOS probability is 1 for some t or the summands of the EOS probability are well-defined
Proposition 4.3 is the main tool for determining tightness
Examples 2.4 and 2.5 show that Proposition 4.3 can be used to determine tightness

Analysis of common language models

Discuss tightness of autoregressive sequence models
Put foundations from previous sections into practice

Stochastic finite-state language models

N-gram language models are a type of stochastic finite-state language models
Tightness of language models can be characterized in the more general setting
SFSSMs have a symbol-specific transition matrix, initial state probabilities, and termination probabilities
Accessible and co-accessible states are important for SFSSMs
Trimming an SFSSM removes non-useful states
Matrix inversion formula can be used to calculate total weight of accepting paths in a weighted graph

Transformer language models

Transformer language models are proven to be tight.
A basic fact in topology is used to prove tightness of various neural architectures, including the Transformer.
A function is defined to address the variable-length nature of modern deep NLP models.
Lemma 5.7 states that a compact set exists such that for any finite number of Transformer layers, the function is contained in the set.
Theorem 5.8 states that a sequence model as defined by any (finite-layer) transformer is tight.

Recurrent neural language models

RNNs have hidden states defined by a recurrence and conditional probabilities defined by an equation
Theorem 5.6 and the same strategy of proof as Theorem 5.8 can prove the tightness of RNN language models with bounded activations
Unbounded activations can lead to non-tightness when the probability of EOS decays too fast
Theorem 4.7 determines how fast the decay can be without losing tightness
Proposition 5.9 generalizes Theorem 4.7 and Lemma 3.2 of Welleck et al. (2020)
If the norm of the activations eventually grows sub-logarithmically, the RNN is still tight

Conclusion

Presents a measure-theoretic treatment of language modeling
Goal is to determine when sampling from an autoregressive sequence model is guaranteed to terminate
Defines components of language modeling in measure-theoretic terminology
Computes termination probabilities of a sequence model
Understands portion of probability mass allocated to infinite-length strings
Formalizes a definition of sequence modeling where probability of producing an infinite-length sequence is non-zero
Studies Transformer language model and its ability to place probability mass on infinite-length strings
Characterizes tightness of language models
Studies limitations of common neural network architectures
Casts language models as stochastic processes
Calculates P(Σ*)
Invokes Caratheory’s Extension Theorem
Constructs a non-measurable set in the cylinder σ-algebra
Constructs σ(C) by performing construction for every countable cardinal α

Link to paper#

Abstract#

Paper Content#

Introduction#

Motivating examples#

The language model measure#

Measure-theoretic background#

Language models as a measure#

Pre-measure#

Extension of pre-measure#

A string-valued random variable#

Characterizing tightness#

A lower bound result#

The borel-cantelli lemmata#

Analysis of common language models#

Stochastic finite-state language models#

Transformer language models#

Recurrent neural language models#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Motivating examples

The language model measure

Measure-theoretic background

Language models as a measure

Pre-measure

Extension of pre-measure

A string-valued random variable

Characterizing tightness

A lower bound result

The borel-cantelli lemmata

Analysis of common language models

Stochastic finite-state language models

Transformer language models

Recurrent neural language models

Conclusion