Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Language modeling is a task in natural language processing.
  • Probability mass can “leak” onto infinite sequences in some cases.
  • This paper offers a measure-theoretic treatment of language modeling.
  • Popular language model families are tight and will not leak.

Paper Content


  • Language modeling is a core task in natural language processing
  • It involves estimating a distribution over the set of strings over a given alphabet
  • It has been used to estimate statistical properties of language and is essential for computational linguistics research
  • It is also central to a wide range of natural language processing applications
  • Language models are typically described as a distribution over the countably infinite set of all (finite) strings
  • Some classes of autoregressive language models have parameter settings in which the generative process terminates with probability < 1
  • Transformer-based language models are always tight and recurrent neural language models are always tight when they employ a bounded activation function

Motivating examples

  • An alphabet is a finite set of symbols, including an end-of-sequence symbol
  • A string is a finite sequence of symbols from the alphabet
  • A language model is a distribution over all strings
  • An autoregressive sequence model is a conditional probability distribution
  • An ASM can match the conditional probabilities of a known language model
  • The probability of all strings is ≈ 0.702
  • An ASM can be tight if the probability of EOS decays more slowly
  • An infinite coin toss model assigns probability (1/2)∞ to each infinite sequence
  • Care must be taken to characterize tightness in NLP literature

The language model measure

  • An ASM can lose probability mass to the set of infinite sequences.
  • The set of infinite sequences is uncountable.

Measure-theoretic background

  • Measure-theoretic probability assigns probability to subsets of an outcome space.
  • It is not possible to assign probabilities to all outcome spaces in a way that satisfies a set of reasonable desiderata.
  • Probability is only assigned to certain “nice” subsets of the outcome space, referred to as measurable subsets or events.
  • A probability measure is a function that assigns a probability to measurable subsets.
  • A measurable space guarantees that countable operations on sets are always valid.

Language models as a measure

  • A sequence model is a probability space over the uncountable set Σ * ∪ Σ ∞.
  • A language model is a probability space over Σ *.
  • The set Σ ∞ represents the event where the language model is nonterminating.


  • It is often impossible to assign a probability measure to every single subset of Ω.
  • There is no probability measure over ({H, T} ∞ , P({H, T} ∞ )) such that P({ω}) = 0 for all ω ∈ {H, T} ∞.
  • Carathéodory’s extension theorem provides a way to construct a probability space for sequences.
  • Definition 3.7: A probability pre-measure over (Ω, A) is a function.
  • Cylinder sets of rank k are the set of infinite strings that share their k-prefix with some string x ∈ H ⊆ Σ k.
  • Lemma 3.10: P 0 is a pre-measure over C.

Extension of pre-measure

  • Carathéodory’s Extension Theorem states that given an algebra A and a probability pre-measure P0, there exists a probability space (Ω, F, P) such that A ⊂ F and P| A = P 0.
  • The σ-algebra F is minimal and unique, and the probability measure P is unique.

A string-valued random variable

  • Constructed a probability space over Σ* ∪ Σ∞
  • Used a random variable to construct a σ-algebra over Σ* ∪ Σ∞
  • Defined a random variable X
  • Constructed a probability measure P
  • Showed that EOS probability is the conditional probability of generating a string with a prefix x ∈ Σ*

Characterizing tightness

  • A goal of the paper is to provide an exact characterization of tightness in autoregressive sequence models.
  • The event A k is the event that an EOS symbol appears at position k in the string.
  • The probability of generating an infinite string is expressed using Eq. (16).
  • A sequence model is tight if P (X ∈ Σ ∞ ) = 0.

A lower bound result

  • Tightness of a language model can be determined using Lemma 4.2.
  • Probability of EOS is lower bounded by a function that depends on length, not content.

The borel-cantelli lemmata

  • Proposition 4.3 admits a converse statement
  • Borel-Cantelli lemmata relate probability measure of sets to a series
  • Lemma 4.5 (Borel-Cantelli II) states that if events are independent, then the probability of them occurring infinitely often is 0
  • Corollary 4.6 states that if a sequence of events has a probability between 0 and 1, then the series of probabilities will diverge
  • Theorem 4.7 states that an autoregressive sequence model is tight if and only if the EOS probability is 1 for some t or the summands of the EOS probability are well-defined
  • Proposition 4.3 is the main tool for determining tightness
  • Examples 2.4 and 2.5 show that Proposition 4.3 can be used to determine tightness

Analysis of common language models

  • Discuss tightness of autoregressive sequence models
  • Put foundations from previous sections into practice

Stochastic finite-state language models

  • N-gram language models are a type of stochastic finite-state language models
  • Tightness of language models can be characterized in the more general setting
  • SFSSMs have a symbol-specific transition matrix, initial state probabilities, and termination probabilities
  • Accessible and co-accessible states are important for SFSSMs
  • Trimming an SFSSM removes non-useful states
  • Matrix inversion formula can be used to calculate total weight of accepting paths in a weighted graph

Transformer language models

  • Transformer language models are proven to be tight.
  • A basic fact in topology is used to prove tightness of various neural architectures, including the Transformer.
  • A function is defined to address the variable-length nature of modern deep NLP models.
  • Lemma 5.7 states that a compact set exists such that for any finite number of Transformer layers, the function is contained in the set.
  • Theorem 5.8 states that a sequence model as defined by any (finite-layer) transformer is tight.

Recurrent neural language models

  • RNNs have hidden states defined by a recurrence and conditional probabilities defined by an equation
  • Theorem 5.6 and the same strategy of proof as Theorem 5.8 can prove the tightness of RNN language models with bounded activations
  • Unbounded activations can lead to non-tightness when the probability of EOS decays too fast
  • Theorem 4.7 determines how fast the decay can be without losing tightness
  • Proposition 5.9 generalizes Theorem 4.7 and Lemma 3.2 of Welleck et al. (2020)
  • If the norm of the activations eventually grows sub-logarithmically, the RNN is still tight


  • Presents a measure-theoretic treatment of language modeling
  • Goal is to determine when sampling from an autoregressive sequence model is guaranteed to terminate
  • Defines components of language modeling in measure-theoretic terminology
  • Computes termination probabilities of a sequence model
  • Understands portion of probability mass allocated to infinite-length strings
  • Formalizes a definition of sequence modeling where probability of producing an infinite-length sequence is non-zero
  • Studies Transformer language model and its ability to place probability mass on infinite-length strings
  • Characterizes tightness of language models
  • Studies limitations of common neural network architectures
  • Casts language models as stochastic processes
  • Calculates P(Σ*)
  • Invokes Caratheory’s Extension Theorem
  • Constructs a non-measurable set in the cylinder σ-algebra
  • Constructs σ(C) by performing construction for every countable cardinal α