Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Language modeling studies probability distributions over strings of texts
  • Used in text generation, speech recognition, machine translation
  • Conventional language models (CLMs) predict probability of linguistic sequences in a causal manner
  • Pre-trained language models (PLMs) cover broader concepts and can be used for both causal sequential modeling and fine-tuning
  • PLMs have their own training paradigms and serve as foundation models in modern NLP systems
  • Overview paper provides introduction to CLMs and PLMs from five aspects
  • Discusses relationship between CLMs and PLMs and future directions of language modeling in pre-trained era

Paper Content


  • Language modeling studies probability distributions over sequences of words
  • Used in many computational linguistic problems
  • Two major approaches: statistical and data-driven
  • Conventional language models (CLMs) predict probability of linguistic sequences in a causal manner
  • Data-driven approach uses neural-network models, leading to pre-trained language models (PLMs)
  • Five perspectives: linguistic units, structures, training methods, evaluation methods, applications
  • CLMs attempt to predict next linguistic unit in a text sequence given its preceding contexts
  • Represented by characters, words, phrases, etc.
  • PLMs do not necessarily follow CLMs in predicting linguistic units
  • Represented by word embeddings and sentence embeddings
  • Overview paper serves two objectives: introducing basic concepts and future research directions

Types of language models

  • Auto-regressive models predict the next linguistic units given the preceding context
  • Chain rule is used to access the probability of a text sequence
  • Goal of CLMs is to decode the probability of text sequences in a causal manner
  • This section introduces more LMs that go beyond CLMs

Structural lm

  • Structural LMs predict linguistic units based on pre-defined linguistic structures
  • Structural LMs use the linguistic structure to bring relevant context closer to the linguistic unit to be predicted
  • Structural LMs have been successfully applied to sentence completion and speech recognition

Bidirectional lm

  • Bidirectional LMs use contexts from both directions to make predictions.
  • Masked LM is a representative bidirectional LM that masks out linguistic units in a text sequence and predicts them.
  • The goal of bidirectional LMs is to learn the inner dependency between linguistic units in an unsupervised manner.
  • Pre-trained bidirectional LMs are used as the backbone for further fine-tuning in various downstream applications.

Permutation lm

  • CLMs and masked LMs have advantages and disadvantages.
  • Permutation LM is a recently proposed LM that combines CLMs and masked LMs.
  • Tokenization is the process of partitioning text sequences into small linguistic units.


  • Language models (LMs) can be based on characters.
  • Using characters has a smaller vocabulary size than other linguistics units.
  • It is difficult to predict the next character, usually requiring a long historical context.
  • Character-level LMs have poorer performance than word-level LMs.
  • Input and output lengths need to be longer to accurately model character distribution.
  • Computational costs are higher for auto-regressive decoding.
  • Combining words and characters can help to alleviate the issue.

Words and subwords

  • Natural tokenization for English is to decompose text into words by white spaces
  • Many language models (LMs) apply word tokenization
  • Issues with naive word tokenization include Out-Of-Vocabulary (OOV) problem
  • OOV problem occurs when words are not stored in pre-defined vocabulary
  • Subword segmentation algorithms are developed to boost LM performance
  • Two subword segmentation approaches: statistics-based and linguistics-based
  • Statistics-based subword tokenizers generate subword vocabulary based on corpus
  • Linguistics-based subword tokenizers exploit linguistic knowledge and decompose words into smaller grammatical units


  • Semantic meaning of a single word can be ambiguous due to different contexts and combinations of words.
  • Word-level language models ignore the relationship between words.
  • Phrase-level language models replace common and cohesive word sequences with phrases.
  • Phrase-level language models are suitable for some applications, such as automatic speech recognition.


  • Language models use conditional probabilities to estimate the probability of text sequences.
  • Sentence-level language models generate sentence features and model the sentence probability directly.
  • This is more convenient than using the chain rule and easier to encode inter-sentence information.

Model structures

  • N-gram, maximum entropy, and neural network models are used to model the probability distributions of text sequences
  • PLMs typically use continuous representations in probability modeling built upon RNNs or transformers

N-gram models

  • N-grams consist of N consecutive linguistic units from a text sequence
  • N-gram LMs assume the probability of a word depends on its preceding N-1 linguistic units
  • N-gram LMs calculate the conditional probability by counting the occurrence time of N-grams
  • N-gram LMs simplify the word probability calculation based on previous N-1 words
  • N-gram LMs encounter two sparsity issues
  • Smoothing techniques are used to alleviate the sparsity issues
  • Additive smoothing adds a small value to the count for every N-gram
  • Advanced smoothing techniques such as back-off and interpolation are used for better probability estimation

Maximum entropy models

  • Maximum Entropy models estimate the probability of text sequences using feature functions.
  • Features are usually generated from N-grams and the Generalized Iterative Scaling algorithm is used to derive the parameter vector.

Feed-forward neural network (fnn) models

  • N-gram model has a performance bottleneck
  • Neural LMs use continuous embedding space to overcome data sparsity
  • Feed-forward Neural Network (FNN) LMs use historical contexts as input and output probability distribution of words
  • FNN LM uses a fixed window to collect fixed-length contexts
  • FNN LM can handle unseen N-grams and is storage-efficient

Recurrent neural network (rnn) models

  • Historical context is insufficient to predict the next word.
  • RNN LMs can exploit arbitrarily long histories to predict the next word.
  • RNN LMs use a non-linear activation function and weight matrices to compute the hidden state.
  • RNN LMs can theoretically use all preceding history to predict the next word, but in practice the gradient vanishing problem hampers the learning of the model.


  • Transformer architecture can capture long-term dependencies and important sequence components.
  • Transformer is easy to parallelize in both training and inference.
  • Transformer consists of an encoder and a decoder.
  • Different transformer models are suitable for different tasks.

Pre-trained language models

  • PLMs are widely used in NLP
  • Deep learning has changed the way PLMs are trained and used
  • PLMs are pre-trained on large corpora to learn universal representations
  • PLMs are fine-tuned for downstream tasks to transfer knowledge
  • Several survey papers on PLMs exist


  • Pre-training tasks are used to train language models
  • Common pre-training task is “missing word prediction”
  • Other pre-training tasks include next-sentence prediction and masked language model
  • Pre-training tasks can help language models learn better linguistic knowledge, such as sentence relationships
  • Other pre-training objectives include token deletion, text infilling, sentence permutation, and document rotation

Fine-tuning and prompt-tuning

  • PLMs learn language knowledge in pre-training stage
  • Fine-tuning adapts model for downstream tasks
  • Model parameters are updated in fine-tuning stage
  • Prompt-tuning mimics pre-training objectives in fine-tuning/inference stage

Model evaluation

  • Intrinsic evaluation examines internal properties of an LM
  • Extrinsic evaluation studies performance in downstream tasks

Intrinsic evaluation

  • Auto-regressive LM estimates probability of text sequences
  • Perplexity is a common evaluation metric for this purpose
  • Good LM should maximize text set probability, which is equivalent to minimizing perplexity
  • Bidirectional Language Model uses different approach to calculate inverse probability
  • Intrinsic evaluation metrics (PLL and PPPL) measure naturalness of sentences for bidirectional LM

Extrinsic evaluation

  • Downstream tasks of language models can be used for evaluation
  • GLUE and SuperGLUE are two popular evaluation benchmarks for natural language understanding

Relation between intrinsic and extrinsic evaluations

  • Pre-training tasks (based on word prediction) can help LMs learn linguistic knowledge
  • Empirical studies show pre-training tasks help LMs learn grammar and semantic roles
  • Theoretical studies attempt to build a connection between LM’s perplexities and performance on downstream tasks
  • Text classification tasks can be reformulated as sentence completion tasks

Applications in text generation

  • Text generation is an important application of LMs
  • Text generation tasks vary depending on the purpose and input
  • Examples of text generation tasks include ASR, machine translation, and story generation
  • Common techniques used in text generation are introduced
  • LMs can be applied to each of the representative tasks

Decoding methods

  • Decoding decides the next output linguistic unit to generate text
  • Decoding methods are important for text generation
  • Poor decoding methods lead to bad generated texts
  • Two main decoding methods: maximization-based and sampling-based
  • Maximization-based searches for tokens with highest probability
  • Sampling-based increases diversity of generated texts

Dialogue systems

  • Dialogue systems simulate human responses when conversing with humans
  • ChatGPT and LaMDA are examples of dialogue systems
  • Dialogue systems can be task-oriented or open-domain
  • ChatGPT is built on a generative language model (GPT-3)
  • Language models are important for natural language understanding and generation
  • Language models are evaluated for dialogue tasks
  • Language models are evaluated for few-shot capability in dialogue tasks

Automatic speech recognition

  • Automatic speech recognition (ASR) is a speech-to-text generation task.
  • ASR systems contain an acoustic model and a language model.
  • Language models help solve acoustically ambiguous utterances and lower computational cost.
  • Different types of language models have been explored in ASR, such as N-gram, FFNN, RNN and Transformer.

Machine translation

  • Machine translation is a text-to-text generation task
  • Transformer-based models have been successful in machine translation

Detection of generated texts

  • Performance of LMs is getting close to or better than humans
  • Misuse of LMs is a serious problem
  • Detection of machine-generated texts is important
  • Two types of detection problems: human written vs. machine generated and inveracious vs. veracious
  • Two common approaches to detecting machine-generated text: exploit probability distribution and train classifiers with supervised learning
  • Recent PLMs require high computational resources and energy consumption
  • Efficient pre-training methods proposed to use data more efficiently
  • Bridging pre-training and downstream tasks with prompt tuning

Model size

  • Training efficiency can be improved by designing smaller models.
  • Model compression is a widely studied topic to reduce model size for mobile or edge devices.

Future research directions

  • Investigate the use of language models in other applications
  • Explore ways to improve the accuracy of language models

Integration of lms and kgs

  • Knowledge Graphs (KGs) are used in many NLP applications.
  • There is a growing interest in evaluating knowledge learned in PLMs.
  • KGs and LMs can interact with each other and KGs can serve as an information database for LMs.

Incremental learning

  • Incremental learning aims to incorporate new information without re-training existing models entirely
  • Problem of catastrophic forgetting associated with neural network models
  • Solution proposed to remember prior important tasks by slowing down learning on weights
  • Difficult to define important tasks in LMs
  • Re-training a large LM too expensive
  • Incremental learning challenging for neural networks
  • Easy for KGs to add/remove data from existing database
  • Integration of KGs and LMs provides solution for incremental learning

Lightweight models

  • PLMs require a lot of computational resources and energy
  • LLMs have a high carbon footprint
  • GL focuses on low carbon footprint solutions
  • Design of lightweight models with lower complexity and smaller size is popular
  • Model compression is used to shrink model size
  • Incorporating linguistic and domain knowledge can reduce model size and training data

Universal versus domain-specific models

  • Universal LM can handle tasks in general domain
  • Domain-specific LMs are designed to handle domain-specific tasks
  • Universal LM requires large model size, training examples, and computational resources
  • Domain-specific LMs require less training data
  • Domain-specific LMs can provide a solid foundation for biomedical NLP

Interpretable models

  • Deep-learning-based LMs are black-box methods without mathematical transparency.
  • Efforts have been made to explain black-box LMs.
  • Providing theoretical explanations or establishing explainable LMs is a challenging and open issue.
  • Incorporating KGs with LMs may provide a logical path for each prediction, making predictions more explainable.

Machine generated text detection

  • Common application of LMs is text generation
  • LMs can be used for malicious purposes
  • Challenge is to determine if text is generated by LMs or written by humans
  • Detecting disinformation is more difficult than detecting machine/human-generated text without assessing the factuality


  • Overview of CLMs and PLMs presented
  • Different levels of linguistic units introduced
  • Tokenization methods discussed
  • Language model structures and training paradigm of PLMs reviewed
  • Evaluation and applications of language models studied
  • Need for explainable, reliable, domain-specific, and lightweight language models emphasized