Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language modeling studies probability distributions over strings of texts
Used in text generation, speech recognition, machine translation
Conventional language models (CLMs) predict probability of linguistic sequences in a causal manner
Pre-trained language models (PLMs) cover broader concepts and can be used for both causal sequential modeling and fine-tuning
PLMs have their own training paradigms and serve as foundation models in modern NLP systems
Overview paper provides introduction to CLMs and PLMs from five aspects
Discusses relationship between CLMs and PLMs and future directions of language modeling in pre-trained era

Paper Content

Introduction

Language modeling studies probability distributions over sequences of words
Used in many computational linguistic problems
Two major approaches: statistical and data-driven
Conventional language models (CLMs) predict probability of linguistic sequences in a causal manner
Data-driven approach uses neural-network models, leading to pre-trained language models (PLMs)
Five perspectives: linguistic units, structures, training methods, evaluation methods, applications
CLMs attempt to predict next linguistic unit in a text sequence given its preceding contexts
Represented by characters, words, phrases, etc.
PLMs do not necessarily follow CLMs in predicting linguistic units
Represented by word embeddings and sentence embeddings
Overview paper serves two objectives: introducing basic concepts and future research directions

Types of language models

Auto-regressive models predict the next linguistic units given the preceding context
Chain rule is used to access the probability of a text sequence
Goal of CLMs is to decode the probability of text sequences in a causal manner
This section introduces more LMs that go beyond CLMs

Structural lm

Structural LMs predict linguistic units based on pre-defined linguistic structures
Structural LMs use the linguistic structure to bring relevant context closer to the linguistic unit to be predicted
Structural LMs have been successfully applied to sentence completion and speech recognition

Bidirectional lm

Bidirectional LMs use contexts from both directions to make predictions.
Masked LM is a representative bidirectional LM that masks out linguistic units in a text sequence and predicts them.
The goal of bidirectional LMs is to learn the inner dependency between linguistic units in an unsupervised manner.
Pre-trained bidirectional LMs are used as the backbone for further fine-tuning in various downstream applications.

Permutation lm

CLMs and masked LMs have advantages and disadvantages.
Permutation LM is a recently proposed LM that combines CLMs and masked LMs.
Tokenization is the process of partitioning text sequences into small linguistic units.

Characters

Language models (LMs) can be based on characters.
Using characters has a smaller vocabulary size than other linguistics units.
It is difficult to predict the next character, usually requiring a long historical context.
Character-level LMs have poorer performance than word-level LMs.
Input and output lengths need to be longer to accurately model character distribution.
Computational costs are higher for auto-regressive decoding.
Combining words and characters can help to alleviate the issue.

Words and subwords

Natural tokenization for English is to decompose text into words by white spaces
Many language models (LMs) apply word tokenization
Issues with naive word tokenization include Out-Of-Vocabulary (OOV) problem
OOV problem occurs when words are not stored in pre-defined vocabulary
Subword segmentation algorithms are developed to boost LM performance
Two subword segmentation approaches: statistics-based and linguistics-based
Statistics-based subword tokenizers generate subword vocabulary based on corpus
Linguistics-based subword tokenizers exploit linguistic knowledge and decompose words into smaller grammatical units

Phrases

Semantic meaning of a single word can be ambiguous due to different contexts and combinations of words.
Word-level language models ignore the relationship between words.
Phrase-level language models replace common and cohesive word sequences with phrases.
Phrase-level language models are suitable for some applications, such as automatic speech recognition.

Sentences

Language models use conditional probabilities to estimate the probability of text sequences.
Sentence-level language models generate sentence features and model the sentence probability directly.
This is more convenient than using the chain rule and easier to encode inter-sentence information.

Model structures

N-gram, maximum entropy, and neural network models are used to model the probability distributions of text sequences
PLMs typically use continuous representations in probability modeling built upon RNNs or transformers

N-gram models

N-grams consist of N consecutive linguistic units from a text sequence
N-gram LMs assume the probability of a word depends on its preceding N-1 linguistic units
N-gram LMs calculate the conditional probability by counting the occurrence time of N-grams
N-gram LMs simplify the word probability calculation based on previous N-1 words
N-gram LMs encounter two sparsity issues
Smoothing techniques are used to alleviate the sparsity issues
Additive smoothing adds a small value to the count for every N-gram
Advanced smoothing techniques such as back-off and interpolation are used for better probability estimation

Maximum entropy models

Maximum Entropy models estimate the probability of text sequences using feature functions.
Features are usually generated from N-grams and the Generalized Iterative Scaling algorithm is used to derive the parameter vector.

Feed-forward neural network (fnn) models

N-gram model has a performance bottleneck
Neural LMs use continuous embedding space to overcome data sparsity
Feed-forward Neural Network (FNN) LMs use historical contexts as input and output probability distribution of words
FNN LM uses a fixed window to collect fixed-length contexts
FNN LM can handle unseen N-grams and is storage-efficient

Recurrent neural network (rnn) models

Historical context is insufficient to predict the next word.
RNN LMs can exploit arbitrarily long histories to predict the next word.
RNN LMs use a non-linear activation function and weight matrices to compute the hidden state.
RNN LMs can theoretically use all preceding history to predict the next word, but in practice the gradient vanishing problem hampers the learning of the model.

Transformers

Transformer architecture can capture long-term dependencies and important sequence components.
Transformer is easy to parallelize in both training and inference.
Transformer consists of an encoder and a decoder.
Different transformer models are suitable for different tasks.

Pre-trained language models

PLMs are widely used in NLP
Deep learning has changed the way PLMs are trained and used
PLMs are pre-trained on large corpora to learn universal representations
PLMs are fine-tuned for downstream tasks to transfer knowledge
Several survey papers on PLMs exist

Pre-training

Pre-training tasks are used to train language models
Common pre-training task is “missing word prediction”
Other pre-training tasks include next-sentence prediction and masked language model
Pre-training tasks can help language models learn better linguistic knowledge, such as sentence relationships
Other pre-training objectives include token deletion, text infilling, sentence permutation, and document rotation

Fine-tuning and prompt-tuning

PLMs learn language knowledge in pre-training stage
Fine-tuning adapts model for downstream tasks
Model parameters are updated in fine-tuning stage
Prompt-tuning mimics pre-training objectives in fine-tuning/inference stage

Model evaluation

Intrinsic evaluation examines internal properties of an LM
Extrinsic evaluation studies performance in downstream tasks

Intrinsic evaluation

Auto-regressive LM estimates probability of text sequences
Perplexity is a common evaluation metric for this purpose
Good LM should maximize text set probability, which is equivalent to minimizing perplexity
Bidirectional Language Model uses different approach to calculate inverse probability
Intrinsic evaluation metrics (PLL and PPPL) measure naturalness of sentences for bidirectional LM

Extrinsic evaluation

Downstream tasks of language models can be used for evaluation
GLUE and SuperGLUE are two popular evaluation benchmarks for natural language understanding

Relation between intrinsic and extrinsic evaluations

Pre-training tasks (based on word prediction) can help LMs learn linguistic knowledge
Empirical studies show pre-training tasks help LMs learn grammar and semantic roles
Theoretical studies attempt to build a connection between LM’s perplexities and performance on downstream tasks
Text classification tasks can be reformulated as sentence completion tasks

Applications in text generation

Text generation is an important application of LMs
Text generation tasks vary depending on the purpose and input
Examples of text generation tasks include ASR, machine translation, and story generation
Common techniques used in text generation are introduced
LMs can be applied to each of the representative tasks

Decoding methods

Decoding decides the next output linguistic unit to generate text
Decoding methods are important for text generation
Poor decoding methods lead to bad generated texts
Two main decoding methods: maximization-based and sampling-based
Maximization-based searches for tokens with highest probability
Sampling-based increases diversity of generated texts

Dialogue systems

Dialogue systems simulate human responses when conversing with humans
ChatGPT and LaMDA are examples of dialogue systems
Dialogue systems can be task-oriented or open-domain
ChatGPT is built on a generative language model (GPT-3)
Language models are important for natural language understanding and generation
Language models are evaluated for dialogue tasks
Language models are evaluated for few-shot capability in dialogue tasks

Automatic speech recognition

Automatic speech recognition (ASR) is a speech-to-text generation task.
ASR systems contain an acoustic model and a language model.
Language models help solve acoustically ambiguous utterances and lower computational cost.
Different types of language models have been explored in ASR, such as N-gram, FFNN, RNN and Transformer.

Machine translation

Machine translation is a text-to-text generation task
Transformer-based models have been successful in machine translation

Detection of generated texts

Performance of LMs is getting close to or better than humans
Misuse of LMs is a serious problem
Detection of machine-generated texts is important
Two types of detection problems: human written vs. machine generated and inveracious vs. veracious
Two common approaches to detecting machine-generated text: exploit probability distribution and train classifiers with supervised learning
Recent PLMs require high computational resources and energy consumption
Efficient pre-training methods proposed to use data more efficiently
Bridging pre-training and downstream tasks with prompt tuning

Model size

Training efficiency can be improved by designing smaller models.
Model compression is a widely studied topic to reduce model size for mobile or edge devices.

Future research directions

Investigate the use of language models in other applications
Explore ways to improve the accuracy of language models

Integration of lms and kgs

Knowledge Graphs (KGs) are used in many NLP applications.
There is a growing interest in evaluating knowledge learned in PLMs.
KGs and LMs can interact with each other and KGs can serve as an information database for LMs.

Incremental learning

Incremental learning aims to incorporate new information without re-training existing models entirely
Problem of catastrophic forgetting associated with neural network models
Solution proposed to remember prior important tasks by slowing down learning on weights
Difficult to define important tasks in LMs
Re-training a large LM too expensive
Incremental learning challenging for neural networks
Easy for KGs to add/remove data from existing database
Integration of KGs and LMs provides solution for incremental learning

Lightweight models

PLMs require a lot of computational resources and energy
LLMs have a high carbon footprint
GL focuses on low carbon footprint solutions
Design of lightweight models with lower complexity and smaller size is popular
Model compression is used to shrink model size
Incorporating linguistic and domain knowledge can reduce model size and training data

Universal versus domain-specific models

Universal LM can handle tasks in general domain
Domain-specific LMs are designed to handle domain-specific tasks
Universal LM requires large model size, training examples, and computational resources
Domain-specific LMs require less training data
Domain-specific LMs can provide a solid foundation for biomedical NLP

Interpretable models

Deep-learning-based LMs are black-box methods without mathematical transparency.
Efforts have been made to explain black-box LMs.
Providing theoretical explanations or establishing explainable LMs is a challenging and open issue.
Incorporating KGs with LMs may provide a logical path for each prediction, making predictions more explainable.

Machine generated text detection

Common application of LMs is text generation
LMs can be used for malicious purposes
Challenge is to determine if text is generated by LMs or written by humans
Detecting disinformation is more difficult than detecting machine/human-generated text without assessing the factuality

Conclusion

Overview of CLMs and PLMs presented
Different levels of linguistic units introduced
Tokenization methods discussed
Language model structures and training paradigm of PLMs reviewed
Evaluation and applications of language models studied
Need for explainable, reliable, domain-specific, and lightweight language models emphasized

Link to paper#

Abstract#

Paper Content#

Introduction#

Types of language models#

Structural lm#

Bidirectional lm#

Permutation lm#

Characters#

Words and subwords#

Phrases#

Sentences#

Model structures#

N-gram models#

Maximum entropy models#

Feed-forward neural network (fnn) models#

Recurrent neural network (rnn) models#

Transformers#

Pre-trained language models#

Pre-training#

Fine-tuning and prompt-tuning#

Model evaluation#

Intrinsic evaluation#

Extrinsic evaluation#

Relation between intrinsic and extrinsic evaluations#

Applications in text generation#

Decoding methods#

Dialogue systems#

Automatic speech recognition#

Machine translation#

Detection of generated texts#

Model size#

Future research directions#

Integration of lms and kgs#

Incremental learning#

Lightweight models#

Universal versus domain-specific models#

Interpretable models#

Machine generated text detection#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Types of language models

Structural lm

Bidirectional lm

Permutation lm

Characters

Words and subwords

Phrases

Sentences

Model structures

N-gram models

Maximum entropy models

Feed-forward neural network (fnn) models

Recurrent neural network (rnn) models

Transformers

Pre-trained language models

Pre-training

Fine-tuning and prompt-tuning

Model evaluation

Intrinsic evaluation

Extrinsic evaluation

Relation between intrinsic and extrinsic evaluations

Applications in text generation

Decoding methods

Dialogue systems

Automatic speech recognition

Machine translation

Detection of generated texts

Model size

Future research directions

Integration of lms and kgs

Incremental learning

Lightweight models

Universal versus domain-specific models

Interpretable models

Machine generated text detection

Conclusion