Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Language modeling studies probability distributions over strings of texts
- Used in text generation, speech recognition, machine translation
- Conventional language models (CLMs) predict probability of linguistic sequences in a causal manner
- Pre-trained language models (PLMs) cover broader concepts and can be used for both causal sequential modeling and fine-tuning
- PLMs have their own training paradigms and serve as foundation models in modern NLP systems
- Overview paper provides introduction to CLMs and PLMs from five aspects
- Discusses relationship between CLMs and PLMs and future directions of language modeling in pre-trained era
Paper Content
Introduction
- Language modeling studies probability distributions over sequences of words
- Used in many computational linguistic problems
- Two major approaches: statistical and data-driven
- Conventional language models (CLMs) predict probability of linguistic sequences in a causal manner
- Data-driven approach uses neural-network models, leading to pre-trained language models (PLMs)
- Five perspectives: linguistic units, structures, training methods, evaluation methods, applications
- CLMs attempt to predict next linguistic unit in a text sequence given its preceding contexts
- Represented by characters, words, phrases, etc.
- PLMs do not necessarily follow CLMs in predicting linguistic units
- Represented by word embeddings and sentence embeddings
- Overview paper serves two objectives: introducing basic concepts and future research directions
Types of language models
- Auto-regressive models predict the next linguistic units given the preceding context
- Chain rule is used to access the probability of a text sequence
- Goal of CLMs is to decode the probability of text sequences in a causal manner
- This section introduces more LMs that go beyond CLMs
Structural lm
- Structural LMs predict linguistic units based on pre-defined linguistic structures
- Structural LMs use the linguistic structure to bring relevant context closer to the linguistic unit to be predicted
- Structural LMs have been successfully applied to sentence completion and speech recognition
Bidirectional lm
- Bidirectional LMs use contexts from both directions to make predictions.
- Masked LM is a representative bidirectional LM that masks out linguistic units in a text sequence and predicts them.
- The goal of bidirectional LMs is to learn the inner dependency between linguistic units in an unsupervised manner.
- Pre-trained bidirectional LMs are used as the backbone for further fine-tuning in various downstream applications.
Permutation lm
- CLMs and masked LMs have advantages and disadvantages.
- Permutation LM is a recently proposed LM that combines CLMs and masked LMs.
- Tokenization is the process of partitioning text sequences into small linguistic units.
Characters
- Language models (LMs) can be based on characters.
- Using characters has a smaller vocabulary size than other linguistics units.
- It is difficult to predict the next character, usually requiring a long historical context.
- Character-level LMs have poorer performance than word-level LMs.
- Input and output lengths need to be longer to accurately model character distribution.
- Computational costs are higher for auto-regressive decoding.
- Combining words and characters can help to alleviate the issue.
Words and subwords
- Natural tokenization for English is to decompose text into words by white spaces
- Many language models (LMs) apply word tokenization
- Issues with naive word tokenization include Out-Of-Vocabulary (OOV) problem
- OOV problem occurs when words are not stored in pre-defined vocabulary
- Subword segmentation algorithms are developed to boost LM performance
- Two subword segmentation approaches: statistics-based and linguistics-based
- Statistics-based subword tokenizers generate subword vocabulary based on corpus
- Linguistics-based subword tokenizers exploit linguistic knowledge and decompose words into smaller grammatical units
Phrases
- Semantic meaning of a single word can be ambiguous due to different contexts and combinations of words.
- Word-level language models ignore the relationship between words.
- Phrase-level language models replace common and cohesive word sequences with phrases.
- Phrase-level language models are suitable for some applications, such as automatic speech recognition.
Sentences
- Language models use conditional probabilities to estimate the probability of text sequences.
- Sentence-level language models generate sentence features and model the sentence probability directly.
- This is more convenient than using the chain rule and easier to encode inter-sentence information.
Model structures
- N-gram, maximum entropy, and neural network models are used to model the probability distributions of text sequences
- PLMs typically use continuous representations in probability modeling built upon RNNs or transformers
N-gram models
- N-grams consist of N consecutive linguistic units from a text sequence
- N-gram LMs assume the probability of a word depends on its preceding N-1 linguistic units
- N-gram LMs calculate the conditional probability by counting the occurrence time of N-grams
- N-gram LMs simplify the word probability calculation based on previous N-1 words
- N-gram LMs encounter two sparsity issues
- Smoothing techniques are used to alleviate the sparsity issues
- Additive smoothing adds a small value to the count for every N-gram
- Advanced smoothing techniques such as back-off and interpolation are used for better probability estimation
Maximum entropy models
- Maximum Entropy models estimate the probability of text sequences using feature functions.
- Features are usually generated from N-grams and the Generalized Iterative Scaling algorithm is used to derive the parameter vector.
Feed-forward neural network (fnn) models
- N-gram model has a performance bottleneck
- Neural LMs use continuous embedding space to overcome data sparsity
- Feed-forward Neural Network (FNN) LMs use historical contexts as input and output probability distribution of words
- FNN LM uses a fixed window to collect fixed-length contexts
- FNN LM can handle unseen N-grams and is storage-efficient
Recurrent neural network (rnn) models
- Historical context is insufficient to predict the next word.
- RNN LMs can exploit arbitrarily long histories to predict the next word.
- RNN LMs use a non-linear activation function and weight matrices to compute the hidden state.
- RNN LMs can theoretically use all preceding history to predict the next word, but in practice the gradient vanishing problem hampers the learning of the model.
Transformers
- Transformer architecture can capture long-term dependencies and important sequence components.
- Transformer is easy to parallelize in both training and inference.
- Transformer consists of an encoder and a decoder.
- Different transformer models are suitable for different tasks.
Pre-trained language models
- PLMs are widely used in NLP
- Deep learning has changed the way PLMs are trained and used
- PLMs are pre-trained on large corpora to learn universal representations
- PLMs are fine-tuned for downstream tasks to transfer knowledge
- Several survey papers on PLMs exist
Pre-training
- Pre-training tasks are used to train language models
- Common pre-training task is “missing word prediction”
- Other pre-training tasks include next-sentence prediction and masked language model
- Pre-training tasks can help language models learn better linguistic knowledge, such as sentence relationships
- Other pre-training objectives include token deletion, text infilling, sentence permutation, and document rotation
Fine-tuning and prompt-tuning
- PLMs learn language knowledge in pre-training stage
- Fine-tuning adapts model for downstream tasks
- Model parameters are updated in fine-tuning stage
- Prompt-tuning mimics pre-training objectives in fine-tuning/inference stage
Model evaluation
- Intrinsic evaluation examines internal properties of an LM
- Extrinsic evaluation studies performance in downstream tasks
Intrinsic evaluation
- Auto-regressive LM estimates probability of text sequences
- Perplexity is a common evaluation metric for this purpose
- Good LM should maximize text set probability, which is equivalent to minimizing perplexity
- Bidirectional Language Model uses different approach to calculate inverse probability
- Intrinsic evaluation metrics (PLL and PPPL) measure naturalness of sentences for bidirectional LM
Extrinsic evaluation
- Downstream tasks of language models can be used for evaluation
- GLUE and SuperGLUE are two popular evaluation benchmarks for natural language understanding
Relation between intrinsic and extrinsic evaluations
- Pre-training tasks (based on word prediction) can help LMs learn linguistic knowledge
- Empirical studies show pre-training tasks help LMs learn grammar and semantic roles
- Theoretical studies attempt to build a connection between LM’s perplexities and performance on downstream tasks
- Text classification tasks can be reformulated as sentence completion tasks
Applications in text generation
- Text generation is an important application of LMs
- Text generation tasks vary depending on the purpose and input
- Examples of text generation tasks include ASR, machine translation, and story generation
- Common techniques used in text generation are introduced
- LMs can be applied to each of the representative tasks
Decoding methods
- Decoding decides the next output linguistic unit to generate text
- Decoding methods are important for text generation
- Poor decoding methods lead to bad generated texts
- Two main decoding methods: maximization-based and sampling-based
- Maximization-based searches for tokens with highest probability
- Sampling-based increases diversity of generated texts
Dialogue systems
- Dialogue systems simulate human responses when conversing with humans
- ChatGPT and LaMDA are examples of dialogue systems
- Dialogue systems can be task-oriented or open-domain
- ChatGPT is built on a generative language model (GPT-3)
- Language models are important for natural language understanding and generation
- Language models are evaluated for dialogue tasks
- Language models are evaluated for few-shot capability in dialogue tasks
Automatic speech recognition
- Automatic speech recognition (ASR) is a speech-to-text generation task.
- ASR systems contain an acoustic model and a language model.
- Language models help solve acoustically ambiguous utterances and lower computational cost.
- Different types of language models have been explored in ASR, such as N-gram, FFNN, RNN and Transformer.
Machine translation
- Machine translation is a text-to-text generation task
- Transformer-based models have been successful in machine translation
Detection of generated texts
- Performance of LMs is getting close to or better than humans
- Misuse of LMs is a serious problem
- Detection of machine-generated texts is important
- Two types of detection problems: human written vs. machine generated and inveracious vs. veracious
- Two common approaches to detecting machine-generated text: exploit probability distribution and train classifiers with supervised learning
- Recent PLMs require high computational resources and energy consumption
- Efficient pre-training methods proposed to use data more efficiently
- Bridging pre-training and downstream tasks with prompt tuning
Model size
- Training efficiency can be improved by designing smaller models.
- Model compression is a widely studied topic to reduce model size for mobile or edge devices.
Future research directions
- Investigate the use of language models in other applications
- Explore ways to improve the accuracy of language models
Integration of lms and kgs
- Knowledge Graphs (KGs) are used in many NLP applications.
- There is a growing interest in evaluating knowledge learned in PLMs.
- KGs and LMs can interact with each other and KGs can serve as an information database for LMs.
Incremental learning
- Incremental learning aims to incorporate new information without re-training existing models entirely
- Problem of catastrophic forgetting associated with neural network models
- Solution proposed to remember prior important tasks by slowing down learning on weights
- Difficult to define important tasks in LMs
- Re-training a large LM too expensive
- Incremental learning challenging for neural networks
- Easy for KGs to add/remove data from existing database
- Integration of KGs and LMs provides solution for incremental learning
Lightweight models
- PLMs require a lot of computational resources and energy
- LLMs have a high carbon footprint
- GL focuses on low carbon footprint solutions
- Design of lightweight models with lower complexity and smaller size is popular
- Model compression is used to shrink model size
- Incorporating linguistic and domain knowledge can reduce model size and training data
Universal versus domain-specific models
- Universal LM can handle tasks in general domain
- Domain-specific LMs are designed to handle domain-specific tasks
- Universal LM requires large model size, training examples, and computational resources
- Domain-specific LMs require less training data
- Domain-specific LMs can provide a solid foundation for biomedical NLP
Interpretable models
- Deep-learning-based LMs are black-box methods without mathematical transparency.
- Efforts have been made to explain black-box LMs.
- Providing theoretical explanations or establishing explainable LMs is a challenging and open issue.
- Incorporating KGs with LMs may provide a logical path for each prediction, making predictions more explainable.
Machine generated text detection
- Common application of LMs is text generation
- LMs can be used for malicious purposes
- Challenge is to determine if text is generated by LMs or written by humans
- Detecting disinformation is more difficult than detecting machine/human-generated text without assessing the factuality
Conclusion
- Overview of CLMs and PLMs presented
- Different levels of linguistic units introduced
- Tokenization methods discussed
- Language model structures and training paradigm of PLMs reviewed
- Evaluation and applications of language models studied
- Need for explainable, reliable, domain-specific, and lightweight language models emphasized