Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Standard probabilistic models for language generation have difficulty estimating the right probability distribution over next tokens.
- Models identify a simple, loss-minimising behaviour of outputting the unigram distribution of the target training corpus.
- A separate module can be implemented to reflect unigram frequency statistics as prior knowledge.
- Initialising the bias term in a model’s final linear layer with the log-unigram distribution improves learning efficiency and overall performance.
Paper Content
Introduction
- Predictions in NLP tasks are usually based on existing knowledge of language
- If context is not understood, the optimal prediction is the language’s most frequent word
- Neural language generation models take several hundred or thousand training updates to learn the unigram frequency distribution
- A factorisation of a language model’s final linear layer can encode and integrate unigram frequency knowledge prior to any optimisation
- This initialisation technique leads to increased training efficiency and improved overall performance
Probabilistic language generators
- Probabilistic models are used for language generation
- These models are parameterized by deep neural networks
- Output of the model is projected onto the probability simplex
- Model parameters are estimated by minimizing a loss function
- Decoding algorithm is used to generate strings from the model
A natural bias
- Prior studies have investigated whether and when NLP models learn various linguistic phenomena.
- Language models reflect the statistical tendencies of their training corpora.
- Some of these tendencies are learnt early on in training.
- After 1000 training updates, language models output the unigram distribution regardless of context.
- Providing the unigram distribution as prior knowledge can help bypass the frequency-learning stage early on in training.
- The unigram prior knowledge can be encoded in the bias term of the final, pre-softmax linear layer.
Experiments
Setup
- Exploring the effects of unigram bias initialisation on neural machine translation systems
- Comparing to standard initialisation technique of initialising bias to 0s or omitting bias term
- Unigram frequencies computed on respective training sets after tokenisation
- Parameters of projection matrix in final linear layer initialised using normal random variables
- Experiments with several language pairs: WMT'14 German-to-English, IWSLT'14 German-to-English, and multiple language pairs in AfroMT dataset
- Preprocess data using byte-pair tokenisation
- Experiments using Transformer encoder-decoder architecture
- Parameter estimation using stochastic gradient-descent techniques
- Decoding done with length-normalised beam search with beam size of 5
Results
- Unigram bias initialisation technique leads to better test set performance than standard bias term initialisation techniques
- Unigram bias initialisation approach reaches better performance in fewer iterations
- Unigram log-frequency is plotted against average log-probability assigned to it
- Unigram-initialised models have frequency biases encoded in the bias term
Discussion
- Models can learn superficial statistical tendencies of language
- Takahashi and Tanaka-Ishii (2019) found evidence that more powerful language models have a natural bias for learning them
- Initialising model parameters with prior knowledge of unigram distribution can improve training efficiency and performance
- Two possible explanations for improvement: changes model learning dynamics and disentangles frequency in modelling of contextual probabilities
Related work
- Prior works have explored strategies for model weight initialisation
- Other works focus on analysing variance caused by different weight initialisation techniques
- Proposed initialisation technique is intuited by learning trends observed in language generation models
- Other works have embraced frameworks akin to product or mixture of experts in language modelling or generation tasks
- Suggested initialisation method does not require training additional models or major changes to model architectures
- Ben Zaken et al. (2022) investigate usefulness of bias term for efficient finetuning techniques
Conclusion and future work
- Explores a simple initialisation technique for language generation models
- Sets the bias term in the final linear projection layer to the log-unigram distribution of (sub)words within the training corpus
- Leads to more efficient training and better overall performance in machine translation experiments
- Analysis and discussion of the cause of these trends
- Could be used to mitigate problems with lexically infrequent words
- Applicable in any classification setting
- Uses the standard Transformer architecture
- Adam optimizer with (β 1 , β 2 ) = (0.9, 0.997)
- Dropout set to 0.1
- Feedforward hidden dimension set to 512 for WMT model and 256 for all other models
- Linear projection matrix W initialised using normal random variables
- Could have downstream negative side-effects
- Regularisation away from the unigram distribution leads to worse models
- Initialising the bias term with the log-unigram distribution improves generalisation performance
- Investigates the effects of this technique in low resource settings