Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Standard probabilistic models for language generation have difficulty estimating the right probability distribution over next tokens.
Models identify a simple, loss-minimising behaviour of outputting the unigram distribution of the target training corpus.
A separate module can be implemented to reflect unigram frequency statistics as prior knowledge.
Initialising the bias term in a model’s final linear layer with the log-unigram distribution improves learning efficiency and overall performance.

Predictions in NLP tasks are usually based on existing knowledge of language
If context is not understood, the optimal prediction is the language’s most frequent word
Neural language generation models take several hundred or thousand training updates to learn the unigram frequency distribution
A factorisation of a language model’s final linear layer can encode and integrate unigram frequency knowledge prior to any optimisation
This initialisation technique leads to increased training efficiency and improved overall performance

Prior studies have investigated whether and when NLP models learn various linguistic phenomena.
Language models reflect the statistical tendencies of their training corpora.
Some of these tendencies are learnt early on in training.
After 1000 training updates, language models output the unigram distribution regardless of context.
Providing the unigram distribution as prior knowledge can help bypass the frequency-learning stage early on in training.
The unigram prior knowledge can be encoded in the bias term of the final, pre-softmax linear layer.

Exploring the effects of unigram bias initialisation on neural machine translation systems
Comparing to standard initialisation technique of initialising bias to 0s or omitting bias term
Unigram frequencies computed on respective training sets after tokenisation
Parameters of projection matrix in final linear layer initialised using normal random variables
Experiments with several language pairs: WMT'14 German-to-English, IWSLT'14 German-to-English, and multiple language pairs in AfroMT dataset
Preprocess data using byte-pair tokenisation
Experiments using Transformer encoder-decoder architecture
Parameter estimation using stochastic gradient-descent techniques
Decoding done with length-normalised beam search with beam size of 5

Unigram bias initialisation technique leads to better test set performance than standard bias term initialisation techniques
Unigram bias initialisation approach reaches better performance in fewer iterations
Unigram log-frequency is plotted against average log-probability assigned to it
Unigram-initialised models have frequency biases encoded in the bias term

Models can learn superficial statistical tendencies of language
Takahashi and Tanaka-Ishii (2019) found evidence that more powerful language models have a natural bias for learning them
Initialising model parameters with prior knowledge of unigram distribution can improve training efficiency and performance
Two possible explanations for improvement: changes model learning dynamics and disentangles frequency in modelling of contextual probabilities

Prior works have explored strategies for model weight initialisation
Other works focus on analysing variance caused by different weight initialisation techniques
Proposed initialisation technique is intuited by learning trends observed in language generation models
Other works have embraced frameworks akin to product or mixture of experts in language modelling or generation tasks
Suggested initialisation method does not require training additional models or major changes to model architectures
Ben Zaken et al. (2022) investigate usefulness of bias term for efficient finetuning techniques

Explores a simple initialisation technique for language generation models
Sets the bias term in the final linear projection layer to the log-unigram distribution of (sub)words within the training corpus
Leads to more efficient training and better overall performance in machine translation experiments
Analysis and discussion of the cause of these trends
Could be used to mitigate problems with lexically infrequent words
Applicable in any classification setting
Uses the standard Transformer architecture
Adam optimizer with (β 1 , β 2 ) = (0.9, 0.997)
Dropout set to 0.1
Feedforward hidden dimension set to 512 for WMT model and 256 for all other models
Linear projection matrix W initialised using normal random variables
Could have downstream negative side-effects
Regularisation away from the unigram distribution leads to worse models
Initialising the bias term with the log-unigram distribution improves generalisation performance
Investigates the effects of this technique in low resource settings