Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Standard probabilistic models for language generation have difficulty estimating the right probability distribution over next tokens.
  • Models identify a simple, loss-minimising behaviour of outputting the unigram distribution of the target training corpus.
  • A separate module can be implemented to reflect unigram frequency statistics as prior knowledge.
  • Initialising the bias term in a model’s final linear layer with the log-unigram distribution improves learning efficiency and overall performance.

Paper Content


  • Predictions in NLP tasks are usually based on existing knowledge of language
  • If context is not understood, the optimal prediction is the language’s most frequent word
  • Neural language generation models take several hundred or thousand training updates to learn the unigram frequency distribution
  • A factorisation of a language model’s final linear layer can encode and integrate unigram frequency knowledge prior to any optimisation
  • This initialisation technique leads to increased training efficiency and improved overall performance

Probabilistic language generators

  • Probabilistic models are used for language generation
  • These models are parameterized by deep neural networks
  • Output of the model is projected onto the probability simplex
  • Model parameters are estimated by minimizing a loss function
  • Decoding algorithm is used to generate strings from the model

A natural bias

  • Prior studies have investigated whether and when NLP models learn various linguistic phenomena.
  • Language models reflect the statistical tendencies of their training corpora.
  • Some of these tendencies are learnt early on in training.
  • After 1000 training updates, language models output the unigram distribution regardless of context.
  • Providing the unigram distribution as prior knowledge can help bypass the frequency-learning stage early on in training.
  • The unigram prior knowledge can be encoded in the bias term of the final, pre-softmax linear layer.



  • Exploring the effects of unigram bias initialisation on neural machine translation systems
  • Comparing to standard initialisation technique of initialising bias to 0s or omitting bias term
  • Unigram frequencies computed on respective training sets after tokenisation
  • Parameters of projection matrix in final linear layer initialised using normal random variables
  • Experiments with several language pairs: WMT'14 German-to-English, IWSLT'14 German-to-English, and multiple language pairs in AfroMT dataset
  • Preprocess data using byte-pair tokenisation
  • Experiments using Transformer encoder-decoder architecture
  • Parameter estimation using stochastic gradient-descent techniques
  • Decoding done with length-normalised beam search with beam size of 5


  • Unigram bias initialisation technique leads to better test set performance than standard bias term initialisation techniques
  • Unigram bias initialisation approach reaches better performance in fewer iterations
  • Unigram log-frequency is plotted against average log-probability assigned to it
  • Unigram-initialised models have frequency biases encoded in the bias term


  • Models can learn superficial statistical tendencies of language
  • Takahashi and Tanaka-Ishii (2019) found evidence that more powerful language models have a natural bias for learning them
  • Initialising model parameters with prior knowledge of unigram distribution can improve training efficiency and performance
  • Two possible explanations for improvement: changes model learning dynamics and disentangles frequency in modelling of contextual probabilities
  • Prior works have explored strategies for model weight initialisation
  • Other works focus on analysing variance caused by different weight initialisation techniques
  • Proposed initialisation technique is intuited by learning trends observed in language generation models
  • Other works have embraced frameworks akin to product or mixture of experts in language modelling or generation tasks
  • Suggested initialisation method does not require training additional models or major changes to model architectures
  • Ben Zaken et al. (2022) investigate usefulness of bias term for efficient finetuning techniques

Conclusion and future work

  • Explores a simple initialisation technique for language generation models
  • Sets the bias term in the final linear projection layer to the log-unigram distribution of (sub)words within the training corpus
  • Leads to more efficient training and better overall performance in machine translation experiments
  • Analysis and discussion of the cause of these trends
  • Could be used to mitigate problems with lexically infrequent words
  • Applicable in any classification setting
  • Uses the standard Transformer architecture
  • Adam optimizer with (β 1 , β 2 ) = (0.9, 0.997)
  • Dropout set to 0.1
  • Feedforward hidden dimension set to 512 for WMT model and 256 for all other models
  • Linear projection matrix W initialised using normal random variables
  • Could have downstream negative side-effects
  • Regularisation away from the unigram distribution leads to worse models
  • Initialising the bias term with the log-unigram distribution improves generalisation performance
  • Investigates the effects of this technique in low resource settings