Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Most language models are trained and applied in a left-to-right fashion.
This paper proposes a new pre-training paradigm to improve training data efficiency and capabilities of language models.
The proposed pre-training paradigm includes a training objective and a bidirectional inference procedure.
Experiments show the effectiveness of the pre-training paradigm, outperforming strong baselines.

Paper Content

Preliminaries

Notation is introduced to be used throughout the paper
Notation is used to denote prefixes and suffixes of a sequence of tokens
Dependence of models on learnable parameters is suppressed
Arrows are used to distinguish the two models and their outputs

The infilling task

Infilling task is given a sequence of tokens, an insertion position, and a length
Task is to generate a plausible sequence of tokens to fill the gap
Finding the sequence that maximizes probability requires time exponential in M
Fill in the Middle (FIM) approach allows LM to use context from both sides
Advantages of FIM are that it can be applied to any pre-trained LM and is computationally efficient
Drawbacks of FIM are unnatural contexts and difficulty balancing influence of prefix and suffix

Bidirectional language modeling

Bidirectional language modeling has been used to train non-autoregressive LMs
Autoregressive LMs produce better representations than non-autoregressive LMs
Our model remains autoregressive and future tokens are used to regularize the model
We do not attempt to produce a single probability for every token, instead there are two probabilities

Meet in the middle

Pre-training involves training two models to predict the next token using different views of the input.
The two models need to balance two goals: predicting the next token well and having the probability distributions assigned to the next token agree.

Pre-training

Two decoder-only language models are used, which share all parameters
The forward model predicts next tokens in the forward direction
The backward model predicts previous tokens in the backward direction
A regularizer is used to encourage the models to agree on their predictions

Infilling

Goal is to have an efficient and low-latency generation procedure
Naive procedure would generate from two models until each meets a condition
Proposed procedure interleaves generation and scoring and can terminate quickly
Models start building a completion from their own side and try to meet in the middle
For each generated token from one model, check if it is in the generated tokens from the other
If the two models produce the same sequence, they have met in the middle
Agreement regularizer helps if the two models produce different sequences
Use n-grams instead of single token to reduce false positives
Parallel verification procedure adapted from [GXS + 22]
If no partial match, return to autoregressive generation
Trade off compatibility with autoregressive LMs for more powerful attention mechanism
Modifications improve infilling metrics but come at the cost of incompatibility with existing autoregressive LMs

Experiments

Pre-training experiments conducted
Evaluation setup described
Main results presented
Ablation studies conducted

Data and models

Pre-trained models on a large and diverse corpus of public code
Python, Java, C++ are the dominant languages
Corpus contains 300 billion tokens
Six times larger than the pre-training dataset used in the original Incoder model
Also pre-trained on natural language datasets
Model sizes of 350M, 1.3B and 2.7B parameters
Baselines pre-trained with FIM models

Benchmarks and metrics

FIM implementation outperforms Incoder models on all metrics and datasets
FIM-2.7B surpasses other strong baselines
MIM consistently outperforms FIM across all metrics and datasets
MIM pre-training receives more dense supervision from agreement regularizer
MIM pre-training leads to high quality left-to-right generative model
MIM consistently outperforms FIM in infilling setting
MIM consistently outperforms FIM in natural language datasets

Ablation study

Ablation study conducted to assess effect of optional enhancements
Comparing autoregressive MIM model with Synchronous Bidirectional Attention layer
Perplexity used to select value of λ
Bidirectional context models outperform unidirectional context models
MIM inference faster than FIM baselines

Bidirectional language modeling has been studied extensively
Early models such as BERT use permutation language modeling to maximize likelihood of training sequences
Representation learning and in-context learning can be difficult
Two works train neural models using similar ideas
One work regularizes forward and backward RNNs by requiring representations to be close in Euclidean distance
Another work encourages agreement in probability space
Our approach focuses on achieving better infilling accuracy than FIM while also reducing latency

Conclusion

Addressed two challenges faced by large LMs
Proposed “Meet in the Middle” method
Uses both forward and backward LMs
Parameters shared between LMs
Inference procedure reduces latency by up to 50%
Standard transformer-based autoregressive language models used
Multi Query Attention used
Bidirectional context used during pre-training
Adam optimizer used
Training done in two stages
Pre-training data statistics given
In-domain and out-of-domain perplexity results given
Masked tokens and spans of tokens used

Link to paper#

Abstract#

Paper Content#

Preliminaries#

The infilling task#

Bidirectional language modeling#

Meet in the middle#

Pre-training#

Infilling#

Experiments#

Data and models#

Benchmarks and metrics#

Ablation study#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Preliminaries

The infilling task

Bidirectional language modeling

Meet in the middle

Pre-training

Infilling

Experiments

Data and models

Benchmarks and metrics

Ablation study

Related work

Conclusion