Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

It is possible to expand pretrained Masked Language Models (MLMs) to new languages by learning a new set of embeddings.
Training the new embeddings requires a full forward and backward pass over the entire model.
Mini-model adaptation is a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model’s parameters.
Experiments show that mini-model adaptation matches the performance of the standard approach using up to 2.4x less compute.

Paper Content

Introduction

Recent work on multilingual NLP has focused on pretraining language models on unlabeled corpora in multiple languages
Models can then be finetuned using labeled downstream data in a single language and zero-shot transferred to the rest of the languages
Existing models rarely cover more than a few dozen languages
A recent line of work has explored pretraining an initial model in a few languages and expanding it to new languages posthoc
Artetxe et al. (2020) showed that it is possible to expand an English masked language model by freezing the transformer body and learning a new embedding layer
Improved results reported by using a better initialization scheme or learning additional language-specific parameters through adapters
Proposed approach is both parameter- and compute-efficient
Mini-models are shallow models aligned with a larger parent model
Two approaches to learn mini-models: MINIJOINT and MINIPOST
Evaluated on natural language inference, question answering and paraphrase identification
Mini-model adaptation can match the performance of the standard method using less compute

Mini-model adaptation

Proposed approach follows a four-step training paradigm
Step 1: learn two aligned models (primary model and shallow mini-model)
Step 2: learn L trg embeddings over mini-model
Step 3 and 4: run as usual over primary model

Experimental settings

Source language is English, target languages are 14
Training corpus is CC-100
Preprocessing uses SentencePiece with a vocabulary size of 50,000
Models use RoBERTa BASE architecture
4 systems compared in experiments
Evaluated on 3 tasks: XNLI, MLQA, PAWS-X
Finetuning uses different learning rates for each task
FLOPs and GPU training days estimated analytically
BASELARGE has highest performance but highest cost
MINIJOINT retains 98.7% of BASELARGE’s performance at 37.2% of its cost

Main results

Performance at training completion

MINIPOST retains 99.3% of BASELARGE’s performance at nearly half of its cost.
Mini-models do not need to be trained from scratch, but can be built post-hoc.
BASESMALL performs substantially worse than MINIJOINT.
Having two aligned models-a shallow one for efficient adaptation and a deep one for best performance at test time-is critical.

Gpu days to near-maximal performance

Early stopping can be used to compare different approaches in terms of efficiency
MINIJOINT requires less than half of the compute of BASELARGE to achieve the same performance in all tasks
MINIPOST is substantially faster than standard adaptation to hit the desired performance
There is a considerable variance across tasks and languages

Analysis

Training curves

MINIJOINT is the fastest system but has a slightly lower final score
BASELARGE is the slowest system and approaches peak performance
BASESMALL gets stuck at a poor performance
All methods adapt rapidly in PAWS-X, suggesting it is an easier task
BASESMALL never achieves near-maximal performance, except for Turkish and Urdu

Mini-model depth

MINIJOINT has 4 layers, MINIPOST has 6
Experiments to study effect of mini-model depth on efficiency and performance
Experiments done on Arabic, German and Turkish
Shallow attachment of secondary head leads to more rapid adaptation, but with cost to final performance
Optimal depth of mini-model is language-dependent

English performance

MINI-POST and MINI-JOINT are two computer science models.
MINI-JOINT jointly pretrains a primary model and its aligned mini-model.
Comparing MINI-JOINT and BASELARGE/BASESMALL shows that dual-head training does not damage performance.

Variance across languages

MINIJOINT takes more than 7 V100 days to achieve near-maximal performance on XNLI for Hindi, Turkish and Urdu.
German, Spanish and French are the closest languages to English and generally obtain the fastest adaptation times.
Swahili has the smallest training corpus but exceeds the other 3 languages on both adaptation speed and raw performance on XNLI.

Training a language model from scratch requires a lot of data
Multilingual models can be pretrained on unlabeled data from many languages and then finetuned on labeled data
Multilingual models are large and expensive to train and can suffer from the curse of multilinguality
Not all languages are created equal in multilingual models
Adapters are a parameter-efficient way to extend language models
Vocabulary adaptation methods reduce the need to extensively finetune a model
Variation between languages is observed in LM adaptation

Conclusion and future work

Possible to extend pretrained models to new languages using only a fraction of their parameters
Two approaches to learn mini-models: MINIJOINT and MINIPOST
Shallower mini-models converge faster but plateau at lower performance
Explore combining multiple mini-models of different sizes
Explore other potential applications of mini-model adaptation
Limited to adaptation of MLMs to new languages
Variance across languages observed
Averaged results over 5 finetuning runs

Link to paper#

Abstract#

Paper Content#

Introduction#

Mini-model adaptation#

Experimental settings#

Main results#

Performance at training completion#

Gpu days to near-maximal performance#

Analysis#

Training curves#

Mini-model depth#

English performance#

Variance across languages#

Related work#

Conclusion and future work#

Link to paper

Abstract

Paper Content

Introduction

Mini-model adaptation

Experimental settings

Main results

Performance at training completion

Gpu days to near-maximal performance

Analysis

Training curves

Mini-model depth

English performance

Variance across languages

Related work

Conclusion and future work