Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- It is possible to expand pretrained Masked Language Models (MLMs) to new languages by learning a new set of embeddings.
- Training the new embeddings requires a full forward and backward pass over the entire model.
- Mini-model adaptation is a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model’s parameters.
- Experiments show that mini-model adaptation matches the performance of the standard approach using up to 2.4x less compute.
Paper Content
Introduction
- Recent work on multilingual NLP has focused on pretraining language models on unlabeled corpora in multiple languages
- Models can then be finetuned using labeled downstream data in a single language and zero-shot transferred to the rest of the languages
- Existing models rarely cover more than a few dozen languages
- A recent line of work has explored pretraining an initial model in a few languages and expanding it to new languages posthoc
- Artetxe et al. (2020) showed that it is possible to expand an English masked language model by freezing the transformer body and learning a new embedding layer
- Improved results reported by using a better initialization scheme or learning additional language-specific parameters through adapters
- Proposed approach is both parameter- and compute-efficient
- Mini-models are shallow models aligned with a larger parent model
- Two approaches to learn mini-models: MINIJOINT and MINIPOST
- Evaluated on natural language inference, question answering and paraphrase identification
- Mini-model adaptation can match the performance of the standard method using less compute
Mini-model adaptation
- Proposed approach follows a four-step training paradigm
- Step 1: learn two aligned models (primary model and shallow mini-model)
- Step 2: learn L trg embeddings over mini-model
- Step 3 and 4: run as usual over primary model
Experimental settings
- Source language is English, target languages are 14
- Training corpus is CC-100
- Preprocessing uses SentencePiece with a vocabulary size of 50,000
- Models use RoBERTa BASE architecture
- 4 systems compared in experiments
- Evaluated on 3 tasks: XNLI, MLQA, PAWS-X
- Finetuning uses different learning rates for each task
- FLOPs and GPU training days estimated analytically
- BASELARGE has highest performance but highest cost
- MINIJOINT retains 98.7% of BASELARGE’s performance at 37.2% of its cost
Main results
Performance at training completion
- MINIPOST retains 99.3% of BASELARGE’s performance at nearly half of its cost.
- Mini-models do not need to be trained from scratch, but can be built post-hoc.
- BASESMALL performs substantially worse than MINIJOINT.
- Having two aligned models-a shallow one for efficient adaptation and a deep one for best performance at test time-is critical.
Gpu days to near-maximal performance
- Early stopping can be used to compare different approaches in terms of efficiency
- MINIJOINT requires less than half of the compute of BASELARGE to achieve the same performance in all tasks
- MINIPOST is substantially faster than standard adaptation to hit the desired performance
- There is a considerable variance across tasks and languages
Analysis
Training curves
- MINIJOINT is the fastest system but has a slightly lower final score
- BASELARGE is the slowest system and approaches peak performance
- BASESMALL gets stuck at a poor performance
- All methods adapt rapidly in PAWS-X, suggesting it is an easier task
- BASESMALL never achieves near-maximal performance, except for Turkish and Urdu
Mini-model depth
- MINIJOINT has 4 layers, MINIPOST has 6
- Experiments to study effect of mini-model depth on efficiency and performance
- Experiments done on Arabic, German and Turkish
- Shallow attachment of secondary head leads to more rapid adaptation, but with cost to final performance
- Optimal depth of mini-model is language-dependent
English performance
- MINI-POST and MINI-JOINT are two computer science models.
- MINI-JOINT jointly pretrains a primary model and its aligned mini-model.
- Comparing MINI-JOINT and BASELARGE/BASESMALL shows that dual-head training does not damage performance.
Variance across languages
- MINIJOINT takes more than 7 V100 days to achieve near-maximal performance on XNLI for Hindi, Turkish and Urdu.
- German, Spanish and French are the closest languages to English and generally obtain the fastest adaptation times.
- Swahili has the smallest training corpus but exceeds the other 3 languages on both adaptation speed and raw performance on XNLI.
Related work
- Training a language model from scratch requires a lot of data
- Multilingual models can be pretrained on unlabeled data from many languages and then finetuned on labeled data
- Multilingual models are large and expensive to train and can suffer from the curse of multilinguality
- Not all languages are created equal in multilingual models
- Adapters are a parameter-efficient way to extend language models
- Vocabulary adaptation methods reduce the need to extensively finetune a model
- Variation between languages is observed in LM adaptation
Conclusion and future work
- Possible to extend pretrained models to new languages using only a fraction of their parameters
- Two approaches to learn mini-models: MINIJOINT and MINIPOST
- Shallower mini-models converge faster but plateau at lower performance
- Explore combining multiple mini-models of different sizes
- Explore other potential applications of mini-model adaptation
- Limited to adaptation of MLMs to new languages
- Variance across languages observed
- Averaged results over 5 finetuning runs