Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • It is possible to expand pretrained Masked Language Models (MLMs) to new languages by learning a new set of embeddings.
  • Training the new embeddings requires a full forward and backward pass over the entire model.
  • Mini-model adaptation is a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model’s parameters.
  • Experiments show that mini-model adaptation matches the performance of the standard approach using up to 2.4x less compute.

Paper Content


  • Recent work on multilingual NLP has focused on pretraining language models on unlabeled corpora in multiple languages
  • Models can then be finetuned using labeled downstream data in a single language and zero-shot transferred to the rest of the languages
  • Existing models rarely cover more than a few dozen languages
  • A recent line of work has explored pretraining an initial model in a few languages and expanding it to new languages posthoc
  • Artetxe et al. (2020) showed that it is possible to expand an English masked language model by freezing the transformer body and learning a new embedding layer
  • Improved results reported by using a better initialization scheme or learning additional language-specific parameters through adapters
  • Proposed approach is both parameter- and compute-efficient
  • Mini-models are shallow models aligned with a larger parent model
  • Two approaches to learn mini-models: MINIJOINT and MINIPOST
  • Evaluated on natural language inference, question answering and paraphrase identification
  • Mini-model adaptation can match the performance of the standard method using less compute

Mini-model adaptation

  • Proposed approach follows a four-step training paradigm
  • Step 1: learn two aligned models (primary model and shallow mini-model)
  • Step 2: learn L trg embeddings over mini-model
  • Step 3 and 4: run as usual over primary model

Experimental settings

  • Source language is English, target languages are 14
  • Training corpus is CC-100
  • Preprocessing uses SentencePiece with a vocabulary size of 50,000
  • Models use RoBERTa BASE architecture
  • 4 systems compared in experiments
  • Evaluated on 3 tasks: XNLI, MLQA, PAWS-X
  • Finetuning uses different learning rates for each task
  • FLOPs and GPU training days estimated analytically
  • BASELARGE has highest performance but highest cost
  • MINIJOINT retains 98.7% of BASELARGE’s performance at 37.2% of its cost

Main results

Performance at training completion

  • MINIPOST retains 99.3% of BASELARGE’s performance at nearly half of its cost.
  • Mini-models do not need to be trained from scratch, but can be built post-hoc.
  • BASESMALL performs substantially worse than MINIJOINT.
  • Having two aligned models-a shallow one for efficient adaptation and a deep one for best performance at test time-is critical.

Gpu days to near-maximal performance

  • Early stopping can be used to compare different approaches in terms of efficiency
  • MINIJOINT requires less than half of the compute of BASELARGE to achieve the same performance in all tasks
  • MINIPOST is substantially faster than standard adaptation to hit the desired performance
  • There is a considerable variance across tasks and languages


Training curves

  • MINIJOINT is the fastest system but has a slightly lower final score
  • BASELARGE is the slowest system and approaches peak performance
  • BASESMALL gets stuck at a poor performance
  • All methods adapt rapidly in PAWS-X, suggesting it is an easier task
  • BASESMALL never achieves near-maximal performance, except for Turkish and Urdu

Mini-model depth

  • MINIJOINT has 4 layers, MINIPOST has 6
  • Experiments to study effect of mini-model depth on efficiency and performance
  • Experiments done on Arabic, German and Turkish
  • Shallow attachment of secondary head leads to more rapid adaptation, but with cost to final performance
  • Optimal depth of mini-model is language-dependent

English performance

  • MINI-POST and MINI-JOINT are two computer science models.
  • MINI-JOINT jointly pretrains a primary model and its aligned mini-model.
  • Comparing MINI-JOINT and BASELARGE/BASESMALL shows that dual-head training does not damage performance.

Variance across languages

  • MINIJOINT takes more than 7 V100 days to achieve near-maximal performance on XNLI for Hindi, Turkish and Urdu.
  • German, Spanish and French are the closest languages to English and generally obtain the fastest adaptation times.
  • Swahili has the smallest training corpus but exceeds the other 3 languages on both adaptation speed and raw performance on XNLI.
  • Training a language model from scratch requires a lot of data
  • Multilingual models can be pretrained on unlabeled data from many languages and then finetuned on labeled data
  • Multilingual models are large and expensive to train and can suffer from the curse of multilinguality
  • Not all languages are created equal in multilingual models
  • Adapters are a parameter-efficient way to extend language models
  • Vocabulary adaptation methods reduce the need to extensively finetune a model
  • Variation between languages is observed in LM adaptation

Conclusion and future work

  • Possible to extend pretrained models to new languages using only a fraction of their parameters
  • Two approaches to learn mini-models: MINIJOINT and MINIPOST
  • Shallower mini-models converge faster but plateau at lower performance
  • Explore combining multiple mini-models of different sizes
  • Explore other potential applications of mini-model adaptation
  • Limited to adaptation of MLMs to new languages
  • Variance across languages observed
  • Averaged results over 5 finetuning runs