Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • BLOOM is a multilingual language model capable of zero-shot learning
  • Previous works have only explored adapting small language models
  • Language adaptation can improve zero-shot performance in new languages
  • Adapter-based finetuning is more effective than continued pretraining for large models
  • Prompting performance is determined by the size of the language adaptation data
  • Including a new language in the multitask fine-tuning mixture is the most effective method to teach BLOOMZ a new language
  • Language adaptation can generalize well to diverse languages with sufficient training data

Paper Content

Introduction

  • Current multilingual language models have limited coverage of languages
  • BLOOM (Scao et al., 2022) covers 46 languages, but excludes high-resource languages such as Korean and Russian
  • Limited availability of unlabeled text and the consideration of the curse of multilinguality are reasons for limited coverage
  • Studying language adaptation to new languages is important
  • Previous work has investigated different language adaptation strategies
  • Little work has explored the effects of language adaptation on prompting
  • This work focuses on language adaptation of BLOOM models to 8 new languages
  • Finetuning adapters recommended for BLOOM with at least 3 billion parameters for better prompting performance
  • Monolingual language adaptation improves prompting performance of BLOOM
  • Language adaptation enables pretrained language models to support languages outside of their pretraining data
  • Language adaptation approaches can be categorized into three categories: continued pretraining, training of language-specific adapters, and training of a sparse subset of model parameters
  • Multilingual prompting reformulates NLP tasks into masked or generative language modeling problem
  • Finetuning XLM-R on clozestyle prompts yields better performance than standard finetuning under a low-resource regime
  • Multitask prompt-based training on a variety of tasks and English or translated prompts improves zero-shot cross-lingual and cross-task performance
  • Multilingual prompt-based learning can be achieved without performing gradient updates for downstream tasks

Bloom pretrained models

  • BLOOM language model has a decoder-only Transformer architecture
  • Uses AliBi positional embeddings and layer normalization after embedding layers
  • Tokenizer is trained with byte-level Byte Pair Encoding algorithm
  • Vocabulary size of 250,680
  • Pretrained for around 350 billion tokens on the ROOTS corpus
  • ROOTS corpus covers 46 natural languages and 13 programming languages

New languages

  • 6 languages unsupported by BLOOM: German, Bulgarian, Russian, Greek, Turkish, Thai
  • Korean included to follow up on past work
  • Guarani included as a low-resource Native American language
  • Languages cover different language families and some do not share scripts with BLOOM’s supported languages

Language adaptation strategies

  • Use 3 language adaptation strategies to analyze their effects on zero-shot prompting
  • Continued pretraining strategy: train model with monolingual text of new language
  • MAD-X: use language adapter and invertible adapter to adapt BLOOM to new languages
  • (IA) 3: parameter-efficient finetuning method to rescale inner Transformer block activations

Language adaptation setting

  • Randomly sampled 100K samples from OSCAR subcorpora
  • Used Jojajovai parallel corpora for Guarani (30K sentences)
  • 25K language adaptation training steps, batch size 8, sequence length 1,024
  • No retraining of tokenizer, embedding layer adapted in two different fashions

Tasks and prompt templates

  • Evaluated models on five multilingual NLU tasks
  • Covered natural language inference, commonsense reasoning, anaphora resolution, and paraphrasing
  • Zero-shot prompting without task-specific finetuning
  • Reused prompt templates from XGLM model
  • Translated prompt templates using automatic translation APIs

Baselines

  • Compared adapted BLOOM model to generative multilingual language models
  • Reported prompting performance of original BLOOM models
  • XGLM models cover 30 natural languages and come in 5 different numbers of parameters
  • mGPT is a GPT model trained on 60 languages from 25 language families with 1.3B parameters
  • BLOOMZ and mT0 are BLOOM and mT5 models finetuned on a multilingual task mixture
  • Performance reported on best prompts with instructions in English and context/label in non-English
  • XGLM, mGPT, and mT0 have seen all new languages except Guarani during model pretraining

Results and discussion

Zero-shot prompting performance

  • Language adaptation improves BLOOM’s zero-shot prompting for unseen languages
  • Performance gains correlate with model sizes
  • Continued pretraining yields best prompting performance
  • BLOOM adapts well to new languages regardless of language family, word order, and script system
  • Adapted BLOOM matches mGPT’s performance in several XNLI tasks and outperforms XGLM and mT0 on German PAWS-X and Russian XWinograd tasks
  • Adapted BLOOM performs poorly on Guarani due to limited adaptation training data

Perplexity

  • Perplexity is a measure of uncertainty when predicting the next token in a sequence.
  • Lower perplexity means better language modeling ability.
  • Perplexity during language adaptation training does not necessarily correlate with prompting performance.
  • As model capacity increases, perplexity decreases but XWinograd performance drops.
  • Even though continued pretraining has lower perplexity, it underperforms for larger model sizes.
  • Perplexity does not always correlate with downstream task performance.

Connection to language independent representation

  • Sentence retrieval accuracy is used to measure quality of language independent representation
  • For MAD-X adapted models, SR accuracy improves as model grows in parameters
  • For continued pretraining, best SR accuracy is achieved with smallest model
  • Need around 100 million tokens of new language for effective language adaptation
  • Low-resource setting has limited effect on certain tasks

Adapters’ capacity

  • Investigated effect of size of adapter’s capacity
  • Varied reduction factor in adapter’s bottleneck layer
  • Smaller reduction value leads to more adapter parameters
  • Positive correlation between adapter parameters and performance

Placement of adapters

  • Examined how adapter placement impacts performance
  • Kept single adapter at different layers of model
  • Last layers benefit most from language adaptation
  • Analyzed performance of MAD-X with and without invertible adapters
  • Invertible adapters only improve performance for German, Bulgarian, and Turkish
  • Language adaptation with continued pretraining and MAD-X language adapters on randomly initialized BLOOM
  • Without pretraining, adapted BLOOM model behaves like a random classifier
  • Investigated language adaptation strategies for BLOOMZ
  • BLOOMZ loses prompting capability after language adaptation on free-form text of monolingual OSCAR corpora
  • Experimented with learning a new language during instruction tuning
  • Finetuning on only Russian without other languages and tasks in xP3 mixture shows tiny improvements
  • Adding new languages during multitask finetuning can be effective but requires additional diverse tasks
  • Did not explore vocabulary and embedding adaptation
  • Evaluated sentence retrieval accuracy for subset of languages used to train BLOOM model
  • Found batch size of 8 is optimal considering performance-compute trade-off
  • Experimented with parameter-efficient finetuning strategies for language adaptation
  • MAD-X language adapters yield best prompting performance
  • Trained for 25,000 steps with batch size of 8 and sequence length of 1024
  • Evaluated every 5,000 steps on perplexity of 1,000 held-out validation samples
  • Performed hyperparameter search on learning rates, linear and cosine decay, and warm-up ratio
  • Found different sets of hyperparameters caused around 1-2% small difference in XNLI accuracy
  • Reported number of tokens after preprocessed by BLOOM’s BPE tokenizer
  • Reported distribution of natural and programming languages in ROOTS pretraining data
  • Compared different language adaptation strategies for BLOOM models on number of trainable parameters, total training time, inference time per prompt on XNLI test set, and maximum GPU memory usage