Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- BLOOM is a multilingual language model capable of zero-shot learning
- Previous works have only explored adapting small language models
- Language adaptation can improve zero-shot performance in new languages
- Adapter-based finetuning is more effective than continued pretraining for large models
- Prompting performance is determined by the size of the language adaptation data
- Including a new language in the multitask fine-tuning mixture is the most effective method to teach BLOOMZ a new language
- Language adaptation can generalize well to diverse languages with sufficient training data
Paper Content
Introduction
- Current multilingual language models have limited coverage of languages
- BLOOM (Scao et al., 2022) covers 46 languages, but excludes high-resource languages such as Korean and Russian
- Limited availability of unlabeled text and the consideration of the curse of multilinguality are reasons for limited coverage
- Studying language adaptation to new languages is important
- Previous work has investigated different language adaptation strategies
- Little work has explored the effects of language adaptation on prompting
- This work focuses on language adaptation of BLOOM models to 8 new languages
- Finetuning adapters recommended for BLOOM with at least 3 billion parameters for better prompting performance
- Monolingual language adaptation improves prompting performance of BLOOM
Related work
- Language adaptation enables pretrained language models to support languages outside of their pretraining data
- Language adaptation approaches can be categorized into three categories: continued pretraining, training of language-specific adapters, and training of a sparse subset of model parameters
- Multilingual prompting reformulates NLP tasks into masked or generative language modeling problem
- Finetuning XLM-R on clozestyle prompts yields better performance than standard finetuning under a low-resource regime
- Multitask prompt-based training on a variety of tasks and English or translated prompts improves zero-shot cross-lingual and cross-task performance
- Multilingual prompt-based learning can be achieved without performing gradient updates for downstream tasks
Bloom pretrained models
- BLOOM language model has a decoder-only Transformer architecture
- Uses AliBi positional embeddings and layer normalization after embedding layers
- Tokenizer is trained with byte-level Byte Pair Encoding algorithm
- Vocabulary size of 250,680
- Pretrained for around 350 billion tokens on the ROOTS corpus
- ROOTS corpus covers 46 natural languages and 13 programming languages
New languages
- 6 languages unsupported by BLOOM: German, Bulgarian, Russian, Greek, Turkish, Thai
- Korean included to follow up on past work
- Guarani included as a low-resource Native American language
- Languages cover different language families and some do not share scripts with BLOOM’s supported languages
Language adaptation strategies
- Use 3 language adaptation strategies to analyze their effects on zero-shot prompting
- Continued pretraining strategy: train model with monolingual text of new language
- MAD-X: use language adapter and invertible adapter to adapt BLOOM to new languages
- (IA) 3: parameter-efficient finetuning method to rescale inner Transformer block activations
Language adaptation setting
- Randomly sampled 100K samples from OSCAR subcorpora
- Used Jojajovai parallel corpora for Guarani (30K sentences)
- 25K language adaptation training steps, batch size 8, sequence length 1,024
- No retraining of tokenizer, embedding layer adapted in two different fashions
Tasks and prompt templates
- Evaluated models on five multilingual NLU tasks
- Covered natural language inference, commonsense reasoning, anaphora resolution, and paraphrasing
- Zero-shot prompting without task-specific finetuning
- Reused prompt templates from XGLM model
- Translated prompt templates using automatic translation APIs
Baselines
- Compared adapted BLOOM model to generative multilingual language models
- Reported prompting performance of original BLOOM models
- XGLM models cover 30 natural languages and come in 5 different numbers of parameters
- mGPT is a GPT model trained on 60 languages from 25 language families with 1.3B parameters
- BLOOMZ and mT0 are BLOOM and mT5 models finetuned on a multilingual task mixture
- Performance reported on best prompts with instructions in English and context/label in non-English
- XGLM, mGPT, and mT0 have seen all new languages except Guarani during model pretraining
Results and discussion
Zero-shot prompting performance
- Language adaptation improves BLOOM’s zero-shot prompting for unseen languages
- Performance gains correlate with model sizes
- Continued pretraining yields best prompting performance
- BLOOM adapts well to new languages regardless of language family, word order, and script system
- Adapted BLOOM matches mGPT’s performance in several XNLI tasks and outperforms XGLM and mT0 on German PAWS-X and Russian XWinograd tasks
- Adapted BLOOM performs poorly on Guarani due to limited adaptation training data
Perplexity
- Perplexity is a measure of uncertainty when predicting the next token in a sequence.
- Lower perplexity means better language modeling ability.
- Perplexity during language adaptation training does not necessarily correlate with prompting performance.
- As model capacity increases, perplexity decreases but XWinograd performance drops.
- Even though continued pretraining has lower perplexity, it underperforms for larger model sizes.
- Perplexity does not always correlate with downstream task performance.
Connection to language independent representation
- Sentence retrieval accuracy is used to measure quality of language independent representation
- For MAD-X adapted models, SR accuracy improves as model grows in parameters
- For continued pretraining, best SR accuracy is achieved with smallest model
- Need around 100 million tokens of new language for effective language adaptation
- Low-resource setting has limited effect on certain tasks
Adapters’ capacity
- Investigated effect of size of adapter’s capacity
- Varied reduction factor in adapter’s bottleneck layer
- Smaller reduction value leads to more adapter parameters
- Positive correlation between adapter parameters and performance
Placement of adapters
- Examined how adapter placement impacts performance
- Kept single adapter at different layers of model
- Last layers benefit most from language adaptation
- Analyzed performance of MAD-X with and without invertible adapters
- Invertible adapters only improve performance for German, Bulgarian, and Turkish
- Language adaptation with continued pretraining and MAD-X language adapters on randomly initialized BLOOM
- Without pretraining, adapted BLOOM model behaves like a random classifier
- Investigated language adaptation strategies for BLOOMZ
- BLOOMZ loses prompting capability after language adaptation on free-form text of monolingual OSCAR corpora
- Experimented with learning a new language during instruction tuning
- Finetuning on only Russian without other languages and tasks in xP3 mixture shows tiny improvements
- Adding new languages during multitask finetuning can be effective but requires additional diverse tasks
- Did not explore vocabulary and embedding adaptation
- Evaluated sentence retrieval accuracy for subset of languages used to train BLOOM model
- Found batch size of 8 is optimal considering performance-compute trade-off
- Experimented with parameter-efficient finetuning strategies for language adaptation
- MAD-X language adapters yield best prompting performance
- Trained for 25,000 steps with batch size of 8 and sequence length of 1024
- Evaluated every 5,000 steps on perplexity of 1,000 held-out validation samples
- Performed hyperparameter search on learning rates, linear and cosine decay, and warm-up ratio
- Found different sets of hyperparameters caused around 1-2% small difference in XNLI accuracy
- Reported number of tokens after preprocessed by BLOOM’s BPE tokenizer
- Reported distribution of natural and programming languages in ROOTS pretraining data
- Compared different language adaptation strategies for BLOOM models on number of trainable parameters, total training time, inference time per prompt on XNLI test set, and maximum GPU memory usage