Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

BLOOM is a multilingual language model capable of zero-shot learning
Previous works have only explored adapting small language models
Language adaptation can improve zero-shot performance in new languages
Adapter-based finetuning is more effective than continued pretraining for large models
Prompting performance is determined by the size of the language adaptation data
Including a new language in the multitask fine-tuning mixture is the most effective method to teach BLOOMZ a new language
Language adaptation can generalize well to diverse languages with sufficient training data

Paper Content

Introduction

Current multilingual language models have limited coverage of languages
BLOOM (Scao et al., 2022) covers 46 languages, but excludes high-resource languages such as Korean and Russian
Limited availability of unlabeled text and the consideration of the curse of multilinguality are reasons for limited coverage
Studying language adaptation to new languages is important
Previous work has investigated different language adaptation strategies
Little work has explored the effects of language adaptation on prompting
This work focuses on language adaptation of BLOOM models to 8 new languages
Finetuning adapters recommended for BLOOM with at least 3 billion parameters for better prompting performance
Monolingual language adaptation improves prompting performance of BLOOM

Language adaptation enables pretrained language models to support languages outside of their pretraining data
Language adaptation approaches can be categorized into three categories: continued pretraining, training of language-specific adapters, and training of a sparse subset of model parameters
Multilingual prompting reformulates NLP tasks into masked or generative language modeling problem
Finetuning XLM-R on clozestyle prompts yields better performance than standard finetuning under a low-resource regime
Multitask prompt-based training on a variety of tasks and English or translated prompts improves zero-shot cross-lingual and cross-task performance
Multilingual prompt-based learning can be achieved without performing gradient updates for downstream tasks

Bloom pretrained models

BLOOM language model has a decoder-only Transformer architecture
Uses AliBi positional embeddings and layer normalization after embedding layers
Tokenizer is trained with byte-level Byte Pair Encoding algorithm
Vocabulary size of 250,680
Pretrained for around 350 billion tokens on the ROOTS corpus
ROOTS corpus covers 46 natural languages and 13 programming languages

New languages

6 languages unsupported by BLOOM: German, Bulgarian, Russian, Greek, Turkish, Thai
Korean included to follow up on past work
Guarani included as a low-resource Native American language
Languages cover different language families and some do not share scripts with BLOOM’s supported languages

Language adaptation strategies

Use 3 language adaptation strategies to analyze their effects on zero-shot prompting
Continued pretraining strategy: train model with monolingual text of new language
MAD-X: use language adapter and invertible adapter to adapt BLOOM to new languages
(IA) 3: parameter-efficient finetuning method to rescale inner Transformer block activations

Language adaptation setting

Randomly sampled 100K samples from OSCAR subcorpora
Used Jojajovai parallel corpora for Guarani (30K sentences)
25K language adaptation training steps, batch size 8, sequence length 1,024
No retraining of tokenizer, embedding layer adapted in two different fashions

Tasks and prompt templates

Evaluated models on five multilingual NLU tasks
Covered natural language inference, commonsense reasoning, anaphora resolution, and paraphrasing
Zero-shot prompting without task-specific finetuning
Reused prompt templates from XGLM model
Translated prompt templates using automatic translation APIs

Baselines

Compared adapted BLOOM model to generative multilingual language models
Reported prompting performance of original BLOOM models
XGLM models cover 30 natural languages and come in 5 different numbers of parameters
mGPT is a GPT model trained on 60 languages from 25 language families with 1.3B parameters
BLOOMZ and mT0 are BLOOM and mT5 models finetuned on a multilingual task mixture
Performance reported on best prompts with instructions in English and context/label in non-English
XGLM, mGPT, and mT0 have seen all new languages except Guarani during model pretraining

Results and discussion

Zero-shot prompting performance

Language adaptation improves BLOOM’s zero-shot prompting for unseen languages
Performance gains correlate with model sizes
Continued pretraining yields best prompting performance
BLOOM adapts well to new languages regardless of language family, word order, and script system
Adapted BLOOM matches mGPT’s performance in several XNLI tasks and outperforms XGLM and mT0 on German PAWS-X and Russian XWinograd tasks
Adapted BLOOM performs poorly on Guarani due to limited adaptation training data

Perplexity

Perplexity is a measure of uncertainty when predicting the next token in a sequence.
Lower perplexity means better language modeling ability.
Perplexity during language adaptation training does not necessarily correlate with prompting performance.
As model capacity increases, perplexity decreases but XWinograd performance drops.
Even though continued pretraining has lower perplexity, it underperforms for larger model sizes.
Perplexity does not always correlate with downstream task performance.

Connection to language independent representation

Sentence retrieval accuracy is used to measure quality of language independent representation
For MAD-X adapted models, SR accuracy improves as model grows in parameters
For continued pretraining, best SR accuracy is achieved with smallest model
Need around 100 million tokens of new language for effective language adaptation
Low-resource setting has limited effect on certain tasks

Adapters’ capacity

Investigated effect of size of adapter’s capacity
Varied reduction factor in adapter’s bottleneck layer
Smaller reduction value leads to more adapter parameters
Positive correlation between adapter parameters and performance

Placement of adapters

Examined how adapter placement impacts performance
Kept single adapter at different layers of model
Last layers benefit most from language adaptation
Analyzed performance of MAD-X with and without invertible adapters
Invertible adapters only improve performance for German, Bulgarian, and Turkish
Language adaptation with continued pretraining and MAD-X language adapters on randomly initialized BLOOM
Without pretraining, adapted BLOOM model behaves like a random classifier
Investigated language adaptation strategies for BLOOMZ
BLOOMZ loses prompting capability after language adaptation on free-form text of monolingual OSCAR corpora
Experimented with learning a new language during instruction tuning
Finetuning on only Russian without other languages and tasks in xP3 mixture shows tiny improvements
Adding new languages during multitask finetuning can be effective but requires additional diverse tasks
Did not explore vocabulary and embedding adaptation
Evaluated sentence retrieval accuracy for subset of languages used to train BLOOM model
Found batch size of 8 is optimal considering performance-compute trade-off
Experimented with parameter-efficient finetuning strategies for language adaptation
MAD-X language adapters yield best prompting performance
Trained for 25,000 steps with batch size of 8 and sequence length of 1024
Evaluated every 5,000 steps on perplexity of 1,000 held-out validation samples
Performed hyperparameter search on learning rates, linear and cosine decay, and warm-up ratio
Found different sets of hyperparameters caused around 1-2% small difference in XNLI accuracy
Reported number of tokens after preprocessed by BLOOM’s BPE tokenizer
Reported distribution of natural and programming languages in ROOTS pretraining data
Compared different language adaptation strategies for BLOOM models on number of trainable parameters, total training time, inference time per prompt on XNLI test set, and maximum GPU memory usage

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Bloom pretrained models#

New languages#

Language adaptation strategies#

Language adaptation setting#

Tasks and prompt templates#

Baselines#

Results and discussion#

Zero-shot prompting performance#

Perplexity#

Connection to language independent representation#

Adapters’ capacity#

Placement of adapters#

Link to paper

Abstract

Paper Content

Introduction

Related work

Bloom pretrained models

New languages

Language adaptation strategies

Language adaptation setting

Tasks and prompt templates

Baselines

Results and discussion

Zero-shot prompting performance

Perplexity

Connection to language independent representation

Adapters’ capacity

Placement of adapters