Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Generative AI models have impressive performance on NLP tasks.
Evaluating generative AI is challenging.
MEGA is a benchmark for generative LLMs, covering 8 tasks and 33 languages.
Comparing generative LLMs to SOTA non-autoregressive models.
Analysis of performance across languages and directions for future progress.

Paper Content

Introduction

Generative Large Language Models (LLMs) have created a lot of interest due to their capabilities
LLMs have been tested on languages other than English with varying results
GPT-4 model was evaluated on the MMLU multiple choice questions benchmark in 26 languages
GPT-3, BLOOM and PaLM models have been trained on multiple languages
It is unclear how well LLMs perform across diverse tasks and languages
Most of the world’s population is under-served in terms of availability of data for their languages
Evaluation of LLMs has been an active area of research with multiple benchmarks
Challenges of scaling up multilingual evaluation due to lack of resources and infrastructure
Limitations of using translated datasets for evaluation
Holistic evaluation of generative AI models for English (HELM)
Bang et al. (2023) evaluate ChatGPT across tasks and languages
Hendy et al. (2023) evaluate translation abilities of three GPT models
Comprehensive Multilingual Evaluation of Generative AI (MEGA) to quantify how well generative LLMs perform across languages
Blueprint of strategies for building systems using generative AI for multilingual users
MEGA benchmarking framework to evaluate systems on specific tasks across languages

Experiments

Adapting NLP tasks to in-context learning setting
Describing prompting strategies for benchmark
Models, tasks and datasets included in initial study

Problem formulation

Adopt prompt-based few-shot learning strategy
Define four components of prompts: test example, few-shot exemplars, prompt template, answer verbalizer
Prompt is composed of template and verbalizer
Evaluation score is determined by exact-match and F1-score for QA tasks

Prompting strategies

Choice of prompt affects performance of Large Language Models
Generative models are sensitive to simple prompting variations
Variations include language of prompt examples, language of prompt template, and language of test examples
Monolingual prompting uses same language for k1 randomly selected examples and test examples
Zero-Shot Cross-Lingual prompting uses k-shot examples from a pivot language different from language of test example
Translate-Test prompting modifies test example by translating it to English
English-Template uses English prompts for all experiments
Native-Language-Template uses native language prompts, but performs poorly

Models

Used OpenAI’s GPT text-davinci-003 model for benchmarking experiments
Compared performance of DV003 with BLOOMZ, TULRv6 and MuRIL models

Tasks and datasets

Two broad families of NLU tasks: Classification and Question Answering
Performance of different models measured in terms of classification accuracy
Exact match between generated output and verbalized label to determine if example was classified correctly
Span Prediction type of Question Answering tasks
Challenge of fitting context and question pairs into maximum context size of 4096
Exact Match and F1 score used to measure performance
Four tasks used for benchmarking: TyDiQA, MLQA, XQuAD, IndicQA

Few-shot examples

Randomly chose few-shot examples from development set
Better choices of few-shot examples can lead to higher performance

Choice of prompts

Use PromptSource from BigScience community to create prompts for tasks
Evaluate performance of English prompts on 10% of English test set
Select best performing English prompt for entire test set
Translate English prompt to target language using Bing translator

Results

Analyzed the results of MEGA over each task in two parts
Compared performance of best DV003 system to BLOOMZ, TULR and MuRIL
Y axis ordered by language class according to Joshi et al. (2020)
Class 5 corresponds to high resource languages, Class 1 and 2 are underresourced languages
Included results on English for each task to show gap between performance in English and other languages

Comparison across prompting strategies

Translatetest performs best across all languages
Monolingual prompting works best for some languages
Translatetest outperforms other settings by a large margin
Performance is much worse for non-Latin script languages

Comparison across models

DV003 compared to generative and SOTA non-autoregressive models
TULRv6 outperforms other models
BLOOMZ performs worse than DV003 translate-test, but better for some languages
MuRIL model outperforms DV003 on all languages
TULRv6 performs best for XCOPA, PAWS-X, TyDiQA and XQUAD
TULRv6 outperforms DV003 in all languages, with higher gap in low-resource and non-Latin script languages
Table 3 provides summary of results averaged over all languages

Discussion

Translate-test setting works best for DV003 across all tasks
Generative LLMs (DV003 and BLOOMZ) have lower performance than SOTA models
Gap between generative LLMs and SOTA models is reduced for high-resource languages
Gap is higher for under-resourced languages

Tokenization

Tokenization is a key component that influences the performance of multilingual models
Poor tokenization for some lower-resource languages can lead to poor context encapsulation and poor semantic representations of inputs

Translating prompts to multiple languages

Translating prompts often results in semantically meaningless prompts for some moderate-resource languages.
Human supervision or post-editing is necessary to generate meaningful prompts for such languages.
Translation appears to favor languages which share scripts with higher-resource languages.
Sharing dominant scripts leads to better cross-lingual transfer and improved performance on downstream tasks.

Translate-test

Translate-test prompt setting works best overall for all languages and tasks.
Translating into English and querying the system is feasible for languages supported by translators.
Translation performance of M2M is presented in Table 4.

Looking forward

Benchmarking generative AI models across languages
Expand language coverage to include low-resource languages and typologically diverse languages
Include more standard NLP tasks and tasks from real-world applications
Compare against BLOOMZ and include more generative models such as GPT4
Include other dimensions such as calibration, fairness, toxicity, bias and disinformation
Incorporate datasets that represent multilingual communication, such as code-switching

Conclusion

Generative LLMs perform worse than SOTA models
Perform better on higher-resource languages and Latin script
Tokenization in GPT may be a reason for the gap
Choice of prompting strategy matters
Tuning prompts is challenging to scale
Need to scale up multilingual prompt generation
Translate-test is currently the best strategy
Prioritize automatic benchmarking and human evaluation
Little representation from African and Indigenous languages
Restricted to accuracy dimension of evaluation
Figures 1-26 included in paper

Link to paper#

Abstract#

Paper Content#

Introduction#

Experiments#

Problem formulation#

Prompting strategies#

Models#

Tasks and datasets#

Few-shot examples#

Choice of prompts#

Results#

Comparison across prompting strategies#

Comparison across models#

Discussion#

Tokenization#

Translating prompts to multiple languages#

Translate-test#

Looking forward#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Experiments

Problem formulation

Prompting strategies

Models

Tasks and datasets

Few-shot examples

Choice of prompts

Results

Comparison across prompting strategies

Comparison across models

Discussion

Tokenization

Translating prompts to multiple languages

Translate-test

Looking forward

Conclusion