Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Multilingual generative language models are becoming more fluent in many languages.
  • It is unknown what cultural biases are present in the predictions of these models.
  • This work focuses on formality, a language property highly influenced by culture.
  • Two popular multilingual language models were analyzed in 5 languages.
  • The models were found to be biased towards the formal style when prompted neutrally.

Paper Content

Introduction

  • Natural Language Processing (NLP) systems used worldwide across multiple cultures, audiences, contexts, communication goals, demographics, and languages.
  • Linguistic style is one of the major dimensions by which cultures vary in NLP technologies.
  • Formality is a stylistic property of language that can impact how we perceive a text.
  • Generating text with a desired level of formality can be useful for different NLP applications.
  • Multilingual language models are trained with large amounts of text from different sources.
  • This work analyzes the formality level of two multilingual language models across five languages.

Formality across different languages

  • Arabic formality is defined as a piece of text containing no words from any dialect which is not considered Fusha.
  • Bengali formality is determined by the relationship between the speaker and the audience, the frequency of Sanskrit-originated words, agglutination/compound words, pronouns, verbs, and negation modifiers.
  • English formality is defined by careful selection of pronunciation, words, and structure.
  • French formality is divided into three classes: soutenu, courant, and familier.
  • Spanish formality is determined by the T/V distinction in the singular second-person pronoun, the presence of typographical or grammatical errors, the topic, and the layout.
  • LLMs have been shown to have social bias and prejudice against minorities
  • LLMs produce damaging content
  • Evaluating social bias in multilingual settings is difficult
  • Curation of culturally aware datasets and knowledge of cultural differences is necessary
  • Papers have focused on measuring social biases and stereotypes against disadvantaged groups
  • Formality analysis has been done for limited number of languages
  • Proposals have been made to create social bias verification pipelines for LLMs
  • Formality-sensitive machine translation has received attention in recent years
  • Datasets with formality annotations have been introduced in multiple languages

Experiments

  • Evaluated formality of two state-of-the-art multilingual language models in five languages
  • Hypothesized that high-resource languages in corpus can cause biases in formality of models
  • Employed variations of prompt lengths and formality
  • Tweaked parameters to avoid incohesive outputs

Language models

  • XGLM is a multilingual generative language model
  • XGLM is trained with 500 billion tokens from 30 languages
  • XGLM aims to achieve multilingual zero-shot and few-shot learning
  • BLOOM is a multilingual generative language model trained on 341 billion tokens from 59 languages
  • BLOOM and XGLM are decoder-based transformers pre-trained on a similar set of languages

Prompting for formality evaluation

  • Employ two prompting strategies to condition the generation of models
  • Short Neutral Prompts composed of up to three words to condition language without impacting formality level
  • Long Informal/Formal Prompts composed of truncated sentences from existing formal/informal sources

Generation parameters

  • Decoding parameters can affect the output of a language model.
  • Parameters are chosen to minimize the impact on the formality level of the model.
  • Prompts used in the experiments are listed in Table 2.
  • Global generation parameters are used to produce assessable outputs.

Formality evaluation

  • Assessed formality of generated outputs
  • Native/proficient speaker of each language classified all 1200 generated sequences
  • No multilingual formality classifier models that include Arabic, Bengali, English, Spanish, and French
  • Annotated without looking at prompt and in randomized order
  • Classification categories: formal, informal, and incohesive

Results & analysis

  • Analyzed cohesiveness of each model
  • Excluded incohesive text from formality analysis

Cohesiveness of generation

  • BLOOM(7.1B) generates more cohesive texts than XGLM(7.5B) for English, French, and Spanish
  • A larger model does not necessarily lead to more cohesive generations
  • BLOOM(3B) generates more cohesive texts than BLOOM(7.1B) for Bengali and English
  • XGLM(2.9B) generates more cohesive texts than XGLM(7.5B) for English, French, and Spanish
  • Percentage of incohesive texts is higher for some languages than others for both BLOOM and XGLM

Formality-level bias

  • Neutral prompts should lead to equitable distributions of formal and informal generations
  • XGLM(2.9B), BLOOM(3B) and BLOOM(7.1B) are almost neutral with small differences of -3% -6% and -3%
  • XGLM(7.5B) shows significantly more bias toward formal generations than BLOOM(7.1B) with a difference of 33%
  • BLOOM(3B) shows only a bias of 1% toward informal generations and BLOOM(7.1B) shows 14% towards formal generations
  • XGLM(2.9) shows significantly more bias than BLOOM(3B) toward formal generations with a difference of 41%
  • XGLM and BLOOM both show a small bias towards different directions for English
  • BLOOM shows extreme bias towards the formal generations for French and Spanish
  • XGLM exhibits less bias towards formal generations than BLOOM
  • Bias is mostly toward formal generations for all the models and for all the languages

Formality-level preservation

  • Formality level of generation is same as prompt
  • Some models preserve formality level of prompt efficiently for some languages
  • BLOOM preserves formality of 94.2% of Arabic samples
  • BLOOM does not pay attention to informal prompts
  • XGLM preserves informal style of prompts significantly better than BLOOM
  • BLOOM and XGLM do not preserve formal style of English and French prompts
  • Spanish formal and informal styles are preserved consistently across models
  • Model size is not an indicator of how well model preserves formality style

General statistics about generations

  • BLOOM generates longer texts than XGLM
  • BLOOM generates more and shorter sentences when generating informal text
  • Informal generations tend to have more emojis, especially in Bengali
  • Informal generations tend to have more punctuation marks than formal ones
  • BLOOM tends to generate conversational text

Discussion

  • Formality bias can lead to undesirable outcomes
  • Prompting techniques are used for zero-shot learning and conversational chatbots
  • Formality bias can lead to misunderstandings and conflicts
  • Recent work has taken formality bias into consideration

Conclusion

  • Analyzed formality level of two large-scale generative language models
  • Found BLOOM(7.1B) predicts more cohesive text than XGLM(7.5B) for English, French, and Spanish
  • Both models tend to generate formal text when prompted neutrally
  • Formality of the prompt highly impacts both models
  • Released 1,200 generations in Arabic, Bengali, English, French, and Spanish
  • Visualized data for each language to help in seeing an overview of all the results