Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Multilingual generative language models are becoming more fluent in many languages.
- It is unknown what cultural biases are present in the predictions of these models.
- This work focuses on formality, a language property highly influenced by culture.
- Two popular multilingual language models were analyzed in 5 languages.
- The models were found to be biased towards the formal style when prompted neutrally.
Paper Content
Introduction
- Natural Language Processing (NLP) systems used worldwide across multiple cultures, audiences, contexts, communication goals, demographics, and languages.
- Linguistic style is one of the major dimensions by which cultures vary in NLP technologies.
- Formality is a stylistic property of language that can impact how we perceive a text.
- Generating text with a desired level of formality can be useful for different NLP applications.
- Multilingual language models are trained with large amounts of text from different sources.
- This work analyzes the formality level of two multilingual language models across five languages.
Formality across different languages
- Arabic formality is defined as a piece of text containing no words from any dialect which is not considered Fusha.
- Bengali formality is determined by the relationship between the speaker and the audience, the frequency of Sanskrit-originated words, agglutination/compound words, pronouns, verbs, and negation modifiers.
- English formality is defined by careful selection of pronunciation, words, and structure.
- French formality is divided into three classes: soutenu, courant, and familier.
- Spanish formality is determined by the T/V distinction in the singular second-person pronoun, the presence of typographical or grammatical errors, the topic, and the layout.
Related work
- LLMs have been shown to have social bias and prejudice against minorities
- LLMs produce damaging content
- Evaluating social bias in multilingual settings is difficult
- Curation of culturally aware datasets and knowledge of cultural differences is necessary
- Papers have focused on measuring social biases and stereotypes against disadvantaged groups
- Formality analysis has been done for limited number of languages
- Proposals have been made to create social bias verification pipelines for LLMs
- Formality-sensitive machine translation has received attention in recent years
- Datasets with formality annotations have been introduced in multiple languages
Experiments
- Evaluated formality of two state-of-the-art multilingual language models in five languages
- Hypothesized that high-resource languages in corpus can cause biases in formality of models
- Employed variations of prompt lengths and formality
- Tweaked parameters to avoid incohesive outputs
Language models
- XGLM is a multilingual generative language model
- XGLM is trained with 500 billion tokens from 30 languages
- XGLM aims to achieve multilingual zero-shot and few-shot learning
- BLOOM is a multilingual generative language model trained on 341 billion tokens from 59 languages
- BLOOM and XGLM are decoder-based transformers pre-trained on a similar set of languages
Prompting for formality evaluation
- Employ two prompting strategies to condition the generation of models
- Short Neutral Prompts composed of up to three words to condition language without impacting formality level
- Long Informal/Formal Prompts composed of truncated sentences from existing formal/informal sources
Generation parameters
- Decoding parameters can affect the output of a language model.
- Parameters are chosen to minimize the impact on the formality level of the model.
- Prompts used in the experiments are listed in Table 2.
- Global generation parameters are used to produce assessable outputs.
Formality evaluation
- Assessed formality of generated outputs
- Native/proficient speaker of each language classified all 1200 generated sequences
- No multilingual formality classifier models that include Arabic, Bengali, English, Spanish, and French
- Annotated without looking at prompt and in randomized order
- Classification categories: formal, informal, and incohesive
Results & analysis
- Analyzed cohesiveness of each model
- Excluded incohesive text from formality analysis
Cohesiveness of generation
- BLOOM(7.1B) generates more cohesive texts than XGLM(7.5B) for English, French, and Spanish
- A larger model does not necessarily lead to more cohesive generations
- BLOOM(3B) generates more cohesive texts than BLOOM(7.1B) for Bengali and English
- XGLM(2.9B) generates more cohesive texts than XGLM(7.5B) for English, French, and Spanish
- Percentage of incohesive texts is higher for some languages than others for both BLOOM and XGLM
Formality-level bias
- Neutral prompts should lead to equitable distributions of formal and informal generations
- XGLM(2.9B), BLOOM(3B) and BLOOM(7.1B) are almost neutral with small differences of -3% -6% and -3%
- XGLM(7.5B) shows significantly more bias toward formal generations than BLOOM(7.1B) with a difference of 33%
- BLOOM(3B) shows only a bias of 1% toward informal generations and BLOOM(7.1B) shows 14% towards formal generations
- XGLM(2.9) shows significantly more bias than BLOOM(3B) toward formal generations with a difference of 41%
- XGLM and BLOOM both show a small bias towards different directions for English
- BLOOM shows extreme bias towards the formal generations for French and Spanish
- XGLM exhibits less bias towards formal generations than BLOOM
- Bias is mostly toward formal generations for all the models and for all the languages
Formality-level preservation
- Formality level of generation is same as prompt
- Some models preserve formality level of prompt efficiently for some languages
- BLOOM preserves formality of 94.2% of Arabic samples
- BLOOM does not pay attention to informal prompts
- XGLM preserves informal style of prompts significantly better than BLOOM
- BLOOM and XGLM do not preserve formal style of English and French prompts
- Spanish formal and informal styles are preserved consistently across models
- Model size is not an indicator of how well model preserves formality style
General statistics about generations
- BLOOM generates longer texts than XGLM
- BLOOM generates more and shorter sentences when generating informal text
- Informal generations tend to have more emojis, especially in Bengali
- Informal generations tend to have more punctuation marks than formal ones
- BLOOM tends to generate conversational text
Discussion
- Formality bias can lead to undesirable outcomes
- Prompting techniques are used for zero-shot learning and conversational chatbots
- Formality bias can lead to misunderstandings and conflicts
- Recent work has taken formality bias into consideration
Conclusion
- Analyzed formality level of two large-scale generative language models
- Found BLOOM(7.1B) predicts more cohesive text than XGLM(7.5B) for English, French, and Spanish
- Both models tend to generate formal text when prompted neutrally
- Formality of the prompt highly impacts both models
- Released 1,200 generations in Arabic, Bengali, English, French, and Spanish
- Visualized data for each language to help in seeing an overview of all the results