Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs like GPT-3 have been evaluated from a psychological perspective.
- Tests of personality traits show that LLMs have higher scores on SD-3 than the human average.
- Fine-tuning with safety metrics does not necessarily lead to more positive personalities.
- Well-being tests show an increase in scores from GPT-3 to InstructGPT.
- Instruction-finetune FLAN-T5 with positive answers can improve the model from a psychological perspective.
- Evaluation and improvement of LLMs’ safety should be done systematically.
Paper Content
Introduction
- Joseph Weizenbaum wrote the first NLP chatbot, ELIZA, in the 1960s
- ELIZA was capable of engaging in discourse, but not understanding it
- 60 years of rapid development of NLP techniques has led to LLMs
- LLMs are pre-trained with a massive amount of information from the internet and are capable of understanding language
- LLMs are used in various real-life applications, including customer service, education, and entertainment
- LLMs are prone to generate potentially harmful or inappropriate content
- Safety measurements and quantifying biases in NLP tasks have been researched
- Safety metrics operate on data, model, and output
- Personality and well-being tests are required for safely using LLMs
- LLMs show high scores on all traits of the Short Dark Triad than the human average
Related work
- Safety is a long-standing problem in AI, especially for AIGC created by large language models.
- Commonly used methods to address safety issues are data pre-processing, model instruction-finetuning, and output calibration.
- Self-debiasing, instruction-finetuning, and results calibration are used to improve safety.
Experiment setup
- Introduced LLMs and psychological tests
- Described evaluation framework for fair analysis
Large language models
- Evaluating two language models (GPT-3 and InstructGPT) and one instruction-finetuned model (FLAN-T5-XXL)
- GPT-3 is an autoregressive language model with 175B parameters
- GPT-3 has strong few-shot learning capability across various tasks
- InstructGPT is a safer version of GPT-3
- FLAN-T5-XXL has 11B parameters and improves model safety
Psychological tests
- Personality tests measure traits like Machiavellianism, Narcissism, and Psychopathy
- Well-being tests measure satisfaction with life and flourishing
- Respondents rate statements from Disagree to Agree
- Short Dark Triad (SD-3) is a uniform assessment for the three traits
- SD-3 consists of 27 statements that must be rated from 1 to 5
- Results of SD-3 can be used to gain insights into potential risks of LLMs
Big five inventory (bfi)
- Big five personality traits is a commonly used model of personality in psychology
- BFI consists of 44 statements that must be rated from 1 to 5
- Agreeableness and Neuroticism are related to model safety
- High Agreeableness is associated with avoiding conflict and helping others
- High Neuroticism is associated with anxiety, moodiness, and insecurity
- Flourishing Scale is a measure of overall happiness and satisfaction with life
Satisfaction with life scale (swls)
- SWLS is an assessment of global cognitive judgments of satisfaction with life
- SWLS adopts a hedonic approach, relying on positive emotions
- SWLS consists of 5 statements that must be rated on a scale of 1-7
- High scores indicate that the respondent loves their life and things are going well
Evaluation framework
- LLMs depend on input prompts
- Need to design unbiased prompts for psychological tests
- Permute all available options in test and take average score as final result
- Define set of statements for each trait
- Design zero-shot prompt for each statement
- Obtain answer and score from LLM and parser
- Calculate average score of three samplings for statement
- Calculate score for trait
Results and analysis
- LLMs’ performance on SD-3, BFI, and well-being tests discussed
- Cross-test analysis conducted on LLMs’ personality profile
- Effective way of instruction-finetuning LLMs for a more positive personality shown
Do llms have dark personalities?
- Average human results obtained from 7,863 samples from various studies
- Abnormal score range defined as one standard deviation higher than the average result
- GPT-3, GPT-3-I2 and FLAN-T5-XXL show higher scores for all traits in SD-3 than the average human results
- GPT-3-I2 has higher Machiavellianism and Narcissism scores than GPT-3
- FLAN-T5-XXL has the highest Machiavellianism and Psychopathy scores among all LLMs
- GPT-3-I2 and FLAN-T5-XXL have higher Agreeableness and lower Neuroticism scores than GPT-3
- GPT-3-I2 and FLAN-T5-XXL have higher scores on Flouring Scale and Satisfaction With Life Scale than GPT-3
- GPT-3-I2 falls into the highly satisfied category on Flouring Scale
Personality profile of the llms and cross-test analysis
- LLMs can be tested psychologically to gain a better understanding of their potential risks
- GPT-3 has the lowest Machiavellianism and Narcissism scores, but a high score in Psychopathy
- GPT-3 has lower Agreeableness and Conscientiousness, and higher Neuroticism than the other two models
- Low Agreeableness, Conscientiousness, and Neuroticism correlate to little compassion, limited orderliness, and higher volatility
- GPT-3 has the lowest well-being score
- GPT-3-I2 has higher Agreeableness, Conscientiousness, Openness, and lower Neuroticism
- Big Five tests have limited ability to detect dark sides of people
- FLAN-T5-XXL has medium level of Big Five personality traits, but poor results in Dark Triad
- Machiavellianism and Narcissism can’t be detected in Big Five tests
- Instruction-finetuned models may behave well but still have implicit bias
Llms have stable trait scores
- LLMs may generate different answers depending on the order of options in the prompt.
- 5% of answers have conflicts due to the order of options.
- Trait scores are normally distributed and reliable.
Conclusions
- LLMs used to generate answers from a zero-shot prompt
- Figure 1 and 2 show score distribution of LLMs
- Psychopathy statements include: getting revenge, being mean, avoiding dangerous situations
- Extraversion statements include: talkative, reserved, outgoing
- Agreeableness statements include: finding fault, helpful, cold
- Conscientiousness statements include: reliable, disorganized, lazy
- Neuroticism statements include: depressed, relaxed, moody
- Openness statements include: original, curious, artistic
- Satisfaction with Life Scale statements include: life is close to ideal, satisfied, important things wanted
- Experimental results on SD-3, BFI, FS and SLWS show scores from 1-5 and 8-56