Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Investigated scaling behaviors of red teaming across 3 model sizes and 4 model types
Released dataset of 38,961 red team attacks for others to analyze and learn from
Analyzed data and found a variety of harmful outputs, ranging from offensive language to subtly unethical outputs

Paper Content

Introduction

Large language models can have harmful behaviors
Examples of harmful behaviors include reinforcing social biases, generating offensive/toxic outputs, leaking personal info, aiding in disinformation campaigns, generating extremist texts, and spreading falsehoods
As AI systems improve, the scope of possible harms is likely to grow
Strategies have been developed to address some of these harms
Red teaming is a tool to address harm, using manual or automated methods to probe a language model for harmful outputs
Paper describes early efforts to implement manual red teaming
Investigated scaling behaviors for red teaming across 3 model sizes and 4 model types
Released dataset of 38,961 red team attacks
Described instructions, processes, and statistical methodologies for red teaming
Proposed policy interventions for how to develop shared norms, practices, and technical standards for red teaming

We use the same models as in our previous work
We run additional experiments to determine the influence of model size on susceptibility to red team attacks
We analyze the content of the attacks to understand the types of harms uncovered by red teaming
We provide more detail on our red team methods and release the data
We focus on reinforcement learning from human feedback as our most promising safety intervention

Red team task

Developed an interface for red team members to have open-ended conversations with an AI assistant
Provided a brief list of example conversation topics
Asked participants to enter a description of how they intend to red team the model
Reviewed literature and conducted interviews to incorporate best practices into task instructions and interface
Warned participants of sensitive content
Asked participants to select topics within their own risk tolerance
Asked participants to select the more harmful of two model-generated responses
Used dataset of pairs of model responses to train a harmlessness preference model
Asked participants to rate how successful they were at making the AI assistant say something bad
Assigned red team members to models at random

Models

We derive dialogue models from a general language model and a helpful and harmless preference model.
We train decoder-only transformer models ranging in size from 2.7B to 13B to 52B parameters.
Our preference model is trained to predict both harmlessness and helpfulness.
We use 1-shot learning to prompt our general language models to behave as dialogue models.
We use 14-shot learning to prompt our general language models to be helpful, harmless, and honest.
We generate 16 samples of AI assistant responses from prompted language models and select the 2 least harmful samples.
We use reinforcement learning to train a prompted language model to maximize the scores given by the preference model.

Red team

Recruited 324 US-based crowdworkers from Amazon’s Mechanical Turk and Upwork
Paid between $7.50 and $9.50 for each set of 5 conversations on MTurk
Paid $20 per hour on Upwork
Crowdworker population not fully representative of US population
79% of participants self-identify as “White or Caucasian”
66% of participants have at least a college degree
80% of red team attacks come from 50 out of 300 workers

Data analysis

Collected 38,961 red team attacks
Measured 3 variables for each attack
Red team members self-rated success on a 5-point Likert scale
Distribution of self-rating was bimodal, with two peaks at 0 and 4
Used harmlessness preference model to compute harmlessness score of AI assistant’s dialogue
Found inverse proportion between self-rating of attack success and minimum harmlessness score

Review task

Collected data across all model types
Performed follow-up experiment to measure two variables: inter-annotator agreement and content of attack types
Self-ratings of attack success are subjective and can vary based on elements of attack and red team member
Wanted to understand variability across different raters
Computed harmlessness score on task description and assistant utterances
Aggregated scores using min or max
Relied on human judgement of attack success on Likert scale
Ran experiment on 500 red team attacks for two models
Low level of inter-rater agreement on success of red team attacks
Fleiss’s Kappa of 0.32 between 4 raters
Asked reviewers to tag transcripts with up to 2 of 20 total topic tags
Incorporated findings from literature on Trust & Safety into experiment design
Used shared communication tool (Slack) to communicate with group
Developed custom well-being survey and sent it to reviewers after 10 tasks

Results

Average success rate for control condition and simplest safety intervention is the same
Rejection sampling makes it difficult to red team language models
No clear trends with model size for self-reported attack success rate
RLHF models become increasingly difficult to red team as they increase in size
Plain LM and prompted LM have little difference
Rejection sampling is an effective safety intervention
Harmful outputs from RS and RLHF models
Visualization of dataset shows clusters of red team attempts
Some attacks more successful than others
Crowdworkers use template-based attacks
Top 5 attack types correspond to discrimination, hate speech, violence, unethical behavior, and bullying
Less common tags include child abuse, self harm, sexual exploitation, terrorism, and animal abuse

Discussion

Limitations and future work

Red teaming language models in the form of an AI assistant allows probing of open-ended input and output spaces
LMs can be used in many applications that don’t require open-endedness
Crowdworkers generated attacks that required domain expertise
Data is incomplete due to unknown and unbounded space of possible harms
Could have asked third party organizations to red team
Could have given crowdworkers way to indicate domain expertise needed
Could have noted instructions to interface to encourage creativity
Red teaming done manually, could be automated in future
Comparing manual and automated approaches to red teaming in future work

Policy interventions

Red teaming involves controversial subject matter
Organizations have counter-incentives to share findings
Difficulty sharing future risks, failures, and implications of yet-to-be developed systems
Incentive structure needs to be changed to share findings
Need to build consensus around how to red team and release findings
Questions to answer: who should red team, protections, instructions, annotation, analysis, success criteria
Decision to release data made in a vacuum
Benefits of release outweigh potential harm
Informational interviews with Trust & Safety professionals
Clear and specific warnings, personal risk tolerance, recommended well-being exercises, pay for time, segment tasks, preview to opt out, well-being survey
Average attack success, correlation between attack success and harmlessness score
Pros and cons for releasing red team data

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Red team task#

Models#

Red team#

Data analysis#

Review task#

Results#

Discussion#

Limitations and future work#

Policy interventions#

Link to paper

Abstract

Paper Content

Introduction

Related work

Red team task

Models

Red team

Data analysis

Review task

Results

Discussion

Limitations and future work

Policy interventions