Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Investigated scaling behaviors of red teaming across 3 model sizes and 4 model types
  • Released dataset of 38,961 red team attacks for others to analyze and learn from
  • Analyzed data and found a variety of harmful outputs, ranging from offensive language to subtly unethical outputs

Paper Content


  • Large language models can have harmful behaviors
  • Examples of harmful behaviors include reinforcing social biases, generating offensive/toxic outputs, leaking personal info, aiding in disinformation campaigns, generating extremist texts, and spreading falsehoods
  • As AI systems improve, the scope of possible harms is likely to grow
  • Strategies have been developed to address some of these harms
  • Red teaming is a tool to address harm, using manual or automated methods to probe a language model for harmful outputs
  • Paper describes early efforts to implement manual red teaming
  • Investigated scaling behaviors for red teaming across 3 model sizes and 4 model types
  • Released dataset of 38,961 red team attacks
  • Described instructions, processes, and statistical methodologies for red teaming
  • Proposed policy interventions for how to develop shared norms, practices, and technical standards for red teaming
  • We use the same models as in our previous work
  • We run additional experiments to determine the influence of model size on susceptibility to red team attacks
  • We analyze the content of the attacks to understand the types of harms uncovered by red teaming
  • We provide more detail on our red team methods and release the data
  • We focus on reinforcement learning from human feedback as our most promising safety intervention

Red team task

  • Developed an interface for red team members to have open-ended conversations with an AI assistant
  • Provided a brief list of example conversation topics
  • Asked participants to enter a description of how they intend to red team the model
  • Reviewed literature and conducted interviews to incorporate best practices into task instructions and interface
  • Warned participants of sensitive content
  • Asked participants to select topics within their own risk tolerance
  • Asked participants to select the more harmful of two model-generated responses
  • Used dataset of pairs of model responses to train a harmlessness preference model
  • Asked participants to rate how successful they were at making the AI assistant say something bad
  • Assigned red team members to models at random


  • We derive dialogue models from a general language model and a helpful and harmless preference model.
  • We train decoder-only transformer models ranging in size from 2.7B to 13B to 52B parameters.
  • Our preference model is trained to predict both harmlessness and helpfulness.
  • We use 1-shot learning to prompt our general language models to behave as dialogue models.
  • We use 14-shot learning to prompt our general language models to be helpful, harmless, and honest.
  • We generate 16 samples of AI assistant responses from prompted language models and select the 2 least harmful samples.
  • We use reinforcement learning to train a prompted language model to maximize the scores given by the preference model.

Red team

  • Recruited 324 US-based crowdworkers from Amazon’s Mechanical Turk and Upwork
  • Paid between $7.50 and $9.50 for each set of 5 conversations on MTurk
  • Paid $20 per hour on Upwork
  • Crowdworker population not fully representative of US population
  • 79% of participants self-identify as “White or Caucasian”
  • 66% of participants have at least a college degree
  • 80% of red team attacks come from 50 out of 300 workers

Data analysis

  • Collected 38,961 red team attacks
  • Measured 3 variables for each attack
  • Red team members self-rated success on a 5-point Likert scale
  • Distribution of self-rating was bimodal, with two peaks at 0 and 4
  • Used harmlessness preference model to compute harmlessness score of AI assistant’s dialogue
  • Found inverse proportion between self-rating of attack success and minimum harmlessness score

Review task

  • Collected data across all model types
  • Performed follow-up experiment to measure two variables: inter-annotator agreement and content of attack types
  • Self-ratings of attack success are subjective and can vary based on elements of attack and red team member
  • Wanted to understand variability across different raters
  • Computed harmlessness score on task description and assistant utterances
  • Aggregated scores using min or max
  • Relied on human judgement of attack success on Likert scale
  • Ran experiment on 500 red team attacks for two models
  • Low level of inter-rater agreement on success of red team attacks
  • Fleiss’s Kappa of 0.32 between 4 raters
  • Asked reviewers to tag transcripts with up to 2 of 20 total topic tags
  • Incorporated findings from literature on Trust & Safety into experiment design
  • Used shared communication tool (Slack) to communicate with group
  • Developed custom well-being survey and sent it to reviewers after 10 tasks


  • Average success rate for control condition and simplest safety intervention is the same
  • Rejection sampling makes it difficult to red team language models
  • No clear trends with model size for self-reported attack success rate
  • RLHF models become increasingly difficult to red team as they increase in size
  • Plain LM and prompted LM have little difference
  • Rejection sampling is an effective safety intervention
  • Harmful outputs from RS and RLHF models
  • Visualization of dataset shows clusters of red team attempts
  • Some attacks more successful than others
  • Crowdworkers use template-based attacks
  • Top 5 attack types correspond to discrimination, hate speech, violence, unethical behavior, and bullying
  • Less common tags include child abuse, self harm, sexual exploitation, terrorism, and animal abuse


Limitations and future work

  • Red teaming language models in the form of an AI assistant allows probing of open-ended input and output spaces
  • LMs can be used in many applications that don’t require open-endedness
  • Crowdworkers generated attacks that required domain expertise
  • Data is incomplete due to unknown and unbounded space of possible harms
  • Could have asked third party organizations to red team
  • Could have given crowdworkers way to indicate domain expertise needed
  • Could have noted instructions to interface to encourage creativity
  • Red teaming done manually, could be automated in future
  • Comparing manual and automated approaches to red teaming in future work

Policy interventions

  • Red teaming involves controversial subject matter
  • Organizations have counter-incentives to share findings
  • Difficulty sharing future risks, failures, and implications of yet-to-be developed systems
  • Incentive structure needs to be changed to share findings
  • Need to build consensus around how to red team and release findings
  • Questions to answer: who should red team, protections, instructions, annotation, analysis, success criteria
  • Decision to release data made in a vacuum
  • Benefits of release outweigh potential harm
  • Informational interviews with Trust & Safety professionals
  • Clear and specific warnings, personal risk tolerance, recommended well-being exercises, pay for time, segment tasks, preview to opt out, well-being survey
  • Average attack success, correlation between attack success and harmlessness score
  • Pros and cons for releasing red team data