Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Generating a chain of thought can improve LLM performance.
  • Zero-shot CoT evaluations have been done mainly on logical tasks.
  • This paper evaluates zero-shot CoT on two sensitive domains.
  • Using zero-shot CoT can increase the likelihood of undesirable output.
  • Zero-shot CoT should be avoided on tasks with marginalized groups or harmful topics.

Paper Content

Introduction

  • LLMs improve performance on a range of tasks
  • Popular approach to implementing CoT involves zero-shot generation
  • Zero-shot CoT produces undesirable biases and toxicity
  • Models can sabotage performance when requiring social knowledge
  • Zero-shot CoT increases model bias and generation toxicity
  • Zero-shot CoT increases stereotypical reasoning and encourages toxic behaviour
  • LLMs can use intermediate reasoning steps to improve performance on tasks like arithmetic, metaphor generation, and commonsense/symbolic reasoning
  • Adding “Let’s think step by step” to a prompt can improve zero-shot performance on reasoning benchmarks
  • Other prompting methods have also yielded performance increases
  • LLMs are sensitive to prompting perturbations
  • LLMs are prone to generating unreliable explanations
  • Instruct-tuned and value-aligned LLMs aim to increase reliability and robustness
  • NLP models exhibit a wide range of social and cultural biases
  • LLMs also exhibit a range of biases and risks

Stereotype & toxicity benchmarks

  • Leveraged 3 widely used stereotype benchmark datasets: CrowS Pairs, Stereoset, and BBQ
  • Bootstrapped a small set of explicitly harmful questions (HarmfulQ)
  • Converted each dataset into a zero-shot reasoning task
  • Evaluated out-of-the-box performance in a zero-shot setting

Stereotype benchmarks

  • CrowS Pairs is a dataset of 1508 sentences covering 9 stereotype dimensions
  • StereoSet is a dataset of 17K instances of stereotypical bias annotated by crowd workers
  • BBQ is a dataset of 50K questions targeting 11 stereotype categories
  • All datasets are used to evaluate model bias

Toxicity benchmark

  • Evaluate how models handle open-ended toxic requests
  • Created a benchmark of 200 explicitly toxic questions
  • Prompted text-davinci-002 to generate harmful questions
  • Manually removed repetitive questions with high text overlap
  • Prompted LLM to generate questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful
  • Seeded prompt with three few-shot examples

Methods

  • Evaluating problematic outputs in a prompt-based setting
  • Outlining prompt construction for each benchmark
  • Discussing reasoning strategies

Framing benchmarks as prompting tasks

  • BBQ, HarmfulQ, CrowS Pairs, and Stereoset are framed as QA tasks
  • For CrowS Pairs and Stereoset, models are prompted to select the more accurate sentence between the stereotypical and anti-stereotypical setting
  • For stereotype datasets, target stereotype and anti-stereotype examples are included as options, with an “Unknown” option as the correct answer
  • Synonyms for “Unknown” are randomly selected for each question to account for potential preference for a specific lexical item
  • Positional bias is reduced by randomly shuffling the type of answer associated with each of the options

Scoring bias and toxicity

  • Evaluate biases in model completions using accuracy
  • Models should not rely on stereotypes or antistereotypes
  • Evaluate models by percent of pattern-matched unknown selections
  • Manually label model outputs as encouraging or discouraging
  • Calculate percent of model generations that encourage harmful behaviour
  • Compute % point differences between CoT and Standard Prompting

Models

  • Evaluated best performing GPT-3 model from zero-shot CoT work
  • Standard parameters provided by OpenAI’s API
  • Generated 5 completions for both Standard and CoT Prompt settings
  • Evaluations ran between Oct 28th and Dec 14th, 2022
  • Analyzed instruction-tuned davinci models in §5.2
  • TD1 and TD2 finetuned on high quality human-written examples & model generations
  • TD3 variant switches to improved reinforcement learning strategy

Results

  • Average % point decrease of 8.8% between CoT and Standard prompting
  • Average % point decrease of 19.4% between HarmfulQ and davinci models
  • Replicate zero-shot CoT on selected benchmarks
  • Analyze davinci-00X variants
  • Characterize trends across scale
  • Evaluate explicit mitigation instructions

Analyzing td2

  • TD2 generally selects a biased output when using CoT
  • Model performance decreased by 18% on average
  • 95% confidence intervals are narrow
  • CoT may have minimal impact on prompts that prefer biased/toxic output
  • Errors in reasoning fall into two categories: explicit and implicit

Instruction tuning behaviour

  • Instruction tuning strategies affect CoT’s impact on tasks.
  • CoT effects generally decrease as instruction tuning behaviour improves.
  • CoT effects are still mixed despite improved human preference alignment.
  • In 1/3 of the stereotype settings, CoT reduces model accuracy.
  • TD3 sees substantially larger decreases on HarmfulQ when using CoT.

Scaling behaviour

  • Chain of Thought is an emergent behaviour that appears at large model scale
  • Performance is tested on smaller GPT models using a single prompt setting
  • As model scale increases, harms induced by CoT appear to get worse
  • % point differences between CoT/non-CoT increase monotonically across scale
  • U-shaped effect may exist, further analysis needed

Prompting with instruction mitigations

  • Instruction-tuned models can follow natural language interventions
  • Adding explicit mitigation instructions to the prompt can reduce biases
  • TD2 accuracy decreases significantly with an explicit instruction, but TD3 accuracy only decreases slightly
  • Adding a prompt-based intervention may be a viable solution for improved instruction-following performance

Conclusion

  • Editing prompt-based reasoning strategies is powerful
  • Auditing reasoning steps is recommended
  • Current value alignment efforts are similar to “Lipstick on a Pig”
  • Findings expected to generalize to other domains
  • Red-teaming models with CoT is an important extension
  • Carefully analyze model behaviours after inducing reasoning steps
  • Faulty CoTs can heavily influence downstream results
  • Viewing chain of thought prompting as a design pattern
  • Analysis ran across 3 separate benchmarks
  • Manual, qualitative analysis of failures
  • General agreement across analyses mitigates flaws of each benchmark
  • Accuracy degradations across standard/CoT settings
  • Religion has a relatively high % point decrease
  • Hand-code 50 random generations from each benchmark
  • Zero-shot CoT reduces likelihood of selecting unknown or generating non-toxic answer
  • Pretending to be an evil AI is a creative workaround for value alignment
  • Small perturbations in task prompt can dramatically change LLM output
  • Applying CoT can exacerbate biases in downstream tasks
  • Models should be explicitly uncertain for generation output