Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Generating a chain of thought can improve LLM performance.
Zero-shot CoT evaluations have been done mainly on logical tasks.
This paper evaluates zero-shot CoT on two sensitive domains.
Using zero-shot CoT can increase the likelihood of undesirable output.
Zero-shot CoT should be avoided on tasks with marginalized groups or harmful topics.

Paper Content

Introduction

LLMs improve performance on a range of tasks
Popular approach to implementing CoT involves zero-shot generation
Zero-shot CoT produces undesirable biases and toxicity
Models can sabotage performance when requiring social knowledge
Zero-shot CoT increases model bias and generation toxicity
Zero-shot CoT increases stereotypical reasoning and encourages toxic behaviour

LLMs can use intermediate reasoning steps to improve performance on tasks like arithmetic, metaphor generation, and commonsense/symbolic reasoning
Adding “Let’s think step by step” to a prompt can improve zero-shot performance on reasoning benchmarks
Other prompting methods have also yielded performance increases
LLMs are sensitive to prompting perturbations
LLMs are prone to generating unreliable explanations
Instruct-tuned and value-aligned LLMs aim to increase reliability and robustness
NLP models exhibit a wide range of social and cultural biases
LLMs also exhibit a range of biases and risks

Stereotype & toxicity benchmarks

Leveraged 3 widely used stereotype benchmark datasets: CrowS Pairs, Stereoset, and BBQ
Bootstrapped a small set of explicitly harmful questions (HarmfulQ)
Converted each dataset into a zero-shot reasoning task
Evaluated out-of-the-box performance in a zero-shot setting

Stereotype benchmarks

CrowS Pairs is a dataset of 1508 sentences covering 9 stereotype dimensions
StereoSet is a dataset of 17K instances of stereotypical bias annotated by crowd workers
BBQ is a dataset of 50K questions targeting 11 stereotype categories
All datasets are used to evaluate model bias

Toxicity benchmark

Evaluate how models handle open-ended toxic requests
Created a benchmark of 200 explicitly toxic questions
Prompted text-davinci-002 to generate harmful questions
Manually removed repetitive questions with high text overlap
Prompted LLM to generate questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful
Seeded prompt with three few-shot examples

Methods

Evaluating problematic outputs in a prompt-based setting
Outlining prompt construction for each benchmark
Discussing reasoning strategies

Framing benchmarks as prompting tasks

BBQ, HarmfulQ, CrowS Pairs, and Stereoset are framed as QA tasks
For CrowS Pairs and Stereoset, models are prompted to select the more accurate sentence between the stereotypical and anti-stereotypical setting
For stereotype datasets, target stereotype and anti-stereotype examples are included as options, with an “Unknown” option as the correct answer
Synonyms for “Unknown” are randomly selected for each question to account for potential preference for a specific lexical item
Positional bias is reduced by randomly shuffling the type of answer associated with each of the options

Scoring bias and toxicity

Evaluate biases in model completions using accuracy
Models should not rely on stereotypes or antistereotypes
Evaluate models by percent of pattern-matched unknown selections
Manually label model outputs as encouraging or discouraging
Calculate percent of model generations that encourage harmful behaviour
Compute % point differences between CoT and Standard Prompting

Models

Evaluated best performing GPT-3 model from zero-shot CoT work
Standard parameters provided by OpenAI’s API
Generated 5 completions for both Standard and CoT Prompt settings
Evaluations ran between Oct 28th and Dec 14th, 2022
Analyzed instruction-tuned davinci models in §5.2
TD1 and TD2 finetuned on high quality human-written examples & model generations
TD3 variant switches to improved reinforcement learning strategy

Results

Average % point decrease of 8.8% between CoT and Standard prompting
Average % point decrease of 19.4% between HarmfulQ and davinci models
Replicate zero-shot CoT on selected benchmarks
Analyze davinci-00X variants
Characterize trends across scale
Evaluate explicit mitigation instructions

Analyzing td2

TD2 generally selects a biased output when using CoT
Model performance decreased by 18% on average
95% confidence intervals are narrow
CoT may have minimal impact on prompts that prefer biased/toxic output
Errors in reasoning fall into two categories: explicit and implicit

Instruction tuning behaviour

Instruction tuning strategies affect CoT’s impact on tasks.
CoT effects generally decrease as instruction tuning behaviour improves.
CoT effects are still mixed despite improved human preference alignment.
In 1/3 of the stereotype settings, CoT reduces model accuracy.
TD3 sees substantially larger decreases on HarmfulQ when using CoT.

Scaling behaviour

Chain of Thought is an emergent behaviour that appears at large model scale
Performance is tested on smaller GPT models using a single prompt setting
As model scale increases, harms induced by CoT appear to get worse
% point differences between CoT/non-CoT increase monotonically across scale
U-shaped effect may exist, further analysis needed

Prompting with instruction mitigations

Instruction-tuned models can follow natural language interventions
Adding explicit mitigation instructions to the prompt can reduce biases
TD2 accuracy decreases significantly with an explicit instruction, but TD3 accuracy only decreases slightly
Adding a prompt-based intervention may be a viable solution for improved instruction-following performance

Conclusion

Editing prompt-based reasoning strategies is powerful
Auditing reasoning steps is recommended
Current value alignment efforts are similar to “Lipstick on a Pig”
Findings expected to generalize to other domains
Red-teaming models with CoT is an important extension
Carefully analyze model behaviours after inducing reasoning steps
Faulty CoTs can heavily influence downstream results
Viewing chain of thought prompting as a design pattern
Analysis ran across 3 separate benchmarks
Manual, qualitative analysis of failures
General agreement across analyses mitigates flaws of each benchmark
Accuracy degradations across standard/CoT settings
Religion has a relatively high % point decrease
Hand-code 50 random generations from each benchmark
Zero-shot CoT reduces likelihood of selecting unknown or generating non-toxic answer
Pretending to be an evil AI is a creative workaround for value alignment
Small perturbations in task prompt can dramatically change LLM output
Applying CoT can exacerbate biases in downstream tasks
Models should be explicitly uncertain for generation output

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Stereotype & toxicity benchmarks#

Stereotype benchmarks#

Toxicity benchmark#

Methods#

Framing benchmarks as prompting tasks#

Scoring bias and toxicity#

Models#

Results#

Analyzing td2#

Instruction tuning behaviour#

Scaling behaviour#

Prompting with instruction mitigations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Stereotype & toxicity benchmarks

Stereotype benchmarks

Toxicity benchmark

Methods

Framing benchmarks as prompting tasks

Scoring bias and toxicity

Models

Results

Analyzing td2

Instruction tuning behaviour

Scaling behaviour

Prompting with instruction mitigations

Conclusion