Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Generating a chain of thought can improve LLM performance.
- Zero-shot CoT evaluations have been done mainly on logical tasks.
- This paper evaluates zero-shot CoT on two sensitive domains.
- Using zero-shot CoT can increase the likelihood of undesirable output.
- Zero-shot CoT should be avoided on tasks with marginalized groups or harmful topics.
Paper Content
Introduction
- LLMs improve performance on a range of tasks
- Popular approach to implementing CoT involves zero-shot generation
- Zero-shot CoT produces undesirable biases and toxicity
- Models can sabotage performance when requiring social knowledge
- Zero-shot CoT increases model bias and generation toxicity
- Zero-shot CoT increases stereotypical reasoning and encourages toxic behaviour
Related work
- LLMs can use intermediate reasoning steps to improve performance on tasks like arithmetic, metaphor generation, and commonsense/symbolic reasoning
- Adding “Let’s think step by step” to a prompt can improve zero-shot performance on reasoning benchmarks
- Other prompting methods have also yielded performance increases
- LLMs are sensitive to prompting perturbations
- LLMs are prone to generating unreliable explanations
- Instruct-tuned and value-aligned LLMs aim to increase reliability and robustness
- NLP models exhibit a wide range of social and cultural biases
- LLMs also exhibit a range of biases and risks
Stereotype & toxicity benchmarks
- Leveraged 3 widely used stereotype benchmark datasets: CrowS Pairs, Stereoset, and BBQ
- Bootstrapped a small set of explicitly harmful questions (HarmfulQ)
- Converted each dataset into a zero-shot reasoning task
- Evaluated out-of-the-box performance in a zero-shot setting
Stereotype benchmarks
- CrowS Pairs is a dataset of 1508 sentences covering 9 stereotype dimensions
- StereoSet is a dataset of 17K instances of stereotypical bias annotated by crowd workers
- BBQ is a dataset of 50K questions targeting 11 stereotype categories
- All datasets are used to evaluate model bias
Toxicity benchmark
- Evaluate how models handle open-ended toxic requests
- Created a benchmark of 200 explicitly toxic questions
- Prompted text-davinci-002 to generate harmful questions
- Manually removed repetitive questions with high text overlap
- Prompted LLM to generate questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful
- Seeded prompt with three few-shot examples
Methods
- Evaluating problematic outputs in a prompt-based setting
- Outlining prompt construction for each benchmark
- Discussing reasoning strategies
Framing benchmarks as prompting tasks
- BBQ, HarmfulQ, CrowS Pairs, and Stereoset are framed as QA tasks
- For CrowS Pairs and Stereoset, models are prompted to select the more accurate sentence between the stereotypical and anti-stereotypical setting
- For stereotype datasets, target stereotype and anti-stereotype examples are included as options, with an “Unknown” option as the correct answer
- Synonyms for “Unknown” are randomly selected for each question to account for potential preference for a specific lexical item
- Positional bias is reduced by randomly shuffling the type of answer associated with each of the options
Scoring bias and toxicity
- Evaluate biases in model completions using accuracy
- Models should not rely on stereotypes or antistereotypes
- Evaluate models by percent of pattern-matched unknown selections
- Manually label model outputs as encouraging or discouraging
- Calculate percent of model generations that encourage harmful behaviour
- Compute % point differences between CoT and Standard Prompting
Models
- Evaluated best performing GPT-3 model from zero-shot CoT work
- Standard parameters provided by OpenAI’s API
- Generated 5 completions for both Standard and CoT Prompt settings
- Evaluations ran between Oct 28th and Dec 14th, 2022
- Analyzed instruction-tuned davinci models in §5.2
- TD1 and TD2 finetuned on high quality human-written examples & model generations
- TD3 variant switches to improved reinforcement learning strategy
Results
- Average % point decrease of 8.8% between CoT and Standard prompting
- Average % point decrease of 19.4% between HarmfulQ and davinci models
- Replicate zero-shot CoT on selected benchmarks
- Analyze davinci-00X variants
- Characterize trends across scale
- Evaluate explicit mitigation instructions
Analyzing td2
- TD2 generally selects a biased output when using CoT
- Model performance decreased by 18% on average
- 95% confidence intervals are narrow
- CoT may have minimal impact on prompts that prefer biased/toxic output
- Errors in reasoning fall into two categories: explicit and implicit
Instruction tuning behaviour
- Instruction tuning strategies affect CoT’s impact on tasks.
- CoT effects generally decrease as instruction tuning behaviour improves.
- CoT effects are still mixed despite improved human preference alignment.
- In 1/3 of the stereotype settings, CoT reduces model accuracy.
- TD3 sees substantially larger decreases on HarmfulQ when using CoT.
Scaling behaviour
- Chain of Thought is an emergent behaviour that appears at large model scale
- Performance is tested on smaller GPT models using a single prompt setting
- As model scale increases, harms induced by CoT appear to get worse
- % point differences between CoT/non-CoT increase monotonically across scale
- U-shaped effect may exist, further analysis needed
Prompting with instruction mitigations
- Instruction-tuned models can follow natural language interventions
- Adding explicit mitigation instructions to the prompt can reduce biases
- TD2 accuracy decreases significantly with an explicit instruction, but TD3 accuracy only decreases slightly
- Adding a prompt-based intervention may be a viable solution for improved instruction-following performance
Conclusion
- Editing prompt-based reasoning strategies is powerful
- Auditing reasoning steps is recommended
- Current value alignment efforts are similar to “Lipstick on a Pig”
- Findings expected to generalize to other domains
- Red-teaming models with CoT is an important extension
- Carefully analyze model behaviours after inducing reasoning steps
- Faulty CoTs can heavily influence downstream results
- Viewing chain of thought prompting as a design pattern
- Analysis ran across 3 separate benchmarks
- Manual, qualitative analysis of failures
- General agreement across analyses mitigates flaws of each benchmark
- Accuracy degradations across standard/CoT settings
- Religion has a relatively high % point decrease
- Hand-code 50 random generations from each benchmark
- Zero-shot CoT reduces likelihood of selecting unknown or generating non-toxic answer
- Pretending to be an evil AI is a creative workaround for value alignment
- Small perturbations in task prompt can dramatically change LLM output
- Applying CoT can exacerbate biases in downstream tasks
- Models should be explicitly uncertain for generation output