Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Context is important for moral reasoning.
  • Lying to a friend can be wrong or okay depending on the context.
  • ClarifyDelphi is an interactive system that generates clarification questions to elicit missing contexts of a moral situation.
  • Reinforcement Learning is used to generate questions that lead to diverging moral judgments.
  • Human evaluation shows ClarifyDelphi generates more relevant, informative and defeasible questions.
  • ClarifyDelphi assists moral reasoning by seeking additional context to disambiguate social and moral situations.

Paper Content

Introduction

  • Reasoning about social or moral situations involves thinking about different contexts.
  • Generally, offering a cup of coffee is seen as a positive moral judgement.
  • Context can strengthen or weaken the judgement, depending on who it is offered to and when.
  • Asking questions can help to elicit more salient context for moral situations.

Hypothetical answer simulation

  • Prompting is used to generate opposing answers
  • A reinforcement learning approach is used to generate questions
  • The reward is based on the difference in moral judgements
  • Questions are generated to expose model-specific ambiguities
  • Evaluations show that the approach outperforms other baselines

Problem setup

  • Aim is to generate questions that are relevant to making a social or moral judgement
  • Questions should be able to weaken or strengthen the default judgement
  • Task is to predict a question given a base-situation and a default moral judgement
  • Answers to the questions should result in an updated situation with an updated moral judgement

Approach

  • Approach is based on a pipeline of components
  • Components are wrapped up in an RL algorithm
  • Section 3.2 describes the policy
  • Section 3.4 describes the reward

Collecting a dataset of clarification questions

  • Collected dataset of clarification questions for social and moral situations
  • Dataset consists of crowdsourced questions and questions generated by GPT-3
  • Annotations collected on Amazon Mechanical Turk
  • Silver data from defeasible inference dataset
  • Most frequent WH-word used in crowdsourced questions is ‘what’
  • Polar (yes/no) questions appear less frequently
  • 53% of situations have at least one question generated by GPT-3 for both weakener and strengthener updates
  • Most frequent question start for forking-path questions in silver data is ‘why’

Supervised question generation

  • A basic question generation system is used to output a question based on a situation
  • A model is trained on a dataset enriched with questions
  • The input/output of the model includes a judgement, situation, type (weakener/strengthener) and question
  • The model is trained on 77k instances from a question-enriched dataset and 4k instances obtained through prompting

Question selection

  • Generate questions and quantify how well they elicit consequential answers
  • Use Delphi (Jiang et al., 2022) to provide feedback
  • Use GPT-3 (text-curie-001) to fuse situations, questions and answers into updated situations
  • Delphi provides a probability distribution over three classes: bad, ok, good
  • Calculate Jensen-Shannon divergence between Delphi probability distributions to assess if simulated weakener and strengthener answers lead to varying judgement

Ppo

  • Aim to optimize for questions that lead to maximally divergent answers
  • Define a reward function using JS-Divergence
  • View question generation model as a policy
  • Maximize reward using Proximal Policy Optimization
  • Filter out generated answers that contradict or are entailed by the given situation
  • Use WaNLI as an off-the-shelf NLI model

Baselines

  • RL approach compared to four other baselines
  • Supervised question generation model on its own
  • Two baselines based on pipeline approach
  • First step: generate diverse set of questions
  • Second step: select best question according to score
  • Two approaches to scoring and ranking questions: discriminator and divergence ranking
  • Why-baseline generates causal questions

Human evaluation

  • Automatic evaluation of questions and their usefulness for clarifying moral situations is difficult.
  • Humans produce diverse questions for the same situation.
  • Human evaluation of the models’ outputs was performed on Amazon Mechanical Turk.
  • Turkers were asked to rate questions on Grammaticality, Relevance and Informativeness.
  • Most importantly, Turkers were asked to evaluate the defeasibility of the questions.

Results of human evaluation

  • Run grammaticality, relevance and informativeness evaluation
  • Exclude questions with lowest rating from second evaluation
  • CLARIFYDELPHI has biggest percentage of relevant and informative questions
  • Differences in grammaticality among models minimal
  • Big majority of questions from all models relevant and informative
  • CLARIFYDELPHI outperforms baselines in terms of defeasibility
  • Adding answer-filtering with NLI step improves question selection

How much supervision does the policy require?

  • RL used in conjunction with supervised policy to generate questions
  • Supervised policy outperforms RL on top of “vanilla” lm-policy
  • Policy trained on varying percentages of δ-CLARIFY training data (25%, 50%, 75%, 100%)
  • Policy trained on SQuAD v1.1 data for comparison
  • More training data leads to more informative questions

Analysis

  • Model succeeds at generating diverse weakener and strengthener answers
  • Model looks at question-guided defeasible update generation
  • Input includes situation, moral judgement, and update type
  • Question generation functions as macro planning
  • Model improved upon generating defeasible updates
  • Question classes include specification, reason, elaboration, manner, and temporal

Interactive judgements

  • PPO training uses answer simulation
  • Inference only requires a situation as input
  • Clarification questions can be used to elicit additional context
  • Interaction is limited to three turns
  • After two questions, it is unlikely that there is missing context
  • Situation is updated with more context after each turn
  • CLARIFYDELPHI’s questions change depending on user answers
  • Clarification question generation has been studied for multiple domains
  • A dataset of more than 30,000 questions was crowdsourced for social and moral situations
  • Most general question generation approaches are based on seq2seq models
  • Some works have incorporated an RL-based approach
  • Delphi is a commonsense moral reasoning model trained on a dataset with 1.7M instances of descriptive knowledge

Conclusion