Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Context is important for moral reasoning.
- Lying to a friend can be wrong or okay depending on the context.
- ClarifyDelphi is an interactive system that generates clarification questions to elicit missing contexts of a moral situation.
- Reinforcement Learning is used to generate questions that lead to diverging moral judgments.
- Human evaluation shows ClarifyDelphi generates more relevant, informative and defeasible questions.
- ClarifyDelphi assists moral reasoning by seeking additional context to disambiguate social and moral situations.
Paper Content
Introduction
- Reasoning about social or moral situations involves thinking about different contexts.
- Generally, offering a cup of coffee is seen as a positive moral judgement.
- Context can strengthen or weaken the judgement, depending on who it is offered to and when.
- Asking questions can help to elicit more salient context for moral situations.
Hypothetical answer simulation
- Prompting is used to generate opposing answers
- A reinforcement learning approach is used to generate questions
- The reward is based on the difference in moral judgements
- Questions are generated to expose model-specific ambiguities
- Evaluations show that the approach outperforms other baselines
Problem setup
- Aim is to generate questions that are relevant to making a social or moral judgement
- Questions should be able to weaken or strengthen the default judgement
- Task is to predict a question given a base-situation and a default moral judgement
- Answers to the questions should result in an updated situation with an updated moral judgement
Approach
- Approach is based on a pipeline of components
- Components are wrapped up in an RL algorithm
- Section 3.2 describes the policy
- Section 3.4 describes the reward
Collecting a dataset of clarification questions
- Collected dataset of clarification questions for social and moral situations
- Dataset consists of crowdsourced questions and questions generated by GPT-3
- Annotations collected on Amazon Mechanical Turk
- Silver data from defeasible inference dataset
- Most frequent WH-word used in crowdsourced questions is ‘what’
- Polar (yes/no) questions appear less frequently
- 53% of situations have at least one question generated by GPT-3 for both weakener and strengthener updates
- Most frequent question start for forking-path questions in silver data is ‘why’
Supervised question generation
- A basic question generation system is used to output a question based on a situation
- A model is trained on a dataset enriched with questions
- The input/output of the model includes a judgement, situation, type (weakener/strengthener) and question
- The model is trained on 77k instances from a question-enriched dataset and 4k instances obtained through prompting
Question selection
- Generate questions and quantify how well they elicit consequential answers
- Use Delphi (Jiang et al., 2022) to provide feedback
- Use GPT-3 (text-curie-001) to fuse situations, questions and answers into updated situations
- Delphi provides a probability distribution over three classes: bad, ok, good
- Calculate Jensen-Shannon divergence between Delphi probability distributions to assess if simulated weakener and strengthener answers lead to varying judgement
Ppo
- Aim to optimize for questions that lead to maximally divergent answers
- Define a reward function using JS-Divergence
- View question generation model as a policy
- Maximize reward using Proximal Policy Optimization
- Filter out generated answers that contradict or are entailed by the given situation
- Use WaNLI as an off-the-shelf NLI model
Baselines
- RL approach compared to four other baselines
- Supervised question generation model on its own
- Two baselines based on pipeline approach
- First step: generate diverse set of questions
- Second step: select best question according to score
- Two approaches to scoring and ranking questions: discriminator and divergence ranking
- Why-baseline generates causal questions
Human evaluation
- Automatic evaluation of questions and their usefulness for clarifying moral situations is difficult.
- Humans produce diverse questions for the same situation.
- Human evaluation of the models’ outputs was performed on Amazon Mechanical Turk.
- Turkers were asked to rate questions on Grammaticality, Relevance and Informativeness.
- Most importantly, Turkers were asked to evaluate the defeasibility of the questions.
Results of human evaluation
- Run grammaticality, relevance and informativeness evaluation
- Exclude questions with lowest rating from second evaluation
- CLARIFYDELPHI has biggest percentage of relevant and informative questions
- Differences in grammaticality among models minimal
- Big majority of questions from all models relevant and informative
- CLARIFYDELPHI outperforms baselines in terms of defeasibility
- Adding answer-filtering with NLI step improves question selection
How much supervision does the policy require?
- RL used in conjunction with supervised policy to generate questions
- Supervised policy outperforms RL on top of “vanilla” lm-policy
- Policy trained on varying percentages of δ-CLARIFY training data (25%, 50%, 75%, 100%)
- Policy trained on SQuAD v1.1 data for comparison
- More training data leads to more informative questions
Analysis
- Model succeeds at generating diverse weakener and strengthener answers
- Model looks at question-guided defeasible update generation
- Input includes situation, moral judgement, and update type
- Question generation functions as macro planning
- Model improved upon generating defeasible updates
- Question classes include specification, reason, elaboration, manner, and temporal
Interactive judgements
- PPO training uses answer simulation
- Inference only requires a situation as input
- Clarification questions can be used to elicit additional context
- Interaction is limited to three turns
- After two questions, it is unlikely that there is missing context
- Situation is updated with more context after each turn
- CLARIFYDELPHI’s questions change depending on user answers
Related work
- Clarification question generation has been studied for multiple domains
- A dataset of more than 30,000 questions was crowdsourced for social and moral situations
- Most general question generation approaches are based on seq2seq models
- Some works have incorporated an RL-based approach
- Delphi is a commonsense moral reasoning model trained on a dataset with 1.7M instances of descriptive knowledge