Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Context is important for moral reasoning.
Lying to a friend can be wrong or okay depending on the context.
ClarifyDelphi is an interactive system that generates clarification questions to elicit missing contexts of a moral situation.
Reinforcement Learning is used to generate questions that lead to diverging moral judgments.
Human evaluation shows ClarifyDelphi generates more relevant, informative and defeasible questions.
ClarifyDelphi assists moral reasoning by seeking additional context to disambiguate social and moral situations.

Paper Content

Introduction

Reasoning about social or moral situations involves thinking about different contexts.
Generally, offering a cup of coffee is seen as a positive moral judgement.
Context can strengthen or weaken the judgement, depending on who it is offered to and when.
Asking questions can help to elicit more salient context for moral situations.

Hypothetical answer simulation

Prompting is used to generate opposing answers
A reinforcement learning approach is used to generate questions
The reward is based on the difference in moral judgements
Questions are generated to expose model-specific ambiguities
Evaluations show that the approach outperforms other baselines

Problem setup

Aim is to generate questions that are relevant to making a social or moral judgement
Questions should be able to weaken or strengthen the default judgement
Task is to predict a question given a base-situation and a default moral judgement
Answers to the questions should result in an updated situation with an updated moral judgement

Approach

Approach is based on a pipeline of components
Components are wrapped up in an RL algorithm
Section 3.2 describes the policy
Section 3.4 describes the reward

Collecting a dataset of clarification questions

Collected dataset of clarification questions for social and moral situations
Dataset consists of crowdsourced questions and questions generated by GPT-3
Annotations collected on Amazon Mechanical Turk
Silver data from defeasible inference dataset
Most frequent WH-word used in crowdsourced questions is ‘what’
Polar (yes/no) questions appear less frequently
53% of situations have at least one question generated by GPT-3 for both weakener and strengthener updates
Most frequent question start for forking-path questions in silver data is ‘why’

Supervised question generation

A basic question generation system is used to output a question based on a situation
A model is trained on a dataset enriched with questions
The input/output of the model includes a judgement, situation, type (weakener/strengthener) and question
The model is trained on 77k instances from a question-enriched dataset and 4k instances obtained through prompting

Question selection

Generate questions and quantify how well they elicit consequential answers
Use Delphi (Jiang et al., 2022) to provide feedback
Use GPT-3 (text-curie-001) to fuse situations, questions and answers into updated situations
Delphi provides a probability distribution over three classes: bad, ok, good
Calculate Jensen-Shannon divergence between Delphi probability distributions to assess if simulated weakener and strengthener answers lead to varying judgement

Ppo

Aim to optimize for questions that lead to maximally divergent answers
Define a reward function using JS-Divergence
View question generation model as a policy
Maximize reward using Proximal Policy Optimization
Filter out generated answers that contradict or are entailed by the given situation
Use WaNLI as an off-the-shelf NLI model

Baselines

RL approach compared to four other baselines
Supervised question generation model on its own
Two baselines based on pipeline approach
First step: generate diverse set of questions
Second step: select best question according to score
Two approaches to scoring and ranking questions: discriminator and divergence ranking
Why-baseline generates causal questions

Human evaluation

Automatic evaluation of questions and their usefulness for clarifying moral situations is difficult.
Humans produce diverse questions for the same situation.
Human evaluation of the models’ outputs was performed on Amazon Mechanical Turk.
Turkers were asked to rate questions on Grammaticality, Relevance and Informativeness.
Most importantly, Turkers were asked to evaluate the defeasibility of the questions.

Results of human evaluation

Run grammaticality, relevance and informativeness evaluation
Exclude questions with lowest rating from second evaluation
CLARIFYDELPHI has biggest percentage of relevant and informative questions
Differences in grammaticality among models minimal
Big majority of questions from all models relevant and informative
CLARIFYDELPHI outperforms baselines in terms of defeasibility
Adding answer-filtering with NLI step improves question selection

How much supervision does the policy require?

RL used in conjunction with supervised policy to generate questions
Supervised policy outperforms RL on top of “vanilla” lm-policy
Policy trained on varying percentages of δ-CLARIFY training data (25%, 50%, 75%, 100%)
Policy trained on SQuAD v1.1 data for comparison
More training data leads to more informative questions

Analysis

Model succeeds at generating diverse weakener and strengthener answers
Model looks at question-guided defeasible update generation
Input includes situation, moral judgement, and update type
Question generation functions as macro planning
Model improved upon generating defeasible updates
Question classes include specification, reason, elaboration, manner, and temporal

Interactive judgements

PPO training uses answer simulation
Inference only requires a situation as input
Clarification questions can be used to elicit additional context
Interaction is limited to three turns
After two questions, it is unlikely that there is missing context
Situation is updated with more context after each turn
CLARIFYDELPHI’s questions change depending on user answers

Clarification question generation has been studied for multiple domains
A dataset of more than 30,000 questions was crowdsourced for social and moral situations
Most general question generation approaches are based on seq2seq models
Some works have incorporated an RL-based approach
Delphi is a commonsense moral reasoning model trained on a dataset with 1.7M instances of descriptive knowledge

Reinforced Clarification Question Generation with Defeasibility Rewards for Disambiguating Social and Moral Situations

Link to paper

Abstract

Paper Content

Introduction

Hypothetical answer simulation

Problem setup

Approach

Collecting a dataset of clarification questions

Supervised question generation

Question selection

Ppo

Baselines

Human evaluation

Results of human evaluation

How much supervision does the policy require?

Analysis

Interactive judgements

Conclusion

Link to paper#

Abstract#

Paper Content#

Introduction#

Hypothetical answer simulation#

Problem setup#

Approach#

Collecting a dataset of clarification questions#

Supervised question generation#

Question selection#

Ppo#

Baselines#

Human evaluation#

Results of human evaluation#

How much supervision does the policy require?#

Analysis#

Interactive judgements#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Hypothetical answer simulation

Problem setup

Approach

Collecting a dataset of clarification questions

Supervised question generation

Question selection

Ppo

Baselines

Human evaluation

Results of human evaluation

How much supervision does the policy require?

Analysis

Interactive judgements

Related work

Conclusion