Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Disagreements are studied from two perspectives: toxicity and argument structure
  • Framework proposed unifies perspectives and includes dialogue acts such as asking questions and providing clarification
  • Framework includes preferential ordering of rebuttal tactics
  • 213 disagreements annotated from Wikipedia Talk pages to investigate research questions
  • Models developed for multilabel prediction of dispute tactics in an utterance
  • Auxiliary task used to incorporate ordering of rebuttal tactics

Paper Content

Introduction

  • Disagreements are common in online communication
  • Debate and disagreement can lead to better supported beliefs
  • People can be biased and not consider evidence
  • NLP research has looked at detecting negative aspects of online disagreements
  • Argumentation mining looks at identifying argument structures and inferring argument quality
  • Real world disagreements contain both well-structured arguments and attacks
  • Proposed framework of dispute tactics consisting of rebuttal and coordination strategies
  • 213 disputes annotated with dispute tactics
  • Lower mean rebuttal level in a disagreement is correlated with less constructive dispute resolutions
  • People use a range of rebuttal levels more often than adhering to only one
  • Models developed to predict dispute tactics used in an utterance
  • Annotations can be used to improve predicting whether a dispute will be resolved without escalating to a moderator

Online disagreements

  • Wikipedia Talk pages are used for NLP studies and to coordinate edits and resolve disputes
  • WikiDisputes is a dataset of Talk page discussions tagged as “disputes”
  • Wikipedia recommends Graham’s hierarchy of disagreement as a guide for constructive dispute resolution
  • Graham’s hierarchy has 7 levels of disagreement, ranging from namecalling to refuting the central point
  • Tang and Wang used this taxonomy to analyse the rationality of online discussions
  • Walker et al. introduced the Internet Arguments Corpus
  • Benesch et al. proposed a taxonomy of counterspeech related to Graham’s hierarchy
  • Lukin et al. and Habernal and Gurevych studied argument quality

Disagreement annotation schema

  • Distinguish between two categories of dispute tactics: rebuttal and coordination
  • Expand Graham’s hierarchy of rebuttal tactics to include “repeated argument” and “attempted derailing or off-topic comments”
  • Draw on previous work on Wikipedia Talk page discussions for coordination tactics
  • Annotate for “opening statement”, negotiating compromises, and “retreat” moves

Data annotation

  • 213 disputes were annotated
  • Median conversation length is 21 utterances
  • Average utterance length is 54 tokens
  • Average number of speakers is 4
  • Agreement scores improved after each pilot round

Analysing dispute tactics

  • Does mean rebuttal level correlate with escalation?
  • What tactics co-occur with personal attacks?

Correlation with escalation

  • Investigated how informative dispute tactics are in predicting outcome of dispute
  • Calculated Spearman correlation between mean observed rebuttal level and escalation tags
  • Used micro-averaged mean over all rebuttal scores assigned during conversation
  • Weak negative correlation (ρ = −0.19, P = 0.005)
  • Macro-averaged mean yielded similar results (ρ = −0.24, P = 0.001)

The context of personal attacks

  • There are two types of personal attacks: name calling and attacks to credibility.
  • Level 1 attacks are more common than level 0 attacks.
  • The most common multilabel combination is credibility attacks with counterargument.
  • Pointwise mutual information (PMI) was used to evaluate how much more than chance the classes co-occur.

The effects of personal attacks

  • Personal attacks can co-occur with higher level rebuttals.
  • Half of disputes recover after a personal attack.
  • Personal attacks are more frequent in escalated conversations.
  • 25.7% of cases involve immediate retaliation.
  • 53% probability of initial offender re-offending.
  • 64% probability of another user using a personal attack.

Individual user rebuttal levels

  • 734 individuals contributed utterances to WikiTactics
  • Median difference between highest and lowest rebuttal level employed by users with more than 1 utterance is 4
  • 167 users only used levels 4 and higher, 18 used only levels 3 and below
  • 66 users participated in more than one dispute, 57% of cases showed positive mirroring behaviour

Multilabel classification

  • Multilabel classification task: predicting dispute tactics used in an utterance
  • Vector representation for labelset: each label is either relevant or not
  • Binary relevance (BR) classification: set of L binary classifiers to predict whether label applies or not
  • Label powerset (LP) method: multiclass classification problem over powerset of all possible label combinations
  • Deep learning: neural network model to directly predict labelset vectors, sigmoid function to determine relevance of each output

Metrics

  • Simple accuracy measures fraction of samples with full labelset predicted correctly
  • Hamming loss measures fraction of incorrectly assigned labels
  • Jaccard score examines proportion of correctly predicted positive labels
  • Evaluation reports all three metrics to capture different perspectives, but prioritises Jaccard score

Experimental setting

  • Truncated LP used to reduce number of labels considered
  • 85% of samples covered by top 20 labelsets
  • 175 samples ignored during training
  • Predictions cast back to multilabel setting for evaluation
  • Context-agnostic and context-aware models used
  • Auxiliary task to predict direction of rebuttal level
  • Models evaluated: BoW, LSTM, BERT
  • Data split into train, test and validation sets
  • Dropout and Adam optimiser used

Results

  • LSTM and BERT models outperform BoW models
  • Adding conversation context improves performance
  • Best performing model is BERT with context
  • Truncated LP method outperforms binary relevance formulation
  • Most correctly predicted label is coordinating edits
  • Refutation and refuting the central point are never correctly predicted

Predicting escalation

  • Dispute tactic annotations can provide useful learning signals for predicting escalation.
  • Multitask training is used to incorporate features that are predictive of dispute tactics.
  • HAN network achieved best results on this task.
  • Modified context-aware LSTM model used to predict escalation.
  • PR-AUC score of 0.487 obtained, indicating knowledge of dispute tactics is useful for tasks beyond classifying tactics employed.

Conclusion

  • Introduced a framework and dataset for dispute tactics
  • Analysed how different tactics are used in disagreements
  • Examined how users alter their tactics to mirror the level of rebuttal used
  • Developed multilabel models for classifying dispute tactics
  • Knowledge of tactics increases accuracy on predicting escalation
  • Talk pages may not transfer well to other domains
  • Dataset is small due to difficulty of task
  • Annotated by one annotator, may introduce biases
  • Only English data used, insights may not hold for other languages