Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Disagreements are studied from two perspectives: toxicity and argument structure
Framework proposed unifies perspectives and includes dialogue acts such as asking questions and providing clarification
Framework includes preferential ordering of rebuttal tactics
213 disagreements annotated from Wikipedia Talk pages to investigate research questions
Models developed for multilabel prediction of dispute tactics in an utterance
Auxiliary task used to incorporate ordering of rebuttal tactics

Paper Content

Introduction

Disagreements are common in online communication
Debate and disagreement can lead to better supported beliefs
People can be biased and not consider evidence
NLP research has looked at detecting negative aspects of online disagreements
Argumentation mining looks at identifying argument structures and inferring argument quality
Real world disagreements contain both well-structured arguments and attacks
Proposed framework of dispute tactics consisting of rebuttal and coordination strategies
213 disputes annotated with dispute tactics
Lower mean rebuttal level in a disagreement is correlated with less constructive dispute resolutions
People use a range of rebuttal levels more often than adhering to only one
Models developed to predict dispute tactics used in an utterance
Annotations can be used to improve predicting whether a dispute will be resolved without escalating to a moderator

Online disagreements

Wikipedia Talk pages are used for NLP studies and to coordinate edits and resolve disputes
WikiDisputes is a dataset of Talk page discussions tagged as “disputes”
Wikipedia recommends Graham’s hierarchy of disagreement as a guide for constructive dispute resolution
Graham’s hierarchy has 7 levels of disagreement, ranging from namecalling to refuting the central point
Tang and Wang used this taxonomy to analyse the rationality of online discussions
Walker et al. introduced the Internet Arguments Corpus
Benesch et al. proposed a taxonomy of counterspeech related to Graham’s hierarchy
Lukin et al. and Habernal and Gurevych studied argument quality

Disagreement annotation schema

Distinguish between two categories of dispute tactics: rebuttal and coordination
Expand Graham’s hierarchy of rebuttal tactics to include “repeated argument” and “attempted derailing or off-topic comments”
Draw on previous work on Wikipedia Talk page discussions for coordination tactics
Annotate for “opening statement”, negotiating compromises, and “retreat” moves

Data annotation

213 disputes were annotated
Median conversation length is 21 utterances
Average utterance length is 54 tokens
Average number of speakers is 4
Agreement scores improved after each pilot round

Analysing dispute tactics

Does mean rebuttal level correlate with escalation?
What tactics co-occur with personal attacks?

Correlation with escalation

Investigated how informative dispute tactics are in predicting outcome of dispute
Calculated Spearman correlation between mean observed rebuttal level and escalation tags
Used micro-averaged mean over all rebuttal scores assigned during conversation
Weak negative correlation (ρ = −0.19, P = 0.005)
Macro-averaged mean yielded similar results (ρ = −0.24, P = 0.001)

The context of personal attacks

There are two types of personal attacks: name calling and attacks to credibility.
Level 1 attacks are more common than level 0 attacks.
The most common multilabel combination is credibility attacks with counterargument.
Pointwise mutual information (PMI) was used to evaluate how much more than chance the classes co-occur.

The effects of personal attacks

Personal attacks can co-occur with higher level rebuttals.
Half of disputes recover after a personal attack.
Personal attacks are more frequent in escalated conversations.
25.7% of cases involve immediate retaliation.
53% probability of initial offender re-offending.
64% probability of another user using a personal attack.

Individual user rebuttal levels

734 individuals contributed utterances to WikiTactics
Median difference between highest and lowest rebuttal level employed by users with more than 1 utterance is 4
167 users only used levels 4 and higher, 18 used only levels 3 and below
66 users participated in more than one dispute, 57% of cases showed positive mirroring behaviour

Multilabel classification

Multilabel classification task: predicting dispute tactics used in an utterance
Vector representation for labelset: each label is either relevant or not
Binary relevance (BR) classification: set of L binary classifiers to predict whether label applies or not
Label powerset (LP) method: multiclass classification problem over powerset of all possible label combinations
Deep learning: neural network model to directly predict labelset vectors, sigmoid function to determine relevance of each output

Metrics

Simple accuracy measures fraction of samples with full labelset predicted correctly
Hamming loss measures fraction of incorrectly assigned labels
Jaccard score examines proportion of correctly predicted positive labels
Evaluation reports all three metrics to capture different perspectives, but prioritises Jaccard score

Experimental setting

Truncated LP used to reduce number of labels considered
85% of samples covered by top 20 labelsets
175 samples ignored during training
Predictions cast back to multilabel setting for evaluation
Context-agnostic and context-aware models used
Auxiliary task to predict direction of rebuttal level
Models evaluated: BoW, LSTM, BERT
Data split into train, test and validation sets
Dropout and Adam optimiser used

Results

LSTM and BERT models outperform BoW models
Adding conversation context improves performance
Best performing model is BERT with context
Truncated LP method outperforms binary relevance formulation
Most correctly predicted label is coordinating edits
Refutation and refuting the central point are never correctly predicted

Predicting escalation

Dispute tactic annotations can provide useful learning signals for predicting escalation.
Multitask training is used to incorporate features that are predictive of dispute tactics.
HAN network achieved best results on this task.
Modified context-aware LSTM model used to predict escalation.
PR-AUC score of 0.487 obtained, indicating knowledge of dispute tactics is useful for tasks beyond classifying tactics employed.

Conclusion

Introduced a framework and dataset for dispute tactics
Analysed how different tactics are used in disagreements
Examined how users alter their tactics to mirror the level of rebuttal used
Developed multilabel models for classifying dispute tactics
Knowledge of tactics increases accuracy on predicting escalation
Talk pages may not transfer well to other domains
Dataset is small due to difficulty of task
Annotated by one annotator, may introduce biases
Only English data used, insights may not hold for other languages

Link to paper#

Abstract#

Paper Content#

Introduction#

Online disagreements#

Disagreement annotation schema#

Data annotation#

Analysing dispute tactics#

Correlation with escalation#

The context of personal attacks#

The effects of personal attacks#

Individual user rebuttal levels#

Multilabel classification#

Metrics#

Experimental setting#

Results#

Predicting escalation#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Online disagreements

Disagreement annotation schema

Data annotation

Analysing dispute tactics

Correlation with escalation

The context of personal attacks

The effects of personal attacks

Individual user rebuttal levels

Multilabel classification

Metrics

Experimental setting

Results

Predicting escalation

Conclusion