Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Disagreements are studied from two perspectives: toxicity and argument structure
- Framework proposed unifies perspectives and includes dialogue acts such as asking questions and providing clarification
- Framework includes preferential ordering of rebuttal tactics
- 213 disagreements annotated from Wikipedia Talk pages to investigate research questions
- Models developed for multilabel prediction of dispute tactics in an utterance
- Auxiliary task used to incorporate ordering of rebuttal tactics
Paper Content
Introduction
- Disagreements are common in online communication
- Debate and disagreement can lead to better supported beliefs
- People can be biased and not consider evidence
- NLP research has looked at detecting negative aspects of online disagreements
- Argumentation mining looks at identifying argument structures and inferring argument quality
- Real world disagreements contain both well-structured arguments and attacks
- Proposed framework of dispute tactics consisting of rebuttal and coordination strategies
- 213 disputes annotated with dispute tactics
- Lower mean rebuttal level in a disagreement is correlated with less constructive dispute resolutions
- People use a range of rebuttal levels more often than adhering to only one
- Models developed to predict dispute tactics used in an utterance
- Annotations can be used to improve predicting whether a dispute will be resolved without escalating to a moderator
Online disagreements
- Wikipedia Talk pages are used for NLP studies and to coordinate edits and resolve disputes
- WikiDisputes is a dataset of Talk page discussions tagged as “disputes”
- Wikipedia recommends Graham’s hierarchy of disagreement as a guide for constructive dispute resolution
- Graham’s hierarchy has 7 levels of disagreement, ranging from namecalling to refuting the central point
- Tang and Wang used this taxonomy to analyse the rationality of online discussions
- Walker et al. introduced the Internet Arguments Corpus
- Benesch et al. proposed a taxonomy of counterspeech related to Graham’s hierarchy
- Lukin et al. and Habernal and Gurevych studied argument quality
Disagreement annotation schema
- Distinguish between two categories of dispute tactics: rebuttal and coordination
- Expand Graham’s hierarchy of rebuttal tactics to include “repeated argument” and “attempted derailing or off-topic comments”
- Draw on previous work on Wikipedia Talk page discussions for coordination tactics
- Annotate for “opening statement”, negotiating compromises, and “retreat” moves
Data annotation
- 213 disputes were annotated
- Median conversation length is 21 utterances
- Average utterance length is 54 tokens
- Average number of speakers is 4
- Agreement scores improved after each pilot round
Analysing dispute tactics
- Does mean rebuttal level correlate with escalation?
- What tactics co-occur with personal attacks?
Correlation with escalation
- Investigated how informative dispute tactics are in predicting outcome of dispute
- Calculated Spearman correlation between mean observed rebuttal level and escalation tags
- Used micro-averaged mean over all rebuttal scores assigned during conversation
- Weak negative correlation (ρ = −0.19, P = 0.005)
- Macro-averaged mean yielded similar results (ρ = −0.24, P = 0.001)
The context of personal attacks
- There are two types of personal attacks: name calling and attacks to credibility.
- Level 1 attacks are more common than level 0 attacks.
- The most common multilabel combination is credibility attacks with counterargument.
- Pointwise mutual information (PMI) was used to evaluate how much more than chance the classes co-occur.
The effects of personal attacks
- Personal attacks can co-occur with higher level rebuttals.
- Half of disputes recover after a personal attack.
- Personal attacks are more frequent in escalated conversations.
- 25.7% of cases involve immediate retaliation.
- 53% probability of initial offender re-offending.
- 64% probability of another user using a personal attack.
Individual user rebuttal levels
- 734 individuals contributed utterances to WikiTactics
- Median difference between highest and lowest rebuttal level employed by users with more than 1 utterance is 4
- 167 users only used levels 4 and higher, 18 used only levels 3 and below
- 66 users participated in more than one dispute, 57% of cases showed positive mirroring behaviour
Multilabel classification
- Multilabel classification task: predicting dispute tactics used in an utterance
- Vector representation for labelset: each label is either relevant or not
- Binary relevance (BR) classification: set of L binary classifiers to predict whether label applies or not
- Label powerset (LP) method: multiclass classification problem over powerset of all possible label combinations
- Deep learning: neural network model to directly predict labelset vectors, sigmoid function to determine relevance of each output
Metrics
- Simple accuracy measures fraction of samples with full labelset predicted correctly
- Hamming loss measures fraction of incorrectly assigned labels
- Jaccard score examines proportion of correctly predicted positive labels
- Evaluation reports all three metrics to capture different perspectives, but prioritises Jaccard score
Experimental setting
- Truncated LP used to reduce number of labels considered
- 85% of samples covered by top 20 labelsets
- 175 samples ignored during training
- Predictions cast back to multilabel setting for evaluation
- Context-agnostic and context-aware models used
- Auxiliary task to predict direction of rebuttal level
- Models evaluated: BoW, LSTM, BERT
- Data split into train, test and validation sets
- Dropout and Adam optimiser used
Results
- LSTM and BERT models outperform BoW models
- Adding conversation context improves performance
- Best performing model is BERT with context
- Truncated LP method outperforms binary relevance formulation
- Most correctly predicted label is coordinating edits
- Refutation and refuting the central point are never correctly predicted
Predicting escalation
- Dispute tactic annotations can provide useful learning signals for predicting escalation.
- Multitask training is used to incorporate features that are predictive of dispute tactics.
- HAN network achieved best results on this task.
- Modified context-aware LSTM model used to predict escalation.
- PR-AUC score of 0.487 obtained, indicating knowledge of dispute tactics is useful for tasks beyond classifying tactics employed.
Conclusion
- Introduced a framework and dataset for dispute tactics
- Analysed how different tactics are used in disagreements
- Examined how users alter their tactics to mirror the level of rebuttal used
- Developed multilabel models for classifying dispute tactics
- Knowledge of tactics increases accuracy on predicting escalation
- Talk pages may not transfer well to other domains
- Dataset is small due to difficulty of task
- Annotated by one annotator, may introduce biases
- Only English data used, insights may not hold for other languages