Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Present SODA: a million-scale high-quality social dialogue dataset
- Train COSMO: a generalizable conversation agent
- Dialogues in SODA are more consistent, specific, and natural than prior datasets
- COSMO is more natural and consistent than best-performing dialogue models
- Data, models, and code are made public
Paper Content
Introduction
- Progress on open-domain social dialogue agents has been hindered by lack of diversity, scale, and quality of training corpora
- Most dialogue agents are trained on large amounts of unfiltered conversations or highly curated/specialized crowdsourced dialogues
- Issues of unnaturalness, toxicity, incoherence, blandness, and lack of commonsense remain
- Introduce SODA, a million-scale dialogue dataset covering a wide variety of social interactions
- SODA is the largest publicly available open-domain conversation dataset
- Human evaluation shows SODA surpasses existing human-authored dialogue corpora
- Proposed CO 3 framework for distilling conversations from large pre-trained language models
- CO 3 adds context information to social commonsense knowledge step-by-step
- COSMO conversation model trained on SODA outperforms existing dialogue models
Background
- Conversation is a form of social interaction
- Narratives and scripts are abstracted from social experiences
- Social experiences form our knowledge for explaining everyday events and inferring the mental states of others
- Attribution in social psychology has been studied in NLP as social commonsense
Commonsense knowledge graph
- Start with a commonsense knowledge graph
- Represented by symbolic triples
- Use Atomic 10x as knowledge graph
- Retrieve triples with social commonsense relations
- Prompt PLM to rewrite commonsense into narrative
- PLMs known for writing capabilities, especially in narratives
From narrative to conversation
- Inferring who is speaking in the dialogue is easier when the narrative contains person variables.
- For narratives with only one person, the PLM is prompted to predict the other interlocutor.
- The PLM is prompted to generate a likely conversation happening in the scene.
Validating narratives and conversations with commonsense knowledge
- Automatic validation of generated dialogues to check if seed commonsense triple is implied
- Validation done by asking PLM two questions: is head of triple represented and are relation and tail?
- Ranking of choices done using conditional pointwise mutual information
- Collected SODA dataset, a large-scale high-quality conversation dataset
Postprocessing the conversations
- Initial set of 2.2 million conversations distilled from Instruct-GPT
- Lexical pattern matching to filter out conversations with erroneous patterns
- Remove conversations with less than four turns or more than twenty turns
- Remove conversations with more than two participants
- Remove conversations with non-human speakers
- Safety filters to avoid dangerous and harmful contents
- Automatic evaluation to identify conversations with seed commonsense knowledge
- Human evaluation with 100 randomly sampled narrative-conversation pairs
- Replace all names with Top-10K names of US SSN applicants
- Head-to-head human evaluations with DailyDialog and BlendedSkillTalk
- Collecting SODA via contextualization framework is cost and time efficient
Contextualization is important
- Isolating the effect of contextualization vs. sampling from a large PLM
- Comparing SODA with dialogues generated by naively sampling from InstructGPT without context
- Human evaluators prefer context-grounded conversations significantly more than ones generated without context
- Conversations sampled without context are less specific, less interesting, and have lower lexical diversity
- Utilizing SODA to train COSMO, a conversation model that can converse in a wide range of social situations
- Training COSMO with structured components of SODA
- Including Prosocial-Dialog for additional training data
- Building COSMO on top of LM-adapted T5
- Comparing COSMO to other conversational agents on social conversation datasets
- Relying on human evaluation for dialogue responses
- Comparing COSMO with Di-aloGPT, BlenderBot, GODEL, Instruct-GPT, and ChatGPT
- Evaluating head-to-head comparison between two responses with four criteria: naturalness, consistency, specificity, and overall
Out-of-domain setting
- Evaluated models on unseen dialogue dataset, DailyDialog
- COSMO outperforms other models with significant margin
- COSMO trained on smaller amount of data
- Human judges prefer COSMO’s responses over original gold responses
One-sided out-of-domain setting
- COSMO outperforms BlenderBot on BlendedSkillTalk (BST)
- BlenderBot shows low performance on SODA
In-domain setting
- COSMO’s responses are more specific and favored than its teacher model
- COSMO is on par with ChatGPT in terms of overall quality
- ChatGPT’s responses are more specific but lack naturalness
- SODA and COSMO focus on modeling natural conversations, not knowledge-enhanced responses
Related work
- Existing dialogue datasets come from online learning websites, movie/drama scripts, crowdsourcing, and noisy web conversations.
- Augmented dialogue datasets use PLMs and additional annotations.
- SODA contributes to existing corpora with improved scale, quality, and contextualization.
Conclusion
- Presented SODA, a million-scale dialogue dataset
- Dataset is larger and better than existing datasets
- Trained COSMO, a conversation model that generalizes better
- Aim to alleviate data scarcity in dialogue field
- Precaution taken to vet safety of distilled conversations
- Manual validation of commonsense and human evaluation
- Limitations of current dataset and future work
- Intention of work is to build better assistive technologies
- Need for improved regulations on use and misuse of conversational AI
- Results of head-to-head comparison between SODA and other datasets
- Statistics of SODA compared to other large-scale dialogue datasets