Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Present SODA: a million-scale high-quality social dialogue dataset
  • Train COSMO: a generalizable conversation agent
  • Dialogues in SODA are more consistent, specific, and natural than prior datasets
  • COSMO is more natural and consistent than best-performing dialogue models
  • Data, models, and code are made public

Paper Content

Introduction

  • Progress on open-domain social dialogue agents has been hindered by lack of diversity, scale, and quality of training corpora
  • Most dialogue agents are trained on large amounts of unfiltered conversations or highly curated/specialized crowdsourced dialogues
  • Issues of unnaturalness, toxicity, incoherence, blandness, and lack of commonsense remain
  • Introduce SODA, a million-scale dialogue dataset covering a wide variety of social interactions
  • SODA is the largest publicly available open-domain conversation dataset
  • Human evaluation shows SODA surpasses existing human-authored dialogue corpora
  • Proposed CO 3 framework for distilling conversations from large pre-trained language models
  • CO 3 adds context information to social commonsense knowledge step-by-step
  • COSMO conversation model trained on SODA outperforms existing dialogue models

Background

  • Conversation is a form of social interaction
  • Narratives and scripts are abstracted from social experiences
  • Social experiences form our knowledge for explaining everyday events and inferring the mental states of others
  • Attribution in social psychology has been studied in NLP as social commonsense

Commonsense knowledge graph

  • Start with a commonsense knowledge graph
  • Represented by symbolic triples
  • Use Atomic 10x as knowledge graph
  • Retrieve triples with social commonsense relations
  • Prompt PLM to rewrite commonsense into narrative
  • PLMs known for writing capabilities, especially in narratives

From narrative to conversation

  • Inferring who is speaking in the dialogue is easier when the narrative contains person variables.
  • For narratives with only one person, the PLM is prompted to predict the other interlocutor.
  • The PLM is prompted to generate a likely conversation happening in the scene.

Validating narratives and conversations with commonsense knowledge

  • Automatic validation of generated dialogues to check if seed commonsense triple is implied
  • Validation done by asking PLM two questions: is head of triple represented and are relation and tail?
  • Ranking of choices done using conditional pointwise mutual information
  • Collected SODA dataset, a large-scale high-quality conversation dataset

Postprocessing the conversations

  • Initial set of 2.2 million conversations distilled from Instruct-GPT
  • Lexical pattern matching to filter out conversations with erroneous patterns
  • Remove conversations with less than four turns or more than twenty turns
  • Remove conversations with more than two participants
  • Remove conversations with non-human speakers
  • Safety filters to avoid dangerous and harmful contents
  • Automatic evaluation to identify conversations with seed commonsense knowledge
  • Human evaluation with 100 randomly sampled narrative-conversation pairs
  • Replace all names with Top-10K names of US SSN applicants
  • Head-to-head human evaluations with DailyDialog and BlendedSkillTalk
  • Collecting SODA via contextualization framework is cost and time efficient

Contextualization is important

  • Isolating the effect of contextualization vs. sampling from a large PLM
  • Comparing SODA with dialogues generated by naively sampling from InstructGPT without context
  • Human evaluators prefer context-grounded conversations significantly more than ones generated without context
  • Conversations sampled without context are less specific, less interesting, and have lower lexical diversity
  • Utilizing SODA to train COSMO, a conversation model that can converse in a wide range of social situations
  • Training COSMO with structured components of SODA
  • Including Prosocial-Dialog for additional training data
  • Building COSMO on top of LM-adapted T5
  • Comparing COSMO to other conversational agents on social conversation datasets
  • Relying on human evaluation for dialogue responses
  • Comparing COSMO with Di-aloGPT, BlenderBot, GODEL, Instruct-GPT, and ChatGPT
  • Evaluating head-to-head comparison between two responses with four criteria: naturalness, consistency, specificity, and overall

Out-of-domain setting

  • Evaluated models on unseen dialogue dataset, DailyDialog
  • COSMO outperforms other models with significant margin
  • COSMO trained on smaller amount of data
  • Human judges prefer COSMO’s responses over original gold responses

One-sided out-of-domain setting

  • COSMO outperforms BlenderBot on BlendedSkillTalk (BST)
  • BlenderBot shows low performance on SODA

In-domain setting

  • COSMO’s responses are more specific and favored than its teacher model
  • COSMO is on par with ChatGPT in terms of overall quality
  • ChatGPT’s responses are more specific but lack naturalness
  • SODA and COSMO focus on modeling natural conversations, not knowledge-enhanced responses
  • Existing dialogue datasets come from online learning websites, movie/drama scripts, crowdsourcing, and noisy web conversations.
  • Augmented dialogue datasets use PLMs and additional annotations.
  • SODA contributes to existing corpora with improved scale, quality, and contextualization.

Conclusion

  • Presented SODA, a million-scale dialogue dataset
  • Dataset is larger and better than existing datasets
  • Trained COSMO, a conversation model that generalizes better
  • Aim to alleviate data scarcity in dialogue field
  • Precaution taken to vet safety of distilled conversations
  • Manual validation of commonsense and human evaluation
  • Limitations of current dataset and future work
  • Intention of work is to build better assistive technologies
  • Need for improved regulations on use and misuse of conversational AI
  • Results of head-to-head comparison between SODA and other datasets
  • Statistics of SODA compared to other large-scale dialogue datasets