Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Task-oriented dialogue systems have been used to help people achieve goals
  • Systems are usually built for one language and don’t work well beyond that
  • MULTI3NLU++ is a dataset that extends NLU++ to multiple languages and domains
  • MULTI3NLU++ includes Spanish, Marathi, Turkish and Amharic
  • MULTI3NLU++ is used to benchmark language models and machine translation systems

Paper Content

Introduction

  • Task-oriented dialogue systems used to automate customer service tasks in travel, finance, and hotel booking
  • Natural Language Understanding module performs intent detection and slot labelling
  • Intent detection task is to identify goal of user’s utterance from pre-defined classes
  • Slot labelling task is to label each token in an utterance with a label that describes the type of semantic information
  • Existing datasets are limited to single intent, single domain, and small set of slot types
  • Existing datasets are evaluated on small set of higher-resource languages
  • Inability to handle multiple intents is a serious limitation
  • MULTI 3 NLU ++ is a multilingual, multi-intent, multi-domain dataset for training and evaluating TOD systems
  • MULTI 3 NLU ++ extends NLU++, a multi-intent, multi-domain dataset for BANKING and HOTELS domains
  • MULTI 3 NLU ++ includes manual translations of 3,080 utterances in NLU++ to four languages
  • MULTI 3 NLU ++ captures language diversity and allows for cross-domain and cross-lingual training and experimentation
  • MULTI 3 NLU ++ enables systematic comparisons of dialogue NLU systems in few-shot setups for cross-lingual and cross-domain transfer for low-, medium-and high-resource languages

Dataset collection

  • Selected Spanish, Marathi, Turkish, and Amharic as languages
  • Recruited professional translators to perform manual translation
  • Instructed translators to treat task as creative writing and maintain colloquial nature of utterances
  • Conducted pilot task with 50 sentences per domain
  • In-house evaluation by native speakers to verify translations
  • Automatic checker to ensure slot values present in translations
  • Data collection process took 5 months and cost £7,611

Baseline experiments

  • MULTI 3 NLU ++ is a multilingual dialogue NLU dataset
  • It covers intent detection and slot labelling tasks
  • It is used to provide reference points and demonstrate aspects of multilingual dialogue NLU systems
  • It is tested using N-fold cross-validation
  • It contains data for two domains: BANKING and HOTELS
  • It allows for comparison of multilingual dialogue NLU systems on languages with different amounts of resources
  • It is tested using in-language and cross-lingual setups

Classification-based methods

  • Evaluate two standard classification approaches to intent detection
  • MLP-based with a fixed encoder and full-model fine-tuning
  • Use a fixed efficient sentence encoder to encode sentences and train only the MLP classifier
  • Use a sigmoid layer on top of the classifier
  • Threshold of 0.3
  • Evaluate two state-of-the-art multilingual sentence encoders
  • Same hyperparameters for all classification models
  • MLP-based approach works better than full finetuning
  • MLP-based approach is more parameter efficient
  • Results demonstrate that low-resource language as source language leads to stronger target language results
  • Unearthing the multilingual sentence encoder’s multi-lingual capabilities

Question answering baseline

  • Implemented intent detection baselines using question-answering models
  • Formulated intent detection as an extractive question-answering task
  • Utterance is appended with “yes. no. [UTTER-ANCE]”
  • Intent labels converted into questions
  • QA model must learn to predict span as “yes” or “no”
  • Fine-tuned multilingual language model with general-purpose QA dataset
  • Investigated if extractive QA can act as strong baseline for multilingual multi-label intent detection
  • Performance lower than results reported in Casanueva et al. (2022)
  • Zero-shot transfer performance lower across all languages and domains
  • Performance inversely correlated with amount of training data present
  • Using questions in target language improves performance
  • Direct transfer performance lower than MLP-based intent detection models
  • Combining translation methods with multilingual models provides best performance

Discussion and future work

  • Multilingual BERT and XLM-R have been pretrained on over 100 languages
  • Representational power is uneven for high- and low-resource languages
  • MULTI 3 NLU ++ includes same training and evaluation data for all languages
  • Performance increases with more training data
  • Low-resource languages have lower performance than high-resource languages
  • Cross-domain setup has lower performance than in-domain setup
  • High-resource languages benefit more from increase in training data size than low-resource languages

Conclusion

  • Collected MULTI 3 NLU ++ dataset for multilingual, multi-label, multi-domain Natural Language Understanding
  • Dataset incorporates core properties from NLU++ (Casanueva et al., 2022)
  • Investigated properties in multilingual setting for Spanish, Marathi, Turkish, and Amharic
  • Implemented MLP-based and QA-based intent detection baselines
  • Performance drops significantly across all languages compared to NLU++
  • Zero-shot performance improves when source language has lower resources
  • Multilingual QA models rely on shallower heuristics than cross-lingual language understanding
  • Dataset covers diverse languages to help improve access to conversational language technologies
  • Utterances extracted from real dialogues and synthetic human-authored utterances
  • Dataset cannot be used to evaluate systems with respect to discourse-level phenomena
  • Evaluating modern automatic machine translation systems for building better multilingual chatbots
  • Sentences should be colloquial, no exact translations
  • Sentences spoken to customer service bot and hotel reception tasks
  • Maintain meaning and style as close to English text as possible
  • Maintain pronouns if present
  • Translate proper names and time values in most natural form for target language
  • Substitute concepts if no exact translation or concept is absent from culture
  • Copy corresponding phrases from Spanish sentence in slot columns
  • Do not change order of values
  • Table 6 provides details on dataset collection costs
  • Tables 7 to 12 provide full results for MLP-based baseline
  • Figures 2, 3, and 4 compare in-domain and cross-domain results