Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Task-oriented dialogue systems have been used to help people achieve goals
- Systems are usually built for one language and don’t work well beyond that
- MULTI3NLU++ is a dataset that extends NLU++ to multiple languages and domains
- MULTI3NLU++ includes Spanish, Marathi, Turkish and Amharic
- MULTI3NLU++ is used to benchmark language models and machine translation systems
Paper Content
Introduction
- Task-oriented dialogue systems used to automate customer service tasks in travel, finance, and hotel booking
- Natural Language Understanding module performs intent detection and slot labelling
- Intent detection task is to identify goal of user’s utterance from pre-defined classes
- Slot labelling task is to label each token in an utterance with a label that describes the type of semantic information
- Existing datasets are limited to single intent, single domain, and small set of slot types
- Existing datasets are evaluated on small set of higher-resource languages
- Inability to handle multiple intents is a serious limitation
- MULTI 3 NLU ++ is a multilingual, multi-intent, multi-domain dataset for training and evaluating TOD systems
- MULTI 3 NLU ++ extends NLU++, a multi-intent, multi-domain dataset for BANKING and HOTELS domains
- MULTI 3 NLU ++ includes manual translations of 3,080 utterances in NLU++ to four languages
- MULTI 3 NLU ++ captures language diversity and allows for cross-domain and cross-lingual training and experimentation
- MULTI 3 NLU ++ enables systematic comparisons of dialogue NLU systems in few-shot setups for cross-lingual and cross-domain transfer for low-, medium-and high-resource languages
Dataset collection
- Selected Spanish, Marathi, Turkish, and Amharic as languages
- Recruited professional translators to perform manual translation
- Instructed translators to treat task as creative writing and maintain colloquial nature of utterances
- Conducted pilot task with 50 sentences per domain
- In-house evaluation by native speakers to verify translations
- Automatic checker to ensure slot values present in translations
- Data collection process took 5 months and cost £7,611
Baseline experiments
- MULTI 3 NLU ++ is a multilingual dialogue NLU dataset
- It covers intent detection and slot labelling tasks
- It is used to provide reference points and demonstrate aspects of multilingual dialogue NLU systems
- It is tested using N-fold cross-validation
- It contains data for two domains: BANKING and HOTELS
- It allows for comparison of multilingual dialogue NLU systems on languages with different amounts of resources
- It is tested using in-language and cross-lingual setups
Classification-based methods
- Evaluate two standard classification approaches to intent detection
- MLP-based with a fixed encoder and full-model fine-tuning
- Use a fixed efficient sentence encoder to encode sentences and train only the MLP classifier
- Use a sigmoid layer on top of the classifier
- Threshold of 0.3
- Evaluate two state-of-the-art multilingual sentence encoders
- Same hyperparameters for all classification models
- MLP-based approach works better than full finetuning
- MLP-based approach is more parameter efficient
- Results demonstrate that low-resource language as source language leads to stronger target language results
- Unearthing the multilingual sentence encoder’s multi-lingual capabilities
Question answering baseline
- Implemented intent detection baselines using question-answering models
- Formulated intent detection as an extractive question-answering task
- Utterance is appended with “yes. no. [UTTER-ANCE]”
- Intent labels converted into questions
- QA model must learn to predict span as “yes” or “no”
- Fine-tuned multilingual language model with general-purpose QA dataset
- Investigated if extractive QA can act as strong baseline for multilingual multi-label intent detection
- Performance lower than results reported in Casanueva et al. (2022)
- Zero-shot transfer performance lower across all languages and domains
- Performance inversely correlated with amount of training data present
- Using questions in target language improves performance
- Direct transfer performance lower than MLP-based intent detection models
- Combining translation methods with multilingual models provides best performance
Discussion and future work
- Multilingual BERT and XLM-R have been pretrained on over 100 languages
- Representational power is uneven for high- and low-resource languages
- MULTI 3 NLU ++ includes same training and evaluation data for all languages
- Performance increases with more training data
- Low-resource languages have lower performance than high-resource languages
- Cross-domain setup has lower performance than in-domain setup
- High-resource languages benefit more from increase in training data size than low-resource languages
Conclusion
- Collected MULTI 3 NLU ++ dataset for multilingual, multi-label, multi-domain Natural Language Understanding
- Dataset incorporates core properties from NLU++ (Casanueva et al., 2022)
- Investigated properties in multilingual setting for Spanish, Marathi, Turkish, and Amharic
- Implemented MLP-based and QA-based intent detection baselines
- Performance drops significantly across all languages compared to NLU++
- Zero-shot performance improves when source language has lower resources
- Multilingual QA models rely on shallower heuristics than cross-lingual language understanding
- Dataset covers diverse languages to help improve access to conversational language technologies
- Utterances extracted from real dialogues and synthetic human-authored utterances
- Dataset cannot be used to evaluate systems with respect to discourse-level phenomena
- Evaluating modern automatic machine translation systems for building better multilingual chatbots
- Sentences should be colloquial, no exact translations
- Sentences spoken to customer service bot and hotel reception tasks
- Maintain meaning and style as close to English text as possible
- Maintain pronouns if present
- Translate proper names and time values in most natural form for target language
- Substitute concepts if no exact translation or concept is absent from culture
- Copy corresponding phrases from Spanish sentence in slot columns
- Do not change order of values
- Table 6 provides details on dataset collection costs
- Tables 7 to 12 provide full results for MLP-based baseline
- Figures 2, 3, and 4 compare in-domain and cross-domain results