Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Task-oriented dialogue systems have been used to help people achieve goals
Systems are usually built for one language and don’t work well beyond that
MULTI3NLU++ is a dataset that extends NLU++ to multiple languages and domains
MULTI3NLU++ includes Spanish, Marathi, Turkish and Amharic
MULTI3NLU++ is used to benchmark language models and machine translation systems

Paper Content

Introduction

Task-oriented dialogue systems used to automate customer service tasks in travel, finance, and hotel booking
Natural Language Understanding module performs intent detection and slot labelling
Intent detection task is to identify goal of user’s utterance from pre-defined classes
Slot labelling task is to label each token in an utterance with a label that describes the type of semantic information
Existing datasets are limited to single intent, single domain, and small set of slot types
Existing datasets are evaluated on small set of higher-resource languages
Inability to handle multiple intents is a serious limitation
MULTI 3 NLU ++ is a multilingual, multi-intent, multi-domain dataset for training and evaluating TOD systems
MULTI 3 NLU ++ extends NLU++, a multi-intent, multi-domain dataset for BANKING and HOTELS domains
MULTI 3 NLU ++ includes manual translations of 3,080 utterances in NLU++ to four languages
MULTI 3 NLU ++ captures language diversity and allows for cross-domain and cross-lingual training and experimentation
MULTI 3 NLU ++ enables systematic comparisons of dialogue NLU systems in few-shot setups for cross-lingual and cross-domain transfer for low-, medium-and high-resource languages

Dataset collection

Selected Spanish, Marathi, Turkish, and Amharic as languages
Recruited professional translators to perform manual translation
Instructed translators to treat task as creative writing and maintain colloquial nature of utterances
Conducted pilot task with 50 sentences per domain
In-house evaluation by native speakers to verify translations
Automatic checker to ensure slot values present in translations
Data collection process took 5 months and cost £7,611

Baseline experiments

MULTI 3 NLU ++ is a multilingual dialogue NLU dataset
It covers intent detection and slot labelling tasks
It is used to provide reference points and demonstrate aspects of multilingual dialogue NLU systems
It is tested using N-fold cross-validation
It contains data for two domains: BANKING and HOTELS
It allows for comparison of multilingual dialogue NLU systems on languages with different amounts of resources
It is tested using in-language and cross-lingual setups

Classification-based methods

Evaluate two standard classification approaches to intent detection
MLP-based with a fixed encoder and full-model fine-tuning
Use a fixed efficient sentence encoder to encode sentences and train only the MLP classifier
Use a sigmoid layer on top of the classifier
Threshold of 0.3
Evaluate two state-of-the-art multilingual sentence encoders
Same hyperparameters for all classification models
MLP-based approach works better than full finetuning
MLP-based approach is more parameter efficient
Results demonstrate that low-resource language as source language leads to stronger target language results
Unearthing the multilingual sentence encoder’s multi-lingual capabilities

Question answering baseline

Implemented intent detection baselines using question-answering models
Formulated intent detection as an extractive question-answering task
Utterance is appended with “yes. no. [UTTER-ANCE]”
Intent labels converted into questions
QA model must learn to predict span as “yes” or “no”
Fine-tuned multilingual language model with general-purpose QA dataset
Investigated if extractive QA can act as strong baseline for multilingual multi-label intent detection
Performance lower than results reported in Casanueva et al. (2022)
Zero-shot transfer performance lower across all languages and domains
Performance inversely correlated with amount of training data present
Using questions in target language improves performance
Direct transfer performance lower than MLP-based intent detection models
Combining translation methods with multilingual models provides best performance

Discussion and future work

Multilingual BERT and XLM-R have been pretrained on over 100 languages
Representational power is uneven for high- and low-resource languages
MULTI 3 NLU ++ includes same training and evaluation data for all languages
Performance increases with more training data
Low-resource languages have lower performance than high-resource languages
Cross-domain setup has lower performance than in-domain setup
High-resource languages benefit more from increase in training data size than low-resource languages

Conclusion

Collected MULTI 3 NLU ++ dataset for multilingual, multi-label, multi-domain Natural Language Understanding
Dataset incorporates core properties from NLU++ (Casanueva et al., 2022)
Investigated properties in multilingual setting for Spanish, Marathi, Turkish, and Amharic
Implemented MLP-based and QA-based intent detection baselines
Performance drops significantly across all languages compared to NLU++
Zero-shot performance improves when source language has lower resources
Multilingual QA models rely on shallower heuristics than cross-lingual language understanding
Dataset covers diverse languages to help improve access to conversational language technologies
Utterances extracted from real dialogues and synthetic human-authored utterances
Dataset cannot be used to evaluate systems with respect to discourse-level phenomena
Evaluating modern automatic machine translation systems for building better multilingual chatbots
Sentences should be colloquial, no exact translations
Sentences spoken to customer service bot and hotel reception tasks
Maintain meaning and style as close to English text as possible
Maintain pronouns if present
Translate proper names and time values in most natural form for target language
Substitute concepts if no exact translation or concept is absent from culture
Copy corresponding phrases from Spanish sentence in slot columns
Do not change order of values
Table 6 provides details on dataset collection costs
Tables 7 to 12 provide full results for MLP-based baseline
Figures 2, 3, and 4 compare in-domain and cross-domain results

Link to paper#

Abstract#

Paper Content#

Introduction#

Dataset collection#

Baseline experiments#

Classification-based methods#

Question answering baseline#

Discussion and future work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Dataset collection

Baseline experiments

Classification-based methods

Question answering baseline

Discussion and future work

Conclusion