Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- NusaCrowd is a collaborative initiative to collect and unite existing resources for Indonesian languages.
- 137 datasets and 117 standardized data loaders have been brought together.
- Quality of datasets has been assessed manually and automatically.
- NusaCrowd enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages.
- NusaCrowd also brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages.
- Intended to help advance natural language processing research in under-represented languages.
Paper Content
Introduction
- Indonesia is a linguistically diverse and populous country
- There are over 700 spoken languages in Indonesia
- NLP research in Indonesian languages is restrained by various factors
- Existing NLP research mainly focuses on high-resource languages
- Indonesian NLP resources are scattered, undocumented, and not publicly available
- NusaCrowd is an open collaborative effort to gather and unify existing resources in Indonesian languages
- NusaCrowd has collected 137 datasheets with 117 standardized data loaders
- NusaNLU, NusaNLG, and NusaASR are introduced as zero-shot benchmarks
- NusaCrowd covers 100+ datasets and 200+ tasks in 19 Indonesian languages
- 200+ hours of ASR corpora are available in 10 Indonesian languages
Related work
- Lack of labeled datasets for NLP research in Indonesian languages
- Building large LMs to allow zero-shot and few-shot transfer learning
- Open source initiatives to gather available datasets and build useful models
Nusacrowd
- Overview of NusaCrowd
- Description of NusaCrowd framework
- Dataset curation process
- Summary and statistics of datasets collected in NusaCrowd
Overview of nusacrowd
- NusaCrowd is a crowdsourcing initiative to collect, open-source, and standardize access to datasets in Indonesian and its 700+ local languages
- NusaCrowd aims to address the resource limitation problem in Indonesian NLP through three solutions
- NusaCrowd promotes public data access for non-public datasets with publications
- NusaCrowd opens up access to 13 previously non-public datasets, some of which are multilingual
- NusaCrowd collects datasets in both text modality and other modalities, e.g., speech and image
- NusaCrowd does not store nor copy any of the hosted datasets, the control and ownership of the hosted datasets belong to the original owner
Nusacrowd framework
- NusaCrowd consists of two platforms: NusaCatalogue and NusaCrowd Data Hub
- The two platforms interact to support dataset registration and standardization
- NusaCatalogue stores datasheets and NusaCrowd Data Hub stores data loaders
- Dataset registration and standardization pipeline consists of four stages
Dataset standardization and curation
- Standardized tasks from datasets collected in NusaCrowd into categories according to a specific schema
- Defined 13 schemas to cover all tasks and modalities
- Single-label text classification schema consists of three attributes (id, text, label)
- Manual curation process for each datasheet submission based on two criteria (language correctness and annotation process)
- Language correctness checked automatically and manually for English, Indonesian, Sundanese, and Javanese
- Manual curation for other local languages
- Annotation process manually checked and categorized into five categories
Datasets in nusacrowd
- 137 datasheets collected with 117 dataloaders
- 14 previously private datasets covering various tasks and local languages
- 36 task types, including machine translation, summarization, sentiment analysis, etc.
- 3 modalities: image, text, and speech
- 19 Indonesian languages, plus some non-Indonesian languages
- 29 out of 36 task types related to natural language understanding and generation
- 3 vision tasks and 4 speech tasks
Nusacrowd benchmarks
- Developed three different benchmarks from subsets of datasets in NusaCrowd
- Benchmarks for Indonesian and its local languages
- Zero-shot NLU benchmark (NusaNLU)
- Zero-shot NLG benchmark (NusaNLG)
- Multilingual ASR benchmark (NusaASR)
Nusanlu
- Existing NLU benchmarks in Indonesian only cover one language and focus on comparing traditional ML and pre-trained LMs
- NusaNLU is the first zero-shot NLU benchmark in Indonesian and its local languages, covering 12 languages and 26 datasets
- Tasks include emotion classification, sentiment analysis, review score regression, hate speech detection, abusive language detection, next tweet prediction, and NLI
- Models evaluated are XLM-R, XGLM, and BLOOMZ
- Results show that single-task training on wav2vec 2.0-pt fails to produce a good result, while ASR fine-tuned wav2vec 2.0-ft model yields a decent result in most languages
- Multilingual multi-task training yields โผ20% WER across all languages
- Results suggest a large difference between vocabulary and speech features from one language to another
- Acehnese (ace) yields different performance to other local languages, suggesting a speech feature distinction
Nusanlg
- Recent works in Indonesian NLG benchmarks use transformer-based models.
- NusaNLG is a new NLG benchmark covering 12 languages.
- Models are evaluated using SacreBLEU and ROUGE-L.
- Prompts are used to explore zero-shot generalization of large LMs.
Nusaasr
- Developed first multilingual ASR benchmark for Indonesian and its local languages
- Employed pre-trained wav2vec 2.0 models in experiment
- Explored three training settings: single-task monolingual, multi-task monolingual, and joint multi-task multilingual
- Experimented with two different wav2vec 2.0 checkpoints
Discussion
Impact of nusacrowd
- NusaCrowd provides access to 137 datasets, which is much higher than existing resource pools and benchmarks.
- NusaCrowd covers more local languages and modalities.
- NusaCrowd presents two solutions to resource limitation in Indonesian NLP: standardization and an active resource pool.
Multilinguality for extremely low-resource languages
- Multilinguality is important in low-resource NLP
- Multilingual LMs are effective for tackling low-resource languages
- Multilingual LMs are more scalable than monolingual or regional LMs
- Multilingual LMs benefit from positive transfer between related languages
Viability of large models for indonesian
- Larger LMs have better performance
- Indonesian NLP data is limited compared to high-resource languages
- Computational resources are limited for Indonesian research institutions and industries
- Suggest making more effort to provide efficient solutions, such as pre-trained models of smaller sizes and efficiency techniques like factorization, pruning, quantization, and distillation
Conclusion
- Introduce NusaCrowd, a resource pool for Indonesian and its local languages
- 137 datasets, 118 with standardized loader
- 3 modalities: text, vision, and speech
- Manual and automatic curation processes to verify quality
- 3 use cases: zero-shot NLU, NLG, and multilingual ASR
- Experiments focus on zero-shot methods
- Majority of datasets skewed towards MT, sentiment, abusive text classification, and ASR
- 700+ languages in Indonesia, but only focus on a small fraction
- Explorations in speech and image modalities for Indonesian and its local languages limited
- 137 datasets listed, 3 use cases shown, but huge potential for utilization
- Schema to define and format attributes of dataset returned by data loader
- Image-text, speech-text, speech-to-speech, unlabeled text, single-label text classification, multi-label text classification, text-to-text, sequence labeling, question answering, single-label text pair classification, single-label text pair classification with continuous values or regression, multi-label text pair classification, knowledge base
- 4 model checkpoints for NLU experiment
- 2 model checkpoints for NLG experiment
- 3 different prompts for each task type
- 2 model checkpoints for ASR experiment
- 3 largest and most widely-used languages in Indonesia: Indonesian, Javanese, and Sundanese