Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

NusaCrowd is a collaborative initiative to collect and unite existing resources for Indonesian languages.
137 datasets and 117 standardized data loaders have been brought together.
Quality of datasets has been assessed manually and automatically.
NusaCrowd enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages.
NusaCrowd also brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages.
Intended to help advance natural language processing research in under-represented languages.

Paper Content

Introduction

Indonesia is a linguistically diverse and populous country
There are over 700 spoken languages in Indonesia
NLP research in Indonesian languages is restrained by various factors
Existing NLP research mainly focuses on high-resource languages
Indonesian NLP resources are scattered, undocumented, and not publicly available
NusaCrowd is an open collaborative effort to gather and unify existing resources in Indonesian languages
NusaCrowd has collected 137 datasheets with 117 standardized data loaders
NusaNLU, NusaNLG, and NusaASR are introduced as zero-shot benchmarks
NusaCrowd covers 100+ datasets and 200+ tasks in 19 Indonesian languages
200+ hours of ASR corpora are available in 10 Indonesian languages

Lack of labeled datasets for NLP research in Indonesian languages
Building large LMs to allow zero-shot and few-shot transfer learning
Open source initiatives to gather available datasets and build useful models

Nusacrowd

Overview of NusaCrowd
Description of NusaCrowd framework
Dataset curation process
Summary and statistics of datasets collected in NusaCrowd

Overview of nusacrowd

NusaCrowd is a crowdsourcing initiative to collect, open-source, and standardize access to datasets in Indonesian and its 700+ local languages
NusaCrowd aims to address the resource limitation problem in Indonesian NLP through three solutions
NusaCrowd promotes public data access for non-public datasets with publications
NusaCrowd opens up access to 13 previously non-public datasets, some of which are multilingual
NusaCrowd collects datasets in both text modality and other modalities, e.g., speech and image
NusaCrowd does not store nor copy any of the hosted datasets, the control and ownership of the hosted datasets belong to the original owner

Nusacrowd framework

NusaCrowd consists of two platforms: NusaCatalogue and NusaCrowd Data Hub
The two platforms interact to support dataset registration and standardization
NusaCatalogue stores datasheets and NusaCrowd Data Hub stores data loaders
Dataset registration and standardization pipeline consists of four stages

Dataset standardization and curation

Standardized tasks from datasets collected in NusaCrowd into categories according to a specific schema
Defined 13 schemas to cover all tasks and modalities
Single-label text classification schema consists of three attributes (id, text, label)
Manual curation process for each datasheet submission based on two criteria (language correctness and annotation process)
Language correctness checked automatically and manually for English, Indonesian, Sundanese, and Javanese
Manual curation for other local languages
Annotation process manually checked and categorized into five categories

Datasets in nusacrowd

137 datasheets collected with 117 dataloaders
14 previously private datasets covering various tasks and local languages
36 task types, including machine translation, summarization, sentiment analysis, etc.
3 modalities: image, text, and speech
19 Indonesian languages, plus some non-Indonesian languages
29 out of 36 task types related to natural language understanding and generation
3 vision tasks and 4 speech tasks

Nusacrowd benchmarks

Developed three different benchmarks from subsets of datasets in NusaCrowd
Benchmarks for Indonesian and its local languages
Zero-shot NLU benchmark (NusaNLU)
Zero-shot NLG benchmark (NusaNLG)
Multilingual ASR benchmark (NusaASR)

Nusanlu

Existing NLU benchmarks in Indonesian only cover one language and focus on comparing traditional ML and pre-trained LMs
NusaNLU is the first zero-shot NLU benchmark in Indonesian and its local languages, covering 12 languages and 26 datasets
Tasks include emotion classification, sentiment analysis, review score regression, hate speech detection, abusive language detection, next tweet prediction, and NLI
Models evaluated are XLM-R, XGLM, and BLOOMZ
Results show that single-task training on wav2vec 2.0-pt fails to produce a good result, while ASR fine-tuned wav2vec 2.0-ft model yields a decent result in most languages
Multilingual multi-task training yields ∼20% WER across all languages
Results suggest a large difference between vocabulary and speech features from one language to another
Acehnese (ace) yields different performance to other local languages, suggesting a speech feature distinction

Nusanlg

Recent works in Indonesian NLG benchmarks use transformer-based models.
NusaNLG is a new NLG benchmark covering 12 languages.
Models are evaluated using SacreBLEU and ROUGE-L.
Prompts are used to explore zero-shot generalization of large LMs.

Nusaasr

Developed first multilingual ASR benchmark for Indonesian and its local languages
Employed pre-trained wav2vec 2.0 models in experiment
Explored three training settings: single-task monolingual, multi-task monolingual, and joint multi-task multilingual
Experimented with two different wav2vec 2.0 checkpoints

Discussion

Impact of nusacrowd

NusaCrowd provides access to 137 datasets, which is much higher than existing resource pools and benchmarks.
NusaCrowd covers more local languages and modalities.
NusaCrowd presents two solutions to resource limitation in Indonesian NLP: standardization and an active resource pool.

Multilinguality for extremely low-resource languages

Multilinguality is important in low-resource NLP
Multilingual LMs are effective for tackling low-resource languages
Multilingual LMs are more scalable than monolingual or regional LMs
Multilingual LMs benefit from positive transfer between related languages

Viability of large models for indonesian

Larger LMs have better performance
Indonesian NLP data is limited compared to high-resource languages
Computational resources are limited for Indonesian research institutions and industries
Suggest making more effort to provide efficient solutions, such as pre-trained models of smaller sizes and efficiency techniques like factorization, pruning, quantization, and distillation

Conclusion

Introduce NusaCrowd, a resource pool for Indonesian and its local languages
137 datasets, 118 with standardized loader
3 modalities: text, vision, and speech
Manual and automatic curation processes to verify quality
3 use cases: zero-shot NLU, NLG, and multilingual ASR
Experiments focus on zero-shot methods
Majority of datasets skewed towards MT, sentiment, abusive text classification, and ASR
700+ languages in Indonesia, but only focus on a small fraction
Explorations in speech and image modalities for Indonesian and its local languages limited
137 datasets listed, 3 use cases shown, but huge potential for utilization
Schema to define and format attributes of dataset returned by data loader
Image-text, speech-text, speech-to-speech, unlabeled text, single-label text classification, multi-label text classification, text-to-text, sequence labeling, question answering, single-label text pair classification, single-label text pair classification with continuous values or regression, multi-label text pair classification, knowledge base
4 model checkpoints for NLU experiment
2 model checkpoints for NLG experiment
3 different prompts for each task type
2 model checkpoints for ASR experiment
3 largest and most widely-used languages in Indonesia: Indonesian, Javanese, and Sundanese

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Nusacrowd#

Overview of nusacrowd#

Nusacrowd framework#

Dataset standardization and curation#

Datasets in nusacrowd#

Nusacrowd benchmarks#

Nusanlu#

Nusanlg#

Nusaasr#

Discussion#

Impact of nusacrowd#

Multilinguality for extremely low-resource languages#

Viability of large models for indonesian#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Nusacrowd

Overview of nusacrowd

Nusacrowd framework

Dataset standardization and curation

Datasets in nusacrowd

Nusacrowd benchmarks

Nusanlu

Nusanlg

Nusaasr

Discussion

Impact of nusacrowd

Multilinguality for extremely low-resource languages

Viability of large models for indonesian

Conclusion