Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • NusaCrowd is a collaborative initiative to collect and unite existing resources for Indonesian languages.
  • 137 datasets and 117 standardized data loaders have been brought together.
  • Quality of datasets has been assessed manually and automatically.
  • NusaCrowd enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages.
  • NusaCrowd also brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages.
  • Intended to help advance natural language processing research in under-represented languages.

Paper Content

Introduction

  • Indonesia is a linguistically diverse and populous country
  • There are over 700 spoken languages in Indonesia
  • NLP research in Indonesian languages is restrained by various factors
  • Existing NLP research mainly focuses on high-resource languages
  • Indonesian NLP resources are scattered, undocumented, and not publicly available
  • NusaCrowd is an open collaborative effort to gather and unify existing resources in Indonesian languages
  • NusaCrowd has collected 137 datasheets with 117 standardized data loaders
  • NusaNLU, NusaNLG, and NusaASR are introduced as zero-shot benchmarks
  • NusaCrowd covers 100+ datasets and 200+ tasks in 19 Indonesian languages
  • 200+ hours of ASR corpora are available in 10 Indonesian languages
  • Lack of labeled datasets for NLP research in Indonesian languages
  • Building large LMs to allow zero-shot and few-shot transfer learning
  • Open source initiatives to gather available datasets and build useful models

Nusacrowd

  • Overview of NusaCrowd
  • Description of NusaCrowd framework
  • Dataset curation process
  • Summary and statistics of datasets collected in NusaCrowd

Overview of nusacrowd

  • NusaCrowd is a crowdsourcing initiative to collect, open-source, and standardize access to datasets in Indonesian and its 700+ local languages
  • NusaCrowd aims to address the resource limitation problem in Indonesian NLP through three solutions
  • NusaCrowd promotes public data access for non-public datasets with publications
  • NusaCrowd opens up access to 13 previously non-public datasets, some of which are multilingual
  • NusaCrowd collects datasets in both text modality and other modalities, e.g., speech and image
  • NusaCrowd does not store nor copy any of the hosted datasets, the control and ownership of the hosted datasets belong to the original owner

Nusacrowd framework

  • NusaCrowd consists of two platforms: NusaCatalogue and NusaCrowd Data Hub
  • The two platforms interact to support dataset registration and standardization
  • NusaCatalogue stores datasheets and NusaCrowd Data Hub stores data loaders
  • Dataset registration and standardization pipeline consists of four stages

Dataset standardization and curation

  • Standardized tasks from datasets collected in NusaCrowd into categories according to a specific schema
  • Defined 13 schemas to cover all tasks and modalities
  • Single-label text classification schema consists of three attributes (id, text, label)
  • Manual curation process for each datasheet submission based on two criteria (language correctness and annotation process)
  • Language correctness checked automatically and manually for English, Indonesian, Sundanese, and Javanese
  • Manual curation for other local languages
  • Annotation process manually checked and categorized into five categories

Datasets in nusacrowd

  • 137 datasheets collected with 117 dataloaders
  • 14 previously private datasets covering various tasks and local languages
  • 36 task types, including machine translation, summarization, sentiment analysis, etc.
  • 3 modalities: image, text, and speech
  • 19 Indonesian languages, plus some non-Indonesian languages
  • 29 out of 36 task types related to natural language understanding and generation
  • 3 vision tasks and 4 speech tasks

Nusacrowd benchmarks

  • Developed three different benchmarks from subsets of datasets in NusaCrowd
  • Benchmarks for Indonesian and its local languages
  • Zero-shot NLU benchmark (NusaNLU)
  • Zero-shot NLG benchmark (NusaNLG)
  • Multilingual ASR benchmark (NusaASR)

Nusanlu

  • Existing NLU benchmarks in Indonesian only cover one language and focus on comparing traditional ML and pre-trained LMs
  • NusaNLU is the first zero-shot NLU benchmark in Indonesian and its local languages, covering 12 languages and 26 datasets
  • Tasks include emotion classification, sentiment analysis, review score regression, hate speech detection, abusive language detection, next tweet prediction, and NLI
  • Models evaluated are XLM-R, XGLM, and BLOOMZ
  • Results show that single-task training on wav2vec 2.0-pt fails to produce a good result, while ASR fine-tuned wav2vec 2.0-ft model yields a decent result in most languages
  • Multilingual multi-task training yields โˆผ20% WER across all languages
  • Results suggest a large difference between vocabulary and speech features from one language to another
  • Acehnese (ace) yields different performance to other local languages, suggesting a speech feature distinction

Nusanlg

  • Recent works in Indonesian NLG benchmarks use transformer-based models.
  • NusaNLG is a new NLG benchmark covering 12 languages.
  • Models are evaluated using SacreBLEU and ROUGE-L.
  • Prompts are used to explore zero-shot generalization of large LMs.

Nusaasr

  • Developed first multilingual ASR benchmark for Indonesian and its local languages
  • Employed pre-trained wav2vec 2.0 models in experiment
  • Explored three training settings: single-task monolingual, multi-task monolingual, and joint multi-task multilingual
  • Experimented with two different wav2vec 2.0 checkpoints

Discussion

Impact of nusacrowd

  • NusaCrowd provides access to 137 datasets, which is much higher than existing resource pools and benchmarks.
  • NusaCrowd covers more local languages and modalities.
  • NusaCrowd presents two solutions to resource limitation in Indonesian NLP: standardization and an active resource pool.

Multilinguality for extremely low-resource languages

  • Multilinguality is important in low-resource NLP
  • Multilingual LMs are effective for tackling low-resource languages
  • Multilingual LMs are more scalable than monolingual or regional LMs
  • Multilingual LMs benefit from positive transfer between related languages

Viability of large models for indonesian

  • Larger LMs have better performance
  • Indonesian NLP data is limited compared to high-resource languages
  • Computational resources are limited for Indonesian research institutions and industries
  • Suggest making more effort to provide efficient solutions, such as pre-trained models of smaller sizes and efficiency techniques like factorization, pruning, quantization, and distillation

Conclusion

  • Introduce NusaCrowd, a resource pool for Indonesian and its local languages
  • 137 datasets, 118 with standardized loader
  • 3 modalities: text, vision, and speech
  • Manual and automatic curation processes to verify quality
  • 3 use cases: zero-shot NLU, NLG, and multilingual ASR
  • Experiments focus on zero-shot methods
  • Majority of datasets skewed towards MT, sentiment, abusive text classification, and ASR
  • 700+ languages in Indonesia, but only focus on a small fraction
  • Explorations in speech and image modalities for Indonesian and its local languages limited
  • 137 datasets listed, 3 use cases shown, but huge potential for utilization
  • Schema to define and format attributes of dataset returned by data loader
  • Image-text, speech-text, speech-to-speech, unlabeled text, single-label text classification, multi-label text classification, text-to-text, sequence labeling, question answering, single-label text pair classification, single-label text pair classification with continuous values or regression, multi-label text pair classification, knowledge base
  • 4 model checkpoints for NLU experiment
  • 2 model checkpoints for NLG experiment
  • 3 different prompts for each task type
  • 2 model checkpoints for ASR experiment
  • 3 largest and most widely-used languages in Indonesia: Indonesian, Javanese, and Sundanese