Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent work attributes progress in NLP to large language models with increased model size and large quantities of pretraining data.
  • Current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages.
  • Multilingual sequence-to-sequence models present a promising building block for NLP for morphologically rich languages.

Paper Content

Introduction

  • Large pretrained language models have been used in a variety of NLP tasks, domains, and languages.
  • These models are trained in a self-supervised fashion on large corpora.
  • Language-specific models have been developed to improve benchmark results in a variety of languages, including Hebrew.
  • Hebrew language models are trained with a relatively small pretraining data and are under-parameterized.
  • Encoder-only architectures are not effective for morpho-syntactic tasks.
  • A three-faceted problem exists in Hebrew NLP: underparameterization, limited training data, and suboptimal pre-training architecture.
  • mT5 is proposed to address these challenges.
  • Text-only formulations are proposed to adapt classification, span prediction, and token/morpheme classification tasks to mT5’s text-to-text paradigm.
  • This paradigm change produces empirical improvements on all tasks evaluated compared to previous state-of-the-art.

Modeling

  • We use mT5, a multilingual generative text-to-text version of T5, trained on 101 languages.
  • We evaluate mT5 on all its available sizes, ranging from 300M to 13B parameters.
  • We propose casting all Hebrew NLP tasks as text-to-text tasks, where the input text is fed into the model and targets are produced in a generative manner.
  • Token classification is not as common in the literature, especially when the tokens consist of multiple morphemes.

Experiments

  • Goal of study is to assess performance of mT5 language model on Hebrew tasks
  • mT5 models were fine-tuned on Hebrew tasks for 4096 steps
  • Compared mT5 models to YAP, mBERT, HeBERT, AlephBERT and ABG
  • Results can be found in Tables 1 and 2

Tasks

  • QA, NER, Sentiment Analysis and morpho-syntactic tasks evaluated
  • ParaShoot dataset used for QA
  • NEMO dataset used for NER
  • Manual evaluation of mT5 models
  • Sentiment Analysis dataset from Amram et al.
  • Two additional tokens used for output

Results

  • mT5 produces better results than evaluated baselines on Hebrew NLP benchmarks
  • mT5 produces the biggest performance boost for the Question-Answering task of ParaShoot
  • mT5 outperforms AlephBERT by 27.9 F1 points
  • mT5 outperforms baseline models for sentiment analysis by a small fraction
  • Error reduction of 30.3% and 32.8% for the segmentation and POS tagging tasks compared to previous state-of-the-art
  • Increase of 16.93 mset F1 points compared to YAP for lemmatization task
  • HeBERT and OSCAR used for sentiment analysis
  • AlephBERT pretrained on Hebrew Wikipedia and tweets
  • Guetta et al. used 2.5x AlephBERT vocabulary size to improve performance
  • Keren et al. proposed char-level LMs to mitigate data sparseness
  • mT5 outperforms baseline models on multilingual datasets
  • Monolingual Hebrew LM papers compared against mBERT

Conclusions

  • All Hebrew LMs to date are encoder-only models, which cannot generate morpheme sequences.
  • mT5 is a publicly available multilingual large language model that was trained on multilingual and Hebrew data.
  • mT5 outperforms all previous baselines.
  • Multilingual sequence-to-sequence models should be used for Hebrew and other MRLs.