Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent work attributes progress in NLP to large language models with increased model size and large quantities of pretraining data.
Current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages.
Multilingual sequence-to-sequence models present a promising building block for NLP for morphologically rich languages.

Large pretrained language models have been used in a variety of NLP tasks, domains, and languages.
These models are trained in a self-supervised fashion on large corpora.
Language-specific models have been developed to improve benchmark results in a variety of languages, including Hebrew.
Hebrew language models are trained with a relatively small pretraining data and are under-parameterized.
Encoder-only architectures are not effective for morpho-syntactic tasks.
A three-faceted problem exists in Hebrew NLP: underparameterization, limited training data, and suboptimal pre-training architecture.
mT5 is proposed to address these challenges.
Text-only formulations are proposed to adapt classification, span prediction, and token/morpheme classification tasks to mT5’s text-to-text paradigm.
This paradigm change produces empirical improvements on all tasks evaluated compared to previous state-of-the-art.

We use mT5, a multilingual generative text-to-text version of T5, trained on 101 languages.
We evaluate mT5 on all its available sizes, ranging from 300M to 13B parameters.
We propose casting all Hebrew NLP tasks as text-to-text tasks, where the input text is fed into the model and targets are produced in a generative manner.
Token classification is not as common in the literature, especially when the tokens consist of multiple morphemes.

mT5 produces better results than evaluated baselines on Hebrew NLP benchmarks
mT5 produces the biggest performance boost for the Question-Answering task of ParaShoot
mT5 outperforms AlephBERT by 27.9 F1 points
mT5 outperforms baseline models for sentiment analysis by a small fraction
Error reduction of 30.3% and 32.8% for the segmentation and POS tagging tasks compared to previous state-of-the-art
Increase of 16.93 mset F1 points compared to YAP for lemmatization task

All Hebrew LMs to date are encoder-only models, which cannot generate morpheme sequences.
mT5 is a publicly available multilingual large language model that was trained on multilingual and Hebrew data.
mT5 outperforms all previous baselines.
Multilingual sequence-to-sequence models should be used for Hebrew and other MRLs.