Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent work attributes progress in NLP to large language models with increased model size and large quantities of pretraining data.
- Current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages.
- Multilingual sequence-to-sequence models present a promising building block for NLP for morphologically rich languages.
Paper Content
Introduction
- Large pretrained language models have been used in a variety of NLP tasks, domains, and languages.
- These models are trained in a self-supervised fashion on large corpora.
- Language-specific models have been developed to improve benchmark results in a variety of languages, including Hebrew.
- Hebrew language models are trained with a relatively small pretraining data and are under-parameterized.
- Encoder-only architectures are not effective for morpho-syntactic tasks.
- A three-faceted problem exists in Hebrew NLP: underparameterization, limited training data, and suboptimal pre-training architecture.
- mT5 is proposed to address these challenges.
- Text-only formulations are proposed to adapt classification, span prediction, and token/morpheme classification tasks to mT5’s text-to-text paradigm.
- This paradigm change produces empirical improvements on all tasks evaluated compared to previous state-of-the-art.
Modeling
- We use mT5, a multilingual generative text-to-text version of T5, trained on 101 languages.
- We evaluate mT5 on all its available sizes, ranging from 300M to 13B parameters.
- We propose casting all Hebrew NLP tasks as text-to-text tasks, where the input text is fed into the model and targets are produced in a generative manner.
- Token classification is not as common in the literature, especially when the tokens consist of multiple morphemes.
Experiments
- Goal of study is to assess performance of mT5 language model on Hebrew tasks
- mT5 models were fine-tuned on Hebrew tasks for 4096 steps
- Compared mT5 models to YAP, mBERT, HeBERT, AlephBERT and ABG
- Results can be found in Tables 1 and 2
Tasks
- QA, NER, Sentiment Analysis and morpho-syntactic tasks evaluated
- ParaShoot dataset used for QA
- NEMO dataset used for NER
- Manual evaluation of mT5 models
- Sentiment Analysis dataset from Amram et al.
- Two additional tokens used for output
Results
- mT5 produces better results than evaluated baselines on Hebrew NLP benchmarks
- mT5 produces the biggest performance boost for the Question-Answering task of ParaShoot
- mT5 outperforms AlephBERT by 27.9 F1 points
- mT5 outperforms baseline models for sentiment analysis by a small fraction
- Error reduction of 30.3% and 32.8% for the segmentation and POS tagging tasks compared to previous state-of-the-art
- Increase of 16.93 mset F1 points compared to YAP for lemmatization task
Related work
- HeBERT and OSCAR used for sentiment analysis
- AlephBERT pretrained on Hebrew Wikipedia and tweets
- Guetta et al. used 2.5x AlephBERT vocabulary size to improve performance
- Keren et al. proposed char-level LMs to mitigate data sparseness
- mT5 outperforms baseline models on multilingual datasets
- Monolingual Hebrew LM papers compared against mBERT
Conclusions
- All Hebrew LMs to date are encoder-only models, which cannot generate morpheme sequences.
- mT5 is a publicly available multilingual large language model that was trained on multilingual and Hebrew data.
- mT5 outperforms all previous baselines.
- Multilingual sequence-to-sequence models should be used for Hebrew and other MRLs.