Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Parallel data is beneficial for cross-lingual learning.
- It is unclear if the improvements come from the data or the modeling of parallel interactions.
- Unsupervised machine translation can generate synthetic parallel data.
- Synthetic parallel data can be useful for downstream tasks.
- Real parallel data still yields the best results.
- Multilingual models do not exploit the full potential of monolingual data.
Paper Content
Introduction
- Multilingual models can generalize across languages without data in the target language
- Models are pretrained on monolingual corpora and finetuned with labeled data in the source language
- Parallel data can be incorporated at pretraining or finetuning time
- Research is needed to understand the contribution of monolingual and parallel data
Experimental setup
Tasks
- Training set is in English
- Evaluate transfer performance in other languages
- Languages evaluated: English, Arabic, German, Hindi, French, Swahili, Russian, Thai, Vietnamese
- Finetuning incorporation experiments involve machine translating training data into target languages
- For XNLI, translate premise and hypothesis, leave label unchanged
- For XQuAD and WikiANN, translate input text and project answer spans using word aligner
Model
- Used XLM-R base for experiments
- Trained XLM-R base with Masked Language Modeling on CC-100
- Finetuned with learning rates of 1e-5, 5e-5, and 1e-4 using Adam optimizer
- Trained for up to 10 epochs and chose checkpoint with best validation performance
Parallel data sources
- Gold: Ground-truth parallel data generated by humans
- Supervised MT: Synthetic parallel data generated through a conventional MT system
- Unsupervised MT: Synthetic parallel data generated through an unsupervised MT system
- Dictionary: Synthetic parallel data generated through random word replacement with a dictionary
Pretraining incorporation
- Incorporated parallel data into pretraining process
- Pretrained on MLM and TLM for 70k steps
- Compared different sources of parallel data
- All variants incorporating parallel data outperform original XLM-R model
- Supervised MT performs at par with gold data
- Unsupervised MT lags behind but outperforms baseline
Finetuning incorporation
- Incorporated parallel data into finetuning process
- Results in Table 2 show incorporating parallel data outperforms baseline in all tasks and languages for all data sources
- Supervised MT obtains best results, followed by unsupervised MT and word-by-word translation with dictionaries
- Synthetic parallel data can bring improvements from monolingual data, but real parallel data brings further improvements
- Simplistic ways to incorporate parallel signals can bring improvements
Discussion
- Incorporating parallel data for cross-lingual transfer learning can partly be attributed to the explicit use of a parallel training signal.
- Unsupervised MT can achieve the same improvement without the need for real parallel data.
- Facilitation of parallel interactions is more important than the use of real parallel data in all tasks but XQuAD.
- Multilingual pretrained models rely on monolingual data, which calls into question the extent to which existing approaches are able to exploit the full potential of such monolingual data.
- Similar results are obtained for both pretraining and finetuning incorporation, as well as supervised MT and gold standard parallel data.
Reconsidering the categorization of cross-lingual learning approaches
- Different data types can be used in a pipeline (monolingual source corpora, monolingual target corpora, parallel corpora, downstream data)
- Different stages of the pipeline can incorporate the data (pretraining, finetuning, testing)
- Different procedures can be used to incorporate the data (directly or indirectly through MT)
Related work
- Prior work has explored the extent to which monolingual pretraining relies on knowledge transfer from unlabeled corpora.
- Cross-lingual learning has not been examined to see if it relies on knowledge transfer from parallel data.
- Our unsupervised MT variant does not use any additional data compared to regular pretraining.
Conclusions
- Model-generated parallel data can be used for cross-lingual learning.
- Investigating the optimal way to leverage monolingual and/or parallel data for cross-lingual learning is advocated.
- Pretraining incorporation results show that XLM-R model can be finetuned on English downstream data and zero-shot transferred to the target language.