Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Parallel data is beneficial for cross-lingual learning.
It is unclear if the improvements come from the data or the modeling of parallel interactions.
Unsupervised machine translation can generate synthetic parallel data.
Synthetic parallel data can be useful for downstream tasks.
Real parallel data still yields the best results.
Multilingual models do not exploit the full potential of monolingual data.

Paper Content

Introduction

Multilingual models can generalize across languages without data in the target language
Models are pretrained on monolingual corpora and finetuned with labeled data in the source language
Parallel data can be incorporated at pretraining or finetuning time
Research is needed to understand the contribution of monolingual and parallel data

Experimental setup

Tasks

Training set is in English
Evaluate transfer performance in other languages
Languages evaluated: English, Arabic, German, Hindi, French, Swahili, Russian, Thai, Vietnamese
Finetuning incorporation experiments involve machine translating training data into target languages
For XNLI, translate premise and hypothesis, leave label unchanged
For XQuAD and WikiANN, translate input text and project answer spans using word aligner

Model

Used XLM-R base for experiments
Trained XLM-R base with Masked Language Modeling on CC-100
Finetuned with learning rates of 1e-5, 5e-5, and 1e-4 using Adam optimizer
Trained for up to 10 epochs and chose checkpoint with best validation performance

Parallel data sources

Gold: Ground-truth parallel data generated by humans
Supervised MT: Synthetic parallel data generated through a conventional MT system
Unsupervised MT: Synthetic parallel data generated through an unsupervised MT system
Dictionary: Synthetic parallel data generated through random word replacement with a dictionary

Pretraining incorporation

Incorporated parallel data into pretraining process
Pretrained on MLM and TLM for 70k steps
Compared different sources of parallel data
All variants incorporating parallel data outperform original XLM-R model
Supervised MT performs at par with gold data
Unsupervised MT lags behind but outperforms baseline

Finetuning incorporation

Incorporated parallel data into finetuning process
Results in Table 2 show incorporating parallel data outperforms baseline in all tasks and languages for all data sources
Supervised MT obtains best results, followed by unsupervised MT and word-by-word translation with dictionaries
Synthetic parallel data can bring improvements from monolingual data, but real parallel data brings further improvements
Simplistic ways to incorporate parallel signals can bring improvements

Discussion

Incorporating parallel data for cross-lingual transfer learning can partly be attributed to the explicit use of a parallel training signal.
Unsupervised MT can achieve the same improvement without the need for real parallel data.
Facilitation of parallel interactions is more important than the use of real parallel data in all tasks but XQuAD.
Multilingual pretrained models rely on monolingual data, which calls into question the extent to which existing approaches are able to exploit the full potential of such monolingual data.
Similar results are obtained for both pretraining and finetuning incorporation, as well as supervised MT and gold standard parallel data.

Reconsidering the categorization of cross-lingual learning approaches

Different data types can be used in a pipeline (monolingual source corpora, monolingual target corpora, parallel corpora, downstream data)
Different stages of the pipeline can incorporate the data (pretraining, finetuning, testing)
Different procedures can be used to incorporate the data (directly or indirectly through MT)

Prior work has explored the extent to which monolingual pretraining relies on knowledge transfer from unlabeled corpora.
Cross-lingual learning has not been examined to see if it relies on knowledge transfer from parallel data.
Our unsupervised MT variant does not use any additional data compared to regular pretraining.

Conclusions

Model-generated parallel data can be used for cross-lingual learning.
Investigating the optimal way to leverage monolingual and/or parallel data for cross-lingual learning is advocated.
Pretraining incorporation results show that XLM-R model can be finetuned on English downstream data and zero-shot transferred to the target language.

Link to paper#

Abstract#

Paper Content#

Introduction#

Experimental setup#

Tasks#

Model#

Parallel data sources#

Pretraining incorporation#

Finetuning incorporation#

Discussion#

Reconsidering the categorization of cross-lingual learning approaches#

Related work#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Experimental setup

Tasks

Model

Parallel data sources

Pretraining incorporation

Finetuning incorporation

Discussion

Reconsidering the categorization of cross-lingual learning approaches

Related work

Conclusions