Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Parallel data is beneficial for cross-lingual learning.
  • It is unclear if the improvements come from the data or the modeling of parallel interactions.
  • Unsupervised machine translation can generate synthetic parallel data.
  • Synthetic parallel data can be useful for downstream tasks.
  • Real parallel data still yields the best results.
  • Multilingual models do not exploit the full potential of monolingual data.

Paper Content

Introduction

  • Multilingual models can generalize across languages without data in the target language
  • Models are pretrained on monolingual corpora and finetuned with labeled data in the source language
  • Parallel data can be incorporated at pretraining or finetuning time
  • Research is needed to understand the contribution of monolingual and parallel data

Experimental setup

Tasks

  • Training set is in English
  • Evaluate transfer performance in other languages
  • Languages evaluated: English, Arabic, German, Hindi, French, Swahili, Russian, Thai, Vietnamese
  • Finetuning incorporation experiments involve machine translating training data into target languages
  • For XNLI, translate premise and hypothesis, leave label unchanged
  • For XQuAD and WikiANN, translate input text and project answer spans using word aligner

Model

  • Used XLM-R base for experiments
  • Trained XLM-R base with Masked Language Modeling on CC-100
  • Finetuned with learning rates of 1e-5, 5e-5, and 1e-4 using Adam optimizer
  • Trained for up to 10 epochs and chose checkpoint with best validation performance

Parallel data sources

  • Gold: Ground-truth parallel data generated by humans
  • Supervised MT: Synthetic parallel data generated through a conventional MT system
  • Unsupervised MT: Synthetic parallel data generated through an unsupervised MT system
  • Dictionary: Synthetic parallel data generated through random word replacement with a dictionary

Pretraining incorporation

  • Incorporated parallel data into pretraining process
  • Pretrained on MLM and TLM for 70k steps
  • Compared different sources of parallel data
  • All variants incorporating parallel data outperform original XLM-R model
  • Supervised MT performs at par with gold data
  • Unsupervised MT lags behind but outperforms baseline

Finetuning incorporation

  • Incorporated parallel data into finetuning process
  • Results in Table 2 show incorporating parallel data outperforms baseline in all tasks and languages for all data sources
  • Supervised MT obtains best results, followed by unsupervised MT and word-by-word translation with dictionaries
  • Synthetic parallel data can bring improvements from monolingual data, but real parallel data brings further improvements
  • Simplistic ways to incorporate parallel signals can bring improvements

Discussion

  • Incorporating parallel data for cross-lingual transfer learning can partly be attributed to the explicit use of a parallel training signal.
  • Unsupervised MT can achieve the same improvement without the need for real parallel data.
  • Facilitation of parallel interactions is more important than the use of real parallel data in all tasks but XQuAD.
  • Multilingual pretrained models rely on monolingual data, which calls into question the extent to which existing approaches are able to exploit the full potential of such monolingual data.
  • Similar results are obtained for both pretraining and finetuning incorporation, as well as supervised MT and gold standard parallel data.

Reconsidering the categorization of cross-lingual learning approaches

  • Different data types can be used in a pipeline (monolingual source corpora, monolingual target corpora, parallel corpora, downstream data)
  • Different stages of the pipeline can incorporate the data (pretraining, finetuning, testing)
  • Different procedures can be used to incorporate the data (directly or indirectly through MT)
  • Prior work has explored the extent to which monolingual pretraining relies on knowledge transfer from unlabeled corpora.
  • Cross-lingual learning has not been examined to see if it relies on knowledge transfer from parallel data.
  • Our unsupervised MT variant does not use any additional data compared to regular pretraining.

Conclusions

  • Model-generated parallel data can be used for cross-lingual learning.
  • Investigating the optimal way to leverage monolingual and/or parallel data for cross-lingual learning is advocated.
  • Pretraining incorporation results show that XLM-R model can be finetuned on English downstream data and zero-shot transferred to the target language.