Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Few-shot translation systems can be trained with unpaired language data.
  • With only 5 examples of high-quality translation data, a transformer decoder-only model can match specialized supervised state-of-the-art models.
  • Few-shot translation systems do not require joint multilingual training or back-translation.
  • Few-shot translation systems are two orders of magnitude smaller than state-of-the-art language models.
  • Quality of few-shot demonstrations heavily determines the quality of translations.
  • Few-shot translation systems can be used to control certain attributes of the translation.

Paper Content

Introduction

  • Current state-of-the-art machine translation systems rely on parallel data mined from the web
  • This is not feasible for most languages
  • Reliance on mined parallel data has potential downsides
  • Unsupervised translation is an alternative
  • Unsupervised translation systems have demonstrated promising performance
  • Few-shot learning is a middle ground between supervised and unsupervised translation
  • Few-shot translation models outperform commercial baselines
  • Few-shot translation models work in low-resource scenarios
  • Demonstrations can be used to constrain output to a desired language register

Experiments on high-resource languages & results

  • Monolingual data and evaluation datasets discussed
  • Architecture and training methods described
  • Experiments conducted using JAX, T5X, and FLAX
  • Performance of few-shot translation models compared to state-of-the-art models

Datasets

  • Training data consists of language-specific corpora
  • For English, similar mix of filtered web pages, Wikipedia, and books used
  • For other languages, only high-quality webpages used
  • Evaluation datasets focus on recent WMT datasets
  • Train-test overlap measured using n-gram matching

Architecture and training procedure

  • Use a Transformer decoder-only architecture with 32 layers, 16 heads, a hidden dimension of 4096, and multi-query attention
  • Use a Sentencepiece model with 128,000 Sentencepieces
  • Use a variant of the UL2 objective specialized for decoder-only models
  • Use two separate span corruption instances with different hyperparameter configurations
  • Include a standard causal language modeling objective
  • Trilingual models use the same amount of data per-language as the bilingual models
  • Perform one epoch over the combined corpus
  • Use Adafactor optimizer with cosine learning rate decay schedule
  • Evaluate models using BLEURT
  • Use development set as a pool of demonstrations for few-shot generation
  • Use MBR decoding with BLEURT for the utility function and beam search with beam size 4 and ฮฑ=0.6

Main results and discussion

  • Three systems from WMT'21 are used for comparison
  • WMT submissions may overestimate performance of general-purpose translation systems
  • Baselines include PaLM and Google Translate
  • Bilingual and Trilingual LMs outperform PaLM in most cases
  • Commercial system outperformed by WMT baselines
  • Few-shot translations outperform WMT baselines in English-XX directions
  • Few-shot translations underperform in XX-English directions
  • MBR consistently improves BLEURT

Performance on a low-resource language

  • Low-resource languages require leveraging data beyond the parallel data available for the language pair.
  • Icelandic is used as a low-resource language of study.
  • Models are warm-started with English-German model.
  • Models are trained on a mixture of English and Icelandic.
  • Learning rate is set to 0.001.
  • Results are competitive with WMT baselines and surpass commercial baseline for English-Icelandic direction.
  • Models develop the ability to extract meaning from text earlier than when they can reliably generate fluent text.

Influence of few-shot demonstrations

  • Few-shot translation models can be competitive with state-of-the-art supervised models.
  • Quality of few-shot demonstrations is an influential predictor for translation quality.
  • Style of few-shot demonstrations influences the style of the translation.

Quality of the demonstrations strongly influences quality of generated translations

  • Concurrent work has shown that the quality of few-shot demonstrations impacts the output.
  • Used an English-German translation dataset with normalized Contrastive Data Selection (CDS) scores.
  • Partitioned examples into 3 buckets based on CDS scores.
  • Quality of translations produced by model decreases as CDS scores increase.

Controllability of output language variety through few-shot demonstrations

  • Example quality is not the only attribute of demonstrations that can influence output quality.
  • Style of demonstrations also influences output in a measurable way.
  • This allows for generation of translations in a given style.
  • FRMT dataset used to measure style influence.
  • FRMT score and lexical accuracy used as metrics.
  • Using demonstrations in the right language variety improves both FRMT score and lexical accuracy.

Controllability of formality through few-shot demonstrations

  • Task of generating translations satisfying a given formality level
  • Languages have built-in rules for expressing formality
  • Focus on English-German language pair
  • Evaluate models using BLEURT and accuracy
  • Draw demonstrations from topical chat split of train set
  • Consider Google Translate and gold finetuned mBART-large UMD submission
  • Initial trilingual experiments showed benefit from multilinguality
  • Quality of parallel data is critical
  • Leverage larger sets of high-quality data while retaining flexibility of few-shot translation paradigm
  • Rely on MBR decoding rather than beam search

Conclusion

  • Investigated potential value of fewshot translation models
  • Showed that these systems can be competitive with supervised models
  • Quality of demonstrations heavily influences quality of translations
  • Paradigm gives way of controlling style of translations
  • Analyzed BLEURT scores on WMT'21 English-Icelandic and Icelandic-English task
  • Followed 15-gram protocol of Chowdhery et al. (2022)
  • Recomputed metrics from textual outputs for fair comparison