Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Few-shot translation systems can be trained with unpaired language data.
With only 5 examples of high-quality translation data, a transformer decoder-only model can match specialized supervised state-of-the-art models.
Few-shot translation systems do not require joint multilingual training or back-translation.
Few-shot translation systems are two orders of magnitude smaller than state-of-the-art language models.
Quality of few-shot demonstrations heavily determines the quality of translations.
Few-shot translation systems can be used to control certain attributes of the translation.

Paper Content

Introduction

Current state-of-the-art machine translation systems rely on parallel data mined from the web
This is not feasible for most languages
Reliance on mined parallel data has potential downsides
Unsupervised translation is an alternative
Unsupervised translation systems have demonstrated promising performance
Few-shot learning is a middle ground between supervised and unsupervised translation
Few-shot translation models outperform commercial baselines
Few-shot translation models work in low-resource scenarios
Demonstrations can be used to constrain output to a desired language register

Experiments on high-resource languages & results

Monolingual data and evaluation datasets discussed
Architecture and training methods described
Experiments conducted using JAX, T5X, and FLAX
Performance of few-shot translation models compared to state-of-the-art models

Datasets

Training data consists of language-specific corpora
For English, similar mix of filtered web pages, Wikipedia, and books used
For other languages, only high-quality webpages used
Evaluation datasets focus on recent WMT datasets
Train-test overlap measured using n-gram matching

Architecture and training procedure

Use a Transformer decoder-only architecture with 32 layers, 16 heads, a hidden dimension of 4096, and multi-query attention
Use a Sentencepiece model with 128,000 Sentencepieces
Use a variant of the UL2 objective specialized for decoder-only models
Use two separate span corruption instances with different hyperparameter configurations
Include a standard causal language modeling objective
Trilingual models use the same amount of data per-language as the bilingual models
Perform one epoch over the combined corpus
Use Adafactor optimizer with cosine learning rate decay schedule
Evaluate models using BLEURT
Use development set as a pool of demonstrations for few-shot generation
Use MBR decoding with BLEURT for the utility function and beam search with beam size 4 and α=0.6

Main results and discussion

Three systems from WMT'21 are used for comparison
WMT submissions may overestimate performance of general-purpose translation systems
Baselines include PaLM and Google Translate
Bilingual and Trilingual LMs outperform PaLM in most cases
Commercial system outperformed by WMT baselines
Few-shot translations outperform WMT baselines in English-XX directions
Few-shot translations underperform in XX-English directions
MBR consistently improves BLEURT

Performance on a low-resource language

Low-resource languages require leveraging data beyond the parallel data available for the language pair.
Icelandic is used as a low-resource language of study.
Models are warm-started with English-German model.
Models are trained on a mixture of English and Icelandic.
Learning rate is set to 0.001.
Results are competitive with WMT baselines and surpass commercial baseline for English-Icelandic direction.
Models develop the ability to extract meaning from text earlier than when they can reliably generate fluent text.

Influence of few-shot demonstrations

Few-shot translation models can be competitive with state-of-the-art supervised models.
Quality of few-shot demonstrations is an influential predictor for translation quality.
Style of few-shot demonstrations influences the style of the translation.

Quality of the demonstrations strongly influences quality of generated translations

Concurrent work has shown that the quality of few-shot demonstrations impacts the output.
Used an English-German translation dataset with normalized Contrastive Data Selection (CDS) scores.
Partitioned examples into 3 buckets based on CDS scores.
Quality of translations produced by model decreases as CDS scores increase.

Controllability of output language variety through few-shot demonstrations

Example quality is not the only attribute of demonstrations that can influence output quality.
Style of demonstrations also influences output in a measurable way.
This allows for generation of translations in a given style.
FRMT dataset used to measure style influence.
FRMT score and lexical accuracy used as metrics.
Using demonstrations in the right language variety improves both FRMT score and lexical accuracy.

Controllability of formality through few-shot demonstrations

Task of generating translations satisfying a given formality level
Languages have built-in rules for expressing formality
Focus on English-German language pair
Evaluate models using BLEURT and accuracy
Draw demonstrations from topical chat split of train set
Consider Google Translate and gold finetuned mBART-large UMD submission
Initial trilingual experiments showed benefit from multilinguality
Quality of parallel data is critical
Leverage larger sets of high-quality data while retaining flexibility of few-shot translation paradigm
Rely on MBR decoding rather than beam search

Conclusion

Investigated potential value of fewshot translation models
Showed that these systems can be competitive with supervised models
Quality of demonstrations heavily influences quality of translations
Paradigm gives way of controlling style of translations
Analyzed BLEURT scores on WMT'21 English-Icelandic and Icelandic-English task
Followed 15-gram protocol of Chowdhery et al. (2022)
Recomputed metrics from textual outputs for fair comparison

Link to paper#

Abstract#

Paper Content#

Introduction#

Experiments on high-resource languages & results#

Datasets#

Architecture and training procedure#

Main results and discussion#

Performance on a low-resource language#

Influence of few-shot demonstrations#

Quality of the demonstrations strongly influences quality of generated translations#

Controllability of output language variety through few-shot demonstrations#

Controllability of formality through few-shot demonstrations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Experiments on high-resource languages & results

Datasets

Architecture and training procedure

Main results and discussion

Performance on a low-resource language

Influence of few-shot demonstrations

Quality of the demonstrations strongly influences quality of generated translations

Controllability of output language variety through few-shot demonstrations

Controllability of formality through few-shot demonstrations

Conclusion