Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Few-shot translation systems can be trained with unpaired language data.
- With only 5 examples of high-quality translation data, a transformer decoder-only model can match specialized supervised state-of-the-art models.
- Few-shot translation systems do not require joint multilingual training or back-translation.
- Few-shot translation systems are two orders of magnitude smaller than state-of-the-art language models.
- Quality of few-shot demonstrations heavily determines the quality of translations.
- Few-shot translation systems can be used to control certain attributes of the translation.
Paper Content
Introduction
- Current state-of-the-art machine translation systems rely on parallel data mined from the web
- This is not feasible for most languages
- Reliance on mined parallel data has potential downsides
- Unsupervised translation is an alternative
- Unsupervised translation systems have demonstrated promising performance
- Few-shot learning is a middle ground between supervised and unsupervised translation
- Few-shot translation models outperform commercial baselines
- Few-shot translation models work in low-resource scenarios
- Demonstrations can be used to constrain output to a desired language register
Experiments on high-resource languages & results
- Monolingual data and evaluation datasets discussed
- Architecture and training methods described
- Experiments conducted using JAX, T5X, and FLAX
- Performance of few-shot translation models compared to state-of-the-art models
Datasets
- Training data consists of language-specific corpora
- For English, similar mix of filtered web pages, Wikipedia, and books used
- For other languages, only high-quality webpages used
- Evaluation datasets focus on recent WMT datasets
- Train-test overlap measured using n-gram matching
Architecture and training procedure
- Use a Transformer decoder-only architecture with 32 layers, 16 heads, a hidden dimension of 4096, and multi-query attention
- Use a Sentencepiece model with 128,000 Sentencepieces
- Use a variant of the UL2 objective specialized for decoder-only models
- Use two separate span corruption instances with different hyperparameter configurations
- Include a standard causal language modeling objective
- Trilingual models use the same amount of data per-language as the bilingual models
- Perform one epoch over the combined corpus
- Use Adafactor optimizer with cosine learning rate decay schedule
- Evaluate models using BLEURT
- Use development set as a pool of demonstrations for few-shot generation
- Use MBR decoding with BLEURT for the utility function and beam search with beam size 4 and ฮฑ=0.6
Main results and discussion
- Three systems from WMT'21 are used for comparison
- WMT submissions may overestimate performance of general-purpose translation systems
- Baselines include PaLM and Google Translate
- Bilingual and Trilingual LMs outperform PaLM in most cases
- Commercial system outperformed by WMT baselines
- Few-shot translations outperform WMT baselines in English-XX directions
- Few-shot translations underperform in XX-English directions
- MBR consistently improves BLEURT
Performance on a low-resource language
- Low-resource languages require leveraging data beyond the parallel data available for the language pair.
- Icelandic is used as a low-resource language of study.
- Models are warm-started with English-German model.
- Models are trained on a mixture of English and Icelandic.
- Learning rate is set to 0.001.
- Results are competitive with WMT baselines and surpass commercial baseline for English-Icelandic direction.
- Models develop the ability to extract meaning from text earlier than when they can reliably generate fluent text.
Influence of few-shot demonstrations
- Few-shot translation models can be competitive with state-of-the-art supervised models.
- Quality of few-shot demonstrations is an influential predictor for translation quality.
- Style of few-shot demonstrations influences the style of the translation.
Quality of the demonstrations strongly influences quality of generated translations
- Concurrent work has shown that the quality of few-shot demonstrations impacts the output.
- Used an English-German translation dataset with normalized Contrastive Data Selection (CDS) scores.
- Partitioned examples into 3 buckets based on CDS scores.
- Quality of translations produced by model decreases as CDS scores increase.
Controllability of output language variety through few-shot demonstrations
- Example quality is not the only attribute of demonstrations that can influence output quality.
- Style of demonstrations also influences output in a measurable way.
- This allows for generation of translations in a given style.
- FRMT dataset used to measure style influence.
- FRMT score and lexical accuracy used as metrics.
- Using demonstrations in the right language variety improves both FRMT score and lexical accuracy.
Controllability of formality through few-shot demonstrations
- Task of generating translations satisfying a given formality level
- Languages have built-in rules for expressing formality
- Focus on English-German language pair
- Evaluate models using BLEURT and accuracy
- Draw demonstrations from topical chat split of train set
- Consider Google Translate and gold finetuned mBART-large UMD submission
- Initial trilingual experiments showed benefit from multilinguality
- Quality of parallel data is critical
- Leverage larger sets of high-quality data while retaining flexibility of few-shot translation paradigm
- Rely on MBR decoding rather than beam search
Conclusion
- Investigated potential value of fewshot translation models
- Showed that these systems can be competitive with supervised models
- Quality of demonstrations heavily influences quality of translations
- Paradigm gives way of controlling style of translations
- Analyzed BLEURT scores on WMT'21 English-Icelandic and Icelandic-English task
- Followed 15-gram protocol of Chowdhery et al. (2022)
- Recomputed metrics from textual outputs for fair comparison