Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Automatic machine translation metrics are used to measure translation quality across large test sets.
- It is unclear if these metrics are reliable at distinguishing good and bad translations at the sentence level.
- This paper investigates how useful MT metrics are at detecting the success of a machine translation component in a larger platform.
- Experiments show that all metrics have negligible correlation with the extrinsic evaluation of the downstream outcomes.
- Neural metrics are not interpretable due to undefined ranges.
- Future MT metrics should be designed to produce error labels rather than scores.
Paper Content
Introduction
- Machine translation is increasingly used as part of complex NLP platforms
- Errors in machine translation can lead to poor user experience
- It is important to detect incorrect translations before they propagate
- MT metrics can pick up subtle differences between MT systems
- Human judgements have not been well correlated with metrics at the segment level
- A new classification task was derived to measure how indicative segment level scores are for downstream performance of an extrinsic cross-lingual task
- Nine metrics had minimal correlation with extrinsic task performance
- Recommendations proposed for developing MT metrics to produce useful segment-level output
Related work
- Evaluation of machine translation has been of great research interest
- Conference on Machine Translation (WMT) has been organizing annual shared tasks on automatic MT evaluation since 2006
- Human evaluation of MT systems has been carried out based on guidelines for fluency, adequacy and/or comprehensibility
- Different methods to compute the correlation between the scores produced by the metrics and this human evaluation have been suggested
- Meta evaluation progress is generally documented in a metrics shared task overview
- Task based MT evaluation has been well studied in the literature
- Our work focuses on the usability of the metrics judged on their ability to work with downstream tasks
Methodology
- Aim is to determine how reliable MT metrics are for predicting success on downstream tasks
- Evaluation setup and metrics are described
- Fig 1 provides an illustration
Setup
- Train a model for a task on a monolingual setup
- Evaluate the source language on each task
- Use Translate-Test paradigm to translate test data from target language to source language
- Use OPUS, M2M100 or dataset-provided translations
- Data across target languages is parallel
- Construct binary classification benchmark using predictions from target language
- Consider only correct predictions from source language to avoid errors from task complexity
- All incorrect predictions from target language are from erroneous translations
- Score triples of source, hypothesis and reference using metrics
- Define threshold for scores to turn metrics into classifiers
- Evaluate metrics on how well predictions for good/bad translation correlate with success/failure of end task for target language
Tasks
- Evaluate metrics on dialogue state tracking, multi-turn conversations, and extrinsic tasks
- MultiWoZ 2.1 dataset for dialogue state tracking
- Multi 2 WoZ dataset for development and test set in German, Russian, Chinese, and Arabic
- Dialogue state tracking model trained on English dataset by Lee et al. (2019)
- Semantic parsing transforms natural language utterances into logical forms
- MultiATIS++SQL dataset from Sherborne and Lapata (2022)
- Extractive question answering using XQuAD dataset (Artetxe et al., 2020)
- Model finetunes RoBERTa (Liu et al., 2019) on SQuAD training set
- Metrics scores produced for binary classification task
Metrics
- Metrics are designed based on surface level overlap, embedding similarity, and neural metrics
- BLEU is a string-matching metric that compares token-level n-grams of hypothesis and reference
- chrF is a character-level metric that computes F-score based on overlap between hypothesis and reference
- BERTScore uses pre-trained language models to compute similarity between tokens in reference and generated translation
- WMT organizes annual shared task to develop MT models and human evaluation is used to determine best-performing system
- Human evaluation follows two protocols: Direct Assessment and Expert-based Evaluation
- Metrics are either reference-based or reference-free
- COMET and UniTE are neural translation metrics that use WMT evaluation data for training
- COMET-QE and UniTE-QE are reference-free versions of COMET and UniTE
Metric evaluation
- Meta-evaluation is carried out using binary classification task
- Macro-F1 and Matthew’s Correlation Coefficient (MCC) are used to evaluate
- Macro-F1 range is 0 to 1 and gives equal weightage to positive and negative class
- MCC range is -1 to 1, 0 indicates no correlation, 0-0.3 indicates negligible correlation, 0.3-0.5 indicates low correlation
Results
- Results for dialogue state tracking, semantic parsing, and question answering are reported in Table 1.
- Fine-grained results are reported in Appendix B.
- Random baseline used for comparison.
Performance on extrinsic tasks
- Metrics perform better than random baseline on macro-F1 metric
- Metrics show negligible correlation under almost all settings
- Neural metrics not better than surface overlap metrics
- Decreasing order of performance: semantic parsing, question answering, dialogue state tracking
- Reference-based versions better than reference-free for semantic parsing and question answering
- Reference-free performs same or better than reference-based for dialogue state tracking
- COMET-QE-DA and COMET-DA better than COMET-QE-MQM and COMET-MQM for question answering
- No clear winner for dialogue state tracking and semantic parsing
- No specific language pairs stand out as easier or harder across tasks
- Poor performance across all languages
Case study
- Semantic parsing task studied with COMET-DA
- Macro-F1 score of 0.59 and MCC value of 0.187
- Threshold of -0.028, suggesting good translations receive negative scores
- Metric makes mistakes throughout the bins
- 54% of wrong predictions due to inability to detect mistranslation
- 20% of errors due to task-specific model
Finding the threshold
- Automatic metrics require additional context to interpret system-level scores.
- Finding the right threshold to identify if a translation needs correction is not straightforward.
- Thresholds of metrics are not consistent and vary greatly across language pairs.
- Some language pairs have negative thresholds.
Reference-based metrics in an online setting
- Reference-based methods are tested in an online setting without access to references.
- A round-trip translation task is used to evaluate the reference-based methods.
- Most metrics improve their performance when using the new reference, except when the target language is Russian.
Recommendations
- Prefer MQM for Human Evaluation of MT outputs
- MT Metrics could produce Labels over Scores
- Use a Combination of MT metrics for the End Tasks
- Add Diverse References during Training
Conclusion
- Proposed method for evaluating MT metrics is reliable and does not depend on human judgements
- Evaluated nine different metrics on ability to detect errors in generated translations
- Used three extrinsic tasks - dialogue state tracking, question answering, and semantic parsing
- Segment-level scores produced by all metrics show negligible correlation with success/failure outcomes of end task
- Recommended predicting error types instead of error scores
- Evaluated 37 language pairs, mostly high-resource languages
- Results provide conclusive trends to metric developers
- Results can be extended to other datasets, tasks, and languages
- Future work should incorporate partially correct outputs