Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Automatic machine translation metrics are used to measure translation quality across large test sets.
It is unclear if these metrics are reliable at distinguishing good and bad translations at the sentence level.
This paper investigates how useful MT metrics are at detecting the success of a machine translation component in a larger platform.
Experiments show that all metrics have negligible correlation with the extrinsic evaluation of the downstream outcomes.
Neural metrics are not interpretable due to undefined ranges.
Future MT metrics should be designed to produce error labels rather than scores.

Paper Content

Introduction

Machine translation is increasingly used as part of complex NLP platforms
Errors in machine translation can lead to poor user experience
It is important to detect incorrect translations before they propagate
MT metrics can pick up subtle differences between MT systems
Human judgements have not been well correlated with metrics at the segment level
A new classification task was derived to measure how indicative segment level scores are for downstream performance of an extrinsic cross-lingual task
Nine metrics had minimal correlation with extrinsic task performance
Recommendations proposed for developing MT metrics to produce useful segment-level output

Evaluation of machine translation has been of great research interest
Conference on Machine Translation (WMT) has been organizing annual shared tasks on automatic MT evaluation since 2006
Human evaluation of MT systems has been carried out based on guidelines for fluency, adequacy and/or comprehensibility
Different methods to compute the correlation between the scores produced by the metrics and this human evaluation have been suggested
Meta evaluation progress is generally documented in a metrics shared task overview
Task based MT evaluation has been well studied in the literature
Our work focuses on the usability of the metrics judged on their ability to work with downstream tasks

Methodology

Aim is to determine how reliable MT metrics are for predicting success on downstream tasks
Evaluation setup and metrics are described
Fig 1 provides an illustration

Setup

Train a model for a task on a monolingual setup
Evaluate the source language on each task
Use Translate-Test paradigm to translate test data from target language to source language
Use OPUS, M2M100 or dataset-provided translations
Data across target languages is parallel
Construct binary classification benchmark using predictions from target language
Consider only correct predictions from source language to avoid errors from task complexity
All incorrect predictions from target language are from erroneous translations
Score triples of source, hypothesis and reference using metrics
Define threshold for scores to turn metrics into classifiers
Evaluate metrics on how well predictions for good/bad translation correlate with success/failure of end task for target language

Tasks

Evaluate metrics on dialogue state tracking, multi-turn conversations, and extrinsic tasks
MultiWoZ 2.1 dataset for dialogue state tracking
Multi 2 WoZ dataset for development and test set in German, Russian, Chinese, and Arabic
Dialogue state tracking model trained on English dataset by Lee et al. (2019)
Semantic parsing transforms natural language utterances into logical forms
MultiATIS++SQL dataset from Sherborne and Lapata (2022)
Extractive question answering using XQuAD dataset (Artetxe et al., 2020)
Model finetunes RoBERTa (Liu et al., 2019) on SQuAD training set
Metrics scores produced for binary classification task

Metrics

Metrics are designed based on surface level overlap, embedding similarity, and neural metrics
BLEU is a string-matching metric that compares token-level n-grams of hypothesis and reference
chrF is a character-level metric that computes F-score based on overlap between hypothesis and reference
BERTScore uses pre-trained language models to compute similarity between tokens in reference and generated translation
WMT organizes annual shared task to develop MT models and human evaluation is used to determine best-performing system
Human evaluation follows two protocols: Direct Assessment and Expert-based Evaluation
Metrics are either reference-based or reference-free
COMET and UniTE are neural translation metrics that use WMT evaluation data for training
COMET-QE and UniTE-QE are reference-free versions of COMET and UniTE

Metric evaluation

Meta-evaluation is carried out using binary classification task
Macro-F1 and Matthew’s Correlation Coefficient (MCC) are used to evaluate
Macro-F1 range is 0 to 1 and gives equal weightage to positive and negative class
MCC range is -1 to 1, 0 indicates no correlation, 0-0.3 indicates negligible correlation, 0.3-0.5 indicates low correlation

Results

Results for dialogue state tracking, semantic parsing, and question answering are reported in Table 1.
Fine-grained results are reported in Appendix B.
Random baseline used for comparison.

Performance on extrinsic tasks

Metrics perform better than random baseline on macro-F1 metric
Metrics show negligible correlation under almost all settings
Neural metrics not better than surface overlap metrics
Decreasing order of performance: semantic parsing, question answering, dialogue state tracking
Reference-based versions better than reference-free for semantic parsing and question answering
Reference-free performs same or better than reference-based for dialogue state tracking
COMET-QE-DA and COMET-DA better than COMET-QE-MQM and COMET-MQM for question answering
No clear winner for dialogue state tracking and semantic parsing
No specific language pairs stand out as easier or harder across tasks
Poor performance across all languages

Case study

Semantic parsing task studied with COMET-DA
Macro-F1 score of 0.59 and MCC value of 0.187
Threshold of -0.028, suggesting good translations receive negative scores
Metric makes mistakes throughout the bins
54% of wrong predictions due to inability to detect mistranslation
20% of errors due to task-specific model

Finding the threshold

Automatic metrics require additional context to interpret system-level scores.
Finding the right threshold to identify if a translation needs correction is not straightforward.
Thresholds of metrics are not consistent and vary greatly across language pairs.
Some language pairs have negative thresholds.

Reference-based metrics in an online setting

Reference-based methods are tested in an online setting without access to references.
A round-trip translation task is used to evaluate the reference-based methods.
Most metrics improve their performance when using the new reference, except when the target language is Russian.

Recommendations

Prefer MQM for Human Evaluation of MT outputs
MT Metrics could produce Labels over Scores
Use a Combination of MT metrics for the End Tasks
Add Diverse References during Training

Conclusion

Proposed method for evaluating MT metrics is reliable and does not depend on human judgements
Evaluated nine different metrics on ability to detect errors in generated translations
Used three extrinsic tasks - dialogue state tracking, question answering, and semantic parsing
Segment-level scores produced by all metrics show negligible correlation with success/failure outcomes of end task
Recommended predicting error types instead of error scores
Evaluated 37 language pairs, mostly high-resource languages
Results provide conclusive trends to metric developers
Results can be extended to other datasets, tasks, and languages
Future work should incorporate partially correct outputs

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Methodology#

Setup#

Tasks#

Metrics#

Metric evaluation#

Results#

Performance on extrinsic tasks#

Case study#

Finding the threshold#

Reference-based metrics in an online setting#

Recommendations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Methodology

Setup

Tasks

Metrics

Metric evaluation

Results

Performance on extrinsic tasks

Case study

Finding the threshold

Reference-based metrics in an online setting

Recommendations

Conclusion