Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Automatic machine translation metrics are used to measure translation quality across large test sets.
  • It is unclear if these metrics are reliable at distinguishing good and bad translations at the sentence level.
  • This paper investigates how useful MT metrics are at detecting the success of a machine translation component in a larger platform.
  • Experiments show that all metrics have negligible correlation with the extrinsic evaluation of the downstream outcomes.
  • Neural metrics are not interpretable due to undefined ranges.
  • Future MT metrics should be designed to produce error labels rather than scores.

Paper Content


  • Machine translation is increasingly used as part of complex NLP platforms
  • Errors in machine translation can lead to poor user experience
  • It is important to detect incorrect translations before they propagate
  • MT metrics can pick up subtle differences between MT systems
  • Human judgements have not been well correlated with metrics at the segment level
  • A new classification task was derived to measure how indicative segment level scores are for downstream performance of an extrinsic cross-lingual task
  • Nine metrics had minimal correlation with extrinsic task performance
  • Recommendations proposed for developing MT metrics to produce useful segment-level output
  • Evaluation of machine translation has been of great research interest
  • Conference on Machine Translation (WMT) has been organizing annual shared tasks on automatic MT evaluation since 2006
  • Human evaluation of MT systems has been carried out based on guidelines for fluency, adequacy and/or comprehensibility
  • Different methods to compute the correlation between the scores produced by the metrics and this human evaluation have been suggested
  • Meta evaluation progress is generally documented in a metrics shared task overview
  • Task based MT evaluation has been well studied in the literature
  • Our work focuses on the usability of the metrics judged on their ability to work with downstream tasks


  • Aim is to determine how reliable MT metrics are for predicting success on downstream tasks
  • Evaluation setup and metrics are described
  • Fig 1 provides an illustration


  • Train a model for a task on a monolingual setup
  • Evaluate the source language on each task
  • Use Translate-Test paradigm to translate test data from target language to source language
  • Use OPUS, M2M100 or dataset-provided translations
  • Data across target languages is parallel
  • Construct binary classification benchmark using predictions from target language
  • Consider only correct predictions from source language to avoid errors from task complexity
  • All incorrect predictions from target language are from erroneous translations
  • Score triples of source, hypothesis and reference using metrics
  • Define threshold for scores to turn metrics into classifiers
  • Evaluate metrics on how well predictions for good/bad translation correlate with success/failure of end task for target language


  • Evaluate metrics on dialogue state tracking, multi-turn conversations, and extrinsic tasks
  • MultiWoZ 2.1 dataset for dialogue state tracking
  • Multi 2 WoZ dataset for development and test set in German, Russian, Chinese, and Arabic
  • Dialogue state tracking model trained on English dataset by Lee et al. (2019)
  • Semantic parsing transforms natural language utterances into logical forms
  • MultiATIS++SQL dataset from Sherborne and Lapata (2022)
  • Extractive question answering using XQuAD dataset (Artetxe et al., 2020)
  • Model finetunes RoBERTa (Liu et al., 2019) on SQuAD training set
  • Metrics scores produced for binary classification task


  • Metrics are designed based on surface level overlap, embedding similarity, and neural metrics
  • BLEU is a string-matching metric that compares token-level n-grams of hypothesis and reference
  • chrF is a character-level metric that computes F-score based on overlap between hypothesis and reference
  • BERTScore uses pre-trained language models to compute similarity between tokens in reference and generated translation
  • WMT organizes annual shared task to develop MT models and human evaluation is used to determine best-performing system
  • Human evaluation follows two protocols: Direct Assessment and Expert-based Evaluation
  • Metrics are either reference-based or reference-free
  • COMET and UniTE are neural translation metrics that use WMT evaluation data for training
  • COMET-QE and UniTE-QE are reference-free versions of COMET and UniTE

Metric evaluation

  • Meta-evaluation is carried out using binary classification task
  • Macro-F1 and Matthew’s Correlation Coefficient (MCC) are used to evaluate
  • Macro-F1 range is 0 to 1 and gives equal weightage to positive and negative class
  • MCC range is -1 to 1, 0 indicates no correlation, 0-0.3 indicates negligible correlation, 0.3-0.5 indicates low correlation


  • Results for dialogue state tracking, semantic parsing, and question answering are reported in Table 1.
  • Fine-grained results are reported in Appendix B.
  • Random baseline used for comparison.

Performance on extrinsic tasks

  • Metrics perform better than random baseline on macro-F1 metric
  • Metrics show negligible correlation under almost all settings
  • Neural metrics not better than surface overlap metrics
  • Decreasing order of performance: semantic parsing, question answering, dialogue state tracking
  • Reference-based versions better than reference-free for semantic parsing and question answering
  • Reference-free performs same or better than reference-based for dialogue state tracking
  • COMET-QE-DA and COMET-DA better than COMET-QE-MQM and COMET-MQM for question answering
  • No clear winner for dialogue state tracking and semantic parsing
  • No specific language pairs stand out as easier or harder across tasks
  • Poor performance across all languages

Case study

  • Semantic parsing task studied with COMET-DA
  • Macro-F1 score of 0.59 and MCC value of 0.187
  • Threshold of -0.028, suggesting good translations receive negative scores
  • Metric makes mistakes throughout the bins
  • 54% of wrong predictions due to inability to detect mistranslation
  • 20% of errors due to task-specific model

Finding the threshold

  • Automatic metrics require additional context to interpret system-level scores.
  • Finding the right threshold to identify if a translation needs correction is not straightforward.
  • Thresholds of metrics are not consistent and vary greatly across language pairs.
  • Some language pairs have negative thresholds.

Reference-based metrics in an online setting

  • Reference-based methods are tested in an online setting without access to references.
  • A round-trip translation task is used to evaluate the reference-based methods.
  • Most metrics improve their performance when using the new reference, except when the target language is Russian.


  • Prefer MQM for Human Evaluation of MT outputs
  • MT Metrics could produce Labels over Scores
  • Use a Combination of MT metrics for the End Tasks
  • Add Diverse References during Training


  • Proposed method for evaluating MT metrics is reliable and does not depend on human judgements
  • Evaluated nine different metrics on ability to detect errors in generated translations
  • Used three extrinsic tasks - dialogue state tracking, question answering, and semantic parsing
  • Segment-level scores produced by all metrics show negligible correlation with success/failure outcomes of end task
  • Recommended predicting error types instead of error scores
  • Evaluated 37 language pairs, mostly high-resource languages
  • Results provide conclusive trends to metric developers
  • Results can be extended to other datasets, tasks, and languages
  • Future work should incorporate partially correct outputs