Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • GEMBA is a GPT-based metric for assessing translation quality
  • It works with and without a reference translation
  • Four prompt variants were compared in two modes
  • Seven versions of GPT models were investigated, including ChatGPT
  • GPT 3.5 and larger models are needed for the method to work
  • Results from WMT22’s Metrics shared task show state-of-the-art accuracy
  • Results are valid for three language pairs
  • Code and prompt templates used for experiments are publicly released

Paper Content

Introduction

  • LLMs can be used for multilingual Q&A
  • LLMs can be used to translate text between languages
  • LLMs can differentiate good from bad translations
  • GPT can be used for automated assessment of translation quality
  • GPT can be used for system-level evaluation of translation quality

Prompt variants

  • Four distinct prompt types are experimented with: two scoring tasks and two classification tasks
  • Two scoring tasks: one based on direct assessment, one based on scalar quality metrics
  • Two classification tasks: one based on one-to-five stars ranking, one based on five discrete quality classes
  • Two modes for each prompt type: one with access to a human reference, one without

Scoring process

  • GEMBA-DA, GEMBA-SQM, and GEMBAstars output scores in the range of 0-100, 1-5, and 0-4 respectively.
  • System-level scores are calculated by averaging segment-level scores.
  • If an invalid answer is returned, randomness is added and more responses are sampled until a valid answer is found.

Gpt models

  • Seven GPT models are used in the experiment
  • The Davinci-003 model is used as the default
  • GPT 2, ChatGPT, and GPT 3 models are used
  • A score is given to a translation from one language to another on a scale from 0 to 100

Experiments

  • Measured performance of GEMBA metric using test data from WMT22 Metrics shared task
  • Compared GEMBA against best-performing automatic metrics: COMET, BLEURT, and MetricX XXL

Test set

  • MQM 2022 test set contains human judgments for 3 translation directions
  • Test set contains 54 machine translation system outputs or human translations
  • Test set contains 106k segments
  • Translation systems mainly from participants of WMT22 General MT shared task
  • Source segments and human reference translations for each language pair contain around 2,000 sentences from 4 different text domains
  • Gold standard for scoring translation quality based on human MQM ratings

Evaluation methods

  • Measured system-level, pairwise accuracy
  • Used Kendall’s Tau for segment-level evaluation
  • Accuracy is defined as number of system pairs ranked correctly by metric compared to human ranking
  • Initially used Kendall’s Tau-a, changed to Kendall’s Tau-b
  • Ties in automatic metrics are rare for non-identical translations
  • Kendall’s Tau is susceptible to noise in gold pairwise rankings
  • Reproduced scores reported in WMT22 Metrics shared task findings paper

Results

  • Investigated GEMBA’s performance in two modes
  • Compared GEMBA-DA against best-performing metrics from WMT22 Metrics shared task

Reference-based

  • GEMBA-Dav3-DA metric sets a new state of the art
  • Outperforms all other reference-based metrics from WMT22 Metrics shared task
  • Human labels used as gold standard are noisy, accuracy of 100% impossible to obtain

Quality estimation

  • GEMBA-Dav3-DA[noref] achieves highest performance for quality estimation mode
  • GEMBA-Dav3-DA[noref] outperforms all other referenceless and reference-based metrics
  • Evaluation based on three language pairs and MQM human labels shows unexpectedly high level of assessment quality

Comparison of gpt models

  • GPT versions compared as automatic metric
  • Babbage and Curie models produce close to random guessing
  • Performance jump for GPT 3.5 and larger models
  • ChatGPT has lowest quality among these three models
  • Davinci-003 has best performance

Segment-level performance

  • Previous results are reported on the system level
  • Investigated how GEMBA metric performs on the segment level
  • GEMBA-Dav3-DA slightly behind top-performing metrics but still has high correlation with human judgment
  • Quality estimation GEMBA-Dav3-DA has lower segment-level performance
  • GEMBA-Dav3-DA outperforms string-based metrics
  • Lower performance of segment-level correlation could be attributed to Kendall’s Tau
  • GEMBA-Dav3-DA returns discrete value between 0-100
  • 79.9% of all scores are of value “95”

Failure rate

  • LLMs may answer with an invalid answer
  • Temperature is increased to take the first answer matching the expected score output range
  • Table 6 shows the number of invalid answers
  • LLMs understand the prompt and provide valid answers with less than 1% of answers being invalid
  • Answers are parsed separately for GEMBA-stars prompts

Conclusion

  • GEMBA is a GPT-based estimation metric-based assessment method
  • GEMBA achieved state-of-the-art performance on the MQM 2022 test set
  • Research will focus on few-shot methodology and model fine-tuning
  • GPT-enhanced evaluation metrics may allow for document-level evaluation
  • GEMBA performance may suffer for under-resourced languages