Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

GEMBA is a GPT-based metric for assessing translation quality
It works with and without a reference translation
Four prompt variants were compared in two modes
Seven versions of GPT models were investigated, including ChatGPT
GPT 3.5 and larger models are needed for the method to work
Results from WMT22’s Metrics shared task show state-of-the-art accuracy
Results are valid for three language pairs
Code and prompt templates used for experiments are publicly released

Paper Content

Introduction

LLMs can be used for multilingual Q&A
LLMs can be used to translate text between languages
LLMs can differentiate good from bad translations
GPT can be used for automated assessment of translation quality
GPT can be used for system-level evaluation of translation quality

Prompt variants

Four distinct prompt types are experimented with: two scoring tasks and two classification tasks
Two scoring tasks: one based on direct assessment, one based on scalar quality metrics
Two classification tasks: one based on one-to-five stars ranking, one based on five discrete quality classes
Two modes for each prompt type: one with access to a human reference, one without

Scoring process

GEMBA-DA, GEMBA-SQM, and GEMBAstars output scores in the range of 0-100, 1-5, and 0-4 respectively.
System-level scores are calculated by averaging segment-level scores.
If an invalid answer is returned, randomness is added and more responses are sampled until a valid answer is found.

Gpt models

Seven GPT models are used in the experiment
The Davinci-003 model is used as the default
GPT 2, ChatGPT, and GPT 3 models are used
A score is given to a translation from one language to another on a scale from 0 to 100

Experiments

Measured performance of GEMBA metric using test data from WMT22 Metrics shared task
Compared GEMBA against best-performing automatic metrics: COMET, BLEURT, and MetricX XXL

Test set

MQM 2022 test set contains human judgments for 3 translation directions
Test set contains 54 machine translation system outputs or human translations
Test set contains 106k segments
Translation systems mainly from participants of WMT22 General MT shared task
Source segments and human reference translations for each language pair contain around 2,000 sentences from 4 different text domains
Gold standard for scoring translation quality based on human MQM ratings

Evaluation methods

Measured system-level, pairwise accuracy
Used Kendall’s Tau for segment-level evaluation
Accuracy is defined as number of system pairs ranked correctly by metric compared to human ranking
Initially used Kendall’s Tau-a, changed to Kendall’s Tau-b
Ties in automatic metrics are rare for non-identical translations
Kendall’s Tau is susceptible to noise in gold pairwise rankings
Reproduced scores reported in WMT22 Metrics shared task findings paper

Results

Investigated GEMBA’s performance in two modes
Compared GEMBA-DA against best-performing metrics from WMT22 Metrics shared task

Reference-based

GEMBA-Dav3-DA metric sets a new state of the art
Outperforms all other reference-based metrics from WMT22 Metrics shared task
Human labels used as gold standard are noisy, accuracy of 100% impossible to obtain

Quality estimation

GEMBA-Dav3-DA[noref] achieves highest performance for quality estimation mode
GEMBA-Dav3-DA[noref] outperforms all other referenceless and reference-based metrics
Evaluation based on three language pairs and MQM human labels shows unexpectedly high level of assessment quality

Comparison of gpt models

GPT versions compared as automatic metric
Babbage and Curie models produce close to random guessing
Performance jump for GPT 3.5 and larger models
ChatGPT has lowest quality among these three models
Davinci-003 has best performance

Segment-level performance

Previous results are reported on the system level
Investigated how GEMBA metric performs on the segment level
GEMBA-Dav3-DA slightly behind top-performing metrics but still has high correlation with human judgment
Quality estimation GEMBA-Dav3-DA has lower segment-level performance
GEMBA-Dav3-DA outperforms string-based metrics
Lower performance of segment-level correlation could be attributed to Kendall’s Tau
GEMBA-Dav3-DA returns discrete value between 0-100
79.9% of all scores are of value “95”

Failure rate

LLMs may answer with an invalid answer
Temperature is increased to take the first answer matching the expected score output range
Table 6 shows the number of invalid answers
LLMs understand the prompt and provide valid answers with less than 1% of answers being invalid
Answers are parsed separately for GEMBA-stars prompts

Conclusion

GEMBA is a GPT-based estimation metric-based assessment method
GEMBA achieved state-of-the-art performance on the MQM 2022 test set
Research will focus on few-shot methodology and model fine-tuning
GPT-enhanced evaluation metrics may allow for document-level evaluation
GEMBA performance may suffer for under-resourced languages

Link to paper#

Abstract#

Paper Content#

Introduction#

Prompt variants#

Scoring process#

Gpt models#

Experiments#

Test set#

Evaluation methods#

Results#

Reference-based#

Quality estimation#

Comparison of gpt models#

Segment-level performance#

Failure rate#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Prompt variants

Scoring process

Gpt models

Experiments

Test set

Evaluation methods

Results

Reference-based

Quality estimation

Comparison of gpt models

Segment-level performance

Failure rate

Conclusion