Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- GEMBA is a GPT-based metric for assessing translation quality
- It works with and without a reference translation
- Four prompt variants were compared in two modes
- Seven versions of GPT models were investigated, including ChatGPT
- GPT 3.5 and larger models are needed for the method to work
- Results from WMT22’s Metrics shared task show state-of-the-art accuracy
- Results are valid for three language pairs
- Code and prompt templates used for experiments are publicly released
Paper Content
Introduction
- LLMs can be used for multilingual Q&A
- LLMs can be used to translate text between languages
- LLMs can differentiate good from bad translations
- GPT can be used for automated assessment of translation quality
- GPT can be used for system-level evaluation of translation quality
Prompt variants
- Four distinct prompt types are experimented with: two scoring tasks and two classification tasks
- Two scoring tasks: one based on direct assessment, one based on scalar quality metrics
- Two classification tasks: one based on one-to-five stars ranking, one based on five discrete quality classes
- Two modes for each prompt type: one with access to a human reference, one without
Scoring process
- GEMBA-DA, GEMBA-SQM, and GEMBAstars output scores in the range of 0-100, 1-5, and 0-4 respectively.
- System-level scores are calculated by averaging segment-level scores.
- If an invalid answer is returned, randomness is added and more responses are sampled until a valid answer is found.
Gpt models
- Seven GPT models are used in the experiment
- The Davinci-003 model is used as the default
- GPT 2, ChatGPT, and GPT 3 models are used
- A score is given to a translation from one language to another on a scale from 0 to 100
Experiments
- Measured performance of GEMBA metric using test data from WMT22 Metrics shared task
- Compared GEMBA against best-performing automatic metrics: COMET, BLEURT, and MetricX XXL
Test set
- MQM 2022 test set contains human judgments for 3 translation directions
- Test set contains 54 machine translation system outputs or human translations
- Test set contains 106k segments
- Translation systems mainly from participants of WMT22 General MT shared task
- Source segments and human reference translations for each language pair contain around 2,000 sentences from 4 different text domains
- Gold standard for scoring translation quality based on human MQM ratings
Evaluation methods
- Measured system-level, pairwise accuracy
- Used Kendall’s Tau for segment-level evaluation
- Accuracy is defined as number of system pairs ranked correctly by metric compared to human ranking
- Initially used Kendall’s Tau-a, changed to Kendall’s Tau-b
- Ties in automatic metrics are rare for non-identical translations
- Kendall’s Tau is susceptible to noise in gold pairwise rankings
- Reproduced scores reported in WMT22 Metrics shared task findings paper
Results
- Investigated GEMBA’s performance in two modes
- Compared GEMBA-DA against best-performing metrics from WMT22 Metrics shared task
Reference-based
- GEMBA-Dav3-DA metric sets a new state of the art
- Outperforms all other reference-based metrics from WMT22 Metrics shared task
- Human labels used as gold standard are noisy, accuracy of 100% impossible to obtain
Quality estimation
- GEMBA-Dav3-DA[noref] achieves highest performance for quality estimation mode
- GEMBA-Dav3-DA[noref] outperforms all other referenceless and reference-based metrics
- Evaluation based on three language pairs and MQM human labels shows unexpectedly high level of assessment quality
Comparison of gpt models
- GPT versions compared as automatic metric
- Babbage and Curie models produce close to random guessing
- Performance jump for GPT 3.5 and larger models
- ChatGPT has lowest quality among these three models
- Davinci-003 has best performance
Segment-level performance
- Previous results are reported on the system level
- Investigated how GEMBA metric performs on the segment level
- GEMBA-Dav3-DA slightly behind top-performing metrics but still has high correlation with human judgment
- Quality estimation GEMBA-Dav3-DA has lower segment-level performance
- GEMBA-Dav3-DA outperforms string-based metrics
- Lower performance of segment-level correlation could be attributed to Kendall’s Tau
- GEMBA-Dav3-DA returns discrete value between 0-100
- 79.9% of all scores are of value “95”
Failure rate
- LLMs may answer with an invalid answer
- Temperature is increased to take the first answer matching the expected score output range
- Table 6 shows the number of invalid answers
- LLMs understand the prompt and provide valid answers with less than 1% of answers being invalid
- Answers are parsed separately for GEMBA-stars prompts
Conclusion
- GEMBA is a GPT-based estimation metric-based assessment method
- GEMBA achieved state-of-the-art performance on the MQM 2022 test set
- Research will focus on few-shot methodology and model fine-tuning
- GPT-enhanced evaluation metrics may allow for document-level evaluation
- GEMBA performance may suffer for under-resourced languages