Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference.
State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements.
Limitations of these systems include being computationally heavy and requiring human annotations.
Metric estimation (ME) predicts automated metric scores without the reference.
ME model can estimate automated metrics at the sentence-level.
Automated metrics correlate with human judgements, allowing for pre-training of a QE model.
Pre-training on TER is better than training for scratch for the QE task.

Paper Content

Introduction

Quality estimation (QE) is used in machine translation production pipelines.
QE is used to decide whether to send an MT output for post-editing or use it directly.
Human-annotated judgements of translation quality are scarce and costly.
Automated metrics can be used to reduce the cost of learning QE models.
Automated metrics can generate large amounts of QE data.
Automated metrics can be used to partially substitute for human data during training.
Research questions investigate if automated metrics can be reliably predicted without the reference.
Experiments are conducted on English → German language direction.
BLEU is predictable with 60.4% sentence-level Pearson’s correlation.
Authentic parallel data is needed for ME models.
ME system can be trained on one MT system and used on a different MT system.
Pre-training on Translation Edit Rate (TER) leads to better results than training on the QE data directly.

Our metric estimation model

Two ME models are built, one glassbox and one blackbox
Glassbox model uses hand-crafted features and features from the MT system
Blackbox model only uses access to the MT system output
Both models are optimized with mean-squared error against a particular metric
Features used include decoder confidence, source and target lengths, average distance and variance between hypotheses
Baselines include linear regression on TF-IDF features, linear regression on all text and MT features, and fine-tuned mBERT
Automated metrics used include BLEU, ChrF, TER, METEOR, and COMET
QE is done at the sentence level

Experiment setup

Translated 500k English→German sentences of the WMT14 dataset
Used pre-trained WMT19 model by Ng et al.
Used 14k human-direct-assessment annotated segments for human scores
Evaluated model performance with Pearson’s coefficient on a dev set of 10k sentences
Evaluated correlations with human judgement on 1k WMT21 Sentence-Level QE data
Estimated human z-scores per-annotator

Results

Studies single-feature baselines
Studies possibility of robust ME model and data size requirements
Checks model on different MT system output
Fine-tuning and evaluation on human data
Experiment on using joint prediction to improve ME model

Feature analysis

Confidence-based features correlate more with automatic metrics than other features.
Metrics and human z-scores are highly correlated with source and target lengths.
Few hypothesis space metrics correlate highly.
Low individual correlations, but may still be useful in combination or for full model.
Automated metrics and humans have low correlations.
BLEURT and COMET have highest correlation.

Metric estimation performance

ME text has access to source and hypothesis texts while ME all has access to extra hand-crafted features
Simple linear regression based on features from Figure 3 achieves > 40% correlations with automated metrics and ∼13% correlation with human judgement
ME all outperforms ME text, possibly because it has access to extra features
Pre-training of mBERT model on language modelling helps only marginally

Data requirements for me

ME models require different amounts of data depending on the type of input
Linear regression gains little from using 500x more data
Main ME model requires larger amount of data
Hypothesis expansion can reduce data requirements in low-resource scenarios
ME model requires 500k parallel sentences before performance plateaus

Generalization of me across mt systems

Examined whether ME model overfits on specific errors of MT system or generalizes to other MT systems
Translated same data using different English → German models
Evaluation of translations by different MT systems showed varying decrease in correlation with automated metrics
Most systems had drop of ∼2-3%, meaning ME estimator generalizes well
Exception was prompted T5 LM, for which transfer mostly failed

From me to qe

Pre-training on ME helps on the QE task
Fine-tuning improved performance over zero-shot
Only TER was able to outperform training on z-scores from scratch
Pre-training & fine-tuning regime does not perform well with limited target-domain data
Very little human-annotated data is needed for training

Joint prediction of multiple metrics

Investigated using all automated metrics in a single model instead of multiple individual models
Trained two models to predict all available metrics at once
Joint learning mostly helps in metric prediction but not in human z-score prediction

Complexity & fluency estimation

Model was dependent on source and hypothesis
Model can consider fluency and other factors with only hypothesis
Model can estimate sentence difficulty/complexity with only source
High correlations close to full text-only model’s performance
Model is not able to utilize relationship between source and hypothesis
Results are in line with general findings
Model’s inadequacy is shown by imperfect performance when given access to hypothesis and reference

ME task is related to an older task of confidence estimation
Confidence estimation is a binary class based on two thresholded MT metrics
QE models use features such as source & target lengths, number of translations, etc.
Deeper QE models regress directly from source and hypothesis texts
Pre-training QE models use artificial and authentic data
QE models have been applied to other NLP tasks
MT QE has a more constrained hypothesis space than other tasks

Discussion

Automated metrics can be predicted without access to the reference.
Pre-training on TER outperformed training from scratch.
Large absolute zero-shot correlation, but no explanation.
ME approach can be used for improving models outside of MT field.

Error analysis

Model predictions are generally more conservative than target scores
Model predictions are rescaled when evaluated with Pearson’s correlation coefficient
Model predictions are not always 100% accurate
Model predictions can be more accurate than the metric it was trained to estimate
Model predictions can be very far from human judgement

Negative results

Attempted to leverage mBERT representations in the main LSTM-based model
Results were on par with the main LSTM-based model alone
Experimented with expressivity of the used model architecture
Model was unable to fit it perfectly
Future work should use models with larger capacities

Conclusion

Proposed task of metric estimation for machine translation
Attempted to solve it with a baseline BiLSTM model
Predicted metric without seeing reference
Data for training ME models can be generated from any parallel corpus
Pre-training on TER did not perform better than baseline
Features in hypothesis space should be explored for tasks beyond ME/QE
ME models should be evaluated for cross-domain performance
Non-perfect correlation with metrics shows importance of exploring more complex architectures
Model provides less explainability
Model required longer training than baselines
Variance in segment-level metrics/evaluations can be dealt with by ME models
Detailed error analysis should be performed before deploying QE system
Used SacreBLEU, multilingual BERT, T5-small, dynamicconv.glu.wmt16.en-de, conv.wmt17.en-de, transformer.wmt16.en-de, transformer.wmt18.en-de
Used sigmoid as final activation function
Optimization loss is mean squared error
Adam optimization
TF-IDF featurizer used in linear regression TF-IDF baseline
Classifier suffers from low accuracy

Link to paper#

Abstract#

Paper Content#

Introduction#

Our metric estimation model#

Experiment setup#

Results#

Feature analysis#

Metric estimation performance#

Data requirements for me#

Generalization of me across mt systems#

From me to qe#

Joint prediction of multiple metrics#

Complexity & fluency estimation#

Related work#

Discussion#

Error analysis#

Negative results#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Our metric estimation model

Experiment setup

Results

Feature analysis

Metric estimation performance

Data requirements for me

Generalization of me across mt systems

From me to qe

Joint prediction of multiple metrics

Complexity & fluency estimation

Related work

Discussion

Error analysis

Negative results

Conclusion