Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference.
  • State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements.
  • Limitations of these systems include being computationally heavy and requiring human annotations.
  • Metric estimation (ME) predicts automated metric scores without the reference.
  • ME model can estimate automated metrics at the sentence-level.
  • Automated metrics correlate with human judgements, allowing for pre-training of a QE model.
  • Pre-training on TER is better than training for scratch for the QE task.

Paper Content

Introduction

  • Quality estimation (QE) is used in machine translation production pipelines.
  • QE is used to decide whether to send an MT output for post-editing or use it directly.
  • Human-annotated judgements of translation quality are scarce and costly.
  • Automated metrics can be used to reduce the cost of learning QE models.
  • Automated metrics can generate large amounts of QE data.
  • Automated metrics can be used to partially substitute for human data during training.
  • Research questions investigate if automated metrics can be reliably predicted without the reference.
  • Experiments are conducted on English → German language direction.
  • BLEU is predictable with 60.4% sentence-level Pearson’s correlation.
  • Authentic parallel data is needed for ME models.
  • ME system can be trained on one MT system and used on a different MT system.
  • Pre-training on Translation Edit Rate (TER) leads to better results than training on the QE data directly.

Our metric estimation model

  • Two ME models are built, one glassbox and one blackbox
  • Glassbox model uses hand-crafted features and features from the MT system
  • Blackbox model only uses access to the MT system output
  • Both models are optimized with mean-squared error against a particular metric
  • Features used include decoder confidence, source and target lengths, average distance and variance between hypotheses
  • Baselines include linear regression on TF-IDF features, linear regression on all text and MT features, and fine-tuned mBERT
  • Automated metrics used include BLEU, ChrF, TER, METEOR, and COMET
  • QE is done at the sentence level

Experiment setup

  • Translated 500k English→German sentences of the WMT14 dataset
  • Used pre-trained WMT19 model by Ng et al.
  • Used 14k human-direct-assessment annotated segments for human scores
  • Evaluated model performance with Pearson’s coefficient on a dev set of 10k sentences
  • Evaluated correlations with human judgement on 1k WMT21 Sentence-Level QE data
  • Estimated human z-scores per-annotator

Results

  • Studies single-feature baselines
  • Studies possibility of robust ME model and data size requirements
  • Checks model on different MT system output
  • Fine-tuning and evaluation on human data
  • Experiment on using joint prediction to improve ME model

Feature analysis

  • Confidence-based features correlate more with automatic metrics than other features.
  • Metrics and human z-scores are highly correlated with source and target lengths.
  • Few hypothesis space metrics correlate highly.
  • Low individual correlations, but may still be useful in combination or for full model.
  • Automated metrics and humans have low correlations.
  • BLEURT and COMET have highest correlation.

Metric estimation performance

  • ME text has access to source and hypothesis texts while ME all has access to extra hand-crafted features
  • Simple linear regression based on features from Figure 3 achieves > 40% correlations with automated metrics and ∼13% correlation with human judgement
  • ME all outperforms ME text, possibly because it has access to extra features
  • Pre-training of mBERT model on language modelling helps only marginally

Data requirements for me

  • ME models require different amounts of data depending on the type of input
  • Linear regression gains little from using 500x more data
  • Main ME model requires larger amount of data
  • Hypothesis expansion can reduce data requirements in low-resource scenarios
  • ME model requires 500k parallel sentences before performance plateaus

Generalization of me across mt systems

  • Examined whether ME model overfits on specific errors of MT system or generalizes to other MT systems
  • Translated same data using different English → German models
  • Evaluation of translations by different MT systems showed varying decrease in correlation with automated metrics
  • Most systems had drop of ∼2-3%, meaning ME estimator generalizes well
  • Exception was prompted T5 LM, for which transfer mostly failed

From me to qe

  • Pre-training on ME helps on the QE task
  • Fine-tuning improved performance over zero-shot
  • Only TER was able to outperform training on z-scores from scratch
  • Pre-training & fine-tuning regime does not perform well with limited target-domain data
  • Very little human-annotated data is needed for training

Joint prediction of multiple metrics

  • Investigated using all automated metrics in a single model instead of multiple individual models
  • Trained two models to predict all available metrics at once
  • Joint learning mostly helps in metric prediction but not in human z-score prediction

Complexity & fluency estimation

  • Model was dependent on source and hypothesis
  • Model can consider fluency and other factors with only hypothesis
  • Model can estimate sentence difficulty/complexity with only source
  • High correlations close to full text-only model’s performance
  • Model is not able to utilize relationship between source and hypothesis
  • Results are in line with general findings
  • Model’s inadequacy is shown by imperfect performance when given access to hypothesis and reference
  • ME task is related to an older task of confidence estimation
  • Confidence estimation is a binary class based on two thresholded MT metrics
  • QE models use features such as source & target lengths, number of translations, etc.
  • Deeper QE models regress directly from source and hypothesis texts
  • Pre-training QE models use artificial and authentic data
  • QE models have been applied to other NLP tasks
  • MT QE has a more constrained hypothesis space than other tasks

Discussion

  • Automated metrics can be predicted without access to the reference.
  • Pre-training on TER outperformed training from scratch.
  • Large absolute zero-shot correlation, but no explanation.
  • ME approach can be used for improving models outside of MT field.

Error analysis

  • Model predictions are generally more conservative than target scores
  • Model predictions are rescaled when evaluated with Pearson’s correlation coefficient
  • Model predictions are not always 100% accurate
  • Model predictions can be more accurate than the metric it was trained to estimate
  • Model predictions can be very far from human judgement

Negative results

  • Attempted to leverage mBERT representations in the main LSTM-based model
  • Results were on par with the main LSTM-based model alone
  • Experimented with expressivity of the used model architecture
  • Model was unable to fit it perfectly
  • Future work should use models with larger capacities

Conclusion

  • Proposed task of metric estimation for machine translation
  • Attempted to solve it with a baseline BiLSTM model
  • Predicted metric without seeing reference
  • Data for training ME models can be generated from any parallel corpus
  • Pre-training on TER did not perform better than baseline
  • Features in hypothesis space should be explored for tasks beyond ME/QE
  • ME models should be evaluated for cross-domain performance
  • Non-perfect correlation with metrics shows importance of exploring more complex architectures
  • Model provides less explainability
  • Model required longer training than baselines
  • Variance in segment-level metrics/evaluations can be dealt with by ME models
  • Detailed error analysis should be performed before deploying QE system
  • Used SacreBLEU, multilingual BERT, T5-small, dynamicconv.glu.wmt16.en-de, conv.wmt17.en-de, transformer.wmt16.en-de, transformer.wmt18.en-de
  • Used sigmoid as final activation function
  • Optimization loss is mean squared error
  • Adam optimization
  • TF-IDF featurizer used in linear regression TF-IDF baseline
  • Classifier suffers from low accuracy