Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference.
- State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements.
- Limitations of these systems include being computationally heavy and requiring human annotations.
- Metric estimation (ME) predicts automated metric scores without the reference.
- ME model can estimate automated metrics at the sentence-level.
- Automated metrics correlate with human judgements, allowing for pre-training of a QE model.
- Pre-training on TER is better than training for scratch for the QE task.
Paper Content
Introduction
- Quality estimation (QE) is used in machine translation production pipelines.
- QE is used to decide whether to send an MT output for post-editing or use it directly.
- Human-annotated judgements of translation quality are scarce and costly.
- Automated metrics can be used to reduce the cost of learning QE models.
- Automated metrics can generate large amounts of QE data.
- Automated metrics can be used to partially substitute for human data during training.
- Research questions investigate if automated metrics can be reliably predicted without the reference.
- Experiments are conducted on English → German language direction.
- BLEU is predictable with 60.4% sentence-level Pearson’s correlation.
- Authentic parallel data is needed for ME models.
- ME system can be trained on one MT system and used on a different MT system.
- Pre-training on Translation Edit Rate (TER) leads to better results than training on the QE data directly.
Our metric estimation model
- Two ME models are built, one glassbox and one blackbox
- Glassbox model uses hand-crafted features and features from the MT system
- Blackbox model only uses access to the MT system output
- Both models are optimized with mean-squared error against a particular metric
- Features used include decoder confidence, source and target lengths, average distance and variance between hypotheses
- Baselines include linear regression on TF-IDF features, linear regression on all text and MT features, and fine-tuned mBERT
- Automated metrics used include BLEU, ChrF, TER, METEOR, and COMET
- QE is done at the sentence level
Experiment setup
- Translated 500k English→German sentences of the WMT14 dataset
- Used pre-trained WMT19 model by Ng et al.
- Used 14k human-direct-assessment annotated segments for human scores
- Evaluated model performance with Pearson’s coefficient on a dev set of 10k sentences
- Evaluated correlations with human judgement on 1k WMT21 Sentence-Level QE data
- Estimated human z-scores per-annotator
Results
- Studies single-feature baselines
- Studies possibility of robust ME model and data size requirements
- Checks model on different MT system output
- Fine-tuning and evaluation on human data
- Experiment on using joint prediction to improve ME model
Feature analysis
- Confidence-based features correlate more with automatic metrics than other features.
- Metrics and human z-scores are highly correlated with source and target lengths.
- Few hypothesis space metrics correlate highly.
- Low individual correlations, but may still be useful in combination or for full model.
- Automated metrics and humans have low correlations.
- BLEURT and COMET have highest correlation.
Metric estimation performance
- ME text has access to source and hypothesis texts while ME all has access to extra hand-crafted features
- Simple linear regression based on features from Figure 3 achieves > 40% correlations with automated metrics and ∼13% correlation with human judgement
- ME all outperforms ME text, possibly because it has access to extra features
- Pre-training of mBERT model on language modelling helps only marginally
Data requirements for me
- ME models require different amounts of data depending on the type of input
- Linear regression gains little from using 500x more data
- Main ME model requires larger amount of data
- Hypothesis expansion can reduce data requirements in low-resource scenarios
- ME model requires 500k parallel sentences before performance plateaus
Generalization of me across mt systems
- Examined whether ME model overfits on specific errors of MT system or generalizes to other MT systems
- Translated same data using different English → German models
- Evaluation of translations by different MT systems showed varying decrease in correlation with automated metrics
- Most systems had drop of ∼2-3%, meaning ME estimator generalizes well
- Exception was prompted T5 LM, for which transfer mostly failed
From me to qe
- Pre-training on ME helps on the QE task
- Fine-tuning improved performance over zero-shot
- Only TER was able to outperform training on z-scores from scratch
- Pre-training & fine-tuning regime does not perform well with limited target-domain data
- Very little human-annotated data is needed for training
Joint prediction of multiple metrics
- Investigated using all automated metrics in a single model instead of multiple individual models
- Trained two models to predict all available metrics at once
- Joint learning mostly helps in metric prediction but not in human z-score prediction
Complexity & fluency estimation
- Model was dependent on source and hypothesis
- Model can consider fluency and other factors with only hypothesis
- Model can estimate sentence difficulty/complexity with only source
- High correlations close to full text-only model’s performance
- Model is not able to utilize relationship between source and hypothesis
- Results are in line with general findings
- Model’s inadequacy is shown by imperfect performance when given access to hypothesis and reference
Related work
- ME task is related to an older task of confidence estimation
- Confidence estimation is a binary class based on two thresholded MT metrics
- QE models use features such as source & target lengths, number of translations, etc.
- Deeper QE models regress directly from source and hypothesis texts
- Pre-training QE models use artificial and authentic data
- QE models have been applied to other NLP tasks
- MT QE has a more constrained hypothesis space than other tasks
Discussion
- Automated metrics can be predicted without access to the reference.
- Pre-training on TER outperformed training from scratch.
- Large absolute zero-shot correlation, but no explanation.
- ME approach can be used for improving models outside of MT field.
Error analysis
- Model predictions are generally more conservative than target scores
- Model predictions are rescaled when evaluated with Pearson’s correlation coefficient
- Model predictions are not always 100% accurate
- Model predictions can be more accurate than the metric it was trained to estimate
- Model predictions can be very far from human judgement
Negative results
- Attempted to leverage mBERT representations in the main LSTM-based model
- Results were on par with the main LSTM-based model alone
- Experimented with expressivity of the used model architecture
- Model was unable to fit it perfectly
- Future work should use models with larger capacities
Conclusion
- Proposed task of metric estimation for machine translation
- Attempted to solve it with a baseline BiLSTM model
- Predicted metric without seeing reference
- Data for training ME models can be generated from any parallel corpus
- Pre-training on TER did not perform better than baseline
- Features in hypothesis space should be explored for tasks beyond ME/QE
- ME models should be evaluated for cross-domain performance
- Non-perfect correlation with metrics shows importance of exploring more complex architectures
- Model provides less explainability
- Model required longer training than baselines
- Variance in segment-level metrics/evaluations can be dealt with by ME models
- Detailed error analysis should be performed before deploying QE system
- Used SacreBLEU, multilingual BERT, T5-small, dynamicconv.glu.wmt16.en-de, conv.wmt17.en-de, transformer.wmt16.en-de, transformer.wmt18.en-de
- Used sigmoid as final activation function
- Optimization loss is mean squared error
- Adam optimization
- TF-IDF featurizer used in linear regression TF-IDF baseline
- Classifier suffers from low accuracy