Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Problem of hallucinations in neural machine translation has been recognized for a long time.
- Progress on alleviating the problem has been limited.
- Standard sequence log-probability is more informative than previously thought.
- Method proposed to evaluate percentage of source contribution to generated translation.
- Method improves detection accuracy for severe hallucinations by a factor of 2.
- Method can alleviate hallucinations at test time.
- Using sentence similarity from cross-lingual embeddings further improves results.
Paper Content
Introduction
- Hallucinations in machine translation are cases when the model generates output unrelated to the source sentence.
- Hallucinations can have a dramatic impact on user experience.
- Addressing hallucinations is challenging because they are rare and hard to identify with automatic metrics.
- Previous work mostly resorted to settings where models are encouraged to hallucinate.
- Recently, a large dataset with professional annotations of translations was gathered to evaluate the performance of various detection methods.
- The best realization of the detection framework uses sequence log-probability.
- We propose to use a method that evaluates the percentage of the source contribution to a generated translation.
- We hypothesize that instead of targeting quality evaluation, it might be beneficial to use models trained with a different objective.
- Cross-lingual sentence similarity and natural language inference models can detect all types of hallucinations.
Background and setting
- Framework proposed by Guerreiro et al. (2022) for evaluation of hallucination detection and mitigation methods
- Large dataset of annotated translations and model that produced them
Model
- Model is Transformer base from fairseq
- Standard hyperparameters setting
- Trained on WMT'18 German-English news translation data
- Excluding Paracrawl, 5.8M sentence pairs
- 1/3 of dataset used as held-out set for analysis
- Model released by Guerreiro et al. (2022) used for analysis
Hallucination dataset
- Guerreiro et al. (2022) released a dataset with manual annotations of 3415 German-to-English translations
- The translations were chosen from a set of 1.8M translations of held-out data
- 10 methods were used to flag the translations, including heuristics, quality estimation models and uncertainty detectors
- The dataset contains 323 examples of hallucinations, 1044 of less severe translation errors and the rest as correct translations
Hallucination detection methods
- Internal methods for handling hallucinations use only information from the translation model
- External methods for handling hallucinations use auxiliary models
- Oracles rely on reference translations
- Oracles cannot be used in preventive settings when references are not available
Reference-based oracles
- Use chrF and COMET metrics
- chrF takes into account word unigrams and bigrams
Internal measures
- Seq-Logprob is a standard length-normalized sequence log-probability
- Seq-Logprob performs the best compared to other methods targeting hallucinations
- ALTI+ is used to measure the percentage of source contribution to a generated translation
- ALTI+ and LRP-based methods show that source influence is lower for artificially created hallucinations
- This is the first time ALTI+ is tested in a real setting
Main results
- ALTI performs comparably to Seq-Logprob for all hallucinations
- ALTI has twice better precision than Seq-Logprob for fully detached hallucinations
- LaBSE and XNLI substantially outperform Seq-Logprob
- LaBSE shows big improvements compared to Seq-Lobprob, while LASER noticeably lags behind
Analysing distributions of the scores
- ALTI and Seq-Logprob show bimodal distribution for partial hallucinations
- COMET and COMET-QE do not separate hallucinations from less severe errors
- LaBSE ranks hallucination severity best
- LASER ignores most non-hallucinated translation errors
- XNLI provides good separation between fully detached hallucinations and correct translations, but is hard to estimate severity of an error
Detected pathology types
- Distribution of pathology types among detected examples
- Recall for different translation types
- LASER flags correct translations more
- XNLI flags undergenerations
- Fully detached hallucinations are easiest to detect
Mitigating hallucinations at test time
- Generating hypotheses using MC dropout outperforms more involved methods
- Reranking hypotheses with COMET-QE and internal ALTI improves hallucination rate
Evaluation methodology
- Experiments use automatic and manual evaluation
- Metrics used: COMET, BLEU, LaBSE, XNLI
- 150 sentences randomly sampled from each group of the hallucination dataset
- 600 sentences used in total
- Mitigation techniques applied to all 600 sentences, but in a practical application would only be applied to translations labeled as potential hallucinations
Generation strategies
- Monte Carlo dropout used to generate alternative hypotheses
- Beam search used to produce diverse translations
- MC BEAM method outperforms other methods
- Variability in hypotheses comes from model predictive uncertainty
Reranking approaches
- Methods proposed can be used as rerankers in “detect-than-rewrite” pipeline
- LaBSE is the best reranker and performs better than COMET-QE baseline
- Reranking with any method is better than no reranking for all groups of original translations
- ALTI performs better than COMET-QE for fully detached hallucinations
- All reranking methods reduce hallucinatory rate by a factor of 2.5-3
- LaBSE produces less hallucinations and more correct translations than ALTI
- COMET-QE has less errors than other reranking methods
Conclusions
- Evaluated percentage of source contribution to generated translation to detect and mitigate hallucinations
- Improved results of overall “detect-then-rewrite” pipeline
- Method improves previous results twice for detecting most severe type of hallucinations
- Matches hallucination reduction rate of previous method based on COMET-QE
- Motivates future work on model understanding and helps practitioners
- Expands methods for handling hallucinations from models specialized for quality estimation
- LaBSE improves previous results significantly
- Models so far overlooked in the context of machine translation can be beneficial
- Fleiss’ Kappa for inter-annotation agreement is 0.57
- Paired two-sided Student test used to compare methods
- Manual annotators provided with guidelines
- Results of manual evaluation reported