Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Problem of hallucinations in neural machine translation has been recognized for a long time.
  • Progress on alleviating the problem has been limited.
  • Standard sequence log-probability is more informative than previously thought.
  • Method proposed to evaluate percentage of source contribution to generated translation.
  • Method improves detection accuracy for severe hallucinations by a factor of 2.
  • Method can alleviate hallucinations at test time.
  • Using sentence similarity from cross-lingual embeddings further improves results.

Paper Content

Introduction

  • Hallucinations in machine translation are cases when the model generates output unrelated to the source sentence.
  • Hallucinations can have a dramatic impact on user experience.
  • Addressing hallucinations is challenging because they are rare and hard to identify with automatic metrics.
  • Previous work mostly resorted to settings where models are encouraged to hallucinate.
  • Recently, a large dataset with professional annotations of translations was gathered to evaluate the performance of various detection methods.
  • The best realization of the detection framework uses sequence log-probability.
  • We propose to use a method that evaluates the percentage of the source contribution to a generated translation.
  • We hypothesize that instead of targeting quality evaluation, it might be beneficial to use models trained with a different objective.
  • Cross-lingual sentence similarity and natural language inference models can detect all types of hallucinations.

Background and setting

  • Framework proposed by Guerreiro et al. (2022) for evaluation of hallucination detection and mitigation methods
  • Large dataset of annotated translations and model that produced them

Model

  • Model is Transformer base from fairseq
  • Standard hyperparameters setting
  • Trained on WMT'18 German-English news translation data
  • Excluding Paracrawl, 5.8M sentence pairs
  • 1/3 of dataset used as held-out set for analysis
  • Model released by Guerreiro et al. (2022) used for analysis

Hallucination dataset

  • Guerreiro et al. (2022) released a dataset with manual annotations of 3415 German-to-English translations
  • The translations were chosen from a set of 1.8M translations of held-out data
  • 10 methods were used to flag the translations, including heuristics, quality estimation models and uncertainty detectors
  • The dataset contains 323 examples of hallucinations, 1044 of less severe translation errors and the rest as correct translations

Hallucination detection methods

  • Internal methods for handling hallucinations use only information from the translation model
  • External methods for handling hallucinations use auxiliary models
  • Oracles rely on reference translations
  • Oracles cannot be used in preventive settings when references are not available

Reference-based oracles

  • Use chrF and COMET metrics
  • chrF takes into account word unigrams and bigrams

Internal measures

  • Seq-Logprob is a standard length-normalized sequence log-probability
  • Seq-Logprob performs the best compared to other methods targeting hallucinations
  • ALTI+ is used to measure the percentage of source contribution to a generated translation
  • ALTI+ and LRP-based methods show that source influence is lower for artificially created hallucinations
  • This is the first time ALTI+ is tested in a real setting

Main results

  • ALTI performs comparably to Seq-Logprob for all hallucinations
  • ALTI has twice better precision than Seq-Logprob for fully detached hallucinations
  • LaBSE and XNLI substantially outperform Seq-Logprob
  • LaBSE shows big improvements compared to Seq-Lobprob, while LASER noticeably lags behind

Analysing distributions of the scores

  • ALTI and Seq-Logprob show bimodal distribution for partial hallucinations
  • COMET and COMET-QE do not separate hallucinations from less severe errors
  • LaBSE ranks hallucination severity best
  • LASER ignores most non-hallucinated translation errors
  • XNLI provides good separation between fully detached hallucinations and correct translations, but is hard to estimate severity of an error

Detected pathology types

  • Distribution of pathology types among detected examples
  • Recall for different translation types
  • LASER flags correct translations more
  • XNLI flags undergenerations
  • Fully detached hallucinations are easiest to detect

Mitigating hallucinations at test time

  • Generating hypotheses using MC dropout outperforms more involved methods
  • Reranking hypotheses with COMET-QE and internal ALTI improves hallucination rate

Evaluation methodology

  • Experiments use automatic and manual evaluation
  • Metrics used: COMET, BLEU, LaBSE, XNLI
  • 150 sentences randomly sampled from each group of the hallucination dataset
  • 600 sentences used in total
  • Mitigation techniques applied to all 600 sentences, but in a practical application would only be applied to translations labeled as potential hallucinations

Generation strategies

  • Monte Carlo dropout used to generate alternative hypotheses
  • Beam search used to produce diverse translations
  • MC BEAM method outperforms other methods
  • Variability in hypotheses comes from model predictive uncertainty

Reranking approaches

  • Methods proposed can be used as rerankers in “detect-than-rewrite” pipeline
  • LaBSE is the best reranker and performs better than COMET-QE baseline
  • Reranking with any method is better than no reranking for all groups of original translations
  • ALTI performs better than COMET-QE for fully detached hallucinations
  • All reranking methods reduce hallucinatory rate by a factor of 2.5-3
  • LaBSE produces less hallucinations and more correct translations than ALTI
  • COMET-QE has less errors than other reranking methods

Conclusions

  • Evaluated percentage of source contribution to generated translation to detect and mitigate hallucinations
  • Improved results of overall “detect-then-rewrite” pipeline
  • Method improves previous results twice for detecting most severe type of hallucinations
  • Matches hallucination reduction rate of previous method based on COMET-QE
  • Motivates future work on model understanding and helps practitioners
  • Expands methods for handling hallucinations from models specialized for quality estimation
  • LaBSE improves previous results significantly
  • Models so far overlooked in the context of machine translation can be beneficial
  • Fleiss’ Kappa for inter-annotation agreement is 0.57
  • Paired two-sided Student test used to compare methods
  • Manual annotators provided with guidelines
  • Results of manual evaluation reported