Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Problem of hallucinations in neural machine translation has been recognized for a long time.
Progress on alleviating the problem has been limited.
Standard sequence log-probability is more informative than previously thought.
Method proposed to evaluate percentage of source contribution to generated translation.
Method improves detection accuracy for severe hallucinations by a factor of 2.
Method can alleviate hallucinations at test time.
Using sentence similarity from cross-lingual embeddings further improves results.

Paper Content

Introduction

Hallucinations in machine translation are cases when the model generates output unrelated to the source sentence.
Hallucinations can have a dramatic impact on user experience.
Addressing hallucinations is challenging because they are rare and hard to identify with automatic metrics.
Previous work mostly resorted to settings where models are encouraged to hallucinate.
Recently, a large dataset with professional annotations of translations was gathered to evaluate the performance of various detection methods.
The best realization of the detection framework uses sequence log-probability.
We propose to use a method that evaluates the percentage of the source contribution to a generated translation.
We hypothesize that instead of targeting quality evaluation, it might be beneficial to use models trained with a different objective.
Cross-lingual sentence similarity and natural language inference models can detect all types of hallucinations.

Background and setting

Framework proposed by Guerreiro et al. (2022) for evaluation of hallucination detection and mitigation methods
Large dataset of annotated translations and model that produced them

Model

Model is Transformer base from fairseq
Standard hyperparameters setting
Trained on WMT'18 German-English news translation data
Excluding Paracrawl, 5.8M sentence pairs
1/3 of dataset used as held-out set for analysis
Model released by Guerreiro et al. (2022) used for analysis

Hallucination dataset

Guerreiro et al. (2022) released a dataset with manual annotations of 3415 German-to-English translations
The translations were chosen from a set of 1.8M translations of held-out data
10 methods were used to flag the translations, including heuristics, quality estimation models and uncertainty detectors
The dataset contains 323 examples of hallucinations, 1044 of less severe translation errors and the rest as correct translations

Hallucination detection methods

Internal methods for handling hallucinations use only information from the translation model
External methods for handling hallucinations use auxiliary models
Oracles rely on reference translations
Oracles cannot be used in preventive settings when references are not available

Reference-based oracles

Use chrF and COMET metrics
chrF takes into account word unigrams and bigrams

Internal measures

Seq-Logprob is a standard length-normalized sequence log-probability
Seq-Logprob performs the best compared to other methods targeting hallucinations
ALTI+ is used to measure the percentage of source contribution to a generated translation
ALTI+ and LRP-based methods show that source influence is lower for artificially created hallucinations
This is the first time ALTI+ is tested in a real setting

Main results

ALTI performs comparably to Seq-Logprob for all hallucinations
ALTI has twice better precision than Seq-Logprob for fully detached hallucinations
LaBSE and XNLI substantially outperform Seq-Logprob
LaBSE shows big improvements compared to Seq-Lobprob, while LASER noticeably lags behind

Analysing distributions of the scores

ALTI and Seq-Logprob show bimodal distribution for partial hallucinations
COMET and COMET-QE do not separate hallucinations from less severe errors
LaBSE ranks hallucination severity best
LASER ignores most non-hallucinated translation errors
XNLI provides good separation between fully detached hallucinations and correct translations, but is hard to estimate severity of an error

Detected pathology types

Distribution of pathology types among detected examples
Recall for different translation types
LASER flags correct translations more
XNLI flags undergenerations
Fully detached hallucinations are easiest to detect

Mitigating hallucinations at test time

Generating hypotheses using MC dropout outperforms more involved methods
Reranking hypotheses with COMET-QE and internal ALTI improves hallucination rate

Evaluation methodology

Experiments use automatic and manual evaluation
Metrics used: COMET, BLEU, LaBSE, XNLI
150 sentences randomly sampled from each group of the hallucination dataset
600 sentences used in total
Mitigation techniques applied to all 600 sentences, but in a practical application would only be applied to translations labeled as potential hallucinations

Generation strategies

Monte Carlo dropout used to generate alternative hypotheses
Beam search used to produce diverse translations
MC BEAM method outperforms other methods
Variability in hypotheses comes from model predictive uncertainty

Reranking approaches

Methods proposed can be used as rerankers in “detect-than-rewrite” pipeline
LaBSE is the best reranker and performs better than COMET-QE baseline
Reranking with any method is better than no reranking for all groups of original translations
ALTI performs better than COMET-QE for fully detached hallucinations
All reranking methods reduce hallucinatory rate by a factor of 2.5-3
LaBSE produces less hallucinations and more correct translations than ALTI
COMET-QE has less errors than other reranking methods

Conclusions

Evaluated percentage of source contribution to generated translation to detect and mitigate hallucinations
Improved results of overall “detect-then-rewrite” pipeline
Method improves previous results twice for detecting most severe type of hallucinations
Matches hallucination reduction rate of previous method based on COMET-QE
Motivates future work on model understanding and helps practitioners
Expands methods for handling hallucinations from models specialized for quality estimation
LaBSE improves previous results significantly
Models so far overlooked in the context of machine translation can be beneficial
Fleiss’ Kappa for inter-annotation agreement is 0.57
Paired two-sided Student test used to compare methods
Manual annotators provided with guidelines
Results of manual evaluation reported

Link to paper#

Abstract#

Paper Content#

Introduction#

Background and setting#

Model#

Hallucination dataset#

Hallucination detection methods#

Reference-based oracles#

Internal measures#

Main results#

Analysing distributions of the scores#

Detected pathology types#

Mitigating hallucinations at test time#

Evaluation methodology#

Generation strategies#

Reranking approaches#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Background and setting

Model

Hallucination dataset

Hallucination detection methods

Reference-based oracles

Internal measures

Main results

Analysing distributions of the scores

Detected pathology types

Mitigating hallucinations at test time

Evaluation methodology

Generation strategies

Reranking approaches

Conclusions