Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Neural sequence generation models can produce outputs that are unrelated to the source text.
It is unclear what conditions cause these hallucinations and how to prevent them.
This work identifies internal model symptoms of hallucinations and uses them to design a detector.
The detector outperforms model-free baselines and strong classifiers on English-Chinese and German-English translation test beds.

Paper Content

Introduction

Neural language generation models can generate high quality text, but also fail in counter-intuitive ways.
Detached hallucinations are the most severe case, and can risk misleading users and undermining trust.
We lack a systematic understanding of the conditions where hallucinations arise.
Prior work has focused on black-box detection methods and studying hallucinations given artificially perturbed inputs.
We identify internal model symptoms that characterize hallucinations given artificial inputs and test them on translations of natural texts.

Hallucinations: definition and hypotheses

Definition of “hallucinations” in MT and NLG
Different from previous work, focus on detached hallucinations
Prior work on understanding conditions that lead to hallucinations focused on training conditions and data noise
Complementary approach to diagnosing hallucinations is to identify symptoms via model introspection
Low Source Contribution Hypothesis states that hallucinations occur when NMT overly relies on target context over source
Local Source Contribution Hypothesis states that hallucinations occur when NMT overly relies on a small subset of source tokens
Static Source Contribution Hypothesis states that distribution of source contributions remains static when NMT model hallucinates

Study of hallucinations under perturbations via model introspection

Hallucinations are rare and hard to identify in natural datasets.
Source perturbations can be used to test hypotheses at scale.
A counterfactual hallucination dataset was constructed to test hypotheses.
Token contributions were computed using LRP.
Controlled comparison of patterns was conducted on original and hallucinated samples.

Perturbation-based hallucination data

Randomly select 50k seed sentence pairs from NMT training corpora
Misspell words with 0.1 probability
Title-case words with 0.1 probability
Insert random token at beginning of source sentence

Measuring relative token contributions

We tested 3 hypotheses on a dataset using LRP (Bach et al., 2015).
LRP decomposes the prediction of a neural model into relevance scores for input dimensions.
LRP-α β (Bach et al., 2015;Binder et al., 2016) was used to back-propagate relevance scores from the last layer to the first layer.
We tested the hypotheses based on the distribution of relative token contributions and compared it with the attention matrix.
We built strong Transformer models on two high-resource language pairs: English→Chinese (En-Zh) and German→English (De-En).
Data was tokenized using Moses scripts (Koehn et al., 2007) and Jieba segmenter.
Models were based on the base Transformer (Vaswani et al., 2017) and trained using the Adam optimizer (Kingma and Ba, 2015).
Decoding was done with beam search with a beam size of 4.

Findings

Compared different classifiers to baselines using Precision, Recall and F1 scores
Reported Area Under the Receiver Operating Characteristic Curve (AUC) to measure discriminative power of each method

A classifier to detect natural hallucinations

Introduced hallucination detector using features extracted from source contributions
Features include normalized source contribution of first and last tokens and source contribution staticity
Training data constructed using perturbation-based samples with source length between 20 and 60

Detecting natural hallucinations

Tested a hallucination classifier built on insights from perturbation-based hallucinations
Evaluated on a human-annotated test bed for hallucinations generated on natural source inputs
Compared against a wide range of relevant models

Natural hallucination evaluation set

Zhou et al. (2021) created the only publicly available dataset of annotated MT hallucinations.
Test bed for detached hallucination detection was built for different language pairs and translation directions.
Data and underlying NMT models will be released.
Samples collected from large pools of out-of-domain data.
10 bilingual annotators assess faithfulness of NMT output.

Experimental conditions

Implemented LRP-based classifier
Clipped source length at 40 and considered influence of most recent 10 target tokens
Tuned hyper-parameters based on average F1 accuracy
Compared with attention-based classifier
Used 3 simple baselines to characterize task
Random classifier, degeneration detector, NMT probability scores
COMET-QE classifier achieved highest AUC and F1 scores
LASER classifier achieved higher AUC than XLM-R classifier
LRP-based classifier benefited most from Source Contribution Staticity features
Highest false positive rate for LRP classifier on incomprehensible but aligned translations
LASER and LRP-based classifiers achieved 35% and 45% Precision@20
After tuning threshold, LRP + LASER ensemble detected 9 hallucinations with 89% precision
Introspection-based classifiers more robust than pre-trained multilingual models
LRP-based classifier best hallucination detector overall

Limitations

Focus on understanding and detecting detached hallucinations in MT
Experiments based on parallel data from WMT, not monolingual data
Open question how findings generalize to models trained on more diverse data types
Hallucinations expected to be rare, small-scale experiment shows promising results

Hallucinations occur in all applications of neural models to language generation
Existing approaches to detecting hallucinations view the generation model as a black-box
Heuristics and external models are used to measure the faithfulness of the outputs
Guerreiro et al. (2022) explore glass-box detection methods based on model confidence scores or attention patterns
This paper investigates varying types of glass-box patterns based on the relative token contributions
MT quality estimation literature does not distinguish adequacy and fluency errors
Glass-box methods rely on model probabilities, uncertainty quantification, and the entropy of the attention distribution
Saliency interpretation method is used to measure the importance of each input unit

Conclusion

Hallucinations in NMT are poorly understood
Distinctive source contribution patterns can indicate hallucinations better than relative contribution of source and target
Quality estimation models and black-box classifiers can be used to detect natural hallucinations
Human-annotated test beds of English-Chinese and German-English hallucinations released
Minimum Risk Training can reduce frequency of hallucinations

Link to paper#

Abstract#

Paper Content#

Introduction#

Hallucinations: definition and hypotheses#

Study of hallucinations under perturbations via model introspection#

Perturbation-based hallucination data#

Measuring relative token contributions#

Findings#

A classifier to detect natural hallucinations#

Detecting natural hallucinations#

Natural hallucination evaluation set#

Experimental conditions#

Limitations#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Hallucinations: definition and hypotheses

Study of hallucinations under perturbations via model introspection

Perturbation-based hallucination data

Measuring relative token contributions

Findings

A classifier to detect natural hallucinations

Detecting natural hallucinations

Natural hallucination evaluation set

Experimental conditions

Limitations

Related work

Conclusion