Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Neural sequence generation models can produce outputs that are unrelated to the source text.
- It is unclear what conditions cause these hallucinations and how to prevent them.
- This work identifies internal model symptoms of hallucinations and uses them to design a detector.
- The detector outperforms model-free baselines and strong classifiers on English-Chinese and German-English translation test beds.
Paper Content
Introduction
- Neural language generation models can generate high quality text, but also fail in counter-intuitive ways.
- Detached hallucinations are the most severe case, and can risk misleading users and undermining trust.
- We lack a systematic understanding of the conditions where hallucinations arise.
- Prior work has focused on black-box detection methods and studying hallucinations given artificially perturbed inputs.
- We identify internal model symptoms that characterize hallucinations given artificial inputs and test them on translations of natural texts.
Hallucinations: definition and hypotheses
- Definition of “hallucinations” in MT and NLG
- Different from previous work, focus on detached hallucinations
- Prior work on understanding conditions that lead to hallucinations focused on training conditions and data noise
- Complementary approach to diagnosing hallucinations is to identify symptoms via model introspection
- Low Source Contribution Hypothesis states that hallucinations occur when NMT overly relies on target context over source
- Local Source Contribution Hypothesis states that hallucinations occur when NMT overly relies on a small subset of source tokens
- Static Source Contribution Hypothesis states that distribution of source contributions remains static when NMT model hallucinates
Study of hallucinations under perturbations via model introspection
- Hallucinations are rare and hard to identify in natural datasets.
- Source perturbations can be used to test hypotheses at scale.
- A counterfactual hallucination dataset was constructed to test hypotheses.
- Token contributions were computed using LRP.
- Controlled comparison of patterns was conducted on original and hallucinated samples.
Perturbation-based hallucination data
- Randomly select 50k seed sentence pairs from NMT training corpora
- Misspell words with 0.1 probability
- Title-case words with 0.1 probability
- Insert random token at beginning of source sentence
Measuring relative token contributions
- We tested 3 hypotheses on a dataset using LRP (Bach et al., 2015).
- LRP decomposes the prediction of a neural model into relevance scores for input dimensions.
- LRP-α β (Bach et al., 2015;Binder et al., 2016) was used to back-propagate relevance scores from the last layer to the first layer.
- We tested the hypotheses based on the distribution of relative token contributions and compared it with the attention matrix.
- We built strong Transformer models on two high-resource language pairs: English→Chinese (En-Zh) and German→English (De-En).
- Data was tokenized using Moses scripts (Koehn et al., 2007) and Jieba segmenter.
- Models were based on the base Transformer (Vaswani et al., 2017) and trained using the Adam optimizer (Kingma and Ba, 2015).
- Decoding was done with beam search with a beam size of 4.
Findings
- Compared different classifiers to baselines using Precision, Recall and F1 scores
- Reported Area Under the Receiver Operating Characteristic Curve (AUC) to measure discriminative power of each method
A classifier to detect natural hallucinations
- Introduced hallucination detector using features extracted from source contributions
- Features include normalized source contribution of first and last tokens and source contribution staticity
- Training data constructed using perturbation-based samples with source length between 20 and 60
Detecting natural hallucinations
- Tested a hallucination classifier built on insights from perturbation-based hallucinations
- Evaluated on a human-annotated test bed for hallucinations generated on natural source inputs
- Compared against a wide range of relevant models
Natural hallucination evaluation set
- Zhou et al. (2021) created the only publicly available dataset of annotated MT hallucinations.
- Test bed for detached hallucination detection was built for different language pairs and translation directions.
- Data and underlying NMT models will be released.
- Samples collected from large pools of out-of-domain data.
- 10 bilingual annotators assess faithfulness of NMT output.
Experimental conditions
- Implemented LRP-based classifier
- Clipped source length at 40 and considered influence of most recent 10 target tokens
- Tuned hyper-parameters based on average F1 accuracy
- Compared with attention-based classifier
- Used 3 simple baselines to characterize task
- Random classifier, degeneration detector, NMT probability scores
- COMET-QE classifier achieved highest AUC and F1 scores
- LASER classifier achieved higher AUC than XLM-R classifier
- LRP-based classifier benefited most from Source Contribution Staticity features
- Highest false positive rate for LRP classifier on incomprehensible but aligned translations
- LASER and LRP-based classifiers achieved 35% and 45% Precision@20
- After tuning threshold, LRP + LASER ensemble detected 9 hallucinations with 89% precision
- Introspection-based classifiers more robust than pre-trained multilingual models
- LRP-based classifier best hallucination detector overall
Limitations
- Focus on understanding and detecting detached hallucinations in MT
- Experiments based on parallel data from WMT, not monolingual data
- Open question how findings generalize to models trained on more diverse data types
- Hallucinations expected to be rare, small-scale experiment shows promising results
Related work
- Hallucinations occur in all applications of neural models to language generation
- Existing approaches to detecting hallucinations view the generation model as a black-box
- Heuristics and external models are used to measure the faithfulness of the outputs
- Guerreiro et al. (2022) explore glass-box detection methods based on model confidence scores or attention patterns
- This paper investigates varying types of glass-box patterns based on the relative token contributions
- MT quality estimation literature does not distinguish adequacy and fluency errors
- Glass-box methods rely on model probabilities, uncertainty quantification, and the entropy of the attention distribution
- Saliency interpretation method is used to measure the importance of each input unit
Conclusion
- Hallucinations in NMT are poorly understood
- Distinctive source contribution patterns can indicate hallucinations better than relative contribution of source and target
- Quality estimation models and black-box classifiers can be used to detect natural hallucinations
- Human-annotated test beds of English-Chinese and German-English hallucinations released
- Minimum Risk Training can reduce frequency of hallucinations