Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Neural sequence generation models can produce outputs that are unrelated to the source text.
  • It is unclear what conditions cause these hallucinations and how to prevent them.
  • This work identifies internal model symptoms of hallucinations and uses them to design a detector.
  • The detector outperforms model-free baselines and strong classifiers on English-Chinese and German-English translation test beds.

Paper Content


  • Neural language generation models can generate high quality text, but also fail in counter-intuitive ways.
  • Detached hallucinations are the most severe case, and can risk misleading users and undermining trust.
  • We lack a systematic understanding of the conditions where hallucinations arise.
  • Prior work has focused on black-box detection methods and studying hallucinations given artificially perturbed inputs.
  • We identify internal model symptoms that characterize hallucinations given artificial inputs and test them on translations of natural texts.

Hallucinations: definition and hypotheses

  • Definition of “hallucinations” in MT and NLG
  • Different from previous work, focus on detached hallucinations
  • Prior work on understanding conditions that lead to hallucinations focused on training conditions and data noise
  • Complementary approach to diagnosing hallucinations is to identify symptoms via model introspection
  • Low Source Contribution Hypothesis states that hallucinations occur when NMT overly relies on target context over source
  • Local Source Contribution Hypothesis states that hallucinations occur when NMT overly relies on a small subset of source tokens
  • Static Source Contribution Hypothesis states that distribution of source contributions remains static when NMT model hallucinates

Study of hallucinations under perturbations via model introspection

  • Hallucinations are rare and hard to identify in natural datasets.
  • Source perturbations can be used to test hypotheses at scale.
  • A counterfactual hallucination dataset was constructed to test hypotheses.
  • Token contributions were computed using LRP.
  • Controlled comparison of patterns was conducted on original and hallucinated samples.

Perturbation-based hallucination data

  • Randomly select 50k seed sentence pairs from NMT training corpora
  • Misspell words with 0.1 probability
  • Title-case words with 0.1 probability
  • Insert random token at beginning of source sentence

Measuring relative token contributions

  • We tested 3 hypotheses on a dataset using LRP (Bach et al., 2015).
  • LRP decomposes the prediction of a neural model into relevance scores for input dimensions.
  • LRP-α β (Bach et al., 2015;Binder et al., 2016) was used to back-propagate relevance scores from the last layer to the first layer.
  • We tested the hypotheses based on the distribution of relative token contributions and compared it with the attention matrix.
  • We built strong Transformer models on two high-resource language pairs: English→Chinese (En-Zh) and German→English (De-En).
  • Data was tokenized using Moses scripts (Koehn et al., 2007) and Jieba segmenter.
  • Models were based on the base Transformer (Vaswani et al., 2017) and trained using the Adam optimizer (Kingma and Ba, 2015).
  • Decoding was done with beam search with a beam size of 4.


  • Compared different classifiers to baselines using Precision, Recall and F1 scores
  • Reported Area Under the Receiver Operating Characteristic Curve (AUC) to measure discriminative power of each method

A classifier to detect natural hallucinations

  • Introduced hallucination detector using features extracted from source contributions
  • Features include normalized source contribution of first and last tokens and source contribution staticity
  • Training data constructed using perturbation-based samples with source length between 20 and 60

Detecting natural hallucinations

  • Tested a hallucination classifier built on insights from perturbation-based hallucinations
  • Evaluated on a human-annotated test bed for hallucinations generated on natural source inputs
  • Compared against a wide range of relevant models

Natural hallucination evaluation set

  • Zhou et al. (2021) created the only publicly available dataset of annotated MT hallucinations.
  • Test bed for detached hallucination detection was built for different language pairs and translation directions.
  • Data and underlying NMT models will be released.
  • Samples collected from large pools of out-of-domain data.
  • 10 bilingual annotators assess faithfulness of NMT output.

Experimental conditions

  • Implemented LRP-based classifier
  • Clipped source length at 40 and considered influence of most recent 10 target tokens
  • Tuned hyper-parameters based on average F1 accuracy
  • Compared with attention-based classifier
  • Used 3 simple baselines to characterize task
  • Random classifier, degeneration detector, NMT probability scores
  • COMET-QE classifier achieved highest AUC and F1 scores
  • LASER classifier achieved higher AUC than XLM-R classifier
  • LRP-based classifier benefited most from Source Contribution Staticity features
  • Highest false positive rate for LRP classifier on incomprehensible but aligned translations
  • LASER and LRP-based classifiers achieved 35% and 45% Precision@20
  • After tuning threshold, LRP + LASER ensemble detected 9 hallucinations with 89% precision
  • Introspection-based classifiers more robust than pre-trained multilingual models
  • LRP-based classifier best hallucination detector overall


  • Focus on understanding and detecting detached hallucinations in MT
  • Experiments based on parallel data from WMT, not monolingual data
  • Open question how findings generalize to models trained on more diverse data types
  • Hallucinations expected to be rare, small-scale experiment shows promising results
  • Hallucinations occur in all applications of neural models to language generation
  • Existing approaches to detecting hallucinations view the generation model as a black-box
  • Heuristics and external models are used to measure the faithfulness of the outputs
  • Guerreiro et al. (2022) explore glass-box detection methods based on model confidence scores or attention patterns
  • This paper investigates varying types of glass-box patterns based on the relative token contributions
  • MT quality estimation literature does not distinguish adequacy and fluency errors
  • Glass-box methods rely on model probabilities, uncertainty quantification, and the entropy of the attention distribution
  • Saliency interpretation method is used to measure the importance of each input unit


  • Hallucinations in NMT are poorly understood
  • Distinctive source contribution patterns can indicate hallucinations better than relative contribution of source and target
  • Quality estimation models and black-box classifiers can be used to detect natural hallucinations
  • Human-annotated test beds of English-Chinese and German-English hallucinations released
  • Minimum Risk Training can reduce frequency of hallucinations