Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Generative models have been used to solve extractive tasks.
- Tokenization inconsistency is commonly neglected in training these models.
- This issue can lead to performance drop and hallucination.
- A simple fix is proposed and a case study is conducted on extractive QA.
- With consistent tokenization, the model performs better and converges faster.
Paper Content
Introduction
- Pretrained seq2seq models have achieved success in a range of tasks
- Tokenization is an important component of the models
- Tokenization inconsistency can affect the performance of seq2seq models
- Extractive QA is used as a case study to identify when tokenization inconsistency happens
- An approach is proposed to mitigate the issue of tokenization inconsistency
Related work
- Byte-Pair-Encoding (BPE) and language-model-based segmentation are commonly used tokenizers for NLP models
- Poor model generalization can be caused by these tokenization approaches
- BPE-Dropout and Dynamic Programming Encoding (DPE) have been proposed to improve model robustness and generalization
- Domain-specific approaches have been proposed to improve segmentation on specific texts
- Character-and byte-level modeling is another line of research that is free of tokens
Consistent tokenization: what it is and how to achieve it
- Seq2seq tasks take text as input and output a sequence of tokens.
- Tokenization is a function that maps text into a sequence of token ids.
- Inconsistent tokenization occurs when the tokenized input and output do not match.
- Inconsistent tokenization can be caused by preceding space, numbers, or punctuation.
- Training with inconsistently tokenized instances is a predicament for models.
- A proposed fix is to retrieve tokenized output from tokenized input to ensure consistency.
Case study: extractive qa with generative models
- Extractive QA is used to represent extractive tasks
- Generative models can lead to performance gains for Extractive QA
Task description
- Extractive QA models are given a question and context and output a span of the context as an answer.
- Extractive models are configured with the question and context as input and trained to return start and end indices to indicate the answer location.
- Generative models replace the index prediction task with a task of directly predicting the answer string from a seq2seq model.
Experimental setup
- Data MRQA improves in-domain performance by 1.0, 1.1 and 0.6 for SQuAD, Trivi-aQA and NewsQA respectively.
- Consistent tokenization allows the model to extract answers from the context instead of needing to learn to paraphrase.
- Consistent tokenization improves zero-shot QA performance on out-of-domain datasets.
- Training with consistent tokenization leads to faster model convergence and improved model confidence on the gold answer.
- Training with consistent tokenization leads to less textual hallucination.
Conclusion
- Issue of tokenization consistency in extractive tasks
- Proposed method to guarantee consistent tokenization
- Model benefits in several aspects when trained with consistent tokenization
- Improvement on in-domain performance, convergence speed, out-of-domain adaptation, and textual hallucination
- Suggestion to apply consistent tokenization to inputs and outputs for extractive tasks
- Exact match accuracy and learning curves show better performance when training with consistent tokenization
- Log perplexity difference between consistent and original models on in-domain and out-of-domain datasets
- Out-of-context answer generated by the model
- F1 and EM of BART models fine-tuned on different datasets with original and consistent tokenization