Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Generative models have been used to solve extractive tasks.
  • Tokenization inconsistency is commonly neglected in training these models.
  • This issue can lead to performance drop and hallucination.
  • A simple fix is proposed and a case study is conducted on extractive QA.
  • With consistent tokenization, the model performs better and converges faster.

Paper Content

Introduction

  • Pretrained seq2seq models have achieved success in a range of tasks
  • Tokenization is an important component of the models
  • Tokenization inconsistency can affect the performance of seq2seq models
  • Extractive QA is used as a case study to identify when tokenization inconsistency happens
  • An approach is proposed to mitigate the issue of tokenization inconsistency
  • Byte-Pair-Encoding (BPE) and language-model-based segmentation are commonly used tokenizers for NLP models
  • Poor model generalization can be caused by these tokenization approaches
  • BPE-Dropout and Dynamic Programming Encoding (DPE) have been proposed to improve model robustness and generalization
  • Domain-specific approaches have been proposed to improve segmentation on specific texts
  • Character-and byte-level modeling is another line of research that is free of tokens

Consistent tokenization: what it is and how to achieve it

  • Seq2seq tasks take text as input and output a sequence of tokens.
  • Tokenization is a function that maps text into a sequence of token ids.
  • Inconsistent tokenization occurs when the tokenized input and output do not match.
  • Inconsistent tokenization can be caused by preceding space, numbers, or punctuation.
  • Training with inconsistently tokenized instances is a predicament for models.
  • A proposed fix is to retrieve tokenized output from tokenized input to ensure consistency.

Case study: extractive qa with generative models

  • Extractive QA is used to represent extractive tasks
  • Generative models can lead to performance gains for Extractive QA

Task description

  • Extractive QA models are given a question and context and output a span of the context as an answer.
  • Extractive models are configured with the question and context as input and trained to return start and end indices to indicate the answer location.
  • Generative models replace the index prediction task with a task of directly predicting the answer string from a seq2seq model.

Experimental setup

  • Data MRQA improves in-domain performance by 1.0, 1.1 and 0.6 for SQuAD, Trivi-aQA and NewsQA respectively.
  • Consistent tokenization allows the model to extract answers from the context instead of needing to learn to paraphrase.
  • Consistent tokenization improves zero-shot QA performance on out-of-domain datasets.
  • Training with consistent tokenization leads to faster model convergence and improved model confidence on the gold answer.
  • Training with consistent tokenization leads to less textual hallucination.

Conclusion

  • Issue of tokenization consistency in extractive tasks
  • Proposed method to guarantee consistent tokenization
  • Model benefits in several aspects when trained with consistent tokenization
  • Improvement on in-domain performance, convergence speed, out-of-domain adaptation, and textual hallucination
  • Suggestion to apply consistent tokenization to inputs and outputs for extractive tasks
  • Exact match accuracy and learning curves show better performance when training with consistent tokenization
  • Log perplexity difference between consistent and original models on in-domain and out-of-domain datasets
  • Out-of-context answer generated by the model
  • F1 and EM of BART models fine-tuned on different datasets with original and consistent tokenization