Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Presents TextBox 2.0, a comprehensive and unified library for text generation research
- Covers 13 common text generation tasks and 83 datasets
- Incorporates 45 pre-trained language models
- Implements 4 efficient training strategies and provides 4 generation objectives
- Easy to use through Python API or command line
Paper Content
Introduction
- Text generation is a fundamental technique in many text applications
- Pre-trained language models (PLMs) are the mainstream approach to developing effective text generation models
- TextBox 2.0 is an extension of TextBox 1.0 which focuses on building a comprehensive and unified framework for PLM-based text generation models
- TextBox 2.0 supports 13 text generation tasks and 83 datasets
- TextBox 2.0 includes 45 pre-trained language models
- TextBox 2.0 provides four efficient and robust training strategies and four pre-training objectives for text generation
- TextBox 2.0 is aligned with a survey on PLM-based text generation
Library design
- TextBox 2.0 introduced new features for PLM-based text generation research
- Features from three aspects: generation tasks, generation models, and training strategies
Generation tasks
- Includes 13 widely studied tasks and 83 datasets
- 13 tasks include text summarization, machine translation, open-ended dialogue system, etc.
- Includes 4 categories of automatic metrics
- Provides visualization tools to explore and analyze generated texts
Generation models
- TextBox 2.0 incorporates 45 PLMs
- Includes general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight modules
- Can be used for different text generation tasks, such as dialogue systems and Chinese generation tasks
Training strategies
- TextBox 2.0 provides four pre-training objectives
- Supports distributed data parallel to implement models on multiple GPUs and machines
- Integrates FastSeq to optimize the decoding process
- Enables users to adjust and select hyper-parameters automatically
Library usage
- Reproducing existing models with TextBox 2.0
- Pre-training a new model with TextBox 2.0
- Analyzing generated results with TextBox 2.0
- TextBox 2.0 supports distributed data parallel and efficient decoding
Experiments
- Conducted extensive experiments
- Verified generation abilities of TextBox 2.0
Result reproduction
- TextBox 2.0 is an open-source library
- 13 tasks were evaluated using 14 datasets
- BART LARGE 5 was used for text generation
- AdamW was used for optimization
- Results were reported based on 3 random seeds
- TextBox 2.0 reproduced results from existing work
- TextBox 2.0 achieved better performance on 37 of 44 metrics
Efficiency comparison
- TextBox 2.0 optimized for computational efficiency
- TextBox 2.0 more efficient than Fairseq and Hugging Face
- TextBox 2.0 simplifies training process and reduces time spent on non-essential functions
- TextBox 2.0 incorporates efficient decoding strategies
Visualization analysis
- Reproducing a model is important
- Comparing existing methods is important
- Analyzing generated texts is important
- Exploring directions for improvement is important
- Leaderboard for each dataset is available
- Visualization analysis can be conducted
- Boxplot of ROUGE-L score for different input lengths can be plotted
- N-gram overlap of target and generated texts with the input document can be plotted
- T5 excels at short document summarization
- BART excels at long document summarization
- BART and T5 tend to “copy” the input document rather than “summarize” it
Conclusion
- TextBox 2.0 is a library for conducting research on PLM-based text generation
- It includes 13 tasks, 83 datasets, 45 PLMs, and training strategies
- Experiments show that the library can accurately reproduce existing models
- It also provides tools to analyze and explore generated results
- It is useful for text generation research and will be regularly updated