Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Large language models have improved natural language understanding, generation, and reasoning.
- A system was developed that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors and MindSpore framework.
- The language model was named PanGu-{\Sigma} and had 1.085T parameters.
- Random Routed Experts (RRE) was used to extend the dense Transformer model to a sparse one.
- 329B tokens were efficiently trained using Expert Computation and Storage Separation (ECSS).
- This resulted in a 6.3x increase in training throughput.
- PanGu-{\Sigma} provided state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks.
- It also demonstrated strong abilities when fine-tuned in application data of open-domain dialogue, question answering, machine translation and code generation.
Paper Content
Introduction
- Large Language Models (LLMs) have demonstrated capabilities and potential in natural language understanding, generation and reasoning.
- LLMs performance scales up with compute budget and model parameters.
- Several large language models with hundreds of billion parameters have been published since GPT-3.
- Researchers are building even larger language models with more than one trillion parameters.
- Primary difficulty lies in the scaling efficiency.
- Model performance of LLMs is expected to scale up with larger model size.
- System Scaling is necessary to achieve optimal performance.
- PanGu-Σ is a large language model with sparse architecture containing 1.085 trillion parameters.
- PanGu-Σ is trained on a cluster with only 512 Ascend 910 AI Accelerators.
- PanGu-Σ outperforms SOTA models in zero-shot setting and fine-tuned applications.
Model
Design principles
- Aims to achieve performance, efficiency, usability and deployment
- Training trillion parameters model with maximum system performance on a modest cluster
- Extendable to various domains or tasks without retraining
- Easily customizable and deployable in various real-world settings
- Language model should have a large number of parameters and be trained on large amount of data
- High-end cluster is mandatory for training
- Model allows for grouping and separation of parameters based on various training and deployment setups
- Auto-regressive language modeling with stacked transformer decoder layers and a query layer
- Bottom M layers are globally shared across all the domains, top N layers are sparsely activated
- Three modes: mixed, dense and sparse
- Token embedding layer uses different embedding matrices for different domains
- Random Routed Experts (RRE) mechanism for routing tokens to experts
Collection
- Collected datasets in 40 domains
- 4 major domains: Chinese, English, Bilingual, Code
- 26 other monolingual natural languages, 6 programming languages, textual data from finance, health, law, and poetry domains
- WuDaoCorpora 2.0 (200GB), CLUECorpus2020 (100GB), Pile dataset (800GB), C4 dataset (750GB), Python code (147GB), Java code (161GB)
- More than 300B tokens for 4 major domains, more than 25B tokens for 36 domains
- Data formats of 26 monolingual domains, finance, health, law, and poetry domains are the same as Chinese and English domains, data format of 6 programming language domains is the same as code domain
System
- PanGu-Σ is implemented with MindSpore 1.6 framework and trained on 512 Ascend 910 accelerators
- Training a trillion parameters language model requires 16TB memory for parameters, gradients and optimizer states
- Training requires more than 32TB memory and 1,000 Ascend 910 accelerators or NVIDIA V100 GPUs
- Heterogeneous training and optimizer states offloading to CPU is used
- Expert Computation and Storage Separation (ECSS) is proposed to improve training throughput
- 8-ways model parallel, 64-ways expert parallel and 64-ways data parallel are used
- Rematerialization and optimizer parallel are used to reduce peak memory consumption
- PanGu-Σ is trained with global batch size of 512 and sequence length of 1024
- Hybrid Hyper-parameter ADAM Optimizer is used to provide stability during pretraining phase
Inheritance learning
- PanGu-Σ model inherits capabilities of existing model and trains in four domains simultaneously
- Vocabulary extended to support Chinese and English texts
- Byte-level BPE used instead of BPE adopted by PanGu-α
- Vocabulary formulated by adding T5 small vocabulary and removing repeated sub-words
- Special tokens added to vocab, classified into two types
- Word embedding and experts in RRE layer initialized with corresponding embedding and feed-forward layers from PanGu-α
- Code domain and other domain updated in different embedding slots
- Loss-free expert pruning method proposed to transfer abilities of PanGu-Σ to various downstream tasks
- Machine reading comprehension, natural language inference, text classification, semantic similarity, Winograd schema challenge, and cloze and completion tasks evaluated
- PanGu-Σ outperformed ERNIE 3.0 Titan on 11 out of 16 datasets
Chinese dialogue generation
- PanGu-Σ model is fine-tuned on 51.5M dataset
- PanGu-Σ outperforms baselines on self-chat, topic-grounded dialogue and question answering
- CDialGPT and PanGu-Bot are two versions of PanGu-α
- Self-chat evaluation uses 50 prompts with 9 turns each
- Human annotators judge sensibility, specificity, interestingness, hallucination and safety
- PanGu-Σ has higher response quality than baselines
- Topic-grounded dialogue evaluation uses semantic consistency, distinct-1 and distinct-2, and Bleu metrics
- Example of PanGu-Σ introducing knowledge about 郎平 in Appendix A.2
Machine translation
- PanGu-Σ model compared to state-of-the-art model CeMAT and benchmark pre-trained large models
- PanGu-Σ model used to fine-tune directly on translation task dataset
- Validation on two mainstream datasets, WMT17 and WMT20
- PanGu-Σ model has large improvement over baseline models
- Outperforms full data fine-tuning of other pre-trained models
- ERINE3.0 is a unified framework for pre-training large-scale knowledge enhanced models
- CeMAT is a universal Conditional Masked Language Pre-training for both Autoregressive and non-Autoregressive machine translation tasks
- Language tag “” used as prefix for Chinese and English text sequences
- Verified PanGu-Σ on WMT20 Chinese-English dataset
- PanGu-Σ exceeded mT5-XXL model by 12.6 BLEU
- PanGu-Σ outperformed Ernie3.0 by 9.8 BLEU
- Verified PanGu-Σ’s performance in low-resource scenarios
- PanGu-Σ outperformed Ernie3.0 by 3.19 BLEU
- PanGu-Σ showed meaningful quality improvement compared to CeMAT
- PanGu-Σ had 0.7 BLEU improvement on Chinese-English task
- PanGu-Σ achieved SOTA results
- Translation results of PanGu-Σ model have higher fidelity compared to CeMAT
Code generation
- PanGu-Σ was evaluated on MBPP, a benchmark to measure the ability of pre-trained models to generate Python programs from natural language descriptions
- MBPP contains 374 programming problems for fine-tuning and 500 programming tasks as test dataset
- PanGu-Coder introduces additional datasets which contain APPS and Code Contests (CC) datasets
- 56k instances for fine-tuning were filtered from APPS and CC
- PanGu-Σ outperformed previous Chinese-English SOTA pre-trained large model
- Data was formatted to make it easier for the model to distinguish between task descriptions and solutions
- Pass@1 was used to evaluate the performance of the fine-tuned code domain mode of PanGu-Σ
English natural language understanding
- Sparse models offer benefits of larger model size with reduced computation cost.
- Generating valuable signals that align with the real world is a crucial research topic.
- Utilizing language models as a foundation and incorporating multiple modalities for perception input is important.
- Deployment cost of large language models is a major hurdle to overcome.
- Online knowledge updates are critical for optimal performance of the large language model system.
- PanGu-Σ is a trillion parameters language model architecture.
- Random Routed Experts (RRE) and Expert Computation Storage Separation (ECSS) are used in PanGu-Σ.
- PanGu-Σ achieves high system performance under the MindSpore framework.
- PanGu-Σ outperforms GPT-3 on SuperGLUE benchmark.
- mT5 is a multilingual variant of T5 with 13B parameter size.
- CPM-2 is a large-scale cost-efficient pre-trained language model.
- EVA and EVA2.0 are encoder-decoder-based Chinese dialogue models.
- PanGu-Σ outperforms CeMAT 3.0 on WMT17 translation task.