Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Large language models have improved natural language understanding, generation, and reasoning.
A system was developed that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors and MindSpore framework.
The language model was named PanGu-{\Sigma} and had 1.085T parameters.
Random Routed Experts (RRE) was used to extend the dense Transformer model to a sparse one.
329B tokens were efficiently trained using Expert Computation and Storage Separation (ECSS).
This resulted in a 6.3x increase in training throughput.
PanGu-{\Sigma} provided state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks.
It also demonstrated strong abilities when fine-tuned in application data of open-domain dialogue, question answering, machine translation and code generation.

Paper Content

Introduction

Large Language Models (LLMs) have demonstrated capabilities and potential in natural language understanding, generation and reasoning.
LLMs performance scales up with compute budget and model parameters.
Several large language models with hundreds of billion parameters have been published since GPT-3.
Researchers are building even larger language models with more than one trillion parameters.
Primary difficulty lies in the scaling efficiency.
Model performance of LLMs is expected to scale up with larger model size.
System Scaling is necessary to achieve optimal performance.
PanGu-Σ is a large language model with sparse architecture containing 1.085 trillion parameters.
PanGu-Σ is trained on a cluster with only 512 Ascend 910 AI Accelerators.
PanGu-Σ outperforms SOTA models in zero-shot setting and fine-tuned applications.

Model

Design principles

Aims to achieve performance, efficiency, usability and deployment
Training trillion parameters model with maximum system performance on a modest cluster
Extendable to various domains or tasks without retraining
Easily customizable and deployable in various real-world settings
Language model should have a large number of parameters and be trained on large amount of data
High-end cluster is mandatory for training
Model allows for grouping and separation of parameters based on various training and deployment setups
Auto-regressive language modeling with stacked transformer decoder layers and a query layer
Bottom M layers are globally shared across all the domains, top N layers are sparsely activated
Three modes: mixed, dense and sparse
Token embedding layer uses different embedding matrices for different domains
Random Routed Experts (RRE) mechanism for routing tokens to experts

Collection

Collected datasets in 40 domains
4 major domains: Chinese, English, Bilingual, Code
26 other monolingual natural languages, 6 programming languages, textual data from finance, health, law, and poetry domains
WuDaoCorpora 2.0 (200GB), CLUECorpus2020 (100GB), Pile dataset (800GB), C4 dataset (750GB), Python code (147GB), Java code (161GB)
More than 300B tokens for 4 major domains, more than 25B tokens for 36 domains
Data formats of 26 monolingual domains, finance, health, law, and poetry domains are the same as Chinese and English domains, data format of 6 programming language domains is the same as code domain

System

PanGu-Σ is implemented with MindSpore 1.6 framework and trained on 512 Ascend 910 accelerators
Training a trillion parameters language model requires 16TB memory for parameters, gradients and optimizer states
Training requires more than 32TB memory and 1,000 Ascend 910 accelerators or NVIDIA V100 GPUs
Heterogeneous training and optimizer states offloading to CPU is used
Expert Computation and Storage Separation (ECSS) is proposed to improve training throughput
8-ways model parallel, 64-ways expert parallel and 64-ways data parallel are used
Rematerialization and optimizer parallel are used to reduce peak memory consumption
PanGu-Σ is trained with global batch size of 512 and sequence length of 1024
Hybrid Hyper-parameter ADAM Optimizer is used to provide stability during pretraining phase

Inheritance learning

PanGu-Σ model inherits capabilities of existing model and trains in four domains simultaneously
Vocabulary extended to support Chinese and English texts
Byte-level BPE used instead of BPE adopted by PanGu-α
Vocabulary formulated by adding T5 small vocabulary and removing repeated sub-words
Special tokens added to vocab, classified into two types
Word embedding and experts in RRE layer initialized with corresponding embedding and feed-forward layers from PanGu-α
Code domain and other domain updated in different embedding slots
Loss-free expert pruning method proposed to transfer abilities of PanGu-Σ to various downstream tasks
Machine reading comprehension, natural language inference, text classification, semantic similarity, Winograd schema challenge, and cloze and completion tasks evaluated
PanGu-Σ outperformed ERNIE 3.0 Titan on 11 out of 16 datasets

Chinese dialogue generation

PanGu-Σ model is fine-tuned on 51.5M dataset
PanGu-Σ outperforms baselines on self-chat, topic-grounded dialogue and question answering
CDialGPT and PanGu-Bot are two versions of PanGu-α
Self-chat evaluation uses 50 prompts with 9 turns each
Human annotators judge sensibility, specificity, interestingness, hallucination and safety
PanGu-Σ has higher response quality than baselines
Topic-grounded dialogue evaluation uses semantic consistency, distinct-1 and distinct-2, and Bleu metrics
Example of PanGu-Σ introducing knowledge about 郎平 in Appendix A.2

Machine translation

PanGu-Σ model compared to state-of-the-art model CeMAT and benchmark pre-trained large models
PanGu-Σ model used to fine-tune directly on translation task dataset
Validation on two mainstream datasets, WMT17 and WMT20
PanGu-Σ model has large improvement over baseline models
Outperforms full data fine-tuning of other pre-trained models
ERINE3.0 is a unified framework for pre-training large-scale knowledge enhanced models
CeMAT is a universal Conditional Masked Language Pre-training for both Autoregressive and non-Autoregressive machine translation tasks
Language tag “” used as prefix for Chinese and English text sequences
Verified PanGu-Σ on WMT20 Chinese-English dataset
PanGu-Σ exceeded mT5-XXL model by 12.6 BLEU
PanGu-Σ outperformed Ernie3.0 by 9.8 BLEU
Verified PanGu-Σ’s performance in low-resource scenarios
PanGu-Σ outperformed Ernie3.0 by 3.19 BLEU
PanGu-Σ showed meaningful quality improvement compared to CeMAT
PanGu-Σ had 0.7 BLEU improvement on Chinese-English task
PanGu-Σ achieved SOTA results
Translation results of PanGu-Σ model have higher fidelity compared to CeMAT

Code generation

PanGu-Σ was evaluated on MBPP, a benchmark to measure the ability of pre-trained models to generate Python programs from natural language descriptions
MBPP contains 374 programming problems for fine-tuning and 500 programming tasks as test dataset
PanGu-Coder introduces additional datasets which contain APPS and Code Contests (CC) datasets
56k instances for fine-tuning were filtered from APPS and CC
PanGu-Σ outperformed previous Chinese-English SOTA pre-trained large model
Data was formatted to make it easier for the model to distinguish between task descriptions and solutions
Pass@1 was used to evaluate the performance of the fine-tuned code domain mode of PanGu-Σ

English natural language understanding

Sparse models offer benefits of larger model size with reduced computation cost.
Generating valuable signals that align with the real world is a crucial research topic.
Utilizing language models as a foundation and incorporating multiple modalities for perception input is important.
Deployment cost of large language models is a major hurdle to overcome.
Online knowledge updates are critical for optimal performance of the large language model system.
PanGu-Σ is a trillion parameters language model architecture.
Random Routed Experts (RRE) and Expert Computation Storage Separation (ECSS) are used in PanGu-Σ.
PanGu-Σ achieves high system performance under the MindSpore framework.
PanGu-Σ outperforms GPT-3 on SuperGLUE benchmark.
mT5 is a multilingual variant of T5 with 13B parameter size.
CPM-2 is a large-scale cost-efficient pre-trained language model.
EVA and EVA2.0 are encoder-decoder-based Chinese dialogue models.
PanGu-Σ outperforms CeMAT 3.0 on WMT17 translation task.

Link to paper#

Abstract#

Paper Content#

Introduction#

Model#

Design principles#

Collection#

System#

Inheritance learning#

Chinese dialogue generation#

Machine translation#

Code generation#

English natural language understanding#

Link to paper

Abstract

Paper Content

Introduction

Model

Design principles

Collection

System

Inheritance learning

Chinese dialogue generation

Machine translation

Code generation

English natural language understanding