Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Large language models have improved natural language understanding, generation, and reasoning.
  • A system was developed that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors and MindSpore framework.
  • The language model was named PanGu-{\Sigma} and had 1.085T parameters.
  • Random Routed Experts (RRE) was used to extend the dense Transformer model to a sparse one.
  • 329B tokens were efficiently trained using Expert Computation and Storage Separation (ECSS).
  • This resulted in a 6.3x increase in training throughput.
  • PanGu-{\Sigma} provided state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks.
  • It also demonstrated strong abilities when fine-tuned in application data of open-domain dialogue, question answering, machine translation and code generation.

Paper Content

Introduction

  • Large Language Models (LLMs) have demonstrated capabilities and potential in natural language understanding, generation and reasoning.
  • LLMs performance scales up with compute budget and model parameters.
  • Several large language models with hundreds of billion parameters have been published since GPT-3.
  • Researchers are building even larger language models with more than one trillion parameters.
  • Primary difficulty lies in the scaling efficiency.
  • Model performance of LLMs is expected to scale up with larger model size.
  • System Scaling is necessary to achieve optimal performance.
  • PanGu-Σ is a large language model with sparse architecture containing 1.085 trillion parameters.
  • PanGu-Σ is trained on a cluster with only 512 Ascend 910 AI Accelerators.
  • PanGu-Σ outperforms SOTA models in zero-shot setting and fine-tuned applications.

Model

Design principles

  • Aims to achieve performance, efficiency, usability and deployment
  • Training trillion parameters model with maximum system performance on a modest cluster
  • Extendable to various domains or tasks without retraining
  • Easily customizable and deployable in various real-world settings
  • Language model should have a large number of parameters and be trained on large amount of data
  • High-end cluster is mandatory for training
  • Model allows for grouping and separation of parameters based on various training and deployment setups
  • Auto-regressive language modeling with stacked transformer decoder layers and a query layer
  • Bottom M layers are globally shared across all the domains, top N layers are sparsely activated
  • Three modes: mixed, dense and sparse
  • Token embedding layer uses different embedding matrices for different domains
  • Random Routed Experts (RRE) mechanism for routing tokens to experts

Collection

  • Collected datasets in 40 domains
  • 4 major domains: Chinese, English, Bilingual, Code
  • 26 other monolingual natural languages, 6 programming languages, textual data from finance, health, law, and poetry domains
  • WuDaoCorpora 2.0 (200GB), CLUECorpus2020 (100GB), Pile dataset (800GB), C4 dataset (750GB), Python code (147GB), Java code (161GB)
  • More than 300B tokens for 4 major domains, more than 25B tokens for 36 domains
  • Data formats of 26 monolingual domains, finance, health, law, and poetry domains are the same as Chinese and English domains, data format of 6 programming language domains is the same as code domain

System

  • PanGu-Σ is implemented with MindSpore 1.6 framework and trained on 512 Ascend 910 accelerators
  • Training a trillion parameters language model requires 16TB memory for parameters, gradients and optimizer states
  • Training requires more than 32TB memory and 1,000 Ascend 910 accelerators or NVIDIA V100 GPUs
  • Heterogeneous training and optimizer states offloading to CPU is used
  • Expert Computation and Storage Separation (ECSS) is proposed to improve training throughput
  • 8-ways model parallel, 64-ways expert parallel and 64-ways data parallel are used
  • Rematerialization and optimizer parallel are used to reduce peak memory consumption
  • PanGu-Σ is trained with global batch size of 512 and sequence length of 1024
  • Hybrid Hyper-parameter ADAM Optimizer is used to provide stability during pretraining phase

Inheritance learning

  • PanGu-Σ model inherits capabilities of existing model and trains in four domains simultaneously
  • Vocabulary extended to support Chinese and English texts
  • Byte-level BPE used instead of BPE adopted by PanGu-α
  • Vocabulary formulated by adding T5 small vocabulary and removing repeated sub-words
  • Special tokens added to vocab, classified into two types
  • Word embedding and experts in RRE layer initialized with corresponding embedding and feed-forward layers from PanGu-α
  • Code domain and other domain updated in different embedding slots
  • Loss-free expert pruning method proposed to transfer abilities of PanGu-Σ to various downstream tasks
  • Machine reading comprehension, natural language inference, text classification, semantic similarity, Winograd schema challenge, and cloze and completion tasks evaluated
  • PanGu-Σ outperformed ERNIE 3.0 Titan on 11 out of 16 datasets

Chinese dialogue generation

  • PanGu-Σ model is fine-tuned on 51.5M dataset
  • PanGu-Σ outperforms baselines on self-chat, topic-grounded dialogue and question answering
  • CDialGPT and PanGu-Bot are two versions of PanGu-α
  • Self-chat evaluation uses 50 prompts with 9 turns each
  • Human annotators judge sensibility, specificity, interestingness, hallucination and safety
  • PanGu-Σ has higher response quality than baselines
  • Topic-grounded dialogue evaluation uses semantic consistency, distinct-1 and distinct-2, and Bleu metrics
  • Example of PanGu-Σ introducing knowledge about 郎平 in Appendix A.2

Machine translation

  • PanGu-Σ model compared to state-of-the-art model CeMAT and benchmark pre-trained large models
  • PanGu-Σ model used to fine-tune directly on translation task dataset
  • Validation on two mainstream datasets, WMT17 and WMT20
  • PanGu-Σ model has large improvement over baseline models
  • Outperforms full data fine-tuning of other pre-trained models
  • ERINE3.0 is a unified framework for pre-training large-scale knowledge enhanced models
  • CeMAT is a universal Conditional Masked Language Pre-training for both Autoregressive and non-Autoregressive machine translation tasks
  • Language tag “” used as prefix for Chinese and English text sequences
  • Verified PanGu-Σ on WMT20 Chinese-English dataset
  • PanGu-Σ exceeded mT5-XXL model by 12.6 BLEU
  • PanGu-Σ outperformed Ernie3.0 by 9.8 BLEU
  • Verified PanGu-Σ’s performance in low-resource scenarios
  • PanGu-Σ outperformed Ernie3.0 by 3.19 BLEU
  • PanGu-Σ showed meaningful quality improvement compared to CeMAT
  • PanGu-Σ had 0.7 BLEU improvement on Chinese-English task
  • PanGu-Σ achieved SOTA results
  • Translation results of PanGu-Σ model have higher fidelity compared to CeMAT

Code generation

  • PanGu-Σ was evaluated on MBPP, a benchmark to measure the ability of pre-trained models to generate Python programs from natural language descriptions
  • MBPP contains 374 programming problems for fine-tuning and 500 programming tasks as test dataset
  • PanGu-Coder introduces additional datasets which contain APPS and Code Contests (CC) datasets
  • 56k instances for fine-tuning were filtered from APPS and CC
  • PanGu-Σ outperformed previous Chinese-English SOTA pre-trained large model
  • Data was formatted to make it easier for the model to distinguish between task descriptions and solutions
  • Pass@1 was used to evaluate the performance of the fine-tuned code domain mode of PanGu-Σ

English natural language understanding

  • Sparse models offer benefits of larger model size with reduced computation cost.
  • Generating valuable signals that align with the real world is a crucial research topic.
  • Utilizing language models as a foundation and incorporating multiple modalities for perception input is important.
  • Deployment cost of large language models is a major hurdle to overcome.
  • Online knowledge updates are critical for optimal performance of the large language model system.
  • PanGu-Σ is a trillion parameters language model architecture.
  • Random Routed Experts (RRE) and Expert Computation Storage Separation (ECSS) are used in PanGu-Σ.
  • PanGu-Σ achieves high system performance under the MindSpore framework.
  • PanGu-Σ outperforms GPT-3 on SuperGLUE benchmark.
  • mT5 is a multilingual variant of T5 with 13B parameter size.
  • CPM-2 is a large-scale cost-efficient pre-trained language model.
  • EVA and EVA2.0 are encoder-decoder-based Chinese dialogue models.
  • PanGu-Σ outperforms CeMAT 3.0 on WMT17 translation task.