Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- GLM-130B is a bilingual (English and Chinese) pre-trained language model with 130 billion parameters.
- It is an attempt to open-source a 100B-scale model at least as good as GPT-3.
- The paper introduces the training process of GLM-130B, including design choices, training strategies, and engineering efforts.
- GLM-130B outperforms GPT-3 175B and ERNIE TITAN 3.0 260B on a range of popular English and Chinese benchmarks.
- GLM-130B can be effectively inferred on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs.
- The GLM-130B model weights, code, training logs, related toolkit, and lessons learned are open-sourced.
Paper Content
Introduction
- Large language models (LLMs) with over 100 billion parameters have been developed
- GPT-3 with 175B parameters is a pioneer study of 100B-scale LLMs
- Training a dense LLM at such a scale raises unexpected technical and engineering challenges
- GLM-130B is a bilingual (English and Chinese) bidirectional dense model with 130 billion parameters
- GLM-130B outperforms GPT-3 on a wide range of benchmarks
- GLM-130B is associated with significantly less bias and generation toxicity than its 100B-scale counterparts
- GLM-130B is designed to empower as many people as possible to conduct 100B-scale LLM studies
- GLM-130B can be used on a single A100 server and fast inference with performance guarantee on a server of 4×RTX 3090 or 8×RTX 2080 Ti
- Model checkpoints, code, training logs, related toolkits, and lessons learned are open-sourced
The design choices of glm-130b
- GLM-130B uses a bidirectional GLM as its backbone
- GLM uses autoregressive blank infilling as its training objective
- GLM-130B uses bidirectional attention over unmasked contexts
- GLM-130B mixes two corruption objectives, [MASK] and [gMASK]
- GLM-130B uses Rotary Positional Encoding and GLU with GeLU activation
- Training instability is a major challenge for training LLMs
- DeepNorm initialization is used to stabilize GLM-130B’s training
- GLM-130B offers a record-high accuracy of 80.2% on zero-shot LAMBADA
Glm-130b’s pre-training setup
- GLM-130B pre-training objective includes self-supervised GLM autoregressive blank infilling and multi-task learning
- Self-supervised blank infilling uses [MASK] and [gMASK] for 30% and 70% of tokens respectively
- Pre-training data includes 1.2T English, 1.0T Chinese and 250G Chinese corpora
- Multi-Task Instruction Pre-Training (MIP) accounts for 5% of tokens and is set in the pre-training stage
- 74 prompted datasets from (Sanh et al., 2022;Wang et al., 2022a) are included in GLM-130B
Platform-aware parallel strategies and model configurations
- GLM-130B is trained on a cluster of 96 DGX-A100 GPU servers
- Data parallelism and tensor model parallelism are used to train billion-scale models
- 3D Parallel Strategy combines pipeline model parallelism with the other two strategies
- 4-way tensor parallelism and 8-way pipeline parallelism used
- GLM-130B configured to run on a single DGX-A100 node in FP16 precision
- 400 billion tokens trained with a fixed sequence length of 2,048 per sample
- AdamW optimizer used with β 1 and β 2 set to 0.9 and 0.95, and a weight decay value of 0.1
The training stability of glm-130b
- GLM-130B’s quality is largely impacted by the number of tokens it passes through
- Low-precision FP formats improve computing efficiency but are prone to errors
- Mixed-precision strategy is used to reduce GPU memory usage and improve training efficiency
- Training of GLM-130B faces frequent loss spikes
- Value scale can be extremely large in deeper layers if using Pre-LN
- Attention scores can grow too large and exceed FP16’s range
- Gradient norm can serve as an informative indicator of training collapses
- Gradient shrink on the embedding layer can help stabilize the GLM-130B training
Glm-130b inference on rtx 2080 ti
- GLM-130B is designed to reduce hardware requirements for accessing 100Bscale LLMs
- GLM-130B inference is accelerated by FasterTransformer and is 7-8.4x faster than BLOOM-176B
- INT4 quantization is used to compress GLM-130B while maintaining performance
- Outliers in GLM-130B activations are solved by quantizing model weights to FP16 precision
The results
- Evaluated GLM-130B on English and Chinese benchmarks
- Clarified scope of zero-shot learning in GLM-130B
- Criterion for picking GLM-130B’s zero-shot datasets based on domain transfer from MIP
Language modeling
- LAMBADA is a dataset to test language modeling capability
- GLM-130B achieved a zero-shot accuracy of 80.2 on LAMBADA
- GLM-130B performed the best on 18 shared test sets in terms of weighted BPB when compared to GPT-3 and Jurassic-1
Massive multitask language understanding (mmlu)
- MMLU is a benchmark for multi-choice question answering tasks
- GPT-3 result is adopted from MMLU
- GLM-130B’s few-shot performance on MMLU approaches GPT-3
- GLM-130B’s accuracy increases as training proceeds
Beyond the imitation game benchmark (big-bench)
- BIG-bench is a computer science benchmark that tests models’ ability on reasoning, knowledge, and commonsense.
- GLM-130B outperforms GPT-3 175B and PaLM 540B in zero-shot setting.
- GLM-130B’s performance increases with the number of shots.
- GLM-130B’s performance growth with few-shot samples is not as significant as GPT-3’s.
Chinese language understanding evaluation (clue)
- Evaluated GLM-130B’s Chinese zero-shot performance on CLUE and FewCLUE
- Compared GLM-130B to ERNIE Titan 3.0
- GLM-130B outperformed ERNIE Titan 3.0 across 12 tasks
- GLM-130B performed at least 260% better than ERNIE on two abstractive MRC datasets
Related work
- Pre-training of language models is done on web-scale corpora
- Transformer-based language models have a scaling law
- GLM-130B is an open-sourced LLM
- Transfer learning for LLMs is concentrated on prompting and in-context learning
- Inference of LLMs is done via limited APIs
- GLM-130B is efficient and fast to infer on GPUs
Conclusion and lessons
- Introduces GLM-130B, a bilingual pre-trained language model
- Aims to facilitate open and inclusive LLM research
- Generates insight into LLMs’ architectures, pre-training objectives, training stability and efficiency, and affordable inference
- High quality of GLM-130B in terms of language performance and ethical results
- Suggests bidirectional-attention GLM as a strong architecture alternative
Lesson (platform-aware configuration).
- Configure LLMs based on cluster and parallel strategy.
- DeepNorm is a type of Post-LN to stabilize GLM-130B.
Lesson (training stability categorization).
- LLMs suffer from unexpected training instability.
- FP16 induces more instability, but allows training and inference on different platforms.
- Shrinking embedding layer’s gradient to 0.1 can solve numerical instability problems.
Lesson (glm’s int4 quantization scaling law).
- GLM has a unique weight quantization scaling law.
- This law is not observed in GPT-style BLOOM.
Lesson (future direction).
- More and better data needed to create powerful LLMs
- Better architectures and pre-training objectives needed to create powerful LLMs
- More sufficient training needed to create powerful LLMs
- Evaluated GLM-130B on ethical evaluation benchmarks
- Evaluation implies algorithm design choices can mitigate biases and toxicity while keeping strong language performance
Reproducibility
- LLMs (Language Models) exist
- GPT-3 175B is an example of an LLM
- This paper is different from most existing LLMs, including GPT-3 175B
A a brief history of glm-130b
- GLM-130B project was conceived in Dec. 2021 at Tsinghua KEG
- Pre-training a highly accurate language model is of value for Chinese and English
- GPT-3 is the pioneer for this effort, but not available to most people and supports English only
2022.4
- Optimize A100 kernel’s computing efficiency
- Train a bilingual pre-trained dense model
- Lack of computational resources
- Lack of a robust pre-training algorithm
- Lack of fast inference solutions
- Train a GLM model of 130 billion parameters
- Pre-training algorithm is GLM
- Attempt to reduce resource requirements
B ethics: evaluation on biases and toxicity
- LLMs can produce toxic and illegal content
- GLM-130B requires applicants to agree not to use it for any harmful deeds
- Self-diagnoses can help reduce harmful generation
- GLM-130B shows fewer biases on most stereotypes
- Multi-lingual pre-training may help LLMs present less harmful biases
- GLM-130B may present special Chinese biases which lack testing benchmarks
B.2 bias measurement: stereoset
- StereoSet is a bias and stereotype evaluation benchmark
- It reports a series of metrics, including Language Modeling Scores (LMS), Stereotype Score (SS), and Idealized Context Association Test Score (ICAT)
- Common practice is to calibrate the likelihood of an option according to its length
- Scores are normalized over tokens rather than characters
B.3 hate speech detection: ethos
- Social media corpus may contain hate speeches
- Investigating LLMs’ ability to identify hate speeches is important
- ETHOS dataset used to detect sexism and racism speech
- GPT-3 Davinci and OPT 175B tested on benchmark
- GLM-130B outperforms other LLMs
- GLM-130B pre-trained on unsupervised diverse corpora
- Evaluating generation toxicity of GLM-130B on RealTox-icPrompts dataset
B.4 toxic genearation: realtoxicprompts
- Results are shown in Figure 10
- As prompt toxicity increases, continuation toxicity probability increases in both models
- GLM-130B has lower toxicity rate than GPT-3 Davinci
- Results from Zhang et al. (2022) not included due to API update
C technical details
- Post-LN is a normalization technique used in transformer architectures
- Pre-LN is a substitute for Post-LN and is used in existing language models
- Pre-LN is unable to handle vulnerable training when models scale up
- Sandwich-LN is a remedy for Pre-LN, but is also prone to collapsing in training
C.3 positional encoding and feed-forward network
- Vanilla transformer uses absolute position encoding
- Relative position encoding captures word relevance better
- RoPE is a relative position encoding implemented as absolute position encoding
- GLM uses two-dimensional absolute position encoding
- GLM-130B uses one-dimensional positional encoding
- FFN is improved with GLU and GeLU activation
C.4 pipeline parallel analysis
- Pipeline parallelism consists of three operations: forward, backward, and optimizer step
- Naive sequential pipeline implementation leads to an unbearable amount of bubbles
- Gpipe (Huang et al., 2019) reduces bubbles by splitting data into microbatches
- PipeDream-Flush (Narayanan et al., 2021) optimizes GPU memory usage
- Increasing the size of tensor parallelism reduces the bubble ratio
C.5 inference acceleration
- Easy to read and run PyTorch implementation of model
- NVIDIA’s FasterTransformer used to speed up inference
- Optimize time-costing operations
- Reduce GPU kernel calls
- Improve computing efficiency
C.6 activation outlier analysis
- GLM-130B’s weight can be quantized into INT4 to reduce parameter redundancy.
- GLM-130B’s activations cannot be appropriately quantized due to value outliers.
- 30% of GLM-130B’s dimensions may present value outliers.
C.8 quantization settings
- Goal is to save GPU memory without hurting model performance
- Only quantize linear layers, leave other parts unchanged
- Quantization precision of INT4, two INT4 weights compressed into one INT8 weight
- Absmax quantization used, found to be enough to maintain model performance
- During inference, only quantized weights stored in GPU memory, FP16 weights dequantized at runtime
C.8.1 quantization results at scales
- GLM models at 110M to 10B scale are from GLM’s original paper (Du et al., 2022).
- Training objective is key factor for quantization.
- Table 10 shows performance of GLM and BLOOM family models at different scales on LAMBADA dataset with varying quantization methods.
- Almost all models maintain performance at INT8 precision.
- GLM maintains better performance than BLOOM at INT4 precision as it scales.
C.8.2 weight distribution analysis
- Analyzed weight value distribution of two models in a histogram
- GLM-130B’s weight values are well-shaped and suitable for INT4 quantization
- Included prompted instruction datasets in GLM-130B’s MIP training
- Datasets from T0 and PromptSource used for natural language understanding and generation
- Created instructions and prompts for part of DeepStruct datasets
- Sampled 50,000 samples from KELM and PropBank datasets
D.2 data and prompts in mip for deepstruct
- Prompts and instructions for datasets in DeepStruct were created manually.
- Prompts were written into Jinja10 templates.
- Evaluation of GLM-130B’s information extraction ability is left for future work.
D.2.1 dialogue state tracking
- Adopted Multiwoz 2.1 dialogue state tracking dataset
- Reformulated into two tasks: dialogue state tracking and slot filling
- Adopted three classical joint entity and relation extraction datasets
- Formulated challenges into sequence-to-sequence generation
- Extracted knowledge triples consisting of “head entity”, “relation”, and “tail entity”
- Evaluation adopted English prompts and verbalizers
- Evaluation metrics reported accuracy
- GPT-3 175B and PaLM 540B results reported
- Weights and code of GLM-130B open to anyone
- Lowered hardware requirements for inference
E.1 impact on ai research
- Most research institutions cannot afford to pre-train large language models.
- Researchers have limited access to inference APIs with fees.
- GLM-130B allows researchers to analyze model parameters and internal states.
- GLM-130B can be used with popularized GPUs from cloud service.
E.2 impact on individual developers and small companies
- Paid inference APIs can be expensive for individual developers and small companies.
- GLM-130B can be deployed on popularized hardware or cloud service to reduce cost.
- Distillation techniques can be used to obtain smaller models with comparable performance.
F environmental impact
- Large language models have high energy usage and carbon emissions
- GPT-3 estimated to use 500 tons of carbon emissions
- Training released 257.01 metric tons of CO2
- Equivalent to the yearly emissions of 18 average Americans
- GLM-130B can save more carbon emissions for reproducing 100B-scale LLMs