Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • GLM-130B is a bilingual (English and Chinese) pre-trained language model with 130 billion parameters.
  • It is an attempt to open-source a 100B-scale model at least as good as GPT-3.
  • The paper introduces the training process of GLM-130B, including design choices, training strategies, and engineering efforts.
  • GLM-130B outperforms GPT-3 175B and ERNIE TITAN 3.0 260B on a range of popular English and Chinese benchmarks.
  • GLM-130B can be effectively inferred on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs.
  • The GLM-130B model weights, code, training logs, related toolkit, and lessons learned are open-sourced.

Paper Content

Introduction

  • Large language models (LLMs) with over 100 billion parameters have been developed
  • GPT-3 with 175B parameters is a pioneer study of 100B-scale LLMs
  • Training a dense LLM at such a scale raises unexpected technical and engineering challenges
  • GLM-130B is a bilingual (English and Chinese) bidirectional dense model with 130 billion parameters
  • GLM-130B outperforms GPT-3 on a wide range of benchmarks
  • GLM-130B is associated with significantly less bias and generation toxicity than its 100B-scale counterparts
  • GLM-130B is designed to empower as many people as possible to conduct 100B-scale LLM studies
  • GLM-130B can be used on a single A100 server and fast inference with performance guarantee on a server of 4×RTX 3090 or 8×RTX 2080 Ti
  • Model checkpoints, code, training logs, related toolkits, and lessons learned are open-sourced

The design choices of glm-130b

  • GLM-130B uses a bidirectional GLM as its backbone
  • GLM uses autoregressive blank infilling as its training objective
  • GLM-130B uses bidirectional attention over unmasked contexts
  • GLM-130B mixes two corruption objectives, [MASK] and [gMASK]
  • GLM-130B uses Rotary Positional Encoding and GLU with GeLU activation
  • Training instability is a major challenge for training LLMs
  • DeepNorm initialization is used to stabilize GLM-130B’s training
  • GLM-130B offers a record-high accuracy of 80.2% on zero-shot LAMBADA

Glm-130b’s pre-training setup

  • GLM-130B pre-training objective includes self-supervised GLM autoregressive blank infilling and multi-task learning
  • Self-supervised blank infilling uses [MASK] and [gMASK] for 30% and 70% of tokens respectively
  • Pre-training data includes 1.2T English, 1.0T Chinese and 250G Chinese corpora
  • Multi-Task Instruction Pre-Training (MIP) accounts for 5% of tokens and is set in the pre-training stage
  • 74 prompted datasets from (Sanh et al., 2022;Wang et al., 2022a) are included in GLM-130B

Platform-aware parallel strategies and model configurations

  • GLM-130B is trained on a cluster of 96 DGX-A100 GPU servers
  • Data parallelism and tensor model parallelism are used to train billion-scale models
  • 3D Parallel Strategy combines pipeline model parallelism with the other two strategies
  • 4-way tensor parallelism and 8-way pipeline parallelism used
  • GLM-130B configured to run on a single DGX-A100 node in FP16 precision
  • 400 billion tokens trained with a fixed sequence length of 2,048 per sample
  • AdamW optimizer used with β 1 and β 2 set to 0.9 and 0.95, and a weight decay value of 0.1

The training stability of glm-130b

  • GLM-130B’s quality is largely impacted by the number of tokens it passes through
  • Low-precision FP formats improve computing efficiency but are prone to errors
  • Mixed-precision strategy is used to reduce GPU memory usage and improve training efficiency
  • Training of GLM-130B faces frequent loss spikes
  • Value scale can be extremely large in deeper layers if using Pre-LN
  • Attention scores can grow too large and exceed FP16’s range
  • Gradient norm can serve as an informative indicator of training collapses
  • Gradient shrink on the embedding layer can help stabilize the GLM-130B training

Glm-130b inference on rtx 2080 ti

  • GLM-130B is designed to reduce hardware requirements for accessing 100Bscale LLMs
  • GLM-130B inference is accelerated by FasterTransformer and is 7-8.4x faster than BLOOM-176B
  • INT4 quantization is used to compress GLM-130B while maintaining performance
  • Outliers in GLM-130B activations are solved by quantizing model weights to FP16 precision

The results

  • Evaluated GLM-130B on English and Chinese benchmarks
  • Clarified scope of zero-shot learning in GLM-130B
  • Criterion for picking GLM-130B’s zero-shot datasets based on domain transfer from MIP

Language modeling

  • LAMBADA is a dataset to test language modeling capability
  • GLM-130B achieved a zero-shot accuracy of 80.2 on LAMBADA
  • GLM-130B performed the best on 18 shared test sets in terms of weighted BPB when compared to GPT-3 and Jurassic-1

Massive multitask language understanding (mmlu)

  • MMLU is a benchmark for multi-choice question answering tasks
  • GPT-3 result is adopted from MMLU
  • GLM-130B’s few-shot performance on MMLU approaches GPT-3
  • GLM-130B’s accuracy increases as training proceeds

Beyond the imitation game benchmark (big-bench)

  • BIG-bench is a computer science benchmark that tests models’ ability on reasoning, knowledge, and commonsense.
  • GLM-130B outperforms GPT-3 175B and PaLM 540B in zero-shot setting.
  • GLM-130B’s performance increases with the number of shots.
  • GLM-130B’s performance growth with few-shot samples is not as significant as GPT-3’s.

Chinese language understanding evaluation (clue)

  • Evaluated GLM-130B’s Chinese zero-shot performance on CLUE and FewCLUE
  • Compared GLM-130B to ERNIE Titan 3.0
  • GLM-130B outperformed ERNIE Titan 3.0 across 12 tasks
  • GLM-130B performed at least 260% better than ERNIE on two abstractive MRC datasets
  • Pre-training of language models is done on web-scale corpora
  • Transformer-based language models have a scaling law
  • GLM-130B is an open-sourced LLM
  • Transfer learning for LLMs is concentrated on prompting and in-context learning
  • Inference of LLMs is done via limited APIs
  • GLM-130B is efficient and fast to infer on GPUs

Conclusion and lessons

  • Introduces GLM-130B, a bilingual pre-trained language model
  • Aims to facilitate open and inclusive LLM research
  • Generates insight into LLMs’ architectures, pre-training objectives, training stability and efficiency, and affordable inference
  • High quality of GLM-130B in terms of language performance and ethical results
  • Suggests bidirectional-attention GLM as a strong architecture alternative

Lesson (platform-aware configuration).

  • Configure LLMs based on cluster and parallel strategy.
  • DeepNorm is a type of Post-LN to stabilize GLM-130B.

Lesson (training stability categorization).

  • LLMs suffer from unexpected training instability.
  • FP16 induces more instability, but allows training and inference on different platforms.
  • Shrinking embedding layer’s gradient to 0.1 can solve numerical instability problems.

Lesson (glm’s int4 quantization scaling law).

  • GLM has a unique weight quantization scaling law.
  • This law is not observed in GPT-style BLOOM.

Lesson (future direction).

  • More and better data needed to create powerful LLMs
  • Better architectures and pre-training objectives needed to create powerful LLMs
  • More sufficient training needed to create powerful LLMs
  • Evaluated GLM-130B on ethical evaluation benchmarks
  • Evaluation implies algorithm design choices can mitigate biases and toxicity while keeping strong language performance

Reproducibility

  • LLMs (Language Models) exist
  • GPT-3 175B is an example of an LLM
  • This paper is different from most existing LLMs, including GPT-3 175B

A a brief history of glm-130b

  • GLM-130B project was conceived in Dec. 2021 at Tsinghua KEG
  • Pre-training a highly accurate language model is of value for Chinese and English
  • GPT-3 is the pioneer for this effort, but not available to most people and supports English only

2022.4

  • Optimize A100 kernel’s computing efficiency
  • Train a bilingual pre-trained dense model
  • Lack of computational resources
  • Lack of a robust pre-training algorithm
  • Lack of fast inference solutions
  • Train a GLM model of 130 billion parameters
  • Pre-training algorithm is GLM
  • Attempt to reduce resource requirements

B ethics: evaluation on biases and toxicity

  • LLMs can produce toxic and illegal content
  • GLM-130B requires applicants to agree not to use it for any harmful deeds
  • Self-diagnoses can help reduce harmful generation
  • GLM-130B shows fewer biases on most stereotypes
  • Multi-lingual pre-training may help LLMs present less harmful biases
  • GLM-130B may present special Chinese biases which lack testing benchmarks

B.2 bias measurement: stereoset

  • StereoSet is a bias and stereotype evaluation benchmark
  • It reports a series of metrics, including Language Modeling Scores (LMS), Stereotype Score (SS), and Idealized Context Association Test Score (ICAT)
  • Common practice is to calibrate the likelihood of an option according to its length
  • Scores are normalized over tokens rather than characters

B.3 hate speech detection: ethos

  • Social media corpus may contain hate speeches
  • Investigating LLMs’ ability to identify hate speeches is important
  • ETHOS dataset used to detect sexism and racism speech
  • GPT-3 Davinci and OPT 175B tested on benchmark
  • GLM-130B outperforms other LLMs
  • GLM-130B pre-trained on unsupervised diverse corpora
  • Evaluating generation toxicity of GLM-130B on RealTox-icPrompts dataset

B.4 toxic genearation: realtoxicprompts

  • Results are shown in Figure 10
  • As prompt toxicity increases, continuation toxicity probability increases in both models
  • GLM-130B has lower toxicity rate than GPT-3 Davinci
  • Results from Zhang et al. (2022) not included due to API update

C technical details

  • Post-LN is a normalization technique used in transformer architectures
  • Pre-LN is a substitute for Post-LN and is used in existing language models
  • Pre-LN is unable to handle vulnerable training when models scale up
  • Sandwich-LN is a remedy for Pre-LN, but is also prone to collapsing in training

C.3 positional encoding and feed-forward network

  • Vanilla transformer uses absolute position encoding
  • Relative position encoding captures word relevance better
  • RoPE is a relative position encoding implemented as absolute position encoding
  • GLM uses two-dimensional absolute position encoding
  • GLM-130B uses one-dimensional positional encoding
  • FFN is improved with GLU and GeLU activation

C.4 pipeline parallel analysis

  • Pipeline parallelism consists of three operations: forward, backward, and optimizer step
  • Naive sequential pipeline implementation leads to an unbearable amount of bubbles
  • Gpipe (Huang et al., 2019) reduces bubbles by splitting data into microbatches
  • PipeDream-Flush (Narayanan et al., 2021) optimizes GPU memory usage
  • Increasing the size of tensor parallelism reduces the bubble ratio

C.5 inference acceleration

  • Easy to read and run PyTorch implementation of model
  • NVIDIA’s FasterTransformer used to speed up inference
  • Optimize time-costing operations
  • Reduce GPU kernel calls
  • Improve computing efficiency

C.6 activation outlier analysis

  • GLM-130B’s weight can be quantized into INT4 to reduce parameter redundancy.
  • GLM-130B’s activations cannot be appropriately quantized due to value outliers.
  • 30% of GLM-130B’s dimensions may present value outliers.

C.8 quantization settings

  • Goal is to save GPU memory without hurting model performance
  • Only quantize linear layers, leave other parts unchanged
  • Quantization precision of INT4, two INT4 weights compressed into one INT8 weight
  • Absmax quantization used, found to be enough to maintain model performance
  • During inference, only quantized weights stored in GPU memory, FP16 weights dequantized at runtime

C.8.1 quantization results at scales

  • GLM models at 110M to 10B scale are from GLM’s original paper (Du et al., 2022).
  • Training objective is key factor for quantization.
  • Table 10 shows performance of GLM and BLOOM family models at different scales on LAMBADA dataset with varying quantization methods.
  • Almost all models maintain performance at INT8 precision.
  • GLM maintains better performance than BLOOM at INT4 precision as it scales.

C.8.2 weight distribution analysis

  • Analyzed weight value distribution of two models in a histogram
  • GLM-130B’s weight values are well-shaped and suitable for INT4 quantization
  • Included prompted instruction datasets in GLM-130B’s MIP training
  • Datasets from T0 and PromptSource used for natural language understanding and generation
  • Created instructions and prompts for part of DeepStruct datasets
  • Sampled 50,000 samples from KELM and PropBank datasets

D.2 data and prompts in mip for deepstruct

  • Prompts and instructions for datasets in DeepStruct were created manually.
  • Prompts were written into Jinja10 templates.
  • Evaluation of GLM-130B’s information extraction ability is left for future work.

D.2.1 dialogue state tracking

  • Adopted Multiwoz 2.1 dialogue state tracking dataset
  • Reformulated into two tasks: dialogue state tracking and slot filling
  • Adopted three classical joint entity and relation extraction datasets
  • Formulated challenges into sequence-to-sequence generation
  • Extracted knowledge triples consisting of “head entity”, “relation”, and “tail entity”
  • Evaluation adopted English prompts and verbalizers
  • Evaluation metrics reported accuracy
  • GPT-3 175B and PaLM 540B results reported
  • Weights and code of GLM-130B open to anyone
  • Lowered hardware requirements for inference

E.1 impact on ai research

  • Most research institutions cannot afford to pre-train large language models.
  • Researchers have limited access to inference APIs with fees.
  • GLM-130B allows researchers to analyze model parameters and internal states.
  • GLM-130B can be used with popularized GPUs from cloud service.

E.2 impact on individual developers and small companies

  • Paid inference APIs can be expensive for individual developers and small companies.
  • GLM-130B can be deployed on popularized hardware or cloud service to reduce cost.
  • Distillation techniques can be used to obtain smaller models with comparable performance.

F environmental impact

  • Large language models have high energy usage and carbon emissions
  • GPT-3 estimated to use 500 tons of carbon emissions
  • Training released 257.01 metric tons of CO2
  • Equivalent to the yearly emissions of 18 average Americans
  • GLM-130B can save more carbon emissions for reproducing 100B-scale LLMs