Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Large language models require significant GPU memory for inference.
- A procedure is developed to reduce memory needed for inference by half while retaining full precision performance.
- A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
- A two-part quantization procedure is developed to cope with emergent features in transformer language models.
- Inference in LLMs with up to 175B parameters can be performed without any performance degradation.
- Software is open-sourced.
Paper Content
Introduction
- Large pretrained language models are widely used in NLP
- These models require significant memory for inference
- 95% of consumed parameters and 65-85% of all computation come from feed-forward and attention projection layers
- 8-bit quantization methods have been developed to reduce memory use
- Degradation-free quantization up to 350M parameters is poorly understood
- Multi-billion parameter quantization is an open challenge
- This paper presents a multi-billion-scale Int8 quantization procedure for transformers
- This procedure does not incur any performance degradation
- Two key challenges are solved: higher quantization precision and explicit representation of sparse but systematic large magnitude outlier features
- Vector-wise quantization is used to retain performance up to 2.7B parameters
Background
- Push quantization techniques to their breaking point by scaling transformer models
- Study high-precision asymmetric and symmetric quantization techniques
- Zeropoint quantization offers high precision but is rarely used
- Absolute maximum quantization is the most commonly used technique
8-bit data types and quantization
- Absmax quantization scales inputs into the 8-bit range [-127, 127]
- Zeropoint quantization shifts the input distribution into the full range [-127, 127]
- Zeropoint quantization uses a special instruction to add a zeropoint to each element of an input tensor before performing a 16-bit integer operation
- Int8 Matrix Multiplication with 16-bit Float Inputs and Outputs is performed using either absmax or zeropoint quantization
Int8 matrix multiplication at scale
- Quantization methods that use a single scaling constant per tensor can be affected by outliers.
- Block-wise constants can be used to limit the effect of outliers.
- Row-wise quantization is improved by vector-wise quantization.
- Mixed-precision decomposition is used to handle large magnitude outlier features.
- Memory reduction of up to 1.96x is achieved.
Vector-wise quantization
- Matrix multiplication can be viewed as a sequence of independent inner products.
- Different scaling constants can be assigned to each row and column of the matrices.
- Denormalization of the inner product results is done by the outer product of the scaling constants.
The core of llm.int8(): mixed-precision decomposition
- 8-bit transformers have large magnitude features
- Vector-wise quantization is ineffective for outlier features
- Outlier features are sparse and systematic
- New decomposition technique focuses on high precision multiplication for outlier features
- Mixed-precision decomposition for matrix multiplication separates outlier feature dimensions into a set
- Threshold of 6.0 reduces transformer performance degradation close to zero
Experimental setup
- Measure robustness of quantization methods as size of pretrained language models is scaled up to 175B parameters
- Two setups used for experiments: language modeling perplexity and zeroshot accuracy degradation
- Pretrained models range from 125M to 13B parameters
- NVIDIA A40 GPUs used for evaluation
- OPT models used to measure zeroshot performance
Main results
- Table 1 shows that absmax, row-wise, and zeropoint quantization fail as the model size increases.
- LLM.int8() is the only method that preserves perplexity and has a favorable scaling trend.
- Figure 1 shows that LLM.int8() maintains full 16-bit performance when scaling from 125M to 175B parameters.
- 8-bit absmax vector-wise quantization scales poorly and degenerates into random performance.
- Quantization overhead can slow inference for models with less than 6.7B parameters.
- LLM.int8() run times is about two times faster for large matrix multiplications equivalent to those in 175B models.
Emergent large magnitude features in transformers at scale
- Outlier features with large magnitudes emerge and affect all layers of a transformer.
- Outlier features strongly affect attention and the overall predictive performance of transformers.
- Up to 150k outliers exist per 2048 token sequence for a 13B model.
- Outlier features are highly systematic and only represent at most 7 unique feature dimensions.
- Insights from this analysis were critical to developing mixed-precision decomposition.
- Analysis explains the advantages of zeropoint quantization and why they disappear with the use of mixed-precision decomposition.
Finding outlier features
- Difficulty with quantitative analysis of emergent phenomena is two-fold
- Aim to select small subset of features for analysis
- Use empirical approach to find constraints
- Outliers defined by magnitude of feature, affects at least 25% of layers, affects at least 6% of sequence dimensions
- Track dimensions with magnitude of 6 or higher
- Ignore attention function and FFN contraction layer
- Thresholds set to limit detection to single outlier in smallest model
- Test models up to 13B parameters
- Evaluate transformers trained in 3 different software frameworks
- Perform analysis in 2 different inference software frameworks
Measuring the effect of outlier features
- Emergence of large magnitude features across all layers of a transformer occurs suddenly between 6B and 6.7B parameters.
- Emergence of large magnitude features across all layers of the transformer can be seen as emerging smoothly according to an exponential function of decreasing perplexity.
- Median outlier feature magnitude rapidly increases once outlier features occur in all layers of the transformer.
- Number of outliers features increases strictly monotonically with respect to decreasing C4 perplexity.
- Removing outliers reduces mean top-1 softmax probability from 40% to 20%, and increases validation perplexity by 600-1000%.
- Removing random feature dimensions reduces top-1 probability by 0.02-0.3%, and increases perplexity by 0.1%.
Interpretation of quantization performance
- Outliers in feature dimensions are common in large transformers.
- Row-wise and vectorwise quantization cannot deal with outliers effectively.
- Outliers have a strict asymmetric distribution.
- Zeropoint quantization is effective for these outliers.
- At 13B scale, even zeropoint quantization fails.
- Mixed-precision decomposition is needed to retain full precision performance.
Related work
- 8-bit Data Types are used by GPUs and have a sign bit and different exponent and fraction bit combinations
- Outlier Features in Language Models are related to Layer Normalization and the token frequency distribution
- Two methods of quantization are nuQmm and ZeroQuant, which use group-wise quantization
- ZeroQuant achieves zero-degradation performance for 8-bit quantization of a 20B model
- LLM.int8() and GLM-130B use insights from the paper to achieve zero-degradation 8-bit quantization
Discussion and limitations
- Multi-billion parameter transformers can be quantized to Int8 and used for inference without performance degradation
- Mixed-precision decomposition isolates outlier features in a separate 16-bit matrix multiplication
- LLM.int8() recovers full inference performance of models with up to 175B parameters
Broader impacts
- Enabling access to large models that previously could not fit into GPU memory
- Allowing research and applications that were not possible before due to limited GPU memory
- Increasing disparities between resource-rich and poor organizations
- Making large pretrained models more accessible
- Potential beneficial and detrimental effects on society
- Quantization of transformers with fewer than 1B parameters
- Using less than 8-bits for data types for convolutional networks
- Learning adjustments to the quantization procedure to improve quantization errors
- Int8 matrix multiplication yields an advantage if the entire GPU is well saturated
- Int8 inference is slightly slower but close to the millisecond latency per token compared to 16-bit inference
- Using 8-bit feed-forward networks with and without 8-bit linear projections in the attention layer
- Degradation of finetuning with 8-bit FFN layers and 8-bit attention projection layers compared to 32-bit
- Improvements if mixed-precision decomposition is used
- Outliers are essential for large softmax probabilities
- Outliers disrupt symmetric absmax quantization and favor asymmetric zeropoint quantization