Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Large language models require significant GPU memory for inference.
A procedure is developed to reduce memory needed for inference by half while retaining full precision performance.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
A two-part quantization procedure is developed to cope with emergent features in transformer language models.
Inference in LLMs with up to 175B parameters can be performed without any performance degradation.
Software is open-sourced.

Paper Content

Introduction

Large pretrained language models are widely used in NLP
These models require significant memory for inference
95% of consumed parameters and 65-85% of all computation come from feed-forward and attention projection layers
8-bit quantization methods have been developed to reduce memory use
Degradation-free quantization up to 350M parameters is poorly understood
Multi-billion parameter quantization is an open challenge
This paper presents a multi-billion-scale Int8 quantization procedure for transformers
This procedure does not incur any performance degradation
Two key challenges are solved: higher quantization precision and explicit representation of sparse but systematic large magnitude outlier features
Vector-wise quantization is used to retain performance up to 2.7B parameters

Background

Push quantization techniques to their breaking point by scaling transformer models
Study high-precision asymmetric and symmetric quantization techniques
Zeropoint quantization offers high precision but is rarely used
Absolute maximum quantization is the most commonly used technique

8-bit data types and quantization

Absmax quantization scales inputs into the 8-bit range [-127, 127]
Zeropoint quantization shifts the input distribution into the full range [-127, 127]
Zeropoint quantization uses a special instruction to add a zeropoint to each element of an input tensor before performing a 16-bit integer operation
Int8 Matrix Multiplication with 16-bit Float Inputs and Outputs is performed using either absmax or zeropoint quantization

Int8 matrix multiplication at scale

Quantization methods that use a single scaling constant per tensor can be affected by outliers.
Block-wise constants can be used to limit the effect of outliers.
Row-wise quantization is improved by vector-wise quantization.
Mixed-precision decomposition is used to handle large magnitude outlier features.
Memory reduction of up to 1.96x is achieved.

Vector-wise quantization

Matrix multiplication can be viewed as a sequence of independent inner products.
Different scaling constants can be assigned to each row and column of the matrices.
Denormalization of the inner product results is done by the outer product of the scaling constants.

The core of llm.int8(): mixed-precision decomposition

8-bit transformers have large magnitude features
Vector-wise quantization is ineffective for outlier features
Outlier features are sparse and systematic
New decomposition technique focuses on high precision multiplication for outlier features
Mixed-precision decomposition for matrix multiplication separates outlier feature dimensions into a set
Threshold of 6.0 reduces transformer performance degradation close to zero

Experimental setup

Measure robustness of quantization methods as size of pretrained language models is scaled up to 175B parameters
Two setups used for experiments: language modeling perplexity and zeroshot accuracy degradation
Pretrained models range from 125M to 13B parameters
NVIDIA A40 GPUs used for evaluation
OPT models used to measure zeroshot performance

Main results

Table 1 shows that absmax, row-wise, and zeropoint quantization fail as the model size increases.
LLM.int8() is the only method that preserves perplexity and has a favorable scaling trend.
Figure 1 shows that LLM.int8() maintains full 16-bit performance when scaling from 125M to 175B parameters.
8-bit absmax vector-wise quantization scales poorly and degenerates into random performance.
Quantization overhead can slow inference for models with less than 6.7B parameters.
LLM.int8() run times is about two times faster for large matrix multiplications equivalent to those in 175B models.

Emergent large magnitude features in transformers at scale

Outlier features with large magnitudes emerge and affect all layers of a transformer.
Outlier features strongly affect attention and the overall predictive performance of transformers.
Up to 150k outliers exist per 2048 token sequence for a 13B model.
Outlier features are highly systematic and only represent at most 7 unique feature dimensions.
Insights from this analysis were critical to developing mixed-precision decomposition.
Analysis explains the advantages of zeropoint quantization and why they disappear with the use of mixed-precision decomposition.

Finding outlier features

Difficulty with quantitative analysis of emergent phenomena is two-fold
Aim to select small subset of features for analysis
Use empirical approach to find constraints
Outliers defined by magnitude of feature, affects at least 25% of layers, affects at least 6% of sequence dimensions
Track dimensions with magnitude of 6 or higher
Ignore attention function and FFN contraction layer
Thresholds set to limit detection to single outlier in smallest model
Test models up to 13B parameters
Evaluate transformers trained in 3 different software frameworks
Perform analysis in 2 different inference software frameworks

Measuring the effect of outlier features

Emergence of large magnitude features across all layers of a transformer occurs suddenly between 6B and 6.7B parameters.
Emergence of large magnitude features across all layers of the transformer can be seen as emerging smoothly according to an exponential function of decreasing perplexity.
Median outlier feature magnitude rapidly increases once outlier features occur in all layers of the transformer.
Number of outliers features increases strictly monotonically with respect to decreasing C4 perplexity.
Removing outliers reduces mean top-1 softmax probability from 40% to 20%, and increases validation perplexity by 600-1000%.
Removing random feature dimensions reduces top-1 probability by 0.02-0.3%, and increases perplexity by 0.1%.

Interpretation of quantization performance

Outliers in feature dimensions are common in large transformers.
Row-wise and vectorwise quantization cannot deal with outliers effectively.
Outliers have a strict asymmetric distribution.
Zeropoint quantization is effective for these outliers.
At 13B scale, even zeropoint quantization fails.
Mixed-precision decomposition is needed to retain full precision performance.

8-bit Data Types are used by GPUs and have a sign bit and different exponent and fraction bit combinations
Outlier Features in Language Models are related to Layer Normalization and the token frequency distribution
Two methods of quantization are nuQmm and ZeroQuant, which use group-wise quantization
ZeroQuant achieves zero-degradation performance for 8-bit quantization of a 20B model
LLM.int8() and GLM-130B use insights from the paper to achieve zero-degradation 8-bit quantization

Discussion and limitations

Multi-billion parameter transformers can be quantized to Int8 and used for inference without performance degradation
Mixed-precision decomposition isolates outlier features in a separate 16-bit matrix multiplication
LLM.int8() recovers full inference performance of models with up to 175B parameters

Broader impacts

Enabling access to large models that previously could not fit into GPU memory
Allowing research and applications that were not possible before due to limited GPU memory
Increasing disparities between resource-rich and poor organizations
Making large pretrained models more accessible
Potential beneficial and detrimental effects on society
Quantization of transformers with fewer than 1B parameters
Using less than 8-bits for data types for convolutional networks
Learning adjustments to the quantization procedure to improve quantization errors
Int8 matrix multiplication yields an advantage if the entire GPU is well saturated
Int8 inference is slightly slower but close to the millisecond latency per token compared to 16-bit inference
Using 8-bit feed-forward networks with and without 8-bit linear projections in the attention layer
Degradation of finetuning with 8-bit FFN layers and 8-bit attention projection layers compared to 32-bit
Improvements if mixed-precision decomposition is used
Outliers are essential for large softmax probabilities
Outliers disrupt symmetric absmax quantization and favor asymmetric zeropoint quantization

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

8-bit data types and quantization#

Int8 matrix multiplication at scale#

Vector-wise quantization#

The core of llm.int8(): mixed-precision decomposition#

Experimental setup#

Main results#

Emergent large magnitude features in transformers at scale#

Finding outlier features#

Measuring the effect of outlier features#

Interpretation of quantization performance#

Related work#

Discussion and limitations#

Broader impacts#

Link to paper

Abstract

Paper Content

Introduction

Background

8-bit data types and quantization

Int8 matrix multiplication at scale

Vector-wise quantization

The core of llm.int8(): mixed-precision decomposition

Experimental setup

Main results

Emergent large magnitude features in transformers at scale

Finding outlier features

Measuring the effect of outlier features

Interpretation of quantization performance

Related work

Discussion and limitations

Broader impacts