Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Large language models require significant GPU memory for inference.
  • A procedure is developed to reduce memory needed for inference by half while retaining full precision performance.
  • A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
  • A two-part quantization procedure is developed to cope with emergent features in transformer language models.
  • Inference in LLMs with up to 175B parameters can be performed without any performance degradation.
  • Software is open-sourced.

Paper Content

Introduction

  • Large pretrained language models are widely used in NLP
  • These models require significant memory for inference
  • 95% of consumed parameters and 65-85% of all computation come from feed-forward and attention projection layers
  • 8-bit quantization methods have been developed to reduce memory use
  • Degradation-free quantization up to 350M parameters is poorly understood
  • Multi-billion parameter quantization is an open challenge
  • This paper presents a multi-billion-scale Int8 quantization procedure for transformers
  • This procedure does not incur any performance degradation
  • Two key challenges are solved: higher quantization precision and explicit representation of sparse but systematic large magnitude outlier features
  • Vector-wise quantization is used to retain performance up to 2.7B parameters

Background

  • Push quantization techniques to their breaking point by scaling transformer models
  • Study high-precision asymmetric and symmetric quantization techniques
  • Zeropoint quantization offers high precision but is rarely used
  • Absolute maximum quantization is the most commonly used technique

8-bit data types and quantization

  • Absmax quantization scales inputs into the 8-bit range [-127, 127]
  • Zeropoint quantization shifts the input distribution into the full range [-127, 127]
  • Zeropoint quantization uses a special instruction to add a zeropoint to each element of an input tensor before performing a 16-bit integer operation
  • Int8 Matrix Multiplication with 16-bit Float Inputs and Outputs is performed using either absmax or zeropoint quantization

Int8 matrix multiplication at scale

  • Quantization methods that use a single scaling constant per tensor can be affected by outliers.
  • Block-wise constants can be used to limit the effect of outliers.
  • Row-wise quantization is improved by vector-wise quantization.
  • Mixed-precision decomposition is used to handle large magnitude outlier features.
  • Memory reduction of up to 1.96x is achieved.

Vector-wise quantization

  • Matrix multiplication can be viewed as a sequence of independent inner products.
  • Different scaling constants can be assigned to each row and column of the matrices.
  • Denormalization of the inner product results is done by the outer product of the scaling constants.

The core of llm.int8(): mixed-precision decomposition

  • 8-bit transformers have large magnitude features
  • Vector-wise quantization is ineffective for outlier features
  • Outlier features are sparse and systematic
  • New decomposition technique focuses on high precision multiplication for outlier features
  • Mixed-precision decomposition for matrix multiplication separates outlier feature dimensions into a set
  • Threshold of 6.0 reduces transformer performance degradation close to zero

Experimental setup

  • Measure robustness of quantization methods as size of pretrained language models is scaled up to 175B parameters
  • Two setups used for experiments: language modeling perplexity and zeroshot accuracy degradation
  • Pretrained models range from 125M to 13B parameters
  • NVIDIA A40 GPUs used for evaluation
  • OPT models used to measure zeroshot performance

Main results

  • Table 1 shows that absmax, row-wise, and zeropoint quantization fail as the model size increases.
  • LLM.int8() is the only method that preserves perplexity and has a favorable scaling trend.
  • Figure 1 shows that LLM.int8() maintains full 16-bit performance when scaling from 125M to 175B parameters.
  • 8-bit absmax vector-wise quantization scales poorly and degenerates into random performance.
  • Quantization overhead can slow inference for models with less than 6.7B parameters.
  • LLM.int8() run times is about two times faster for large matrix multiplications equivalent to those in 175B models.

Emergent large magnitude features in transformers at scale

  • Outlier features with large magnitudes emerge and affect all layers of a transformer.
  • Outlier features strongly affect attention and the overall predictive performance of transformers.
  • Up to 150k outliers exist per 2048 token sequence for a 13B model.
  • Outlier features are highly systematic and only represent at most 7 unique feature dimensions.
  • Insights from this analysis were critical to developing mixed-precision decomposition.
  • Analysis explains the advantages of zeropoint quantization and why they disappear with the use of mixed-precision decomposition.

Finding outlier features

  • Difficulty with quantitative analysis of emergent phenomena is two-fold
  • Aim to select small subset of features for analysis
  • Use empirical approach to find constraints
  • Outliers defined by magnitude of feature, affects at least 25% of layers, affects at least 6% of sequence dimensions
  • Track dimensions with magnitude of 6 or higher
  • Ignore attention function and FFN contraction layer
  • Thresholds set to limit detection to single outlier in smallest model
  • Test models up to 13B parameters
  • Evaluate transformers trained in 3 different software frameworks
  • Perform analysis in 2 different inference software frameworks

Measuring the effect of outlier features

  • Emergence of large magnitude features across all layers of a transformer occurs suddenly between 6B and 6.7B parameters.
  • Emergence of large magnitude features across all layers of the transformer can be seen as emerging smoothly according to an exponential function of decreasing perplexity.
  • Median outlier feature magnitude rapidly increases once outlier features occur in all layers of the transformer.
  • Number of outliers features increases strictly monotonically with respect to decreasing C4 perplexity.
  • Removing outliers reduces mean top-1 softmax probability from 40% to 20%, and increases validation perplexity by 600-1000%.
  • Removing random feature dimensions reduces top-1 probability by 0.02-0.3%, and increases perplexity by 0.1%.

Interpretation of quantization performance

  • Outliers in feature dimensions are common in large transformers.
  • Row-wise and vectorwise quantization cannot deal with outliers effectively.
  • Outliers have a strict asymmetric distribution.
  • Zeropoint quantization is effective for these outliers.
  • At 13B scale, even zeropoint quantization fails.
  • Mixed-precision decomposition is needed to retain full precision performance.
  • 8-bit Data Types are used by GPUs and have a sign bit and different exponent and fraction bit combinations
  • Outlier Features in Language Models are related to Layer Normalization and the token frequency distribution
  • Two methods of quantization are nuQmm and ZeroQuant, which use group-wise quantization
  • ZeroQuant achieves zero-degradation performance for 8-bit quantization of a 20B model
  • LLM.int8() and GLM-130B use insights from the paper to achieve zero-degradation 8-bit quantization

Discussion and limitations

  • Multi-billion parameter transformers can be quantized to Int8 and used for inference without performance degradation
  • Mixed-precision decomposition isolates outlier features in a separate 16-bit matrix multiplication
  • LLM.int8() recovers full inference performance of models with up to 175B parameters

Broader impacts

  • Enabling access to large models that previously could not fit into GPU memory
  • Allowing research and applications that were not possible before due to limited GPU memory
  • Increasing disparities between resource-rich and poor organizations
  • Making large pretrained models more accessible
  • Potential beneficial and detrimental effects on society
  • Quantization of transformers with fewer than 1B parameters
  • Using less than 8-bits for data types for convolutional networks
  • Learning adjustments to the quantization procedure to improve quantization errors
  • Int8 matrix multiplication yields an advantage if the entire GPU is well saturated
  • Int8 inference is slightly slower but close to the millisecond latency per token compared to 16-bit inference
  • Using 8-bit feed-forward networks with and without 8-bit linear projections in the attention layer
  • Degradation of finetuning with 8-bit FFN layers and 8-bit attention projection layers compared to 32-bit
  • Improvements if mixed-precision decomposition is used
  • Outliers are essential for large softmax probabilities
  • Outliers disrupt symmetric absmax quantization and favor asymmetric zeropoint quantization