Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Quantization methods reduce the number of bits required to represent parameters in a model.
  • The final model size depends on the number of parameters and the rate of compression.
  • This work studies the trade-off between accuracy and model size.
  • 35,000 zero-shot experiments were run with 16-bit inputs and k-bit parameters.
  • 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy.

Paper Content

Introduction

  • LLMs are widely used for zero/few-shot inference
  • LLMs can be challenging to use due to large memory and high latency
  • Memory and latency are determined by number of bits in parameters
  • Reducing model bits through quantization reduces latency
  • Trade-off between accuracy and total model bits
  • Bit-level scaling laws to determine precision that maximizes zero-shot accuracy
  • 4-bit parameters yield optimal zero-shot accuracy
  • 35,000 zero-shot experiments to vary quantization precision
  • Data types and small quantization block size best for 4-bit precision
  • Outlier-dependent quantization not effective for bit-level scaling

Background

  • Reducing the number of bits of a model is related to inference latency for LLMs
  • Section provides background to understand this relationship
  • Thorough background on quantization data types and methods

Relationship between inference latency and total model bits

  • Main goal is to find trade-offs between total model bits and zero-shot accuracy for LLMs
  • Total model bits related to inference latency if batch size is below 60
  • Overall computation latency determined by two factors: loading data from main memory and performing computation on data in registers
  • Caching effective for training LLMs but not for inference
  • Reducing memory loaded from matrix W can reduce inference latency
  • Relationship between total model bits and inference latency starts to crumble as inference batch size increases

Data types

  • Quantization is a mapping from k-bit integers to floating point values in the range [-1, 1]
  • Data types are one of the few things that improve bit-level scaling laws
  • Blocking is another quantization method that improves bit-level scaling
  • Integer data types map integers to itself with an offset of 128 for a signed integer
  • Floating point data types are represented by a combination of exponent and mantissa bits
  • Dynamic exponent data types use one bit for the sign and the number of following zero bits represents the exponent with base 10
  • Quantile quantization is an information-theoretically optimal data type for a k-bit quantization
  • SRAM Quantiles algorithm approximates the quantiles of an input tensor through the empirical cumulative distribution function

Blocking / grouping & distribution centering

  • Quantization precision is determined by how many quantization bins are used.
  • Methods can be used to increase the average use of all quantization bins.
  • Centering is used to center distributions and increase quantization precision.
  • Blocking/grouping chunks the tensor into smaller pieces and quantizes each block independently.
  • Blocking provides a measure of additional bits per parameter independent of the hidden dimension.

Outlier-dependent quantization through proxy quantization

  • 16-bit inputs and 8-bit weights can avoid disruption
  • Outlier features can cause degradation with 16-bit inputs and weights below 8-bit
  • Developed outlier-dependent quantization through proxy quantization
  • Model-independent metric needed with constant memory footprint
  • Standard deviation of hidden states unreliable
  • Standard deviation of hidden unit weights of previous layer better measure of detecting outlier dimensions
  • Proxy quantization input-independent and task-independent

Experimental setup

  • 16-bit inputs and k-bit quantized parameters used for experiments (3 โ‰ฅ k โ‰ฅ 8)
  • Attention matrices not quantized
  • 16-bit baseline used (no quantization)
  • Perplexity and zero-shot performance used to measure inference performance
  • Zero-shot tasks used: LAMBADA, Winogrande, HellaSwag, PiQA
  • Perplexity found to be superior metric
  • 4-bit precision found to be optimal for almost all models

Results & analysis

Bit-level inference scaling laws

  • Figure 3 shows mean zero-shot accuracy for 4 models
  • 4-bit precision yields optimal scaling for most models
  • Scaling curves are almost parallel, except for 3-bit quantization
  • 3-bit inference is close to random for largest models

Improving scaling laws

  • Small block size improves scaling
  • Data types improve scaling
  • Distribution centering is ineffective
  • Outlier-dependent quantization improves stability, but not scaling
  • Small block size adds only 0.25 bits per parameter
  • Outlier-dependent quantization only useful for 3-bit precision weights
  • 4-bit precision remains optimal
  • Quantile quantization is the best data type across all models
  • Float data type better than Integer quantization for 4-bit precision
  • Large language model quantization is closely related to models with more than a billion parameters
  • Quantization methods can be grouped into specific categories, such as blocking and grouping, centering, learned data types, and direct codebook optimization
  • This work studies grouping and blocking, and one data type that groups similar weights through their quantiles of the entire input tensor

Recommendations & future work

  • Use 4-bit quantization for LLM inference by default
  • Use block size of 128 or lower to improve zero-shot performance
  • Use floating point or quantile quantization data type
  • Future work should focus on data types and methods for precise outlier quantization

Discussion & limitations

  • 35,000 experiments conducted
  • Certain classes of quantization methods not considered
  • Optimization from weights alone similar to quantile quantization
  • Study an essential step towards recognizing importance of quantization methods
  • Lack of optimized GPU implementations
  • 4-bit data types small enough to be stored in fast registers
  • Float data type effective and does not require lookup table
  • Scaling laws only valid for cases with less than 60-200 sequences
  • New set of scaling laws required for high throughput systems
  • Loading weight matrix only one part of inference latency

Conclusion

  • Large-scale study of 35,000 zero-shot experiments on a wide variety of LLMs and parameter scales
  • 4-bit quantization is almost universally optimal to reduce the model bits and maximize zero-shot accuracy
  • Data types and block size are the most critical measures to improve bit-level scaling
  • 6-8 bit precision is sufficient to model the weights with enough precision to not cause any major quantization precision problems
  • Float data types with relatively many exponent bits do well if we have row-wise quantized inputs and block-wise quantized weights
  • 4-bit precision is optimal for all models at all scales with few exceptions
  • Choice of data types does not affect scaling behavior at 6-bit precision
  • Choice of the blocksize does not affect scaling behavior at 6-bit precision
  • Outlier-dependent quantization is only useful for 3-bit precision weights, and 4-bit precision remains optimal
  • Choice of block size improves bit-level scaling for most models at most scales