Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Quantization methods reduce the number of bits required to represent parameters in a model.
- The final model size depends on the number of parameters and the rate of compression.
- This work studies the trade-off between accuracy and model size.
- 35,000 zero-shot experiments were run with 16-bit inputs and k-bit parameters.
- 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy.
Paper Content
Introduction
- LLMs are widely used for zero/few-shot inference
- LLMs can be challenging to use due to large memory and high latency
- Memory and latency are determined by number of bits in parameters
- Reducing model bits through quantization reduces latency
- Trade-off between accuracy and total model bits
- Bit-level scaling laws to determine precision that maximizes zero-shot accuracy
- 4-bit parameters yield optimal zero-shot accuracy
- 35,000 zero-shot experiments to vary quantization precision
- Data types and small quantization block size best for 4-bit precision
- Outlier-dependent quantization not effective for bit-level scaling
Background
- Reducing the number of bits of a model is related to inference latency for LLMs
- Section provides background to understand this relationship
- Thorough background on quantization data types and methods
Relationship between inference latency and total model bits
- Main goal is to find trade-offs between total model bits and zero-shot accuracy for LLMs
- Total model bits related to inference latency if batch size is below 60
- Overall computation latency determined by two factors: loading data from main memory and performing computation on data in registers
- Caching effective for training LLMs but not for inference
- Reducing memory loaded from matrix W can reduce inference latency
- Relationship between total model bits and inference latency starts to crumble as inference batch size increases
Data types
- Quantization is a mapping from k-bit integers to floating point values in the range [-1, 1]
- Data types are one of the few things that improve bit-level scaling laws
- Blocking is another quantization method that improves bit-level scaling
- Integer data types map integers to itself with an offset of 128 for a signed integer
- Floating point data types are represented by a combination of exponent and mantissa bits
- Dynamic exponent data types use one bit for the sign and the number of following zero bits represents the exponent with base 10
- Quantile quantization is an information-theoretically optimal data type for a k-bit quantization
- SRAM Quantiles algorithm approximates the quantiles of an input tensor through the empirical cumulative distribution function
Blocking / grouping & distribution centering
- Quantization precision is determined by how many quantization bins are used.
- Methods can be used to increase the average use of all quantization bins.
- Centering is used to center distributions and increase quantization precision.
- Blocking/grouping chunks the tensor into smaller pieces and quantizes each block independently.
- Blocking provides a measure of additional bits per parameter independent of the hidden dimension.
Outlier-dependent quantization through proxy quantization
- 16-bit inputs and 8-bit weights can avoid disruption
- Outlier features can cause degradation with 16-bit inputs and weights below 8-bit
- Developed outlier-dependent quantization through proxy quantization
- Model-independent metric needed with constant memory footprint
- Standard deviation of hidden states unreliable
- Standard deviation of hidden unit weights of previous layer better measure of detecting outlier dimensions
- Proxy quantization input-independent and task-independent
Experimental setup
- 16-bit inputs and k-bit quantized parameters used for experiments (3 โฅ k โฅ 8)
- Attention matrices not quantized
- 16-bit baseline used (no quantization)
- Perplexity and zero-shot performance used to measure inference performance
- Zero-shot tasks used: LAMBADA, Winogrande, HellaSwag, PiQA
- Perplexity found to be superior metric
- 4-bit precision found to be optimal for almost all models
Results & analysis
Bit-level inference scaling laws
- Figure 3 shows mean zero-shot accuracy for 4 models
- 4-bit precision yields optimal scaling for most models
- Scaling curves are almost parallel, except for 3-bit quantization
- 3-bit inference is close to random for largest models
Improving scaling laws
- Small block size improves scaling
- Data types improve scaling
- Distribution centering is ineffective
- Outlier-dependent quantization improves stability, but not scaling
- Small block size adds only 0.25 bits per parameter
- Outlier-dependent quantization only useful for 3-bit precision weights
- 4-bit precision remains optimal
- Quantile quantization is the best data type across all models
- Float data type better than Integer quantization for 4-bit precision
Related work
- Large language model quantization is closely related to models with more than a billion parameters
- Quantization methods can be grouped into specific categories, such as blocking and grouping, centering, learned data types, and direct codebook optimization
- This work studies grouping and blocking, and one data type that groups similar weights through their quantiles of the entire input tensor
Recommendations & future work
- Use 4-bit quantization for LLM inference by default
- Use block size of 128 or lower to improve zero-shot performance
- Use floating point or quantile quantization data type
- Future work should focus on data types and methods for precise outlier quantization
Discussion & limitations
- 35,000 experiments conducted
- Certain classes of quantization methods not considered
- Optimization from weights alone similar to quantile quantization
- Study an essential step towards recognizing importance of quantization methods
- Lack of optimized GPU implementations
- 4-bit data types small enough to be stored in fast registers
- Float data type effective and does not require lookup table
- Scaling laws only valid for cases with less than 60-200 sequences
- New set of scaling laws required for high throughput systems
- Loading weight matrix only one part of inference latency
Conclusion
- Large-scale study of 35,000 zero-shot experiments on a wide variety of LLMs and parameter scales
- 4-bit quantization is almost universally optimal to reduce the model bits and maximize zero-shot accuracy
- Data types and block size are the most critical measures to improve bit-level scaling
- 6-8 bit precision is sufficient to model the weights with enough precision to not cause any major quantization precision problems
- Float data types with relatively many exponent bits do well if we have row-wise quantized inputs and block-wise quantized weights
- 4-bit precision is optimal for all models at all scales with few exceptions
- Choice of data types does not affect scaling behavior at 6-bit precision
- Choice of the blocksize does not affect scaling behavior at 6-bit precision
- Outlier-dependent quantization is only useful for 3-bit precision weights, and 4-bit precision remains optimal
- Choice of block size improves bit-level scaling for most models at most scales